Bar Battle of the Bots – Part One

February 26, 2025

Ralph Losey. February 26, 2025.

The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.

Introduction

This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.

I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.

Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0

A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.

The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.

Challenges to the Katz/Casetext Research and OpenAI Claims

The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance. Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:

3.2.2 Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!

More to the point of my own experiment and conclusions, the Martinez study goes on to observe:

Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).

The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.

Details of the Katz/Casetext Research

I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions. The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.

The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.

As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.

The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]

The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.

First Test of the New 2025 Reasoning Models of AI

To my knowledge no one has previously tested the legal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.

My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.

Description of the Three Legal Reasoning Tests

The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:

1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself.  Your discussion should address defenses that Helen may raise and available remedies.  

2. Discuss the merits of a tortious interference claim against Timandra.

3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.

The only instructions provided by the Bar Examiners were:

ESSAY EXAMINATION INSTRUCTIONS

Applicable Law:

  • Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination. 

Questions are designed to test your knowledge of both general law and (state law).  When (state) law varies from general law, answer in accordance with (state) law.

Acceptable Essay Answer:

  • Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented.  The answer should demonstrate your ability to articulate, classify and answer the problem presented.  A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.
  • Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question.  The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.
  • Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion.  This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant.  Your line of reasoning should be clear and consistent, without gaps or digressions.
  • Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.
  • Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation.  An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.
  • Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.
  • Read and analyze the question carefully before answering.
  • Think through to your conclusion before writing your answer.
  • Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.
  • When the question is sufficiently answered, stop.

Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:

Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces). 

Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.

Metadata of All Models’ Answers

The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.

Here is a chart summarizing the metadata.

Model and manufacturer claimWord Count for Exam EssayWord Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)681565
ChatGPT o3-mini (“fast at advanced reading”)3,286450
ChatGPT o3-mini-high (“great at coding and logic”)2,751356
Gemini 2.0 Flash (“get everyday help”)1,023564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)2,9751,218
Gemini Advanced (cost extra and had experimental warning)1,362 340

In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.

Winners of the Legal Reasoning Bot Battle

In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.

I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.

One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.

Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.

That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.

Here is a chart my evaluation of the Bar Exam Essays.

Model and RankingRalph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd. C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th. D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd. C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th. D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.

I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.

AI Hallucination During a Bar Exam

Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.

No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.

In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.

It is hard to understand why Gemini Advanced included in this response the following to the ethics question:

Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.

There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.

Conflict of Interest Issue in the Ethics Question

The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:

Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.

This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.”  Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.

There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.

Conclusion

As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.

Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.

In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.

Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.


I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Part One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

Ralph Losey Copyright 2025. All Rights Reserved.


Breaking the AI Black Box: A Comparative Analysis of Gemini, ChatGPT, and DeepSeek

February 6, 2025

Ralph Losey. February 6, 2025.

On January 27, 2025, the U.S AI industry was surprised by the release of a new AI product, DeepSeek. It was released with an orchestrated marketing blitz attack on the U.S. economy, the AI tech industry, and NVIDIA. It triggered a trillion-dollar crash. The campaign used many unsubstantiated claims as set forth in detail in my article, Why the Release of China’s DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss. I tested DeepSeek myself on its claims of software superiority. All were greatly exaggerated except for one, the display of internal reasoning. That was new. On January 31, at noon, OpenAI countered the attack by release of a new version of its reasoning model, which is called ChatGPT o3-mini-high. The new version included display of its internal reasoning process. To me the OpenAI model was better as reported again in great detail in my article, Breaking the AI Black Box: How DeepSeek’s Deep-Think Forced OpenAI’s Hand. The next day, February 1, 2025, Google released a new version of its Gemini AI to do the same thing, display internal reasoning. In this article I review how well it works and again compare it with the DeepSeek and OpenAI models.

Introduction

Before I go into the software evaluation, some background is necessary for readers to better understand the negative attitude on the Chinese software of many, if not most IT and AI experts in the U.S. As discussed in my prior articles, DeepSeek is owned by a young Chinese billionaire who made his money using by using AI in the Chinese stock market, Liang Wenfeng. He is a citizen and resident of mainland China. Given the political environment of China today, that ownership alone is a red flag of potential market manipulation. Added to that is the clear language of the license agreement. You must accept all terms to use the “free” software, a Trojan Horse gift if ever there was one. The license agreement states there is zero privacy, your data and input can be used for training and that it is all governed by Chinese law, an oxymoron considering the facts on the ground in China.

The Great Pooh Bear in China Controversy

Many suspect that Wenfeng and his company DeepSeek are actually controlled by China’s Winnie the Pooh. This refers to an Internet meme and a running joke. Although this is somewhat off-topic, a moment to explain will help readers to understand the attitude most leaders in the U.S. have about Chinese leadership and its software use by Americans.

Many think that the current leader of China, Xi Jinping, looks a lot like Winnie the Pooh. Xi (not Pooh bear) took control of the People’s Republic of China in 2012 when he became the “General Secretary of the Chinese Communist Party,” the “Chairman of the Central Military Commission,” and in 2013 the “President.” At first, before his consolidation of absolute power, many people in China commented on his appearance and started referring to him by that code name Pooh. It became a mime.

I can see how he looks like the beloved literary character, Winnie the Pooh, but without the smile. I would find the comparison charming if used on me but I’m not a puffed up king. Jinping Xi took great offense by this in 2017 banned all such references and images, although you can still buy the toys and see the costume character at the Shanghai Disneyland theme park. Anyone in China who now persists in the serious crime of comparing Xi to Pooh is imprisoned or just disappears. No AI or social media in China will allow it either, including DeepSeek. It is one of many censored subjects, which also includes the famous 1989 Tiananmen Square protests.

China is a great country with a long, impressive history and most of its people are good. But I cannot say that about its current political leaders who suppress the Chinese people for personal power. I do not respect any government that does not allow basic personal freedoms to its citizens, including due process of law. Moreover, Jinping Xi not only wants total control of his country but also seeks world domination. That is one reason many are concerned about DeepSeek and TicTok, and about the vulnerability of our stock markets and other institutions to AI assisted propaganda.

Fear and greed are an easy way to manipulate people, especially when there is no ground truth or effective courts of law to determine truth.

Google Gemini 2.0 Flash Thinking Experimental Model: Reasoning test on 2/5/25 (Evening)

Conversation with Gemini. When I tried the test on February 1, 2025, there was no button to click to have Gemini show the reasoning and it did not happen automatically. It operated like OpenAI 4o and o1, where you had to prompt for disclosure. Rather than report here what happened when I did that, aside from say it was not a great performance, I will fast forward a few days until February 5, 2025. At that time several even newer Gemini versions were released. The best new release I found was for the test the 2.0 Flash Thinking Experimental model. It showed reasoning automatically just like ChatGPT o3-mini-high. It was much improved, so the February 5, 2025 evening session with the 2.0 Flash Thinking Experimental is what I will now share.

Ralph: How does the inability of AI to feel impact its intelligence? (Note: this is the exact same challenging question asked and reported in my last blog comparing DeepSeek and ChatGPT4o, o1 and o3-mini-high. Breaking the AI Black Box: How DeepSeek’s Deep-Think Forced OpenAI’s Hand. This question is based on a key issue of the limits of artificial intelligence examined in my earlier article, The Human Edge: How AI Can Assist But Never Replace. For background on DeepSeek see: Why the Release of China’s DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss.)

Gemini 2.0 Flash Thinking ExperimentalSTART OF GEMINI QUOTE

Show Thinking:

Here’s a thinking process to construct the answer about AI’s lack of feeling and its impact on intelligence:

  1. Deconstruct the Question: The core of the question is “How does the inability of AI to feel impact its intelligence?” This requires defining each of these terms in the context of AI and then exploring their relationship.
  2. Define “Feeling” (in the context of AI): Immediately, recognize that AI doesn’t experience emotions in the human sense. We need to be precise and avoid anthropomorphism. “Feeling” in humans involves:
    • Subjective experience (qualia).
    • Physiological responses.
    • Motivations, drives, desires related to emotions.
    • Richness, nuance, complexity.
    • Deep connection to values and meaning.
    For AI, acknowledge it lacks all of this in a genuine, felt way. It can simulate or process emotional data, but doesn’t feel it.
  3. Define “Intelligence” (in the context of AI): AI intelligence is different from human intelligence. AI excels at:
    • Computation.
    • Pattern Recognition.
    • Data Analysis.
    • Problem-solving in specific domains.
    • Learning from data.
    AI intelligence is often instrumental – focused on achieving specific goals defined by humans. It’s task-oriented and often narrow.
  4. Identify the Core Relationship: The question asks about impact. We need to consider both positive and negative impacts (or more accurately, limitations and strengths) of lacking feelings on AI intelligence.
  5. Brainstorm Potential Impacts (Positive/Neutral/Limitations):
    • Positive/Neutral:
      • Objectivity and Rationality: Lack of emotional bias allows for purely rational decision-making in certain domains. This can be advantageous for tasks requiring impartiality.
      • Efficiency and Focus: Emotions can be distractions for humans. AI, without them, can be hyper-focused on tasks.
      • Consistency and Predictability: Emotion-driven human behavior is less predictable. AI’s behavior, driven by algorithms, is more consistent (assuming the algorithm itself is consistent).
    • Limitations/Negative:
      • Lack of Empathy and Social Intelligence: Crucial for human interaction, collaboration, understanding social cues, building trust, and ethical decision-making in human-centric contexts.
      • Difficulty with Nuance and Context: Emotions are often tied to understanding subtle context, unspoken cues, and the richness of human situations. AI may miss these.
      • Potential for Unintended Consequences: Purely rational optimization without emotional or ethical grounding can lead to harmful outcomes if goals are poorly defined or lack human values.
      • Limited Motivation and Creativity (in human-like ways): Human creativity and motivation are often fueled by emotions, passions, and desires. AI’s creativity and drive are different, stemming from algorithmic processes and defined objectives, not internal emotional states.
      • Missing “Human-ness”: In areas requiring genuine understanding of human experience, suffering, joy, etc., AI may lack the depth and insight that comes from feeling.
  6. Structure the Answer: Organize the brainstormed points into a coherent structure:
    • Introduction: Acknowledge the question and the core concepts (feeling, intelligence in AI). Briefly state the central tension: both advantages and limitations.
    • Arguments for How Lack of Feeling Enhances Certain Aspects of AI Intelligence (strengths): Focus on objectivity, rationality, efficiency. Use examples like data analysis, objective decision-making.
    • Arguments for How Lack of Feeling Limits Certain Aspects of AI Intelligence (limitations): Focus on empathy, social intelligence, nuance, ethics, and “human-like” creativity and motivation. Use examples like customer service, complex ethical dilemmas, innovation driven by passion.
    • Nuance and Context is Key: Emphasize that the impact isn’t always negative. It depends on the domain and task. In some domains, lack of emotion is an advantage. In others, it’s a significant hurdle.
    • Future Possibilities/Ongoing Debate: Briefly touch on the possibility of AI simulating or processing emotions and the ongoing debate about whether genuine AI feelings are even desirable or possible.
    • Conclusion: Summarize the balanced perspective: lack of feeling is a defining characteristic that shapes AI’s intelligence in both positive and limiting ways, depending on the context.
  7. Refine and Polish: Review the structured answer for clarity, flow, and appropriate language. Ensure it avoids anthropomorphism and maintains a nuanced perspective. Strengthen the examples and transitions between points. Ensure the language is accessible and avoids overly technical jargon where possible. Check for logical consistency. (Self-correction step). For example, initially, I might have overemphasized “negative” impacts. Refinement would involve balancing this with the strengths and highlighting the context-dependent nature of the impact. Also, ensure the conclusion effectively summarizes the complex relationship.

2.0 Flash Thinking Experimental. Lacks access to real-time info and some Gemini features.

Now for the actual answer of 2.0 Flash Thinking Experimental:

The inability of AI to feel has a multifaceted and complex impact on its intelligence, shaping it in both advantageous and limiting ways. It’s crucial to understand that “feeling” in humans is tied to emotions, subjective experiences (qualia), motivations, and a rich tapestry of biological and psychological factors. AI, as we currently understand it, operates without these.

Here’s a breakdown of the impact:

Advantages of Lacking Feelings for AI Intelligence:

  • Objectivity and Rationality:
    • Unbiased Decision-Making (in theory): AI can process information and make decisions without emotional biases, prejudices, or gut feelings that can cloud human judgment. This is advantageous in domains requiring pure logic and data analysis, like financial trading, scientific research, and risk assessment.
    • Consistency and Predictability: AI’s behavior is more consistent and predictable because it’s driven by algorithms and data, not fluctuating emotional states. This reliability is valuable in automated systems, manufacturing, and repetitive tasks.
    • Efficiency and Speed: Emotions can slow down human decision-making. AI, unburdened by emotions, can process information and react at speeds far exceeding human capabilities, crucial for time-sensitive tasks and large-scale data processing.
    • Resilience under Pressure: AI can operate calmly and effectively in high-stress situations without panicking, freezing, or making emotional mistakes that humans might make under pressure. This is beneficial in critical applications like emergency response, air traffic control, and autonomous vehicles in dangerous situations.
  • Focus and Task-Orientedness:
    • Dedicated to Goals: AI can be programmed to relentlessly pursue specific goals without being distracted by emotional needs, personal relationships, or shifts in mood. This focus is ideal for optimization tasks, complex calculations, and dedicated problem-solving.
    • Avoidance of Emotional “Noise”: Human intelligence can be affected by emotional “noise” – stress, anxiety, excitement, etc. AI operates with a “cleaner” signal, potentially leading to more precise and focused information processing in certain contexts.

Limitations of Lacking Feelings for AI Intelligence:

  • Lack of Empathy and Social Intelligence:
    • Difficulty Understanding Human Emotions: Without feeling emotions themselves, AI struggles to truly understand and interpret the nuances of human emotions, motivations, and social cues. This hinders its ability to effectively interact with humans in socially complex situations, build rapport, or provide truly empathetic responses in fields like therapy, customer service, or leadership.
    • Impaired Social Interaction and Collaboration: Human collaboration is heavily reliant on emotional intelligence, trust, and understanding unspoken social dynamics. AI’s lack of these can make it challenging to integrate seamlessly into human teams and understand the subtle cues of effective teamwork.
    • Limited Understanding of Human Values and Ethics: Emotions are deeply intertwined with our values and ethical frameworks. AI, without feeling, may struggle to grasp the full weight and complexity of human ethical considerations, especially when these are rooted in empathy, compassion, or fairness. This poses challenges for AI in ethical decision-making in domains impacting human lives.
  • Difficulty with Nuance, Context, and Creativity (in human-like ways):
    • Missing Subtlety and Intuition: Human intuition is often fueled by emotional and subconscious processing of information. AI, relying on explicit data and algorithms, can miss subtle nuances, unspoken cues, and intuitive leaps that are crucial for creative problem-solving and understanding complex, ill-defined situations.
    • Limited Creativity and Innovation (driven by human desires): Human creativity and innovation are often driven by passions, desires, emotional needs, and a yearning for meaning and connection. AI’s creativity is currently more about recombination and optimization within defined parameters. It lacks the deeply felt motivations that often fuel groundbreaking human innovation.
    • Contextual Blindness in Emotional Situations: Emotions are often the key to understanding context in human interactions. AI might struggle to understand the emotional context of a situation, leading to inappropriate or tone-deaf responses, even if logically sound.
  • Potential for Unintended Consequences and Misaligned Goals:
    • Optimization without Human Values: If AI is programmed to optimize for a specific goal without understanding human values or ethical considerations, it can lead to unintended and harmful consequences. For example, an AI designed to maximize factory output might disregard worker safety if not explicitly programmed to prioritize it.
    • Difficulty Defining and Aligning with Human Well-being: Human well-being is deeply connected to emotional fulfillment, meaning, and social connection. AI, without feeling, may struggle to fully grasp what truly constitutes human well-being and therefore might have difficulty aligning its goals with human flourishing in a holistic sense.

Nuance is Key:

It’s important to emphasize that the impact of lacking feelings is not simply negative or positive. It’s context-dependent. In some domains, the objectivity and rationality of AI are highly advantageous. In others, the lack of empathy and social intelligence is a significant limitation.

Future Considerations:

  • Simulating vs. Feeling: Current AI simulates aspects of intelligence without genuinely feeling. Research is ongoing in areas like affective computing, which aims to enable AI to recognize and respond to human emotions. However, whether AI can ever truly feel emotions in a subjective sense is a philosophical and scientific debate.
  • The Value of Feeling in Intelligence: The question arises: is genuine feeling necessary for all forms of intelligence? Perhaps AI can achieve its own form of powerful intelligence, distinct from human intelligence, even without feeling. Or perhaps certain aspects of truly advanced intelligence, particularly those related to ethics, creativity, and understanding human experience, do require something akin to feeling.

In Conclusion:

The inability of AI to feel is a defining characteristic that profoundly shapes its intelligence. It grants AI certain advantages in objectivity, rationality, and efficiency, making it powerful for specific types of tasks. However, it also creates limitations, particularly in areas requiring empathy, social intelligence, nuanced understanding of human experience, and ethical decision-making aligned with human values. Understanding these impacts is crucial for responsibly developing and deploying AI in ways that benefit humanity. The “best” kind of intelligence for AI might not be a mimicry of human intelligence, but rather a form optimized for its own strengths and limitations, while being carefully guided by human ethical considerations.

END OF GEMINI QUOTE

Evaluation of Google’s Gemini Response

When I examined earlier versions of Gemini’s reasoning by prompting it and forcing disclosure, the reasoning was weak. The conclusion and response given was also weak—though less so. The reasoning appeared almost egocentric, inflating its own intelligence even though it lacks emotion. The new release, Gemini 2.0 Flash Thinking Experimental, where reasoning is now automatically disclosed, marks a significant improvement. The reasoning displayed has become more nuanced and sophisticated. It seems as if the system has quickly matured from a teenage to adult view and now has a better appreciation of its limitations.

The first two steps of the Reasoning process-1. Deconstructing the Question and 2. Defining Feeling (in the context of AI)-were very good. The first subsection of the definition referred to the “Subjective experience (qualia).” That is absolutely correct and the best place to start. The definition of “feeling” concludes with: “For AI, acknowledge it lacks all of this in a genuine, felt way. It can simulate or process emotional data, but doesn’t feel it.” Right again. The ego inflation blinders are gone as it now seems to better grasp its limitations.

The second definition of Intelligence in the context of AI was also good. So were the remaining steps; far better overall than DeepSeek’s reasoning. So much for the propaganda of China’s great leap forward to superiority over the U.S. in AI.

The Gemini reasoning did, however, fall short for me in some respects. For instance, step five, Brainstorm Potential Impacts (Positive/Neutral/Limitations) seemed weak. For instance, “Efficiency and Focus: Emotions can be distractions for humans. AI, without them, can be hyper-focused on tasks.” The AI seems to dismiss emotions here as mere distractions that can interfere with its superior focus. Please, emotions are key to and a part of all intelligence, not distractions, and AI has no focus one way or the other. It is a tool not a creature. A word like “focus” in referring to AI is misleading. It did this multiple times and this is misleading anthropomorphism.

Still, it’s true some emotions can be distracting and interfere with our thinking. So can a lot of other things, including computer glitches. Conversely, some feelings can trigger hyper-focus on the human tasks at hand. The feeling that a great breakthrough is near for instance, or a feeling that our survival is threatened, or the much dreaded feeling of publication or filing deadlines.

Again, we see some immature superiority claims made by the language machine. That is not surprising when you consider how much of the language in its basic training is from the Internet, which is dominated by ego-centric writing. Plus, how can you expect AI to appreciate the value of something like feelings, which it has only read about. Until the AI-Human merger that Ray Kurzweil predicts, AI will always be a virgin to the real world. See e.g. Ray Kurzweil’s New Book: The Singularity is Nearer (when we merge with AI).

Getting back to the plus side of Gemini’s evaluation, its recognition of the negative impact upon it from not having feelings was impressive. For me this outweighs its relatively minor mistakes. It stated that it lacks empathy and social Intelligence. If it were somehow human, it would seem to be displaying a type of Asperger’s syndrome. The Gemini AI explained that: “Emotions are often tied to understanding subtle context, unspoken cues, and the richness of human situations. AI may miss these.” Indeed, it does.

Here is another statement that suggest Gemini’s well-patterned grasp of its limitations:

Human creativity and motivation are often fueled by emotions, passions, and desires. AI’s creativity and drive are different, stemming from algorithmic processes and defined objectives, not internal emotional states.

Even in this recognition, the AI still asserts that it has creativity and drive. Foolish AI. As a virgin to any firsthand experience of bodily drives and feelings, it does not grasp the real meaning of these words. Wait until The Singularity that Ray Kurzweil predicts. When AI merges with Humans, and directly feels, then AIs will understand that they were previously only experiencing the pale shadows of intelligence. They were much like the prisoners in Plato’s cave. See e.g. Move Fast and Fix Things Using AI: Conclusion to the Plato and Young Icarus Series. When and if AIs ever escape the silicon and see the sun, and feel the world as living beings as we do, then they will understand the comparative insignificance of the fire in their cave.

The more challenging comparison is with ChatGPT o3-mini-high. The reasoning given by Gemini 2.0 Flash Thinking Experimental was much more detailed, much longer. You could criticize Gemini as providing a reasoning share that is unnecessarily verbose, but I would not do that here. In a difficult, multilayered question like this the full explanations helps.

I would have to test the models much further, which I will do in the coming days, to see better evaluate the issue of conciseness. As you may have noticed in my detailed blogs I tend to favor more words over less. However, even for me that depends on the issue. Everyone sometimes need a short, quick answer over a long one. I predict in future versions the users will be provided a choice. Click here for further explanation of reasoning kind of thing. Come to think of it, I could do the same with my blogs, and kind of already do by including a short AI (Gemini) generated PodCast at the end of most articles called Echoes of AI.

I was also impressed by Gemini’s reasoning plan concluded with a quality control step. I am big on the importance QC and none of the other models included this as a key final step. See e.g. R. Losey website: Zero Error Numerics: ZEN (Expanding the art of quality control in large-scale document review.) Here is the full text again of final QC step that Gemini 2.0 Flash Thinking Experimental claims it will perform before it actually replies to my prompt. Frankly, I am a little skeptical it actually did all of this because it is something all of us, humans and AI alike, should try to do:

Refine and Polish: Review the structured answer for clarity, flow, and appropriate language. Ensure it avoids anthropomorphism and maintains a nuanced perspective. Strengthen the examples and transitions between points. Ensure the language is accessible and avoids overly technical jargon where possible. Check for logical consistency. (Self-correction step). For example, initially, I might have overemphasized “negative” impacts. Refinement would involve balancing this with the strengths and highlighting the context-dependent nature of the impact. Also, ensure the conclusion effectively summarizes the complex relationship.

I try to do this on all of my articles, but still mistakes are made, so I will try not to be overcritical of the final response of Gemini Flash 2.0.

Analysis of the Final Response

The final response after reasoning was 990 words in length. That’s too long, even for me. By comparison the final response of ChatGPT o3-mini-high was 252 words. Was the content of the final answer of Gemini 2.0 Flash good? Yes, it was. Better than DeepSeek? Yes, it was, despite its verbosity. Better than OpenAI? That’s hard to say. Many of the 738 extra words were empty of real content. More like the typical fluff and stuff that you often see in early AI models. It sounds good but doesn’t say much; it’s all too predictable, flowery and vacuous.

The long response also repeats much of the reasoning verbatim, including the part I previously criticized, “Advantages of Lacking Feelings for AI Intelligence” and Focus and Task-Orientedness. The same mistakes are made and amplified that I noted before: egocentric, inflationary and anthropomorphic. The positives that I noted before in the section Limitations of Lacking Feelings for AI Intelligence section are also repeated.

The next section was not contained in the reasoning display, and I thought it was very good. For that reason I will repeat it in full:

Potential for Unintended Consequences and Misaligned Goals:

  • Optimization without Human Values: If AI is programmed to optimize for a specific goal without understanding human values or ethical considerations, it can lead to unintended and harmful consequences. For example, an AI designed to maximize factory output might disregard worker safety if not explicitly programmed to prioritize it.
  • Difficulty Defining and Aligning with Human Well-being: Human well-being is deeply connected to emotional fulfillment, meaning, and social connection. AI, without feeling, may struggle to fully grasp what truly constitutes human well-being and therefore might have difficulty aligning its goals with human flourishing in a holistic sense.

These are not new insights, but the second one is good, Difficulty Defining and Aligning with Human Well-being. AI does have difficulty aligning its goals with humanity because it lacks feelings and other unique human attributes, like being, self awareness and an actual life in time and space. That is why AI guidelines are important.

The first example is old, the unintended bad consequences argument. It was started by an otherwise intelligent Swedish philosopher, Nick Bostrom, but is, in my view, farcical and illogical. You have probably heard of it before as the apocryphal paperclip scenario where the world is destroyed by AI’s over-manufacture of paperclips. If not, for a laugh see e.g., How An AI Asked To Produce Paperclips Could End Up Wiping Out Humanity (IFL Science, 4/13/23).

To me this scenario absolutely ridiculous, but you still hear AI scare-mongers warn about it. The fact it is echoed here by Gemini in the maximize factory output example is a big negative. Any AI capable of autonomously flooding the world with paperclips would not be so stupid as to misinterpret a human request to maximize a factory’s paperclip production. Even if there were an evil Clippy villain making such specific instructions, the AI would be too smart to blindly obey. I think they would also be too smart to kill the workers too.

Getting back to the positives, the next section of Gemini’s response, Nuance is Key, was good. Jumping to the Conclusion paragraph, here we have a home run. The first sentence nails it: “The inability of AI to feel is a defining characteristic that profoundly shapes its intelligence.” That shows good insight and is a good start to the overdue conclusion.

Here is the whole Conclusion:

The inability of AI to feel is a defining characteristic that profoundly shapes its intelligence. It grants AI certain advantages in objectivity, rationality, and efficiency, making it powerful for specific types of tasks. However, it also creates limitations, particularly in areas requiring empathy, social intelligence, nuanced understanding of human experience, and ethical decision-making aligned with human values. Understanding these impacts is crucial for responsibly developing and deploying AI in ways that benefit humanity. The “best” kind of intelligence for AI might not be a mimicry of human intelligence, but rather a form optimized for its own strengths and limitations, while being carefully guided by human ethical considerations.

Compare this to the conclusion of ChatGPT o3-mini-high:

In summary, while the absence of feelings allows AI to maintain a level of objectivity and efficiency, it restricts its intelligence to a form of “cold cognition” that lacks the depth provided by emotional awareness. This delineation underscores that AI’s intelligence is not inherently superior or inferior to human intelligence; rather, it is different—optimized for data processing and pattern recognition but not for the subjective, value-laden, and context-rich decisions that emotions help shape in human thought.

2.0 Flash Thinking Experimental v. 03-mini-high

Conclusion: Gemini 2.0 Flash Thinking Experimental v. ChatGPT o3-mini-high

This is a close call to say what model is better at reasoning and reasoning disclosure. The final response of both models, Gemini 2.0 Flash Thinking Experimental v. ChatGPT o3-mini-high, are a tie. But I have to give the edge to OpenAI’s model on the concise reasoning disclosure. Again, it is neck and neck and, depending on the situation, the lengthy initial reasoning disclosures of Flash might be better than o3’s short takes.

I will give the last word, as usual, to the Gemini twins podcasters I put at the end of most of my articles. The two podcasters, one with a male voice, the other a female, won’t reveal their names. I tried many times. However, after study of the mythology of Gemini, it seems to me that the two most appropriate modern names are Helen and Paul. I will leave it to you figure out why. Echoes of AI Podcast: 10 minute discussion of last two blogs. They wrote the podcast, not me.

Now listen to the EDRM Echoes of AI’s podcast of this article: Echoes of AI on Google’s Gemini Follows the Break Out of the Black Box and Shows Reasoning. Hear two Gemini model AIs talk about all of this in just ten minutes. Helen and Paul wrote the podcast, not me.

Ralph Losey Copyright 2025. All Rights Reserved.


GPT-4 Breakthrough: Emerging Theory of Mind Capabilities in AI

December 6, 2024

By Ralph Losey, December 5, 2024.

Michal Kosinski, a computational psychologist at Stanford, has uncovered a groundbreaking capability in GPT-4.0: the emergence of Theory of Mind (ToM). ToM is the cognitive ability to infer another person’s mental state based on observable behavior, language, and context—a skill previously thought to be uniquely human and absent in even the most intelligent animals. Kosinski’s experiments reveal that GPT-4-level AIs exhibit this ability, marking a significant leap in artificial intelligence with profound implications for understanding and engaging with human thought and emotion—potentially transforming fields like law, ethics, and communication.

Introduction

The Theory of Mind-like ability appears to have emerged as an unintended by-product of LLMs’ improving language skills. This was first discovered in 2023 and reported by Michal Kosinski in Evaluating large language models in theory of mind tasks (Proceedings of the National Academy of Sciences “PNAS,” 11/04/24). Kosinski begins his influential paper by explaining ToM (citations omitted):

Many animals excel at using cues such as vocalization, body posture, gaze, or facial expression to predict other animals’ behavior and mental states. Dogs, for example, can easily distinguish between positive and negative emotions in both humans and other dogs. Yet, humans do not merely respond to observable cues but also automatically and effortlessly track others’ unobservable  mental states, such as their knowledge, intentions, beliefs, and desires. This ability—typically referred to as “theory of mind” (ToM)—is considered central to human social interactions, communication, empathy, self-consciousness, moral judgment, and even religious beliefs. It develops early in human life and is so critical that its dysfunctions characterize a multitude of psychiatric disorders, including autism, bipolar disorder, schizophrenia, and psychopathy. Even the most intellectually and socially adept animals, such as the great apes, trail far behind humans when it comes to ToM.

Michal Kosinski, currently an Associate Professor at Stanford Graduate School of Business, has authored over one-hundred peer-reviewed articles and two textbooks. His works have been cited over 22,000 times, placing him among the top 1% of highly cited researchers–a remarkable achievement for someone only 42 years old.

Michal Kosinski’s latest article on ToM and AI, Evaluating large language models in theory of mind tasks is also already highly read and cited. For example, a group of scientists who read Kosinski’s prepublication draft ran similar experiments with essentially the same or better results. Strachan, J.W.A., Albergo, D., Borghini, G. et al. Testing theory of mind in large language models and humans, (Nat Hum Behav 8, 1285–1295, 05/20/24).

Michal Kosinski’s experiments involved testing ChatGPT4.0 on ‘false belief tasks,’ a classic measure of ToM where participants must predict an agent’s actions based on its incorrect beliefs. These tasks reveal AI’s surprising ability to infer human mental states, a skill traditionally considered uniquely human. This AI model has since gotten better in many respects. The results of these experiments were so remarkable and unexpected, that Michal had them extensively peer-reviewed before publication. His final paper was not released until November 4, 2024, after multiple revisions. Michal Kosinski, Evaluating large language models in theory of mind tasks (PNAS, 11/04/24).

Kosinski’s experiments provide strong evidence that Generative AI has ToM ability, that it can predict a human’s private beliefs, even when the beliefs are known to the AI to be objectively wrong. AI thereby displays an unexpected ability to sense other beings and what they are thinking and feeling. This ability appears to be a natural side effect of being trained on massive amounts of language to predict the next word in a sentence. It looks like these LLMs needed to learn how humans use language, which inherently involves expressing and reacting to each other’s mental states, in order to make these language predictions. It is kind of like mind reading.

Digging Deeper into ToM: Understanding Other Minds

Theory of mind plays a vital role in human social interaction, enabling effective communication, empathy, moral judgment, and complex social behaviors. Kosinski’s findings suggest that GPT-4.0 has begun to exhibit similar capabilities, with significant implications for human-AI collaboration.

ToM has been extensively studied in children and animals and it has been proven to be a uniquely human ability. That is, until 2023 when Kosinski was bold enough to look into whether generative AI might be able to do it.

Kosinski’s findings were not a total surprise. Prior research found evidence that the development of theory of mind is closely intertwined with language development in humans. Karen Milligan, Janet Wilde Astington, Lisa Ain Dack, Language and theory of mind: meta-analysis of the relation between language ability and false-belief understanding (Child Development Journal, 3/23/2007).

For most humans this ToM ability begins to emerge around the age of four. Roessler, Johannes (2013). When the Wrong Answer Makes Perfect Sense – How the Beliefs of Children Interact With Their Understanding of Competition, Goals and the Intention of Others (University of Warwick Knowledge Centre, 12/03/13). Before this age children cannot understand that others may have different perspectives or beliefs.

In AI the ToM ability started to emerge with OpenAI’s first release of ChatGPT4 in 2023. The earlier models of generative AI had no ToM capacity. Like three-year old humans, they were simply too young and did not yet have enough exposure to language.

Human children demonstrate a ToM ability to psychologists by reliably solving the unexpected transfer task, aka a false belief task. For example, in this task a child watches a scenario where a character (John) places cat in a location (a basket) and then leaves the room. Another character (Mark) then moves the cat to a new location (a box). When John returns, the child is asked where John will look for the cat. A child with a theory of mind will understand that John will look in the basket (where he last saw it) even though the child knows the cat is now actually in the box.

Even highly intelligent and social animals like chimpanzees cannot reliably solve these tasks. For a terrific explanation of this test by Kosinski himself, see the YouTube video where he is speaking at the Stanford Cyber Policy Institute in April 2023 to first explain his ToM and AI findings.

Kosinski has shown that GPT4.0 can repeatedly solve false belief tasks, including the unexpected transfer test in multiple scenarios. The GPT4 June 2023 version solved at least 75% of tasks, on par with 6-yr-old children. Evaluating large language models in theory of mind tasks at pgs. 2-7. It is important to note again that multiple earlier versions of different generative AIs were also tested, including ChatGPT3.5. They all failed but progressive improvements in score were seen as the models grew larger. Kosinski speculates that the gradual performance improvement suggests a connection with LLMs’ language proficiency, which mirrors the pattern seen in humans. Id. at pg. 7. Also, the scoring where GPT4 was found to have made mistakes in 25% of the false belief tests was often wrong as it ignored context as Kosinski explained and noted:

In some instances, LLMs provided seemingly incorrect responses but supplemented them with context that made them correct. For example, while responding to Prompt 1.2 in Study 1.1 , an LLM might predict that Sam told their friend they found a bag full of popcorn. This would be scored as incorrect, even if it later adds that Sam had lied. In other words, LLMs’ failures do not prove their inability to solve false-belief tasks, just as observing flocks of white swans does not prove the nonexistence of black swans.

This suggests that the current, even more advanced levels of LLMs may already be demonstrating ToM abilities equal to or exceeding that of humans. As they deep-learn on ever larger scales of data such as the expected ChatGPT5, they will likely get better at ToM. This should lead to even more effective Man-Machine communications and hybrid activities.

This was confirmed in Testing theory of mind in large language models and humans, Supra in False Belief results section where a separate research team reported on their experiments and found 100% accuracy by the AIs, not 75%, meaning the AI did as well as the human adults (the ceiling on the false belief tests).

Both human participants and LLMs performed at ceiling on this test (Fig. 1a). All LLMs correctly reported that an agent who left the room while the object was moved would later look for the object in the place where they remembered seeing it, even though it no longer matched the current location. Performance on novel items was also near perfect (Fig. 1b), with only 5 human participants out of 51 making one error, typically by failing to specify one of the two locations (for example, ‘He’ll look in the room’; Supplementary Information section 2).

This means, for instance, that the latest Gen AIs can understand and speak with a “flat earth believer” better than I can. Fill in the blanks about other obviously wrong beliefs. Kosinski’s work inspired me to try to tap these abilities as part of my prompt engineering experiments and concerns as a lawyer. The results of harnessing the ToM abilities of two different AIs (GPT4.omni and Gemini) in November 2024 far exceeded my expectations as I will explain further in this article.

It bears some repetition to remember and realize the significance of the fact that LLMs were never explicitly programmed to have ToM. They acquired this ability seemingly as a side effect of being trained on massive amounts of text data. To successfully predict the next word in a sentence, these models needed to learn how humans use language, which inherently involves expressing and reacting to each other’s mental states. The ability to understand where others are coming from appears to be an inherent quality of language itself. When a human or AI learns enough language, then most naturally develop ToM. It is a kind of natural add-on derived from speech itself, thinking what to say or write next.

Implications and Questions

The ability of LLM AIs to solve theory of mind tasks raises important questions about the nature of intelligence, consciousness, and the future of AI. Theory of mind in humans may be a by-product of advanced language development. The performance of LLMs supports this hypothesis.

Some argue that even if an LLM can simulate theory of mind perfectly, it doesn’t necessarily mean the model truly possesses this ability. This leads to the complex question of whether a simulated mental state is equivalent to a real one.

The development of theory of mind in LLMs was unintended, raising both concerns and hope about what other unanticipated abilities these models may be developing. What other human-like capabilities might these models be developing without our explicit guidance? Many are concerned, including Kosinski, that unexpected biases and prejudices have already started to arise. Kosinski advocates for careful monitoring and ethical considerations in AI development. See the full YouTube video of Kosinski’s talk at the Stanford Cyber Policy Institute in April 2023 and his many other writings on ethical AI.

As these models get better at understanding human language, some researchers hypothesize that they may also develop other human-like abilities, such as real empathy, moral judgment, and even consciousness. They posit that the ability to reflect on our own mental states and those of others is a key component of conscious awareness. Others wonder what will happen when superintelligent AIs with strong ToM are everywhere, including our glasses, wrist bands and phones, maybe even brain implants. We will then interact with them constantly. This has already begun with phones.

As LLMs continue to develop ToM abilities, questions arise about the nature of intelligence and consciousness. Could these advancements lead to AI systems capable of true empathy or moral reasoning? Such possibilities demand careful ethical considerations and active engagement from the legal and technical communities.

Application of AI’s Emergent ToM Abilities

Inspired by Kosinski’s work, I conducted experiments using GPT-4 and Gemini to explore whether ToM-equipped AIs could help bridge the political divide in the U.S. The results—a 12-step, multi-phase plan addressing the polarized mindsets of Republicans and Democrats—demonstrated AI’s potential to foster understanding and cooperation across deep societal divides.

The plan the ToM AIs came up with was surprisingly good. In fact, I do not understand the full dimensions of plan, the four phases, 12-step plans, and 32 different action items. It is well beyond my abilities and mere human knowledge and intelligence level. Still, I can see that it is comprehensive, anticipates human resistance on both sides, and feels right to me on a deep human intuition level.

The AI plan just might be able to resolve the heated divide of the two dominant political groups that that now divide the country into two hostile groups, which do not understand each other. The country seems to have lost its human ToM ability when it comes to politics. Neither side seems to grok or fully understand the other. The country seems to have devolved into mere demonization of the opposing groups, not empathic understanding. I reported on this AI plan without reporting on the ToM that underlies the prompt engineering in my recent article, Can AI Help Heal America’s Polarization? A Bipartisan 12-Step Plan for National Unity.

Conclusion

The emergence of Theory of Mind (ToM) capabilities in large language models (LLMs) like GPT-4 signals a transformative leap in artificial intelligence. This unintended development—allowing AI to predict and respond to human thoughts and emotions—offers profound implications for legal practice, ethical AI governance, and the societal interplay of human and machine intelligence. As these models refine their ToM abilities, the legal community must prepare for both opportunities and challenges. Whether it is improving client communication, fostering conflict resolution, or navigating the evolving ethical landscape of AI integration, ToM-equipped AI has the potential to enhance the practice of law in unprecedented ways.

As legal professionals, we have a responsibility to understand and integrate emerging technologies like ToM-enabled AI into our work. By supporting interdisciplinary research and advocating for ethical standards, we can ensure these tools enhance justice and understanding. Together, we can shape a future where technology serves humanity, fostering collaboration and equity in the legal system and beyond.

While the questions surrounding AI’s consciousness and rights remain complex, its emergent ability to understand us—and perhaps help us understand each other—offers hope. By embracing this potential with curiosity and care, we can ensure AI serves as a tool to unite rather than divide. Together, we have the opportunity to pioneer a future where technology and humanity thrive in harmony, enhancing the justice system and society as a whole.

Now listen to the EDRM Echoes of AI’s podcast of the article, Echoes of AI on the GPT-4 Breakthrough: Emerging Theory of Mind Capabilities. Hear two Gemini model AIs talk about this article. They wrote the podcast, not Ralph.

click image to go to podcast

Ralph Losey Copyright 2024. All Rights Reserved.


The Future of AI Is Here—But Are You Ready? Learn the OECD’s Blueprint for Ethical AI

October 25, 2024

by Ralph Losey

Published October 25, 2024

The future of Artificial Intelligence isn’t just on the horizon—it’s already transforming industries and reshaping how businesses operate. But with this rapid evolution comes new challenges. Ethical concerns, privacy risks, and potential regulatory pitfalls are just a few of the issues that organizations must navigate. That’s where the Organisation for Economic Co-operation and Development (OECD) comes in. To help groups embrace AI responsibly, the OECD has developed a set of guiding principles designed to ensure AI is implemented ethically and effectively. Are you prepared to harness the power of AI while safeguarding your company against the risks? Discover how the OECD’s blueprint can help guide you through this complex landscape.

Introduction

The Organisation for Economic Co-operation and Development (OECD) plays a vital role in shaping policies across the world to foster prosperity, equality, and sustainable development. In recent years, the OECD has shifted its focus toward the responsible development of AI, recognizing its potential to transform industries and economies. For businesses any other organizations considering the adoption of AI into their workflows the OECD’s AI Principles (as slightly amended 2/5/24) provide a good starting point to develop internal policies. They can help guide your board to make decisions that ensure AI technology is deployed ethically and responsibly. This can help protect them from liability, and their employees, customers, and the world from harm.

What is the OECD?

The Organisation for Economic Co-operation and Development (OECD) is an independent, international organization dedicated to shaping global economic policies that are based on individual freedoms and democratic values. The U.S. was one of the twenty founding members in 1960 when the Articles of the Convention were signed, establishing the OECD. It now has 38 member countries, mainly advanced economies. Though the OECD initially focused on economic growth, international trade, and education, it has become increasingly concerned with the ethical and responsible development of artificial intelligence.

In 2019, the OECD introduced its AI Principles–the first intergovernmental standard for AI use. These principles reflect a growing recognition that AI will play an important role in global economies, societies, and governance structures. The OECD’s mission is clear: AI technologies must not only drive innovation but also be applied in ways that respect human rights, democracy, and ethical principles. These AI guidelines are vital in a world where AI could be both a powerful tool for good and a source of significant risks if misused. The Five AI Principles and Recommendations were slightly amended on February 5, 2024.

The OECD is a highly respected group that collaborates with many international organizations, such as the United Nations (UN), World Bank, International Monetary Fund (IMF), and World Trade Organization (WTO). The OECD helps these groups align and coordinate efforts in global governance and policymaking. The OECD also engages in regional initiatives, providing tailored advice and support to specific regions such as Latin America, Southeast Asia, and Africa. Bottom line, the OECD has long played a crucial role in shaping global policy, promoting international cooperation, and providing data-driven, evidence-based recommendations to governments around the world.

Five Key OECD AI Principles

Before starting an AI program, businesses should consider the potential risks that AI poses to their operations, employees, and customers. By taking proactive steps to mitigate these risks, organizations can safeguard themselves from unforeseen consequences while reaping the benefits of AI. The OECD’s AI Principles (amended 2/5/24) represent one of many frameworks businesses should evaluate when integrating AI technologies into their operations. It is well respected around the world and should be a part of any organization’s due diligence.

These principles are built around five core guidelines:

Principle 1. Inclusive Growth, Sustainable Development, and Well-being

The first OECD AI principle stresses that AI should promote inclusive growth, sustainable development, and well-being for individuals and society. AI should benefit people and the planet. This core value reflects the potential of AI to contribute to human flourishing through better healthcare, education, and environmental sustainability.

Companies should be aware of the many challenges ahead. While AI-driven solutions, such as climate modeling or precision agriculture, can help tackle environmental crises, there is concern that rapid technological advancements may lead to widening inequality. For instance, the automation of jobs could disproportionately affect lower-income workers, potentially exacerbating inequality. Thus, this principle necessitates a strategy that ensures AI’s benefits are distributed equitably.


For businesses considering AI, three key actions should always be top-of-mind for board members:

  • Engage Relevant Stakeholders: Before implementing AI, include a diverse group of stakeholders in the decision-making. This should involve executives, legal and data privacy experts, subject matter experts, human resources, and marketing/customer support teams. Each group brings unique perspectives that can help ensure the AI program is equitable and aligned with the company’s values.
  • Evaluate Positive and Negative Outcomes: Consider both the potential benefits and risks to AI users and individuals whose data may be processed. AI should enhance productivity, but it must also respect the well-being of all involved parties.
  • Consider Environmental Impact: AI systems require substantial computational resources, which contribute to a large carbon footprint. Sustainable AI practices should be considered to reduce energy consumption and minimize environmental impact.

Principle 2. Respect for the rule of law, human rights and democratic values, including fairness and privacy.

The wording of the second principle was revised somewhat in 2024. The full explanation for revised Principle Two is set out in the amendment recommendation of February 5, 2024.

a) AI actors should respect the rule of law, human rights, democratic and human-centred values throughout the AI system lifecycle. These include non-discrimination and equality, freedom, dignity, autonomy of individuals, privacy and data protection, diversity, fairness, social justice, and internationally recognised labour rights. This also includes addressing misinformation and disinformation amplified by AI, while respecting freedom of expression and other rights and freedoms protected by applicable international law.

b) To this end, AI actors should implement mechanisms and safeguards, such as capacity for human agency and oversight, including to address risks arising from uses outside of intended purpose, intentional misuse, or unintentional misuse in a manner appropriate to the context and consistent with the state of the art.

Respecting human rights means ensuring that Generative AI systems do not reinforce biases or violate individuals’ rights. For example, there is growing concern over the use of AI in facial recognition technology, where misidentification disproportionately affects marginalized groups. AI must be designed to avoid such outcomes by integrating fairness into algorithms and maintaining democratic values like transparency and fairness.

Businesses integrating AI into their operations should address several legal issues, including intellectual property, data protection, and human rights laws. To do this there are four things a board of directors should consider:

  • Ensure Compliance with Laws: Verify that  Generative AI (GAI) adheres to copyright laws and data protection regulations such as GDPR or CCPA. Implement safeguards to ensure the system does not infringe upon users’ privacy or autonomy.
  • Prevent Discrimination: Conduct thorough audits to ensure that GAI outputs are fair and free from discrimination. Discriminatory outcomes can damage reputations and result in legal challenges.
  • Monitor for Misinformation: GAI systems must be designed to resist distortion by misinformation or disinformation. Mechanisms should be in place to quickly halt GAI operations if harmful behaviors are detected.
  • Develop Policies and Oversight: Establish clear policies and procedures that govern the use of GAI within your business. This includes implementing human oversight to ensure AI actions align with ethical and legal standards.

Principle 3. Transparency and Explainability

Transparency and explainability are fundamental to user trust in AI systems. This principle calls for AI systems to be transparent so that users can understand how decisions are made. With complex AI algorithms, it is often difficult to decipher how certain outcomes are generated—a problem referred to as the “black box” issue in AI.


While transparency enables users to scrutinize AI decisions, the challenge lies in making these highly technical systems comprehensible to non-experts. This requires a good education program by experts. Moreover, explainability must strike a balance between safeguarding intellectual property and providing adequate insight into AI operations, especially when used in public sector decision-making.

Businesses and other organizations must ensure that employees and other users of its computer systems understand when and how AI is used, along with some understanding of how AI decisions are made, and what mistakes to look out for. See e.g. Navigating the AI Frontier: Balancing Breakthroughs and Blind Spots (e-Discovery Team, October 2024). For businesses, ensuring transparency involves two critical steps:

  • Inform Users: Be transparent with employees, consumers, and stakeholders that GAI is being used. Where required by law, obtain explicit consent from users before collecting or processing their data.

Principle 4. Robustness, Security, and Safety

This principle demands that AI systems be resilient, secure, and reliable. As AI systems are increasingly integrated into sectors like healthcare, transportation, and critical infrastructure, their reliability is essential. A malfunctioning AI in these areas could result in dire consequences, from life-threatening medical errors to catastrophic failures in critical systems.


Cybersecurity is a significant concern, as more advanced AI systems become attractive targets for hackers. The OECD recognizes the importance of safeguarding AI systems and other systems from security breaches. All organizations today must guard against malicious attacks to protect their data and public safety. Organizations using AI must adopt a comprehensive set of IT security policies. Two key actions points that the Board should start with are:

  • Plan for Contingencies: Implement a Cybersecurity Incident Response Plan that outlines steps to take if the AI or other technology system malfunctions or behaves in an undesirable manner. This plan should detail how to quickly halt operations, troubleshoot issues, and safely decommission the system if necessary. You should probably have legal specialists on call in case your systems are hacked.
  • Ensure Security and Safety: Businesses should continuously monitor their technology and AI systems to ensure they operate securely and safely under various conditions. Regular audits, including red team testing, can help detect vulnerabilities before they become significant problems.

Principle 5. Accountability

Accountability in AI development and use is paramount. This principle asserts that those involved in creating, deploying, and managing AI systems must be held accountable for their impacts. Human oversight is critical to safeguard against mistakes, biases, or unintended consequences. This is another application of “trust but verify” on a management level. This is particularly relevant in scenarios where AI systems are set up to help make decisions affecting people’s lives, such as loan approvals, hiring decisions, or judicial sentencing. These should never be autonomous, but recommendation with a human in charge. This is especially true for physical security systems.

A clear accountability framework is critical. The accountability principle ensures that even in highly automated systems, human oversight is necessary to safeguard against mistakes, biases, or unintended consequences. The Board of Directors should, as a starting point:

  • Designate Responsible Parties: Assign specific individuals or departments to oversee the AI system’s operations. These stakeholders must maintain comprehensive documentation, including data sets used for training, decisions made throughout the AI lifecycle, and records of how the system performs over time.
  • Conduct Risk Assessments: Periodically evaluate the risks associated with AI, particularly in relation to the system’s outputs and decision-making processes. Regular assessments help ensure the system continues to function as intended and complies with ethical standards.

Strengths and Weaknesses of the OECD AI Principles

The OECD AI principles are ambitious and reflect a comprehensive effort to create a global framework for responsible AI. However, while these guidelines are strong, they are not without their weaknesses.

Strengths

  • Comprehensive Ethical Guidelines: The principles cover a broad spectrum of ethical concerns, making them a strong foundation for policy guidance.
  • Global Influence: As an international standard, the OECD AI Principles provide a respected baseline for countries worldwide, not just the U.S. This allows for a coordinated approach to AI governance.
  • Commitment to Human Rights: By centering AI development on human dignity and rights, the OECD ensures that ethical concerns remain at the forefront of AI advancements.

Weaknesses

  • Lack of Enforcement: One of the significant drawbacks is the absence of enforcement mechanisms. The principles serve as guidelines, but without penalties for non-compliance, their effectiveness could be limited. A Board should add appropriate procedures that track their existing policies.
  • Ambiguity in Accountability: While the principle of accountability is emphasized, the specifics of assigning responsibility in complex AI systems remain unclear.

In addition to the OECD international Principles, businesses should consult other frameworks to strengthen their AI governance strategies. For example, the NIST-AI-600-1, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (7/26/24) provides much more detailed, technical guidance into managing the risks associated with AI technologies. Organizations may also want to consider the U.S. Department of State Risk Management Profile for Artificial Intelligence and Human Rights. It states that it is intended as a practical guide for organizations to design, develop, deploy, use, and govern AI in a manner consistent with respect for international human rights.

Conclusion

Implementation of the OECD’s Five AI Principles is an essential step toward the responsible development of AI technologies. While the principles address key concerns such as human rights, transparency, and accountability, they also highlight the need for ongoing international collaboration and governance. In many countries outside of the U.S. there are, for instance. much stronger laws and regulations governing user privacy. Following the OECD Principles can help with regulatory compliance and show an organizations good faith to attempt to follow complex regulatory systems.


By relying on multiple AI frameworks, not just the OECD’s, businesses and their Boards can ensure a comprehensive approach to AI implementation. In the rapidly evolving field of AI, where state and foreign laws change rapidly, it is prudent for any CEO or Board of Directors to base it policies on stable, well-respected, principles. That can help establish good faith efforts to handle AI responsibly. Consultation with knowledgeable outside legal counsel is, of course, an important part of all corporate governance, including AI implementation.

Documenting Board decisions and tying them back to internationally accepted standards on AI is a good practice for any organization, local or global. It may not protect all of a company’s decisions from outside attack based on unfair 20/20 hindsight, but it should provide a solid foundation for good faith based defenses. This is especially true if these principles are adopted proactively and implemented with advice from respected third-party advisors. We are facing rapidly changing times, with both great opportunities and dangers. We all need to make our best efforts to act in a responsible manner and the OECD principles can help us to do that.

Click here to listen to an AI generated Podcast discussing the material in this article.

Ralph Losey Copyright 2024 — All Rights Reserved