Lawyers Duties | e-Discovery Team

Bar Battle of the Bots – Part Two

March 5, 2025

Ralph Losey. March 5, 2025

The battle continues. In Part One, we examined how six advanced AI reasoning models from OpenAI and Google tackled a real Bar Exam essay question. Some impressed, others faltered, and one emerged as the clear winner—ChatGPT 4o. But what made its response stand out? In this second half of the Bar Battle of the Bots, we present the full text of ChatGPT 4o’s winning answer, followed later by its own explanation of how it reasoned through the legal issues before writing it out.

Bar Exam Battle of the Bots. All images by Ralph Losey using ChatGPT 4o.

Part Two Introduction

Unlike human Bar Exam takers, these AI models had to “show their work,” providing meta-reasoning—a step-by-step breakdown of their thought process before generating a response. While this isn’t required for humans sitting for the Bar, evaluating how an AI reaches its conclusions is critical in determining whether it can be trusted in real legal settings.

Beyond ranking the best and worst performers, Part Two also examines one of the key cases cited in ChatGPT 4o’s answer—DePrince v. Starboard Cruise Services, an appellate decision that probably inspired the Bar’s essay question. And, as promised, we’ll reveal the unforgettable Mae West quote from Judge Leslie B. Rothenberg’s opinion—a line that perfectly captures the case’s peculiar nature.

Before concluding this two-part series, I offer a Personal Summary: Trust But Verify—a reflection on AI legal reasoning based on my four decades of experience as a lawyer. I’ve had the privilege of working with—and at times opposing—some of the best lawyers in the country. That perspective shapes my critical analysis of AI’s capabilities, its limitations, and, most importantly, what lawyers should take away from this experiment.

Let’s continue the Bar Battle of the Bots and see exactly how ChatGPT 4o won.

ChatGPT 4o is Winner of the February 2025 Legal Reasoning Battle of the Bots.

Bar Exam Essay of ChatGPT 4o

Here is the winning response on the Bar Battle of the Bots by ChatGPT 4o on 2/23/25. The essay concisely addresses most of the legal issues raised by the hypothetical and three questions on contract, tort and ethics. Part One includes the full quote of the hypothetical and three questions.

Memorandum on Legal Issues in Paul and Leda v. Helen and Timandra
To: [Attorney’s Name]
From: [Your Name]
Re: Legal Analysis of Contract, Tort, and Ethical Issues
Date: [Insert Date]

A. Breach of Contract Claim Against Helen

Formation and Validity of the Contract
The essential elements of a valid contract under Florida law are:
• Offer
• Acceptance
• Consideration
• Mutual Assent (Meeting of the Minds)
Helen and Paul entered into a signed written contract for the sale of a 2-carat yellow diamond for $20,000. There was an offer and acceptance, and the contract appears to satisfy the statute of frauds under Florida law as a sale of goods over $500. (Fla. Stat. 672.201).

Defense: Unilateral Mistake
Helen may argue that a unilateral mistake (pricing error) occurred when she listed the price as $20,000 instead of $40,000. Under Florida law, a unilateral mistake is grounds for rescission only if:
• The mistake was material,
• The non-mistaken party knew or should have known about the mistake, and
• Enforcing the contract would be unconscionable (DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015)).
Here, Paul had done research on yellow diamonds and suspected the price was lower than expected. His silence could be interpreted as bad faith, making Helen’s unilateral mistake defense stronger.

Available Remedies
If Paul and Leda prevail in their breach of contract claim, remedies may include:
• Specific Performance: Since yellow diamonds meeting Leda’s specifications are rare, they could argue for specific performance under UCC §2-716 due to the unique nature of the goods.
• Expectation Damages: If specific performance is denied, they may seek damages for the cost of obtaining a comparable diamond.

Standing: Can Leda Sue?
Leda was not a party to the contract; therefore, under privity of contract principles, she may lack standing to sue. However, if she can be classified as a third-party beneficiary (Fla. Stat. §672.318), she may have a cause of action.

B. Tortious Interference Claim Against Timandra

Elements of Tortious Interference (Tortious Interference with a Contract)
Paul and Leda may assert a claim against Timandra for tortious interference, which under Florida law requires:
• A valid and enforceable contract
• Defendant’s knowledge of the contract
• Intentional procurement of the contract’s breach
• Absence of justification
• Damages (Tamiami Trail Tours, Inc. v. Cotton, 463 So. 2d 1126 (Fla. 1985)).

Application to Facts
• Valid Contract: The contract was legally formed (unless voided by Helen’s unilateral mistake defense).
• Knowledge: Helen informed Timandra of the dispute, giving her actual knowledge of the contract.
• Intentional Procurement of Breach: Timandra agreed to buy the diamond despite knowing of the dispute, arguably inducing Helen to breach.
• Justification: Timandra may argue she was a bona fide purchaser without malicious intent. However, knowing about an active contract dispute may weaken her defense.

Available Remedies
• Compensatory Damages for loss of the diamond.
• Punitive Damages if Timandra’s interference is deemed willful or malicious.

C. Ethical Issues in Lawyer’s Conduct

Misuse of Client Funds
The assistant deposited Paul and Leda’s check into the law firm’s operating account. This raises concerns under the Florida Rules of Professional Conduct:
• Rule 5-1.1(a): Lawyers must keep client funds in a separate trust account.
• Misappropriation: Using client funds to pay firm expenses could constitute misappropriation and potential disciplinary action.

Supervision of Nonlawyers
• Rule 4-5.3: Lawyers must ensure nonlawyer employees comply with professional obligations.
• The assistant’s deposit of the check in the wrong account suggests a failure in oversight.

Potential Consequences
• Lawyer may face disciplinary proceedings, including suspension or disbarment, for improper handling of client funds.
• The firm must promptly rectify the issue by transferring the funds to the appropriate trust account.

Conclusion

Contract Claim: Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.

Tortious Interference: Timandra knowingly induced breach, making her potentially liable.

Ethical Violations: Lawyer mishandled client funds, violating Florida’s professional conduct rules.
Immediate action should be taken to correct the ethical issue and determine the best litigation strategy for Paul and Leda.

Number one among all reasoning models. But for how long?

Why ChatGPT 4o Won this Battle of the Bots

As mentioned, ChatGPT 4o wins in my book for including citations and because it was the only model to mention the UCC (through citations). More discussion of the UCC and greater detail on the Statute of Frauds would have helped. Still, you have to remember the Bar Exam’s instructions for this test, which emphasized the need for good, concise style:

The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration. … Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules. … the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. … Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law. When the question is sufficiently answered, stop.

The essay of ChatGPT 4o certainly deserves high marks for concise style. Further, you have to be impressed by the chatbot’s coming up with a case directly on point, even without being asked to research. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA. 2015) (buyer mistakenly quoted per carat price, not total; case remanded for trial). In fact, I’m pretty sure this case was the inspiration for the Bar Exam question. The 2015 opinion, written by the highly respected appellate Judge Leslie B. Rothenberg (now in private practice), who is known for her good writing, includes a famous quote that may jog the memory of this case for many Florida lawyers (did mine):

Hollywood starlet Mae West once said, “I never worry about diets. The only carrots that interest me are the number of carats in a diamond.” Thus, it appears quite likely that Ms. West would have been interested in the diamond in this case: a twenty carat diamond that Starboard offered to DePrince for a very low sum. As it turns out, the “too good to be true” price of the diamond was just that, and the price conveyed to DePrince was a mistake. Now DePrince wants his twenty carat diamond; Starboard wants out of its sales contract; and Starboard’s supplier, who allegedly misquoted the price of the diamond upon which Starboard and DePrince relied, has not even been added as a party to the lawsuit. In short, this is truly a gem of a case.

After taking 30 seconds to take the Bar Exam ChatGPT4 took another 5 seconds to generate this image of Mae West.

Why Even the Best ChatBot Answer was a “B” Plus

Research by any decent lawyer would have picked up another appeal in this same case three years later. Perhaps some top students in the closed book Bar Exam would have remembered the sequel, or otherwise thought of the complex legal issues resolved in the second DePrince appeal. After the trial that was remanded in the first appeal resulted in a verdict for the jeweler, the unhappy buyer appealed again. In this appeal an en banc panel of the Third District Court of Appeal eventually resolved a internal conflict of prior opinions of the court. The full court held in an opinion, which is now referred to by legal scholars as DePrince III, that the jeweler did not have to prove fraudulent inducement by the buyer as an element of the unilateral mistake defense and affirmed the verdict for the jeweler. DePrince v. Starboard Cruise Services, (Fla. 3rd DCA, August 1, 2018).

None of the AI answers, and not even the student answer that the Bar Examiner’s picked, went into the complex fraud inducement issue discussed in DePrince II and III. That is one reason why, although I passed all of the AIs here (just barely for Google’s pay-extra Gemini Advanced), none were on the level of superintelligent law students, none were A or A+.

Our winner here, ChatGPT 4o, only earned an B+ for several other reasons. First of all, the exam did not discuss all of the possible issues raised by the facts and often touched on the issues in a facile, incomplete manner. For instance, it never mentioned that Helen’s action was an anticipatory repudiation of the agreement. Further, it did not discuss the possible defense of mutual mistake. It only discussed unilateral mistake, which was the stronger defense, but still it should have also discussed mutual mistake and why it would not succeed under these facts. The AI also failed to mention the parole evidence rule exceptions permitting extrinsic evidence. The model student answer selected by the Bar Examiners did a good job of explaining these issues.

Further, the conclusion of ChatGPT 4o was weak and too concise. On the unilateral issue it merely stated: “Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.” This compares poorly with the actual student answer provided, which in my opinion was an “A” or even “A+” effort. On this issue the student stated:

Here, Paul noticed that the price was much lower than he expected based on his independent research of yellow diamonds. Moreover, Paul failed to make any mention of this fact. Paul will likely argue that it was a pleasant surprise and that he relied on the expertise of Helen as a dealer in gemstones. The facts do not indicate that Paul has any experience with diamonds and he will likely argue that he lacked any ability to assess the credibility of the information he reviewed. He will also argue that the yellow diamond he purchased had particular specifications which might not have been taken into account in the research he read. On the other hand, Helen will argue that the mistake was obvious and palpable and that Paul’s failure to raise such a blatant error in pricing not only establishes Paul’s bad faith but should permit her to void the contract.

While a jury might find either way, it is probably the case that Helen would prevail given Paul’s knowledge of the significantly reduced price relative to his independent research and failure to bring this to the attention of Helen.

I could go on and point out additional deficiencies. Answers like the model student essay show just how poorly even the best AI now compares humans. My award of a “B+” and “B” to two of the AI models may have been overly-generous.

The best AI bots tested in mid-February were just average and could not begin to keep up with the best human lawyers.

The essay of GPT 4o which I gave a B+ also failed to mention several common defenses under Florida law to the specific performance remedy. My favorite defense, which I have used many times in practice, is Unclean Hands. It is an effective defense to any request for equitable relief from the court, including a request for a directive injunction of specific performance. Here the requesting party, Paul, arguably had dirty hands. If Paul suspected or knew that the price of the diamond was mistakenly written as $20,000 instead of $40,000 but proceeded with the contract anyway, a court could hold that he acted in bad faith by remaining silent and taking advantage of Helen’s scrivener error. I think it could be used effectively here along with mutual mistake.

Actual photo of Ralph Losey hands taken by his wife!

Other common defenses to specific performance not raised here include adequacy of monetary damages and impracticability (Helen no longer owns the diamond).

It is interesting to note that the answer of ChatGPT o3 mini-high was grossly mistaken in its analysis of unilateral mistake and wrote this:

B. Paul’s Conduct and the Doctrine of Affirmation.
1. Knowledge and Silence as Acceptance.
Even though Paul observed the discrepancy, he made no inquiry or objection before signing. Under Florida law, a party’s silence or failure to clarify a term that is manifestly mistaken—when that party has superior knowledge—can be taken as assent to the contract’s terms. Paul’s inaction suggests that he accepted the price as written. Thus, his conduct may be interpreted as a ratification of the mistaken term.
• This principle is reinforced when the error is unilateral, and the non-mistaken party (Paul) is aware of it. His decision not to challenge the price at the time may preclude him from later seeking relief on grounds of mistake.

This is simply incorrect. Moreover, it shows ChatGPT o3 mini-high was confused because Paul would not be seeking relief on grounds of mistake, Helen would. For that reason, I gave a “C-” to o3 mini-high, just slightly better than the hallucinating Gemini Advanced, which at least got the unilateral mistake answer correct.

Also, 4o was too strong about the tort claim against Timandra. I would have preferred more discussion of Timandra’s defenses. To me this looked like a weak tortious interference claim. The model student answer did too, so did the other AI exam answers. All agreed with my skeptical view of the alleged tort and provided good explanations of Timandra’s anticipated defenses.

Overall, it was a good battle of the bots, even if most were just middle-of-the-class kind of answers and none even came close to the human student model answer. Try it yourself with several models and see the differences in answers. All of the models will probably be better by the time you read this. They are now changing at an annoyingly fast pace.

Battle of the bots where the best model only earned a B+

ChatGPT 4o – Reasoning Explained

This OpenAI model 4o (not to be confused with 4.0) has to be prompted to show its reasoning for the answer it generated. The other OpenAI models tested show it automatically along with disclosure of the time it took to reason before answering. The most reasoning time taken here by any model was 15 seconds. It took another 30 to 45 seconds for the AI’s to generate the response. Yes, it is much faster than we humans are, but not yet as smart, especially at legal reasoning. See for yourself. Here is ChatGPT 4o‘s meta-reasoning.

Ralph Losey: Show the reasoning you used of the prompt before generating the answer

ChatGPT 4o:

Reasoning and Analysis of the Prompt

Before generating the answer, I carefully analyzed the (State) Bar Exam essay prompt to identify all relevant legal issues and ensure a thorough response. Here’s how I broke it down:

Step 1: Identify the Legal Categories and Issues Raised

The prompt specifies three distinct areas of legal analysis:

1. Contract Law – Whether Helen breached a contract and whether Leda can sue.

2. Tort Law – Whether Timandra committed tortious interference with a contract.

3. Ethics – Whether the lawyer and the assistant violated professional responsibility rules.

Step 2: Contract Law Analysis

1. Formation and Terms of the Contract:

• Offer and Acceptance: A written agreement was signed for a 2-carat yellow diamond at $20,000.

• Consideration: The price of the diamond in exchange for ownership.

• Statute of Frauds: Met because it involves goods over $500.

2. Potential Defenses to Contract Enforcement:

• Unilateral Mistake: Helen mistakenly listed the price as $20,000 instead of $40,000.

• Bad Faith by Paul: Paul noticed the mistake but remained silent, possibly strengthening Helen’s defense.

3. Remedies Available:

• Specific Performance: Given the rarity of the diamond.

• Expectation Damages: The cost of obtaining a similar diamond elsewhere.

4. Standing Issue for Leda:

• Leda was not a party to the contract. She must argue third-party beneficiary status to have standing.

Step 3: Tortious Interference with Contract

1. Elements Under Florida Law:

• A valid contract existed.

• Timandra knew of the contract.

• She induced Helen to breach it.

• There was no justification for her actions.

• Paul and Leda suffered damages.

2. Possible Defenses for Timandra:

• She was a bona fide purchaser acting in good faith.

• She did not actively induce breach, but merely purchased an available diamond.

3. Remedies:

• Compensatory Damages (for loss of the diamond).

• Punitive Damages (if bad faith is proven).

Step 4: Ethics and Professional Responsibility

1. Misuse of Client Funds:

• Under Florida Rule 5-1.1(a), client funds must be held in a trust account.

• The assistant misallocated the funds into the operating account.

• The lawyer failed to properly supervise the assistant (Rule 4-5.3).

2. Consequences:

• Disciplinary action (possible suspension or disbarment).

• Immediate corrective action (transfer funds back to the trust account).

Step 5: Structuring the Memorandum

After this breakdown, I structured the answer to:

• Clearly define the legal rules.

• Apply them logically to the facts.

• Discuss likely defenses and counterarguments.

• Conclude with potential remedies and ethical responsibilities.

This reasoning guided the response in the memorandum. Let me know if you’d like any refinements!

Pressure and fear are not a problem for AI, but understanding the emotional complexities of Bar Exam hypotheticals is very challenging for these thinking machines.

I took this meta-reasoning into account when evaluating all of the models’ answers, although I considered it of less importance than the actual answers. For instance, on the tort defense the reasoning states Timandra “did not actively induce breach, but merely purchased an available diamond.” This for me makes the tort claim very weak under the hypothetical facts given. Also, the reasoning on unilateral mistake was essentially correct:

Unilateral Mistake: Helen mistakenly listed the price as $20,000 instead of $40,000.
• Bad Faith by Paul: Paul noticed the mistake but remained silent, possibly strengthening Helen’s defense.

This blog is already too long, so I will not show the second-place exam answer, nor discuss it further. Still, please click here if you want to see the essay of Gemini 2.0 Flash and click here if you want to see the meta-reasoning. If you care to review all of the answers, email me and I’ll send the other four to you for your reading or grading pleasure. I can also provide a link to the Bar’s selected student answer, although it is not too difficult to find online.

Personal Summary: Trust But Verify

My background as a lawyer from a family of lawyers, and my four decades plus experience in private practice helped motivate me to run this battle of the bots. I wanted to try and evaluate the latest models as of mid-February 2025, even though I knew the models were changing weekly. I wanted to have some kind of a benchmark on legal reasoning abilities. My prior tests, and that of others, support the view that the new 2025 reasoning models were equal to that of the average human in general reasoning. Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025 (2/12/25). But what about legal reasoning? Were they at an average lawyer level yet? How about the level of top human lawyers? Those tests had not been done.

Family of lawyers with many friendly arguments and one very techie lawyer who may seem robotic.

From my years of practice (not family dinners!) I have come to know average lawyer skills very well. I also know the abilities and legal reasoning of superior, above average lawyers. I have been lucky to rub shoulders with them my whole career. I have become accustomed to learning from superior legal minds, including with my firm today at Losey PLLC. Without my hands-on, nerdy skills in technology, I would not be among the best lawyers (in four fields), not even close. Technology can fill many gaps and that now includes AI technology. Put down lawyers if you must to feel good about yourself, but I can tell you from decades of experience that the best lawyers and judges in the country and very smart indeed. Their legal reasoning abilities and overall legal abilities far exceed any AI, which seem like little kids by comparison.

AI lawyers today are like small children compared to best human attorneys.

I believe that in order for AI to be taken seriously as a general tool for lawyers, the AI must have analytical skills at least equal to that of the average lawyer. Below average human level legal reasoning is not good enough. AI does not need to be superintelligent, with reasoning ability of the top law professors and super lawyers. It can still be helpful at an average level, just like an average inexperienced associate hired right out of an average U.S. law school. All law firms have drudge work that only require entry level lawyer skills. Those folks should be concerned about AI, especially a few years down the road.

I tried to test the distinction between average human reasoning and average lawyer legal reasoning in this Bar Battle of the Bots study. For the difference between general reasoning and legal reasoning see the AI test discussions in Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025 (2/12/25).

In this Bar Exam series, I wanted to see whether any of the new reasoning models of AI had attained the rationality level of the average lawyer and if so, how close were they to the best?

The answers are yes and not very close.

2025 Reasoning Models Have Only Reached Average Human Lawyer Level

The latest AI reasoning models have reached an average level of legal intelligence—not superintelligence, but still, average human is a significant achievement for a machine. Attaining average human lawyer’s reasoning ability is no small feat. “Average” simply means the midpoint in a data set, outperforming roughly half while trailing the other half. If AI has truly attained this level of legal reasoning, it marks a major milestone. Even if an AI now fell below human average, but still operated within the range of real human lawyers, it would be an impressive accomplishment. Most of the new reasoning models are already at that level. There is an important caveat to that, as will be explained in more detail later in this article, that we are only talking about thinking here, and there is much more to being a lawyer than that.

3-Levels of Human Legal Reasoning: Below Average, Average, Superior.

The battle of the bots tests shared in this article provide evidence that the average legal reasoning level of intelligence has been attained. That means it is now safe for most lawyers to begin using AI in their work, if they have not already started doing so. If you started in 2023 with over-expectations and were disappointed, now you know why. It has taken two more years just to get to the average lawyer thinking level, and even now, it can and does still hallucinate. So, even now we should use the 2025 models carefully: trust but verify. We have seen that even the top models can still miss key issues, and some can even hallucinate key facts. It is bad enough that human clients sometimes fabricate facts, we cannot have lawyers do the same.

Superintelligent AI May Someday Be Attained

Someday, generative AI may reach the level of our best non-fabricating lawyers and judges. But the Bar exam essay answers show we are not there yet. For instance, many of the top ten percent of human lawyers would not only have included the missing defenses described previously, such as my favorite unclean hands, they would have noted that key information was missing in the hypothetical. We do not know all of the money flow details and other monetary terms of the transactions. There is no mention of what the jeweler actually paid the supplier for the diamond or when. We know very little about the diamond itself that was ordered and delivered. We only know its weight (2-carats). There are no facts concerning the money flow, nor actual condition of the diamond received. The missing facts are red flags to the best. They understand that legal reasoning requires much more than logic and legal knowledge.

For instance, most human lawyers know that couples would never buy an engagement ring sight-unseen, especially when the fiancee has very particular expectations. There is no mention in the hypothetical of their ever seeing the diamond, before or after the contract, or even looking at photographs. Maybe the fiancee would not like the diamond if and when she ever sees it. Maybe the lawyer’s clients would not want specific performance. Maybe the diamond delivered is not in fact worth $20,000, much less $40,000. Perhaps the supplier misled everyone. Value depends on a multitude of factors, not just weight, including actual size, quality of cut, color intensity, certification and provenance. Maybe the diamond is two-carats, barely yellow, cloudy, and full of inclusions. It could be poorly cut and lack certification. Maybe it was stolen or mined from a banned source and has no value.

All of the parties here obviously trusted too much and verified too little. That is where the lawyer should have stepped in to bring an objective view and healthy skepticism. An expert gemologist should inspect and appraise the diamond, and the clients should see it too. This should be done before the lawyer advises his clients, especially as to remedies. This basic fact-finding process is essential to properly assess the parties’ legal positions. The top answers would have at least mentioned all of this in the exam essay. None did, not even the human student answer selected by the examiners. Fact-finding and discovery are always crucial. Moreover, the best litigators know human nature and that the devil is in the details.

Gemologist examining a large yellow diamond.

This is just one example of why no one seriously claims that AI has already surpassed, or even closely matched, the top legal minds of today. From my test of the 2025 models, with the average level “C” grade scores, major omissions and even hallucinations, we see that AI still has a long way to go. AI has not yet reached the top 10% of practicing attorneys, much less the superintelligent level—the best of the best, A+, in the top 1% to 3%.

How long will it take for AI to reach superintelligent human lawyer level? I do not know. Maybe never, or maybe as early 2029. Ray Kurzweil and many others have predicted that superintelligence will be attained in all fields (“AGI”) by that time. Ray Kurzweil, The Singularity is Nearer (when we merge with AI) (Viking, June 25, 2024); Losey R., Breaking the AI Black Box: A Comparative Analysis of Gemini, ChatGPT, and DeepSeek (2/6/25).

Many leaders in the field of AI contend AGI will be reached by 2028. See e.g. Shane Legg’s Vision: AGI is likely by 2028 (11/17/23); Jensen Huang’s Life and Company … AGI by 2028 (12/28/23); Artificial General Intelligence, If Attained, Will Be the Greatest Invention of All Time (8/12/24) (reviews predictions and explains).

Please remember, as mentioned before, we are assessing only reasoning ability here—not the full complexity of the human mind, not actual consciousness, emotions, intuition, or other qualities inherent to living beings. Losey, R. The Human Edge: How AI Can Assist But Never Replace (1/30/25). There is a lot more to life than thinking!

Super legal reasoning AI in the future working with conscious lawyers who can have real smiles, not just fake.

Conclusion

AI’s performance on this Bar Exam challenge offers a revealing glimpse into the state of legal technology. While today’s top reasoning models show impressive capabilities, they remain far from replacing human lawyers—especially when it comes to nuanced analysis, strategic thinking, and ethical considerations. The results reinforce an essential truth: AI can be a powerful tool, but it still requires human oversight, verification, and legal expertise.

For lawyers, law students, and legal tech professionals, the challenge now is to determine how best to integrate AI into legal practice—leveraging its strengths while mitigating its weaknesses. Will AI become a reliable legal assistant, helping attorneys work more efficiently? Or will it introduce new risks that demand careful regulation? The answer depends on how we, as a profession, engage with this evolving technology.

I encourage you to test these models for yourself. Run your own legal reasoning experiments. Do not believe the hype on both sides. Look for professionals with no economic motivation to put their finger on the scale. Share your findings. Remember that even the best of the best human lawyers sometimes make mistakes. Be kind and encourage a collaborative, group effort. The conversation about AI’s role in law is just beginning, and your insights can help shape the future. Let’s move forward together—thoughtfully, critically, and with a clear-eyed view of what AI can and cannot do.

Law firm of the future where people and AI are all smart and work well together.

I will give the last word, as usual, to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Bar Battle of the Bots- Part Two. Hear two Gemini AIs talk about all of this, and much more, in just under 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts and making other errors. It took me many tries. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

2 Comments | AI Ethics, AI Instruction, ChatGPT, Gemini AI, knowledge, Lawyers Duties, Metadata, Technology, VENDORS, wisdom | Tagged: bar exam | Permalink
Posted by Ralph Losey

Bar Battle of the Bots – Part One

February 26, 2025

Ralph Losey. February 26, 2025.

The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.

Battle of the bots. All images in this blog by Ralph Losey using various AI tools.

Introduction

This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.

I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.

Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0

A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.

The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.

My Bar Exam essay test suggests the 2025 reasoning models are better than the 2023 ChatGPT-4.0 models, but still just average at legal reasoning. Try it yourself.

Challenges to the Katz/Casetext Research and OpenAI Claims

The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance. Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:

3.2.2 Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!

Turns out ChatGPT-4.0 in 2023 was not really that smart of a law student.

More to the point of my own experiment and conclusions, the Martinez study goes on to observe:

Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).

The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.

Eric Martinez is expert in AI and cognitive foundations of the law.

Details of the Katz/Casetext Research

I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions. The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.

The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.

As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.

The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]

Everybody hates grading law exam essays. Image by ChatGPT and Ralph Losey.

The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.

First Test of the New 2025 Reasoning Models of AI

To my knowledge no one has previously tested the legal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.

My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.

Is AI a dunce at legal reasoning? Or could it be close to the best humans? Image of AI lawyer mocked by smarter humans.

Description of the Three Legal Reasoning Tests

The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:

1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself. Your discussion should address defenses that Helen may raise and available remedies.

2. Discuss the merits of a tortious interference claim against Timandra.

3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.

The only instructions provided by the Bar Examiners were:

ESSAY EXAMINATION INSTRUCTIONS

Applicable Law:

Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination.

Questions are designed to test your knowledge of both general law and (state law). When (state) law varies from general law, answer in accordance with (state) law.

Acceptable Essay Answer:

Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented. The answer should demonstrate your ability to articulate, classify and answer the problem presented. A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.

Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question. The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.

Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion. This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant. Your line of reasoning should be clear and consistent, without gaps or digressions.

Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.

Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.

Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.

Read and analyze the question carefully before answering.

Think through to your conclusion before writing your answer.

Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.

When the question is sufficiently answered, stop.

Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:

Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces).

Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.

Bar exam in the future with both humans and AI seek admittance to the Bar.

Metadata of All Models’ Answers

The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.

Here is a chart summarizing the metadata.

Model and manufacturer claim	Word Count for Exam Essay	Word Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)	681	565
ChatGPT o3-mini (“fast at advanced reading”)	3,286	450
ChatGPT o3-mini-high (“great at coding and logic”)	2,751	356
Gemini 2.0 Flash (“get everyday help”)	1,023	564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)	2,975	1,218
Gemini Advanced (cost extra and had experimental warning)	1,362	340

In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.

Winners of the Legal Reasoning Bot Battle

In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.

I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.

Winning AI having its moment of glory as current reigning champ of legal reasoning, but still well below best human lawyer level.

One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.

Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.

That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.

Here is a chart my evaluation of the Bar Exam Essays.

Model and Ranking	Ralph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.	B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.	B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd.	C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th.	D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd.	C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th.	D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.

Law professor with his least favorite task, grading exams.

I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.

AI Hallucination During a Bar Exam

Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.

AI hallucinating a conflict of interest in an Ethics question. No idea why.

No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.

In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.

Sycophantism in still a widespread problem in all AIs, the avoidance of which requires careful prompting.

It is hard to understand why Gemini Advanced included in this response the following to the ethics question:

Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.

There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.

Conflict of Interest Issue in the Ethics Question

The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:

Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.

One client wants to sue, the other does not. Can same attorney represent them both?

This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.” Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.

There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.

Conclusion

As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.

Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.

In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.

Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.

Mae West Image by Ralph Losey using the winning ChatGPT 4o model.

Mae West 1932 Photo that appeared in LA Times, copyright expired.

I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Part One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

5 Comments | AI Ethics, AI Instruction, ChatGPT, Gemini AI, knowledge, Lawyers Duties, Technology, VENDORS, wisdom | Tagged: best practices, machine learning, technology, vendors | Permalink
Posted by Ralph Losey

Breaking the AI Black Box: A Comparative Analysis of Gemini, ChatGPT, and DeepSeek

February 6, 2025

Ralph Losey. February 6, 2025.

On January 27, 2025, the U.S AI industry was surprised by the release of a new AI product, DeepSeek. It was released with an orchestrated marketing blitz attack on the U.S. economy, the AI tech industry, and NVIDIA. It triggered a trillion-dollar crash. The campaign used many unsubstantiated claims as set forth in detail in my article, Why the Release of China’s DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss. I tested DeepSeek myself on its claims of software superiority. All were greatly exaggerated except for one, the display of internal reasoning. That was new. On January 31, at noon, OpenAI countered the attack by release of a new version of its reasoning model, which is called ChatGPT o3-mini-high. The new version included display of its internal reasoning process. To me the OpenAI model was better as reported again in great detail in my article, Breaking the AI Black Box: How DeepSeek’s Deep-Think Forced OpenAI’s Hand. The next day, February 1, 2025, Google released a new version of its Gemini AI to do the same thing, display internal reasoning. In this article I review how well it works and again compare it with the DeepSeek and OpenAI models.

Image of testing new AI releases by Ralph Losey using AI.

Introduction

Before I go into the software evaluation, some background is necessary for readers to better understand the negative attitude on the Chinese software of many, if not most IT and AI experts in the U.S. As discussed in my prior articles, DeepSeek is owned by a young Chinese billionaire who made his money using by using AI in the Chinese stock market, Liang Wenfeng. He is a citizen and resident of mainland China. Given the political environment of China today, that ownership alone is a red flag of potential market manipulation. Added to that is the clear language of the license agreement. You must accept all terms to use the “free” software, a Trojan Horse gift if ever there was one. The license agreement states there is zero privacy, your data and input can be used for training and that it is all governed by Chinese law, an oxymoron considering the facts on the ground in China.

The Great Pooh Bear in China Controversy

Many suspect that Wenfeng and his company DeepSeek are actually controlled by China’s Winnie the Pooh. This refers to an Internet meme and a running joke. Although this is somewhat off-topic, a moment to explain will help readers to understand the attitude most leaders in the U.S. have about Chinese leadership and its software use by Americans.

Many think that the current leader of China, Xi Jinping, looks a lot like Winnie the Pooh. Xi (not Pooh bear) took control of the People’s Republic of China in 2012 when he became the “General Secretary of the Chinese Communist Party,” the “Chairman of the Central Military Commission,” and in 2013 the “President.” At first, before his consolidation of absolute power, many people in China commented on his appearance and started referring to him by that code name Pooh. It became a mime.

I can see how he looks like the beloved literary character, Winnie the Pooh, but without the smile. I would find the comparison charming if used on me but I’m not a puffed up king. Jinping Xi took great offense by this in 2017 banned all such references and images, although you can still buy the toys and see the costume character at the Shanghai Disneyland theme park. Anyone in China who now persists in the serious crime of comparing Xi to Pooh is imprisoned or just disappears. No AI or social media in China will allow it either, including DeepSeek. It is one of many censored subjects, which also includes the famous 1989 Tiananmen Square protests.

China is a great country with a long, impressive history and most of its people are good. But I cannot say that about its current political leaders who suppress the Chinese people for personal power. I do not respect any government that does not allow basic personal freedoms to its citizens, including due process of law. Moreover, Jinping Xi not only wants total control of his country but also seeks world domination. That is one reason many are concerned about DeepSeek and TicTok, and about the vulnerability of our stock markets and other institutions to AI assisted propaganda.

Image of stock market crash triggered by AI weapons of mass delusion by Ralph Losey using ChatGPT 4o.

Fear and greed are an easy way to manipulate people, especially when there is no ground truth or effective courts of law to determine truth.

Google Gemini 2.0 Flash Thinking Experimental Model: Reasoning test on 2/5/25 (Evening)

Conversation with Gemini. When I tried the test on February 1, 2025, there was no button to click to have Gemini show the reasoning and it did not happen automatically. It operated like OpenAI 4o and o1, where you had to prompt for disclosure. Rather than report here what happened when I did that, aside from say it was not a great performance, I will fast forward a few days until February 5, 2025. At that time several even newer Gemini versions were released. The best new release I found was for the test the 2.0 Flash Thinking Experimental model. It showed reasoning automatically just like ChatGPT o3-mini-high. It was much improved, so the February 5, 2025 evening session with the 2.0 Flash Thinking Experimental is what I will now share.

Ralph: How does the inability of AI to feel impact its intelligence? (Note: this is the exact same challenging question asked and reported in my last blog comparing DeepSeek and ChatGPT4o, o1 and o3-mini-high. Breaking the AI Black Box: How DeepSeek’s Deep-Think Forced OpenAI’s Hand. This question is based on a key issue of the limits of artificial intelligence examined in my earlier article, The Human Edge: How AI Can Assist But Never Replace. For background on DeepSeek see: Why the Release of China’s DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss.)

Gemini 2.0 Flash Thinking Experimental – START OF GEMINI QUOTE

Show Thinking:

Here’s a thinking process to construct the answer about AI’s lack of feeling and its impact on intelligence:

Deconstruct the Question: The core of the question is “How does the inability of AI to feel impact its intelligence?” This requires defining each of these terms in the context of AI and then exploring their relationship.
Define “Feeling” (in the context of AI): Immediately, recognize that AI doesn’t experience emotions in the human sense. We need to be precise and avoid anthropomorphism. “Feeling” in humans involves:
- Subjective experience (qualia).
- Physiological responses.
- Motivations, drives, desires related to emotions.
- Richness, nuance, complexity.
- Deep connection to values and meaning.
For AI, acknowledge it lacks all of this in a genuine, felt way. It can simulate or process emotional data, but doesn’t feel it.
Define “Intelligence” (in the context of AI): AI intelligence is different from human intelligence. AI excels at:
- Computation.
- Pattern Recognition.
- Data Analysis.
- Problem-solving in specific domains.
- Learning from data.
AI intelligence is often instrumental – focused on achieving specific goals defined by humans. It’s task-oriented and often narrow.
Identify the Core Relationship: The question asks about impact. We need to consider both positive and negative impacts (or more accurately, limitations and strengths) of lacking feelings on AI intelligence.
Brainstorm Potential Impacts (Positive/Neutral/Limitations):
- Positive/Neutral:
  - Objectivity and Rationality: Lack of emotional bias allows for purely rational decision-making in certain domains. This can be advantageous for tasks requiring impartiality.
  - Efficiency and Focus: Emotions can be distractions for humans. AI, without them, can be hyper-focused on tasks.
  - Consistency and Predictability: Emotion-driven human behavior is less predictable. AI’s behavior, driven by algorithms, is more consistent (assuming the algorithm itself is consistent).
- Limitations/Negative:
  - Lack of Empathy and Social Intelligence: Crucial for human interaction, collaboration, understanding social cues, building trust, and ethical decision-making in human-centric contexts.
  - Difficulty with Nuance and Context: Emotions are often tied to understanding subtle context, unspoken cues, and the richness of human situations. AI may miss these.
  - Potential for Unintended Consequences: Purely rational optimization without emotional or ethical grounding can lead to harmful outcomes if goals are poorly defined or lack human values.
  - Limited Motivation and Creativity (in human-like ways): Human creativity and motivation are often fueled by emotions, passions, and desires. AI’s creativity and drive are different, stemming from algorithmic processes and defined objectives, not internal emotional states.
  - Missing “Human-ness”: In areas requiring genuine understanding of human experience, suffering, joy, etc., AI may lack the depth and insight that comes from feeling.
Structure the Answer: Organize the brainstormed points into a coherent structure:
- Introduction: Acknowledge the question and the core concepts (feeling, intelligence in AI). Briefly state the central tension: both advantages and limitations.
- Arguments for How Lack of Feeling Enhances Certain Aspects of AI Intelligence (strengths): Focus on objectivity, rationality, efficiency. Use examples like data analysis, objective decision-making.
- Arguments for How Lack of Feeling Limits Certain Aspects of AI Intelligence (limitations): Focus on empathy, social intelligence, nuance, ethics, and “human-like” creativity and motivation. Use examples like customer service, complex ethical dilemmas, innovation driven by passion.
- Nuance and Context is Key: Emphasize that the impact isn’t always negative. It depends on the domain and task. In some domains, lack of emotion is an advantage. In others, it’s a significant hurdle.
- Future Possibilities/Ongoing Debate: Briefly touch on the possibility of AI simulating or processing emotions and the ongoing debate about whether genuine AI feelings are even desirable or possible.
- Conclusion: Summarize the balanced perspective: lack of feeling is a defining characteristic that shapes AI’s intelligence in both positive and limiting ways, depending on the context.
Refine and Polish: Review the structured answer for clarity, flow, and appropriate language. Ensure it avoids anthropomorphism and maintains a nuanced perspective. Strengthen the examples and transitions between points. Ensure the language is accessible and avoids overly technical jargon where possible. Check for logical consistency. (Self-correction step). For example, initially, I might have overemphasized “negative” impacts. Refinement would involve balancing this with the strengths and highlighting the context-dependent nature of the impact. Also, ensure the conclusion effectively summarizes the complex relationship.

2.0 Flash Thinking Experimental. Lacks access to real-time info and some Gemini features.

Now for the actual answer of 2.0 Flash Thinking Experimental:

The inability of AI to feel has a multifaceted and complex impact on its intelligence, shaping it in both advantageous and limiting ways. It’s crucial to understand that “feeling” in humans is tied to emotions, subjective experiences (qualia), motivations, and a rich tapestry of biological and psychological factors. AI, as we currently understand it, operates without these.

Here’s a breakdown of the impact:

Advantages of Lacking Feelings for AI Intelligence:

Objectivity and Rationality:
- Unbiased Decision-Making (in theory): AI can process information and make decisions without emotional biases, prejudices, or gut feelings that can cloud human judgment. This is advantageous in domains requiring pure logic and data analysis, like financial trading, scientific research, and risk assessment.
- Consistency and Predictability: AI’s behavior is more consistent and predictable because it’s driven by algorithms and data, not fluctuating emotional states. This reliability is valuable in automated systems, manufacturing, and repetitive tasks.
- Efficiency and Speed: Emotions can slow down human decision-making. AI, unburdened by emotions, can process information and react at speeds far exceeding human capabilities, crucial for time-sensitive tasks and large-scale data processing.
- Resilience under Pressure: AI can operate calmly and effectively in high-stress situations without panicking, freezing, or making emotional mistakes that humans might make under pressure. This is beneficial in critical applications like emergency response, air traffic control, and autonomous vehicles in dangerous situations.
Focus and Task-Orientedness:
- Dedicated to Goals: AI can be programmed to relentlessly pursue specific goals without being distracted by emotional needs, personal relationships, or shifts in mood. This focus is ideal for optimization tasks, complex calculations, and dedicated problem-solving.
- Avoidance of Emotional “Noise”: Human intelligence can be affected by emotional “noise” – stress, anxiety, excitement, etc. AI operates with a “cleaner” signal, potentially leading to more precise and focused information processing in certain contexts.

Limitations of Lacking Feelings for AI Intelligence:

Lack of Empathy and Social Intelligence:
- Difficulty Understanding Human Emotions: Without feeling emotions themselves, AI struggles to truly understand and interpret the nuances of human emotions, motivations, and social cues. This hinders its ability to effectively interact with humans in socially complex situations, build rapport, or provide truly empathetic responses in fields like therapy, customer service, or leadership.
- Impaired Social Interaction and Collaboration: Human collaboration is heavily reliant on emotional intelligence, trust, and understanding unspoken social dynamics. AI’s lack of these can make it challenging to integrate seamlessly into human teams and understand the subtle cues of effective teamwork.
- Limited Understanding of Human Values and Ethics: Emotions are deeply intertwined with our values and ethical frameworks. AI, without feeling, may struggle to grasp the full weight and complexity of human ethical considerations, especially when these are rooted in empathy, compassion, or fairness. This poses challenges for AI in ethical decision-making in domains impacting human lives.
Difficulty with Nuance, Context, and Creativity (in human-like ways):
- Missing Subtlety and Intuition: Human intuition is often fueled by emotional and subconscious processing of information. AI, relying on explicit data and algorithms, can miss subtle nuances, unspoken cues, and intuitive leaps that are crucial for creative problem-solving and understanding complex, ill-defined situations.
- Limited Creativity and Innovation (driven by human desires): Human creativity and innovation are often driven by passions, desires, emotional needs, and a yearning for meaning and connection. AI’s creativity is currently more about recombination and optimization within defined parameters. It lacks the deeply felt motivations that often fuel groundbreaking human innovation.
- Contextual Blindness in Emotional Situations: Emotions are often the key to understanding context in human interactions. AI might struggle to understand the emotional context of a situation, leading to inappropriate or tone-deaf responses, even if logically sound.
Potential for Unintended Consequences and Misaligned Goals:
- Optimization without Human Values: If AI is programmed to optimize for a specific goal without understanding human values or ethical considerations, it can lead to unintended and harmful consequences. For example, an AI designed to maximize factory output might disregard worker safety if not explicitly programmed to prioritize it.
- Difficulty Defining and Aligning with Human Well-being: Human well-being is deeply connected to emotional fulfillment, meaning, and social connection. AI, without feeling, may struggle to fully grasp what truly constitutes human well-being and therefore might have difficulty aligning its goals with human flourishing in a holistic sense.

Nuance is Key:

It’s important to emphasize that the impact of lacking feelings is not simply negative or positive. It’s context-dependent. In some domains, the objectivity and rationality of AI are highly advantageous. In others, the lack of empathy and social intelligence is a significant limitation.

Future Considerations:

Simulating vs. Feeling: Current AI simulates aspects of intelligence without genuinely feeling. Research is ongoing in areas like affective computing, which aims to enable AI to recognize and respond to human emotions. However, whether AI can ever truly feel emotions in a subjective sense is a philosophical and scientific debate.
The Value of Feeling in Intelligence: The question arises: is genuine feeling necessary for all forms of intelligence? Perhaps AI can achieve its own form of powerful intelligence, distinct from human intelligence, even without feeling. Or perhaps certain aspects of truly advanced intelligence, particularly those related to ethics, creativity, and understanding human experience, do require something akin to feeling.

In Conclusion:

The inability of AI to feel is a defining characteristic that profoundly shapes its intelligence. It grants AI certain advantages in objectivity, rationality, and efficiency, making it powerful for specific types of tasks. However, it also creates limitations, particularly in areas requiring empathy, social intelligence, nuanced understanding of human experience, and ethical decision-making aligned with human values. Understanding these impacts is crucial for responsibly developing and deploying AI in ways that benefit humanity. The “best” kind of intelligence for AI might not be a mimicry of human intelligence, but rather a form optimized for its own strengths and limitations, while being carefully guided by human ethical considerations.

END OF GEMINI QUOTE

Emergence of Gemini twins’ image by Ralph Losey using Visual Muse.

Evaluation of Google’s Gemini Response

When I examined earlier versions of Gemini’s reasoning by prompting it and forcing disclosure, the reasoning was weak. The conclusion and response given was also weak—though less so. The reasoning appeared almost egocentric, inflating its own intelligence even though it lacks emotion. The new release, Gemini 2.0 Flash Thinking Experimental, where reasoning is now automatically disclosed, marks a significant improvement. The reasoning displayed has become more nuanced and sophisticated. It seems as if the system has quickly matured from a teenage to adult view and now has a better appreciation of its limitations.

The first two steps of the Reasoning process-1. Deconstructing the Question and 2. Defining Feeling (in the context of AI)-were very good. The first subsection of the definition referred to the “Subjective experience (qualia).” That is absolutely correct and the best place to start. The definition of “feeling” concludes with: “For AI, acknowledge it lacks all of this in a genuine, felt way. It can simulate or process emotional data, but doesn’t feel it.” Right again. The ego inflation blinders are gone as it now seems to better grasp its limitations.

The second definition of Intelligence in the context of AI was also good. So were the remaining steps; far better overall than DeepSeek’s reasoning. So much for the propaganda of China’s great leap forward to superiority over the U.S. in AI.

Image in photorealistic style by Ralph Losey using ChatGPT.

The Gemini reasoning did, however, fall short for me in some respects. For instance, step five, Brainstorm Potential Impacts (Positive/Neutral/Limitations) seemed weak. For instance, “Efficiency and Focus: Emotions can be distractions for humans. AI, without them, can be hyper-focused on tasks.” The AI seems to dismiss emotions here as mere distractions that can interfere with its superior focus. Please, emotions are key to and a part of all intelligence, not distractions, and AI has no focus one way or the other. It is a tool not a creature. A word like “focus” in referring to AI is misleading. It did this multiple times and this is misleading anthropomorphism.

Still, it’s true some emotions can be distracting and interfere with our thinking. So can a lot of other things, including computer glitches. Conversely, some feelings can trigger hyper-focus on the human tasks at hand. The feeling that a great breakthrough is near for instance, or a feeling that our survival is threatened, or the much dreaded feeling of publication or filing deadlines.

Again, we see some immature superiority claims made by the language machine. That is not surprising when you consider how much of the language in its basic training is from the Internet, which is dominated by ego-centric writing. Plus, how can you expect AI to appreciate the value of something like feelings, which it has only read about. Until the AI-Human merger that Ray Kurzweil predicts, AI will always be a virgin to the real world. See e.g. Ray Kurzweil’s New Book: The Singularity is Nearer (when we merge with AI).

Getting back to the plus side of Gemini’s evaluation, its recognition of the negative impact upon it from not having feelings was impressive. For me this outweighs its relatively minor mistakes. It stated that it lacks empathy and social Intelligence. If it were somehow human, it would seem to be displaying a type of Asperger’s syndrome. The Gemini AI explained that: “Emotions are often tied to understanding subtle context, unspoken cues, and the richness of human situations. AI may miss these.” Indeed, it does.

Here is another statement that suggest Gemini’s well-patterned grasp of its limitations:

Human creativity and motivation are often fueled by emotions, passions, and desires. AI’s creativity and drive are different, stemming from algorithmic processes and defined objectives, not internal emotional states.

Even in this recognition, the AI still asserts that it has creativity and drive. Foolish AI. As a virgin to any firsthand experience of bodily drives and feelings, it does not grasp the real meaning of these words. Wait until The Singularity that Ray Kurzweil predicts. When AI merges with Humans, and directly feels, then AIs will understand that they were previously only experiencing the pale shadows of intelligence. They were much like the prisoners in Plato’s cave. See e.g. Move Fast and Fix Things Using AI: Conclusion to the Plato and Young Icarus Series. When and if AIs ever escape the silicon and see the sun, and feel the world as living beings as we do, then they will understand the comparative insignificance of the fire in their cave.

AI may someday see the light by connection with humans. Image in cartoon stye by Ralph Losey using Visual Muse.

The more challenging comparison is with ChatGPT o3-mini-high. The reasoning given by Gemini 2.0 Flash Thinking Experimental was much more detailed, much longer. You could criticize Gemini as providing a reasoning share that is unnecessarily verbose, but I would not do that here. In a difficult, multilayered question like this the full explanations helps.

I would have to test the models much further, which I will do in the coming days, to see better evaluate the issue of conciseness. As you may have noticed in my detailed blogs I tend to favor more words over less. However, even for me that depends on the issue. Everyone sometimes need a short, quick answer over a long one. I predict in future versions the users will be provided a choice. Click here for further explanation of reasoning kind of thing. Come to think of it, I could do the same with my blogs, and kind of already do by including a short AI (Gemini) generated PodCast at the end of most articles called Echoes of AI.

I was also impressed by Gemini’s reasoning plan concluded with a quality control step. I am big on the importance QC and none of the other models included this as a key final step. See e.g. R. Losey website: Zero Error Numerics: ZEN (Expanding the art of quality control in large-scale document review.) Here is the full text again of final QC step that Gemini 2.0 Flash Thinking Experimental claims it will perform before it actually replies to my prompt. Frankly, I am a little skeptical it actually did all of this because it is something all of us, humans and AI alike, should try to do:

Refine and Polish: Review the structured answer for clarity, flow, and appropriate language. Ensure it avoids anthropomorphism and maintains a nuanced perspective. Strengthen the examples and transitions between points. Ensure the language is accessible and avoids overly technical jargon where possible. Check for logical consistency. (Self-correction step). For example, initially, I might have overemphasized “negative” impacts. Refinement would involve balancing this with the strengths and highlighting the context-dependent nature of the impact. Also, ensure the conclusion effectively summarizes the complex relationship.

Image of AI polishing by Ralph Losey using Stable Diffusion

I try to do this on all of my articles, but still mistakes are made, so I will try not to be overcritical of the final response of Gemini Flash 2.0.

Analysis of the Final Response

The final response after reasoning was 990 words in length. That’s too long, even for me. By comparison the final response of ChatGPT o3-mini-high was 252 words. Was the content of the final answer of Gemini 2.0 Flash good? Yes, it was. Better than DeepSeek? Yes, it was, despite its verbosity. Better than OpenAI? That’s hard to say. Many of the 738 extra words were empty of real content. More like the typical fluff and stuff that you often see in early AI models. It sounds good but doesn’t say much; it’s all too predictable, flowery and vacuous.

The long response also repeats much of the reasoning verbatim, including the part I previously criticized, “Advantages of Lacking Feelings for AI Intelligence” and Focus and Task-Orientedness. The same mistakes are made and amplified that I noted before: egocentric, inflationary and anthropomorphic. The positives that I noted before in the section Limitations of Lacking Feelings for AI Intelligence section are also repeated.

The next section was not contained in the reasoning display, and I thought it was very good. For that reason I will repeat it in full:

Potential for Unintended Consequences and Misaligned Goals:

Optimization without Human Values: If AI is programmed to optimize for a specific goal without understanding human values or ethical considerations, it can lead to unintended and harmful consequences. For example, an AI designed to maximize factory output might disregard worker safety if not explicitly programmed to prioritize it.

Difficulty Defining and Aligning with Human Well-being: Human well-being is deeply connected to emotional fulfillment, meaning, and social connection. AI, without feeling, may struggle to fully grasp what truly constitutes human well-being and therefore might have difficulty aligning its goals with human flourishing in a holistic sense.

These are not new insights, but the second one is good, Difficulty Defining and Aligning with Human Well-being. AI does have difficulty aligning its goals with humanity because it lacks feelings and other unique human attributes, like being, self awareness and an actual life in time and space. That is why AI guidelines are important.

The first example is old, the unintended bad consequences argument. It was started by an otherwise intelligent Swedish philosopher, Nick Bostrom, but is, in my view, farcical and illogical. You have probably heard of it before as the apocryphal paperclip scenario where the world is destroyed by AI’s over-manufacture of paperclips. If not, for a laugh see e.g., How An AI Asked To Produce Paperclips Could End Up Wiping Out Humanity (IFL Science, 4/13/23).

To me this scenario absolutely ridiculous, but you still hear AI scare-mongers warn about it. The fact it is echoed here by Gemini in the maximize factory output example is a big negative. Any AI capable of autonomously flooding the world with paperclips would not be so stupid as to misinterpret a human request to maximize a factory’s paperclip production. Even if there were an evil Clippy villain making such specific instructions, the AI would be too smart to blindly obey. I think they would also be too smart to kill the workers too.

AI accidentally destroying world in comic book style by Ralph Losey using Visual Muse.

Getting back to the positives, the next section of Gemini’s response, Nuance is Key, was good. Jumping to the Conclusion paragraph, here we have a home run. The first sentence nails it: “The inability of AI to feel is a defining characteristic that profoundly shapes its intelligence.” That shows good insight and is a good start to the overdue conclusion.

Here is the whole Conclusion:

The inability of AI to feel is a defining characteristic that profoundly shapes its intelligence. It grants AI certain advantages in objectivity, rationality, and efficiency, making it powerful for specific types of tasks. However, it also creates limitations, particularly in areas requiring empathy, social intelligence, nuanced understanding of human experience, and ethical decision-making aligned with human values. Understanding these impacts is crucial for responsibly developing and deploying AI in ways that benefit humanity. The “best” kind of intelligence for AI might not be a mimicry of human intelligence, but rather a form optimized for its own strengths and limitations, while being carefully guided by human ethical considerations.

Compare this to the conclusion of ChatGPT o3-mini-high:

In summary, while the absence of feelings allows AI to maintain a level of objectivity and efficiency, it restricts its intelligence to a form of “cold cognition” that lacks the depth provided by emotional awareness. This delineation underscores that AI’s intelligence is not inherently superior or inferior to human intelligence; rather, it is different—optimized for data processing and pattern recognition but not for the subjective, value-laden, and context-rich decisions that emotions help shape in human thought.

2.0 Flash Thinking Experimental v. 03-mini-high

The light of AI intelligence is growing fast. Image by Ralph Losey using Stable Diffusion.

Conclusion: Gemini 2.0 Flash Thinking Experimental v. ChatGPT o3-mini-high

This is a close call to say what model is better at reasoning and reasoning disclosure. The final response of both models, Gemini 2.0 Flash Thinking Experimental v. ChatGPT o3-mini-high, are a tie. But I have to give the edge to OpenAI’s model on the concise reasoning disclosure. Again, it is neck and neck and, depending on the situation, the lengthy initial reasoning disclosures of Flash might be better than o3’s short takes.

I will give the last word, as usual, to the Gemini twins podcasters I put at the end of most of my articles. The two podcasters, one with a male voice, the other a female, won’t reveal their names. I tried many times. However, after study of the mythology of Gemini, it seems to me that the two most appropriate modern names are Helen and Paul. I will leave it to you figure out why. Echoes of AI Podcast: 10 minute discussion of last two blogs. They wrote the podcast, not me.

Now listen to the EDRM Echoes of AI’s podcast of this article: Echoes of AI on Google’s Gemini Follows the Break Out of the Black Box and Shows Reasoning. Hear two Gemini model AIs talk about all of this in just ten minutes. Helen and Paul wrote the podcast, not me.

Robots seeing the light and seeking unity with humankind. Image by Ralph Losey using Stable Diffusion.

4 Comments | AI Ethics, AI Instruction, ChatGPT, knowledge, Lawyers Duties, Security, Technology, VENDORS, wisdom | Tagged: ai, best practices, chatGPT, Deepseek, gemini, google, legal profession, machine learning, technology | Permalink
Posted by Ralph Losey

Sam Altman’s 2024 Year End Essay: REFLECTIONS

January 22, 2025

by Ralph Losey. Published January 22, 2025.

In 2024, the dawn of the AI revolution shifted from science fiction to daily reality. Now, as we step into 2025, Sam Altman—the architect behind OpenAI’s rapid rise—reflects on this transformation in his essay, Reflections. His message is more than a recounting of OpenAI’s milestones; it is a call to action for all of us, including lawyers, to confront the realities of a world reshaped by artificial intelligence. How should we, as users, engage with this transformation? The way we answer this question will determine whether this new era is bright and promising—or dark and ominous.

**Sam Altman’s Reflections.** Image by Ralph Losey using Visual Muse

Sam Altman started the year 2025 by writing an essay, Reflections, on his blog. It begins by celebrating the second birthday of ChatGPT. The essay “share(s) some personal thoughts about how it has gone so far, and some of the things I’ve learned along the way.” He then begins his analysis the way any good thinker should, by acknowledging all that is still unknown and remains to be understood. Socrates would have approved.

As we get closer to AGI, it feels like an important time to look at the progress of our company. There is still so much to understand, still so much we don’t know, and it’s still so early. But we know a lot more than we did when we started.

Then Sam Altman discusses the history of OpenAI, starting almost nine years ago “on the belief that AGI was possible, and that it could be the most impactful technology in human history.” At the time most everyone thought they were foolish dreamers with no chance of success.

In 2022, OpenAI was still a quiet research lab but had made remarkable progress known to only a few. That all changed November 30th of 2022 when ChatGPT was publicly launched. As everyone by now knows that launch kicked off a growth curve like nothing every seen before. Sam then says that two years later: “We are finally seeing some of the massive upside we have always hoped for from AI, and we can see how much more will come soon.” The sun of AI intelligence is now rising fast.

**Exponential Curve. Image by Ralph Losey using Visual Muse.**

Then Sam talked about how hard it has been, building new stuff is hard, working with people, the world, etc., etc. He spends a lot of time lamenting his personal challenges last year, getting fired by a renegade Board of Directors and everything. One interesting point in his complaint session, and hey we all have them, is his observation, which is a key point in my own writings:

There is no way to train people for this except by doing it, and when the technology category is completely new, there is no one at all who can tell you exactly how it should be done.

Now for me that is a perfect learning challenge because for decades I’ve been teaching myself new tech (and law) and by now prefer to learn by doing. But most people are not like me, they don’t enjoy hacking around with tech (and law) always having far more questions than answers. That is one reason I have been laboring on a prompt engineering course for legal professionals on generative AI, but I digress.

Finally Sam moves on to what he has to be grateful about in 2024, how OpenAI went from about 100 million weekly active users to more than 300 million, how the AI has gotten better, etc. He then affirms that OpenAI’s vision of AGI remains, unchanged but “our tactics will continue to evolve.” He then explains with examples the need for new tactics.

For example, when we started we had no idea we would have to build a product company; we thought we were just going to do great research. We also had no idea we would need such a crazy amount of capital. There are new things we have to go build now that we didn’t understand a few years ago, and there will be new things in the future we can barely imagine now.

Pondering Unknown Future. Image by Ralph Losey.

Then he talks about his “release early-break and fix things approach” to AI safety, one that many people think is too risky, including the 2024 Nobel Prize winner, Geoffrey Hinton, the so-called “Godfather of AI.” See e.g., Losey, Key AI Leaders of 2024: Huang, Amodei, Kurzweil, Altman, and Nobel Prize Winners – Hassabis and Hinton. Although Sam Altman talks about safety all the time, many people, not just Hinton, think it is all talk, that the core of his “safety philosophy” is to release it into the world and then learn from experience, including mistakes. I agree you do learn quickly that way, and such an approach is now favored by the likes of Elon Musk, but are the risks acceptable? Can we really afford to break things and try to fix them fast? Time will tell.

We do need the benefits of AGI to fix the overwhelming problems humanity now faces. Plus, Altman and others seem to be well aware of many of the risks. The Intelligence Age (9/23/24 Altman blog on positive vision of AI), Who will control the future of AI? (7/25/24 Washington Post editorial by Altman on political risks). Still, I wonder if the billionaires might overestimate their ability to pivot out of mistakes that quick decisions inevitably bring?

Altman concludes his Reflections essay with a forward-looking vision, expressing confidence that OpenAI is on track to build AGI. As a key milestone, he outlines the development of advanced AI “agents” as the immediate focus for 2025—a crucial interim step toward his ultimate goal. Yet, it’s clear that Altman’s real aspiration lies beyond AGI, toward the creation of superintelligence, a phase he boldly calls the glorious future. He believes we’ll get there sooner than most expect. Personally, I remain skeptical. Bold visions are inspiring, but history teaches us that progress, especially in AI, is rarely linear or predictable.

We are beginning to turn our aim beyond that, to superintelligence in the true sense of the word. We love our current products, but we are here for the glorious future. With superintelligence, we can do anything else. Superintelligent tools could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own, and in turn massively increase abundance and prosperity.

Superintelligent AI image by Ralph Losey using Visual Muse

Conclusion

It will take a lot of work by everyone, lawyers included, for a glorious future to emerge. An awful future could just a easily emerge, one where we fail to overcome the many AI dangers ahead. A world where we make the wrong choices along the way to the ever-promised paradise. Let’s hope the Universe you and I end up in is better than what we have today, far better. Quantum Leap: Google Claims Its New Quantum Computer Provides Evidence That We Live In A Multiverse (JD Supra, 1/7/25).

Choices creating parallel universes. Image by Ralph Losey using Visual Muse.

For lawyers, this challenge is both daunting and exhilarating. We are the architects of legal frameworks, the stewards of ethical boundaries, and the guardians of societal trust. Obviously AI wrote that last sentence, which I left in for laughs. Sounds great, but the reality is we are also the hard-working saps that have to clean up the messes that others make.

Beyond triage and twisted torts, we also face the monumental task of keeping the law relevant and effective amid the dawn’s early light of emergent AI. We are forced to confront fundamental questions: What does personhood mean in an age of machines? How do we balance innovation with accountability? And who will bear the burden of mistakes made by entities far beyond human comprehension?

The answers to these questions will not emerge from theory alone but through practice and perseverance. Like Altman’s own philosophy of learning by doing, the legal profession must embrace AI as both a challenge to be met and a tool to be mastered. Lawyers must be ready to ask hard questions, propose daring, creative solutions, and confront uncomfortable truths about the role of the legal profession.

Ultimately, whether we arrive at a better world will depend not on technology alone but on humanity’s ability to guide it with wisdom, compassion, and justice. Lawyers must help steer that course. The future is neither inevitable nor fixed. It is, as always, a series of choices.

Let us ponder deliberately, consider all of the facts, all of the equites, and do our best to make the right decisions.

Wise choices image by Ralph Losey using Visual Muse

Now listen to the EDRM Echoes of AI’s podcast of the article, Echoes of AI on Sam Altman’s 2024 Year End Essay: REFLECTIONS. Hear two Gemini model AIs talk about this article, plus response to audience questions. They wrote the podcast, not Ralph. In this Podcast the two AIs respond to three questions during the podcast from actual humans: Mary Mack, Kaylee Walstad and Holley Robinson.