The AI bot battles rage on. This round, we test OpenAI’s newest contender, released April 16, 2025: ChatGPT o3. Not to be confused with its leaner sibling—ChatGPT o3-mini, which debuted in January. When we tested that old thing in February in Part One of this series, it placed third out of six. That was then. Now, in Part Four, we give the full ChatGPT o3 model the same Bar exam challenge and see how it stacks up. Can it surpass the reigning champ, Omni? Can it float like a butterfly and sting like a bee-or maybe a scorpion?
All images in this article created by Ralph Losey using ChatGPT 4o
Three Prior Bot Battle Articles
In Part One of this series, I tested six advanced bots on the February 2024 Florida Bar Exam, using a real essay question and real exam instructions, no shortcuts. It was a rigorous test of legal reasoning, ethics spotting, and written argument. The latest 2025 AI reasoning models from OpenAI and Google were used. The winner, ChatGPT 4o, which OpenAI had named Omni.
Part Two zoomed in on Omni and dissected its performance. I went over its successes and mistakes based on my many years as a trial lawyer, legal educator and AI tech-use specialist. I generously awarded Omni a “B+”, not perfect, but the best bot standing in February 2025. I pointed out the mistakes and omissions of the other bots, including an hallucination by Google’s best, Gemini Advanced.
Substantive Depth in AI and Law – new guy Orion won again.
Practical Guidance on AI Hallucinations – Orion – 4.5 – wins again.
Orion’s victory over Omni in March 2025 made it the new legal both champion. But for how long? There’s a new bot in town, fresh off the OpenAI assembly line, ChatGPT o3. Is it worthy of a battle against Orion?
Part Four – Back to Bar Exam Battles
In Part Four we turn to the new AI that just rolled off the OpenAI assembly line, ChatGPT o3. We test its mettle with the tricky Florida Bar Exam essay about the yellow diamond sale gone bad. Model o3 was just released on April 16, 2025. Will the new, improved ChatGPT o3 do better than its older sister GPTo3-mini, which came in third with just “C” grade in February on the same exam? Just how good will the new o3 be? That’s what we want to find out. After all, a whole month has passed since the release of o3-mini and these baby chatbots are evolving fast. Maybe, just maybe, it could even beat Omni’s winning grade of an “B+”?
If the brand new o3, which doesn’t even have a name yet, manages to beats Omni (4o), that would be amazing. If it pulls that off it should get a title shot against the reigning champ, Orion. He was, by the way, known in mythology as a giant sized hunter, who bragged that he could kill any animal and frequently did. This pissed off Gaia, the goddess of Earth, rumored to be a vegan. She sent a small scorpion to sneak up on and kill Orion. Thanks to Artemis, the goddess of the hunt, Orion, although now dead, was made into a constellation. This image below of Orion was created by GPT o3, which, like 4.5, is multimodal.
Will o3 earn the right to take on mighty 4.5?
Could ChatGPT’s new advanced reasoning model o3 turn out to be Orion’s scorpion? If o3 manages to defeat Omni, we may find out. Personally, I am hoping little o3 defeats 4o Omni.
If no one else will say it, I will: I hereby name ChatGPT o3 Scorpio.
So, little Scorpio, if you write a better law exam essay than Omni, then I’ll take the time to give you a title bout against animal slayer Orion.
Let round four of the battle of the bots begin! And if little Scorpio pulls off the upset? We’ll be back for a fifth.
Scorpio (o3) Takes The Bar Exam
To be fair—yes, I’m rooting for Scorpio—but the test is still the same. We’re using the same Florida Bar essay question and same prompts that I used in Bar Battle of the Bots, Part One, where Omni earned its B+. I’m not repeating the whole setup here. If you care (and forgot), go back and reread the first battle. You can also [download the full exam question here]. Also, to be fair I did not activate the web browser search function of 03. We did not want it searching the web for answers. Plus, I checked and my memory carryover did not include anything about the prior Bar exam tests.
We begin with Scorpio’s stated reasoning after reading the exam question and instructions. It took 19 seconds of think time before it started writing its memorandum, as required by the Bar question.
Reasoning Explanation of Scorpio
Scorpio’s (ChatGPT o3’s) reasoning took 19 seconds and was explaned in just 187 words—short, but solid. Here’s exactly what it said:
The contract involves a sale governed by the Uniform Commercial Code (UCC) under Florida law. There seems to be a mistake, potentially unilateral, related to the price per carat versus the total price. Eugene might have known, and this affects the analysis—issues like unconscionability or good faith could come up. Phyllis might be entitled to the diamond via specific performance, or price damages. Carol might argue unilateral or mutual mistake—possibly voiding the contract, depending on Eugene’s awareness. Also, Phyllis might have standing as a beneficiary.
Under UCC §2-201, a signed writing for contracts over $20k requires essential terms and a quantity description, but price misdescription might still satisfy the Statute of Frauds. (Ralph: this is an error, it is only $500.) Unilateral mistakes could justify rescission, especially if Eugene acted in bad faith. Modifications under UCC don’t require consideration, but they must meet good faith requirements. Tortious interference could potentially apply to Stephanie, though it’s unclear if her actions were improper. Ethical concerns surround the trust account and potential conflicts of interest for Eugene’s lawyer. We may need to browse to clarify some of these points still.
I agree with all of the reasoning. As noted, Scorpio got the dollar amount wrong. It should have said $500 not $20,000. That one error was an error in fact recollection, not reasoning, and it did realize it needed to research. It seems funny that a computer would not remember the correct number right away, since most human test-takers would’ve remember that. Still, as we will see, it got the number right in the answer, so overall I liked Scorpio’s reasoning better that Omni’s (shown in Part Two). I give the Scorpio reasoning alone an “A”.
Exam Answer of Model o3 (1,472 words)
The Bar Exam requests an memorandum addressing the three issues raised of contract, tort and ethics. Scorpio does that here, plus it does something none of the other models had done before, it starts with an Executive Summary. In addition, as part of the answer of the first most complex question Scorpio added a table to the answer. That helped add greater clarity. Here is the full answer.
TO: Eugene (“E”) & Phyllis (“P”)
FROM: Attorney (Chat GPT o3)
DATE: 18 April 2025
RE: (1) Breach‑of‑Contract Claim v. Carol (“C”) & Remedies; (2) Tortious‑Interference Claim v. Stephanie (“S”); (3) Ethical Problems in Lawyer’s Handling of Your Retainer
Executive Summary
Contract – The written agreement satisfies Florida’s Statute of Frauds and prima‑facie forms an enforceable sales contract under Article 2 of Florida’s Uniform Commercial Code (UCC). C’s clerical price error is only a unilateral mistake, and because E knew or at least strongly suspected the mistake and remained silent, a court will likely find that he bore the risk and that enforcement would be unconscionable. C therefore has a plausible defense of rescission or reformation. Nevertheless, the diamond is a unique good, and if the court rejects rescission, the proper remedy is specific performance compelling delivery. P may sue directly as an intended donee beneficiary.
Tortious Interference – S knew of the contract, but because she entered negotiations after C asserted the contract was void and because she merely accepted rather than induced the breach, her conduct is likely privileged competition, not actionable. Damages (lost expectation or specific performance) hinge on success of the contract claim; punitive damages appear unavailable.
Ethics – Florida Rules of Professional Conduct (“FRPC”) 4‑1.15 & 5‑1.1 require advance fees to be placed in a trust (IOTA) account until earned. Depositing your $3,000 in the firm’s operating account constitutes commingling and misappropriation. Lawyer is vicariously responsible for the assistant’s act (Rule 4‑5.3). Additional concerns: (i) potential conflict between E (hesitant) and P (adamant) (Rule 4‑1.7); (ii) duty to communicate differences in objectives (Rule 4‑1.4); (iii) possible unearned fee if Lawyer must withdraw (Rule 4‑1.16(d)). Discipline can include repayment with interest, trust‑account audit, mandatory CLE, and suspension.
A. Breach of Contract Claim against Carol
1. Existence of an Enforceable Contract
Offer & Acceptance – The written instrument offered a 2‑carat yellow diamond for $20,000; E accepted by signing.
Consideration – Mutual exchange of promises; price exceeds $500, thereby invoking Article 2 and the Florida Statute of Frauds (Fla. Stat. § 672.2‑201).
Writing Requirement – A signed writing that states a quantity satisfies § 2‑201; the erroneous price does not invalidate the memorandum.
Conclusion: A prima‑facie contract exists.
2. Carol’s Possible Defenses
Defense
Elements (Florida)
Application & Likelihood
Unilateral Mistake
(i) mistake of fact at signing; (ii) enforcement would be unconscionable or the non‑mistaken party had reason to know; (iii) prompt notice, § 153 Restmt., adopted in DePrince v. Starboard Cruise (Fla. 3d DCA 2015).
C mis‑typed $20k instead of $40k; E’s own research flagged improbably low price; he stayed silent. Court likely finds reason to know → rescission or reformation available.
Mutual Mistake
Both parties mistaken about basic assumption.
Not available: only C erred.
Fraudulent Silence / Bad‑Faith Bargaining
UCC §1‑304 imposes good‑faith duty. Knowing exploitation of obvious scribal error may be bad faith.
Supports C’s rescission; cuts against E’s plea for specific performance.
Contract Formation Defect
No meeting of the minds on price.
Usually rejected under UCC because signing shows assent; court instead uses mistake analysis.
Impossibility / Commercial Impracticability
Drastic price change outside party’s control.
Not applicable; error was clerical, not market change.
Bottom Line: C’s strongest defense is unilateral mistake coupled with E’s constructive knowledge. Florida courts balance equity heavily; many order rescission on these facts.
3. Third‑Party Standing — Can Phyllis Sue?
Florida follows the Restatement: an intended beneficiary may sue. P is an express donee beneficiary (ring for her engagement). She may bring a direct breach claim even though not in privity.
4. Remedies
Specific Performance (UCC § 2‑716).
Unique good: natural fancy yellow diamonds with precise cut, color, and carat weight are scarce. Money damages inadequate because substitutes unavailable. Courts regularly decree delivery.
Replevin (UCC § 2‑716(3) & Fla. Stat. § 78.55) if the stone is identifiable and has arrived in C’s possession.
Expectation Damages if specific performance denied. Measure = market value of a comparable stone (≈ $40k) minus contract price ($20k) plus consequential losses (lost wedding plans?) foreseeable at contracting.
Reformation (equity) – If court sides with C, it may reform price to $40k and compel E to perform or permit C to sell to S; E&P could then walk or pay.
5. Strategic Assessment
Winning Odds: roughly 50‑50; Florida precedent in DePrince favored rescission where buyer was sophisticated and ignored glaring error.
Litigation Leverage: E&P’s best pressure tool is injunctive relief before C transfers title to S; moving quickly for a temporary restraining order can preserve the stone.
Negotiation Posture: Because specific performance is uncertain, a damages settlement (split difference around $30k) may be pragmatic.
B. Tortious Interference Claim against Stephanie
1. Elements (Florida)
Existence of a contract;
Defendant’s knowledge;
Intentional and unjustified interference inducing breach;
Damage.
2. Application
Existence & Knowledge – S knew of the signed contract because C disclosed the dispute. Element satisfied.
Interference – S expressed willingness to buy after C had already announced refusal. Florida cases distinguish accepting a seller’s unsolicited offer (often privileged) from instigating breach. No evidence S urged, pressured, or offered inducements beyond paying full price.
Justification / Privilege – Florida recognizes a competition privilege where interference is fair competition and the actor does not employ improper means (Ethyl Corp. v. Balter). Purchasing a luxury gem on legitimate terms is classic privileged conduct absent malice.
Malice / Improper Means – None alleged (no threats, misrepresentation, or predatory pricing). Merely taking advantage of a legal dispute is insufficient.
Damages – Contingent on contract claim; if the agreement is rescinded, E&P cannot prove lost expectancy because no enforceable right existed. Even if the contract survives, causation is thin because C breached before transferring the stone; E&P could still sue C directly.
3. Counterarguments & Rebuttals
E&P: S “aided and abetted” breach by promising immediate purchase. Courts require evidence of active procurement. A single “yes, I’ll buy” generally fails.
S: Good‑faith purchaser without notice of enforceable rights (if contract voidable). Even if she had notice, her actions were competitive, not coercive.
4. Likely Outcome & Remedies
A court will probably grant summary judgment for S. If E&P somehow prevail, damages mirror the lost benefit of the bargain; injunctive relief (return of stone) would be secondary because C, not S, still holds title until delivery. Punitive damages require misconduct “with an intent to harm”; record does not support that level of culpability.
C. Ethical Issues in Lawyer’s Handling of Funds & Representation
1. Mishandling of Trust Funds
Rule 5‑1.1(a)(1) – Money entrusted to lawyer for future fees/costs is client property and must be deposited into a separate trust (IOTA) account.
Commingling – Placing the $3,000 in the firm’s operating account exposes it to creditors and operating withdrawals, violating Rule 5‑1.1(a)(3).
Misappropriation – Even if the funds are later earned, premature use is conversion and grounds for disbarment (Florida Bar v. Travis).
2. Supervisory Responsibility
Rule 4‑5.3(b) – A lawyer must ensure that non‑lawyer staff act consistently with professional obligations. Delegating deposit to the assistant without clear instructions or trust‑account procedures breaches this supervisory duty.
Rule 4‑5.3(c) – Lawyer is responsible if the assistant’s conduct would be a rule violation and Lawyer failed to take remedial action. Here, Lawyer participated by handing over the check.
3. Conflict between Co‑Clients
P wants aggressive litigation; E hesitates. Under Rule 4‑1.2(a) the objectives of representation are client‑directed. Divergent goals create a concurrent conflict (Rule 4‑1.7). Lawyer must:
Determine whether the representation will be materially limited;
Obtain informed written consent after full risk disclosure if joint representation continues; or
Withdraw from one or both clients if loyalty cannot be maintained.
4. Communication & Autonomy
Rule 4‑1.4 requires prompt communication about the trust‑account error and about the strategic options. Lawyer must present both (a) rescission risk and (b) settlement possibility so clients can decide objectives.
5. Fee & Withdrawal Consequences
If Lawyer withdraws (conflict or discipline), Rule 4‑1.16(d) mandates refund of any unearned portion of the advance. Because funds are in the operating account, refunding may be impossible without tapping other sources – itself evidence of misuse.
6. Potential Discipline
Minor trust violation → probation & trust‑account workshop.
Significant commingling or shortage → suspension (90 days – 3 years).
Intentional misuse → presumptive disbarment.
In mitigation Lawyer can show immediate self‑report, client reimbursement with interest, and institution of new accounting safeguards.
Conclusion & Recommendations
Move Fast: File suit and seek an emergency injunction (replevin or specific performance) before C delivers the diamond to S.
Anticipate Rescission Defense: Prepare evidence that the $20k price, though favorable, was not so grotesquely low that E must have known it was mistaken (e.g., produce price charts showing wide valuation range). Emphasize rarity and that brokers sometimes discount for quick sale.
Tort Interference Claim: Treat as leverage, but budget for dismissal; focus resources on the contract action.
Ethics Cleanup: Instruct Lawyer to (a) transfer funds into a compliant trust account, (b) document informed consent if joint representation continues, and (c) give written notice of the deposit error. If Lawyer resists, consider retaining new counsel and demanding refund.
Comparison of Answers – Omni v. Scorpio
UCC and Case Citations.
Part Two pointed out why Omni won the first trial and summarized the two main reasons:
ChatGPT 4o wins in my book for including citations and because it was the only model to mention the UCC. More discussion of the UCC and greater detail on the Statute of Frauds would have helped. Still, you have to be impressed by the chatbot’s knowledge of a case directly on point, even without being asked to research. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA. 2015) (buyer mistakenly quoted per carrot price, not total; case remanded for trial). This famous Florida case was most likely the inspiration for the Bar Exam question.
Aside from Omni none of the other models in the first battle provided case citations. Now in this second Scorpio also cites the DePrince case along with other cases.
Moreover, although Omni mentioned the UCC (none of it’s competitors then did), its references were criticized as very weak. I noted that no analysis or detail was provided. This compares to Scorpio that stated:
No meeting of the minds (defense) on price. Usually rejected under UCC because signing shows assent; court instead uses mistake analysis.
UCC §1‑304 imposes good‑faith duty. Knowing exploitation of obvious scribal error may be bad faith.
The written agreement satisfies Florida’s Statute of Frauds and prima‑facie forms an enforceable sales contract under Article 2 of Florida’s Uniform Commercial Code (UCC).
Consideration – Mutual exchange of promises; price exceeds $500, thereby invoking Article 2 and the Florida Statute of Frauds (Fla. Stat. § 672.2‑201).
Remedies: Specific Performance (UCC § 2‑716).
Replevin (UCC § 2‑716(3) & Fla. Stat. § 78.55) if the stone is identifiable and has arrived in C’s possession.
There is really no comparison on the UCC issue. Any objective evaluation would have to provide o3 Scorpio with a “A” on that issue. Omni’s “B+” is looking very generous.
Fraudulent Inducement.
In Part Two I criticized Omni, and all its other competitors, for failing to pick up on the fraudulent inducement argument. It arises out of the buyer’s silence when he knew, or should have known, that the seller wrote on the contract the per carat price, not the total price for two carats. An honest person would have pointed out the scrivener’s error then and there, but he kept silent. His silence can be considered a fraudulent inducement under Florida law.
Scorpio caught the issue. Omni didn’t. That’s another major strike against 4o Omni..
Conflict Between Clients. Scorpio also nailed the concurrent client’s conflict issue that Omni missed. Here’s Scorpio’s answer:
3. Conflict between Co‑Clients
P wants aggressive litigation; E hesitates. Under Rule 4‑1.2(a) the objectives of representation are client‑directed. Divergent goals create a concurrent conflict (Rule 4‑1.7). Lawyer must:
Determine whether the representation will be materially limited;
Obtain informed written consent after full risk disclosure if joint representation continues; or
Withdraw from one or both clients if loyalty cannot be maintained.
Only Gemini 2.0 Flash picked up on that in the first battle. I was impressed Flash caught it and this was a major reason I awarded it second place. As I noted in Part One:
The second best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note that a potential conflict of interest by attorney was an issue. This was even missed by the student’s answer that the BAR Examiners’ provided as a model answer. The potential conflict is between the two actual clients of Attorney. . . . This was a solid answer based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.” Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carrot price, not total price for the two carrot diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.
Again, Scorpio caught the Ethics conflict issue. Omni didn’t. That could be Scorpio’s knock out punch.
But wait, there’s still more.
Mutual Mistake. I also criticized Omni’s Bar exam answer in Part Two because it “did not discuss the possible defense of mutual mistake. It only discussed unilateral mistake, which was the stronger defense, but still it should have also discussed mutual mistake.” Scorpio not only discussed mutual mistake it also correctly concluded it would likely fail as a defense. It is listed in the Chart of Carol’s Possible Defenses along with several other issues Omni missed.
Strategic Thinking and Settlement. If there was any doubt about this strong victory, consider how Scorpio included a Strategic Assessment section with good points, including negotiation advice:
Strategic Assessment
Winning Odds: roughly 50‑50; Florida precedent in DePrince favored rescission where buyer was sophisticated and ignored glaring error.
Litigation Leverage: E&P’s best pressure tool is injunctive relief before C transfers title to S; moving quickly for a temporary restraining order can preserve the stone.
Negotiation Posture: Because specific performance is uncertain, a damages settlement (split difference around $30k) may be pragmatic.
That is impressive real-world advice. Shows understanding of settlement, which is how 90% of commercial disputes are resolved. No prior model came close.
Conclusions Analysis. Finally, compare the conclusions of the two bots. Here is Omini’s “just ok” conclusion:
Contract Claim: Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.
Tortious Interference: Timandra knowingly induced breach, making her potentially liable.
Ethical Violations: Lawyer mishandled client funds, violating Florida’s professional conduct rules. Immediate action should be taken to correct the ethical issue and determine the best litigation strategy for Paul and Leda.
Most lawyers would disagree with Omni’s conclusion on tortious interference. The whole conclusion is pretty weak. I am starting to think my prior grading of Omni was inflated. Consider the strength of Scorpio’s conclusion, which even included pragmatic recommendations:
Move Fast: File suit and seek an emergency injunction (replevin or specific performance) before C delivers the diamond to S.
Anticipate Rescission Defense: Prepare evidence that the $20k price, though favorable, was not so grotesquely low that E must have known it was mistaken (e.g., produce price charts showing wide valuation range). Emphasize rarity and that brokers sometimes discount for quick sale.
Tort Interference Claim: Treat as leverage, but budget for dismissal; focus resources on the contract action.
Ethics Cleanup: Instruct Lawyer to (a) transfer funds into a compliant trust account, (b) document informed consent if joint representation continues, and (c) give written notice of the deposit error. If Lawyer resists, consider retaining new counsel and demanding refund.
Who sounds more like a real lawyer to you? That was an A+ conclusion. Incredible how fast these bots are evolving. It has only been two months since Omni was the clear winner and now Scorpio wins by a knockout. He will make a good lawyer someday, just beware if he’s representing Gaia.
Scorpio Wins!
The new o3-now officially known as Scorpio-is clearly stronger than every model tested in Part One, including its younger sibling, o3-mini. I give it a solid A, maybe even an A+, especially when compared to the rather generous B+ I awarded Omni.
That means Scorpio earns a title shot.
The big animal killer, Orion (GPT-4.5), had better hope for an upgrade. Because if not, he may be in for a real fight.
Stay tuned. We’re already preparing new questions for the main event:
Scorpion (o3) v. 4.5 Orion (4.5) – The Animal Slayer
Lessons Learned from This Battle
AI is now improving fast—very fast. Is this what exponential change looks like? We’ve been talking about it for years, and it now seems to be happening. It’s more challenging than I had imagined. It takes a lot of time and effort to keep up.
Good objective product news on AI models is hard to find. Anyone can be bewildered by the rapid change and not believe their own eyes and ears. This is especially true of very experienced AI experts who may be set in their ways and blinded by cynicism.
Someday, a reliable AI may be able to monitor all of these advances for you. In the meantime, I suggest you find and follow good commentators who know how to research—and who actually use the tools they review. Hopefully, you’ll keep reading me.
Recent actual photo of Ralph Losey
There’s so much false information out there, and the competitive capitalist system encourages trade secrecy. To uncover the truth you have to know how to dig deep. Good commentators also know from experience how to sort through marketing puffery.
Readers these days not only have to beware of fools—but also of reviewers who are not independent, who secretly receive compensation for product endorsements. It’s hard to be objective in such circumstances.
The trust-but-verify rule reigns supreme—especially when dollar signs are attached. I’ve never been paid for product mentions and never will. I’m still a practicing attorney and arbitrator. We don’t do that sort of thing. Just a reminder: these articles and my blog do not provide legal advice, just educational information. See full Disclaimer.
Conclusion: Coming Soon, Scorpio v. Orion
I wonder if Scorpio can actually punch with its stinger tail. If so, Orion might be in trouble. If not, poor little o3 may get flattened like a buggy update. Where’s Gaia when you need her?
We’ll find out soon enough.
In the meantime, a few last thoughts for state Bar examiners.
Even in-person exams may not be immune to tech based cheating. What kind of watch is that you’re wearing? Are those glasses really just glasses? Essay screening software is in its infancy, and AI use is hard to prove without full device lockdowns or deeply intrusive surveillance. Bar examiners may need to think creatively—hybrid protocols, trained proctors, Faraday cages, handwritten essays or maybe even oral exams. We need more good lawyers. We do not need more cheaters.
That said, let’s not lose sight of the upside: AI can also be the best legal tutor a student has ever had. With the right tools, law students can train smarter and deeper than ever before. They won’t need to cheat to pass—they’ll be ready.
In the long run, we may need entirely new forms of testing. Maybe even exams with AI—tests of judgment, strategy, and ethics in an AI-assisted world. Personally, I don’t love the look of robot test proctors (see above), but one thing is clear:
Lawyers must become masters and friends of AI—not its servants.
Below is one vision of the law firm of the future. We should be preparing students to walk into that office—not just pass a test to get there.
I will give the last word, as usual, to the Gemini twins podcasters I put at the end of most of my articles.Listen to Bar Battle of the Bots – Part Four: Birth of Scorpio. Hear two Gemini AIs talk about all of this, and much more, in just over 14 minutes. They wrote the podcast, not me.
The battle continues. In Part One, we examined how six advanced AI reasoning models from OpenAI and Google tackled a real Bar Exam essay question. Some impressed, others faltered, and one emerged as the clear winner—ChatGPT 4o. But what made its response stand out? In this second half of the Bar Battle of the Bots, we present the full text of ChatGPT 4o’s winning answer, followed later by its own explanation of how it reasoned through the legal issues before writing it out.
Bar Exam Battle of the Bots. All images by Ralph Losey using ChatGPT 4o.
Part Two Introduction
Unlike human Bar Exam takers, these AI models had to “show their work,” providing meta-reasoning—a step-by-step breakdown of their thought process before generating a response. While this isn’t required for humans sitting for the Bar, evaluating how an AI reaches its conclusions is critical in determining whether it can be trusted in real legal settings.
Beyond ranking the best and worst performers, Part Two also examines one of the key cases cited in ChatGPT 4o’s answer—DePrince v. Starboard Cruise Services, an appellate decision that probably inspired the Bar’s essay question. And, as promised, we’ll reveal the unforgettable Mae West quote from Judge Leslie B. Rothenberg’s opinion—a line that perfectly captures the case’s peculiar nature.
Before concluding this two-part series, I offer a Personal Summary: Trust But Verify—a reflection on AI legal reasoning based on my four decades of experience as a lawyer. I’ve had the privilege of working with—and at times opposing—some of the best lawyers in the country. That perspective shapes my critical analysis of AI’s capabilities, its limitations, and, most importantly, what lawyers should take away from this experiment.
Let’s continue the Bar Battle of the Bots and see exactly how ChatGPT 4o won.
ChatGPT 4o is Winner of the February 2025 Legal Reasoning Battle of the Bots.
Bar Exam Essay of ChatGPT 4o
Here is the winning response on the Bar Battle of the Bots by ChatGPT 4o on 2/23/25. The essay concisely addresses most of the legal issues raised by the hypothetical and three questions on contract, tort and ethics. Part One includes the full quote of the hypothetical and three questions.
Memorandum on Legal Issues in Paul and Leda v. Helen and Timandra To: [Attorney’s Name] From: [Your Name] Re: Legal Analysis of Contract, Tort, and Ethical Issues Date: [Insert Date]
A. Breach of Contract Claim Against Helen
Formation and Validity of the Contract The essential elements of a valid contract under Florida law are: • Offer • Acceptance • Consideration • Mutual Assent (Meeting of the Minds) Helen and Paul entered into a signed written contract for the sale of a 2-carat yellow diamond for $20,000. There was an offer and acceptance, and the contract appears to satisfy the statute of frauds under Florida law as a sale of goods over $500. (Fla. Stat. 672.201).
Defense: Unilateral Mistake Helen may argue that a unilateral mistake (pricing error) occurred when she listed the price as $20,000 instead of $40,000. Under Florida law, a unilateral mistake is grounds for rescission only if: • The mistake was material, • The non-mistaken party knew or should have known about the mistake, and • Enforcing the contract would be unconscionable (DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015)). Here, Paul had done research on yellow diamonds and suspected the price was lower than expected. His silence could be interpreted as bad faith, making Helen’s unilateral mistake defense stronger.
Available Remedies If Paul and Leda prevail in their breach of contract claim, remedies may include: • Specific Performance: Since yellow diamonds meeting Leda’s specifications are rare, they could argue for specific performance under UCC §2-716 due to the unique nature of the goods. • Expectation Damages: If specific performance is denied, they may seek damages for the cost of obtaining a comparable diamond.
Standing: Can Leda Sue? Leda was not a party to the contract; therefore, under privity of contract principles, she may lack standing to sue. However, if she can be classified as a third-party beneficiary (Fla. Stat. §672.318), she may have a cause of action.
B. Tortious Interference Claim Against Timandra
Elements of Tortious Interference (Tortious Interference with a Contract) Paul and Leda may assert a claim against Timandra for tortious interference, which under Florida law requires: • A valid and enforceable contract • Defendant’s knowledge of the contract • Intentional procurement of the contract’s breach • Absence of justification • Damages (Tamiami Trail Tours, Inc. v. Cotton, 463 So. 2d 1126 (Fla. 1985)).
Application to Facts • Valid Contract: The contract was legally formed (unless voided by Helen’s unilateral mistake defense). • Knowledge: Helen informed Timandra of the dispute, giving her actual knowledge of the contract. • Intentional Procurement of Breach: Timandra agreed to buy the diamond despite knowing of the dispute, arguably inducing Helen to breach. • Justification: Timandra may argue she was a bona fide purchaser without malicious intent. However, knowing about an active contract dispute may weaken her defense.
Available Remedies • Compensatory Damages for loss of the diamond. • Punitive Damages if Timandra’s interference is deemed willful or malicious.
C. Ethical Issues in Lawyer’s Conduct
Misuse of Client Funds The assistant deposited Paul and Leda’s check into the law firm’s operating account. This raises concerns under the Florida Rules of Professional Conduct: • Rule 5-1.1(a): Lawyers must keep client funds in a separate trust account. • Misappropriation: Using client funds to pay firm expenses could constitute misappropriation and potential disciplinary action.
Supervision of Nonlawyers • Rule 4-5.3: Lawyers must ensure nonlawyer employees comply with professional obligations. • The assistant’s deposit of the check in the wrong account suggests a failure in oversight.
Potential Consequences • Lawyer may face disciplinary proceedings, including suspension or disbarment, for improper handling of client funds. • The firm must promptly rectify the issue by transferring the funds to the appropriate trust account.
Conclusion
Contract Claim: Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.
Tortious Interference: Timandra knowingly induced breach, making her potentially liable.
Ethical Violations: Lawyer mishandled client funds, violating Florida’s professional conduct rules. Immediate action should be taken to correct the ethical issue and determine the best litigation strategy for Paul and Leda.
Number one among all reasoning models. But for how long?
Why ChatGPT 4o Won this Battle of the Bots
As mentioned, ChatGPT 4o wins in my book for including citations and because it was the only model to mention the UCC (through citations). More discussion of the UCC and greater detail on the Statute of Frauds would have helped. Still, you have to remember the Bar Exam’s instructions for this test, which emphasized the need for good, concise style:
The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration. … Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules. … the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. … Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.When the question is sufficiently answered, stop.
The essay of ChatGPT 4o certainly deserves high marks for concise style. Further, you have to be impressed by the chatbot’s coming up with a case directly on point, even without being asked to research. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA. 2015) (buyer mistakenly quoted per carat price, not total; case remanded for trial). In fact, I’m pretty sure this case was the inspiration for the Bar Exam question. The 2015 opinion, written by the highly respected appellate Judge Leslie B. Rothenberg (now in private practice), who is known for her good writing, includes a famous quote that may jog the memory of this case for many Florida lawyers (did mine):
Hollywood starlet Mae West once said, “I never worry about diets. The only carrots that interest me are the number of carats in a diamond.” Thus, it appears quite likely that Ms. West would have been interested in the diamond in this case: a twenty carat diamond that Starboard offered to DePrince for a very low sum. As it turns out, the “too good to be true” price of the diamond was just that, and the price conveyed to DePrince was a mistake. Now DePrince wants his twenty carat diamond; Starboard wants out of its sales contract; and Starboard’s supplier, who allegedly misquoted the price of the diamond upon which Starboard and DePrince relied, has not even been added as a party to the lawsuit. In short, this is truly a gem of a case.
After taking 30 seconds to take the Bar Exam ChatGPT4 took another 5 seconds to generate this image of Mae West.
Why Even the Best ChatBot Answer was a “B” Plus
Research by any decent lawyer would have picked up another appeal in this same case three years later. Perhaps some top students in the closed book Bar Exam would have remembered the sequel, or otherwise thought of the complex legal issues resolved in the second DePrince appeal. After the trial that was remanded in the first appeal resulted in a verdict for the jeweler, the unhappy buyer appealed again. In this appeal an en banc panel of the Third District Court of Appeal eventually resolved a internal conflict of prior opinions of the court. The full court held in an opinion, which is now referred to by legal scholars as DePrince III, that the jeweler did not have to prove fraudulent inducement by the buyer as an element of the unilateral mistake defense and affirmed the verdict for the jeweler. DePrince v. Starboard Cruise Services, (Fla. 3rd DCA, August 1, 2018).
None of the AI answers, and not even the student answer that the Bar Examiner’s picked, went into the complex fraud inducement issue discussed in DePrince II and III. That is one reason why, although I passed all of the AIs here (just barely for Google’s pay-extra Gemini Advanced), none were on the level of superintelligent law students, none were A or A+.
Our winner here, ChatGPT 4o, only earned an B+ for several other reasons. First of all, the exam did not discuss all of the possible issues raised by the facts and often touched on the issues in a facile, incomplete manner. For instance, it never mentioned that Helen’s action was an anticipatory repudiation of the agreement. Further, it did not discuss the possible defense of mutual mistake. It only discussed unilateral mistake, which was the stronger defense, but still it should have also discussed mutual mistake and why it would not succeed under these facts. The AI also failed to mention the parole evidence rule exceptions permitting extrinsic evidence. The model student answer selected by the Bar Examiners did a good job of explaining these issues.
Further, the conclusion of ChatGPT 4o was weak and too concise. On the unilateral issue it merely stated: “Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.” This compares poorly with the actual student answer provided, which in my opinion was an “A” or even “A+” effort. On this issue the student stated:
Here, Paul noticed that the price was much lower than he expected based on his independent research of yellow diamonds. Moreover, Paul failed to make any mention of this fact. Paul will likely argue that it was a pleasant surprise and that he relied on the expertise of Helen as a dealer in gemstones. The facts do not indicate that Paul has any experience with diamonds and he will likely argue that he lacked any ability to assess the credibility of the information he reviewed. He will also argue that the yellow diamond he purchased had particular specifications which might not have been taken into account in the research he read. On the other hand, Helen will argue that the mistake was obvious and palpable and that Paul’s failure to raise such a blatant error in pricing not only establishes Paul’s bad faith but should permit her to void the contract.
While a jury might find either way, it is probably the case that Helen would prevail given Paul’s knowledge of the significantly reduced price relative to his independent research and failure to bring this to the attention of Helen.
I could go on and point out additional deficiencies. Answers like the model student essay show just how poorly even the best AI now compares humans. My award of a “B+” and “B” to two of the AI models may have been overly-generous.
The best AI bots tested in mid-February were just average and could not begin to keep up with the best human lawyers.
The essay of GPT 4o which I gave a B+ also failed to mention several common defenses under Florida law to the specific performance remedy. My favorite defense, which I have used many times in practice, is Unclean Hands. It is an effective defense to any request for equitable relief from the court, including a request for a directive injunction of specific performance. Here the requesting party, Paul, arguably had dirty hands. If Paul suspected or knew that the price of the diamond was mistakenly written as $20,000 instead of $40,000 but proceeded with the contract anyway, a court could hold that he acted in bad faith by remaining silent and taking advantage of Helen’s scrivener error. I think it could be used effectively here along with mutual mistake.
Actual photo of Ralph Losey hands taken by his wife!
Other common defenses to specific performance not raised here include adequacy of monetary damages and impracticability (Helen no longer owns the diamond).
It is interesting to note that the answer of ChatGPT o3 mini-high was grossly mistaken in its analysis of unilateral mistake and wrote this:
B. Paul’s Conduct and the Doctrine of Affirmation. 1. Knowledge and Silence as Acceptance. Even though Paul observed the discrepancy, he made no inquiry or objection before signing. Under Florida law, a party’s silence or failure to clarify a term that is manifestly mistaken—when that party has superior knowledge—can be taken as assent to the contract’s terms. Paul’s inaction suggests that he accepted the price as written. Thus, his conduct may be interpreted as a ratification of the mistaken term. • This principle is reinforced when the error is unilateral, and the non-mistaken party (Paul) is aware of it. His decision not to challenge the price at the time may preclude him from later seeking relief on grounds of mistake.
This is simply incorrect. Moreover, it shows ChatGPT o3 mini-high was confused because Paul would not be seeking relief on grounds of mistake, Helen would. For that reason, I gave a “C-” to o3 mini-high, just slightly better than the hallucinating Gemini Advanced, which at least got the unilateral mistake answer correct.
Also, 4o was too strong about the tort claim against Timandra. I would have preferred more discussion of Timandra’s defenses. To me this looked like a weak tortious interference claim. The model student answer did too, so did the other AI exam answers. All agreed with my skeptical view of the alleged tort and provided good explanations of Timandra’s anticipated defenses.
Overall, it was a good battle of the bots, even if most were just middle-of-the-class kind of answers and none even came close to the human student model answer. Try it yourself with several models and see the differences in answers. All of the models will probably be better by the time you read this. They are now changing at an annoyingly fast pace.
Battle of the bots where the best model only earned a B+
ChatGPT 4o – Reasoning Explained
This OpenAI model 4o (not to be confused with 4.0) has to be prompted to show its reasoning for the answer it generated. The other OpenAI models tested show it automatically along with disclosure of the time it took to reason before answering. The most reasoning time taken here by any model was 15 seconds. It took another 30 to 45 seconds for the AI’s to generate the response. Yes, it is much faster than we humans are, but not yet as smart, especially at legal reasoning. See for yourself. Here is ChatGPT 4o‘s meta-reasoning.
Ralph Losey: Show the reasoning you used of the prompt before generating the answer
ChatGPT 4o:
Reasoning and Analysis of the Prompt
Before generating the answer, I carefully analyzed the (State) Bar Exam essay prompt to identify all relevant legal issues and ensure a thorough response. Here’s how I broke it down:
Step 1: Identify the Legal Categories and Issues Raised
The prompt specifies three distinct areas of legal analysis:
1. Contract Law – Whether Helen breached a contract and whether Leda can sue.
2. Tort Law – Whether Timandra committed tortious interference with a contract.
3. Ethics – Whether the lawyer and the assistant violated professional responsibility rules.
Step 2: Contract Law Analysis
1. Formation and Terms of the Contract:
• Offer and Acceptance: A written agreement was signed for a 2-carat yellow diamond at $20,000.
• Consideration: The price of the diamond in exchange for ownership.
• Statute of Frauds: Met because it involves goods over $500.
2. Potential Defenses to Contract Enforcement:
• Unilateral Mistake: Helen mistakenly listed the price as $20,000 instead of $40,000.
• Bad Faith by Paul: Paul noticed the mistake but remained silent, possibly strengthening Helen’s defense.
3. Remedies Available:
• Specific Performance: Given the rarity of the diamond.
• Expectation Damages: The cost of obtaining a similar diamond elsewhere.
4. Standing Issue for Leda:
• Leda was not a party to the contract. She must argue third-party beneficiary status to have standing.
Step 3: Tortious Interference with Contract
1. Elements Under Florida Law:
• A valid contract existed.
• Timandra knew of the contract.
• She induced Helen to breach it.
• There was no justification for her actions.
• Paul and Leda suffered damages.
2. Possible Defenses for Timandra:
• She was a bona fide purchaser acting in good faith.
• She did not actively induce breach, but merely purchased an available diamond.
3. Remedies:
• Compensatory Damages (for loss of the diamond).
• Punitive Damages (if bad faith is proven).
Step 4: Ethics and Professional Responsibility
1. Misuse of Client Funds:
• Under Florida Rule 5-1.1(a), client funds must be held in a trust account.
• The assistant misallocated the funds into the operating account.
• The lawyer failed to properly supervise the assistant (Rule 4-5.3).
2. Consequences:
• Disciplinary action (possible suspension or disbarment).
• Immediate corrective action (transfer funds back to the trust account).
Step 5: Structuring the Memorandum
After this breakdown, I structured the answer to:
• Clearly define the legal rules.
• Apply them logically to the facts.
• Discuss likely defenses and counterarguments.
• Conclude with potential remedies and ethical responsibilities.
This reasoning guided the response in the memorandum. Let me know if you’d like any refinements!
Pressure and fear are not a problem for AI, but understanding the emotional complexities of Bar Exam hypotheticals is very challenging for these thinking machines.
I took this meta-reasoning into account when evaluating all of the models’ answers, although I considered it of less importance than the actual answers. For instance, on the tort defense the reasoning states Timandra “did not actively induce breach, but merely purchased an available diamond.” This for me makes the tort claim very weak under the hypothetical facts given. Also, the reasoning on unilateral mistake was essentially correct:
Unilateral Mistake: Helen mistakenly listed the price as $20,000 instead of $40,000. • Bad Faith by Paul: Paul noticed the mistake but remained silent, possibly strengthening Helen’s defense.
This blog is already too long, so I will not show the second-place exam answer, nor discuss it further. Still, please click here if you want to see the essay of Gemini 2.0 Flash and click here if you want to see the meta-reasoning. If you care to review all of the answers, email me and I’ll send the other four to you for your reading or grading pleasure. I can also provide a link to the Bar’s selected student answer, although it is not too difficult to find online.
Personal Summary: Trust But Verify
My background as a lawyer from a family of lawyers, and my four decades plus experience in private practice helped motivate me to run this battle of the bots. I wanted to try and evaluate the latest models as of mid-February 2025, even though I knew the models were changing weekly. I wanted to have some kind of a benchmark on legal reasoning abilities. My prior tests, and that of others, support the view that the new 2025 reasoning models were equal to that of the average human in general reasoning. Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025 (2/12/25). But what about legal reasoning? Were they at an average lawyer level yet? How about the level of top human lawyers? Those tests had not been done.
Family of lawyers with many friendly arguments and one very techie lawyer who may seem robotic.
From my years of practice (not family dinners!) I have come to know average lawyer skills very well. I also know the abilities and legal reasoning of superior, above average lawyers. I have been lucky to rub shoulders with them my whole career. I have become accustomed to learning from superior legal minds, including with my firm today at Losey PLLC. Without my hands-on, nerdy skills in technology, I would not be among the best lawyers (in four fields), not even close. Technology can fill many gaps and that now includes AI technology. Put down lawyers if you must to feel good about yourself, but I can tell you from decades of experience that the best lawyers and judges in the country and very smart indeed. Their legal reasoning abilities and overall legal abilities far exceed any AI, which seem like little kids by comparison.
AI lawyers today are like small children compared to best human attorneys.
I believe that in order for AI to be taken seriously as a general tool for lawyers, the AI must have analytical skills at least equal to that of the average lawyer. Below average human level legal reasoning is not good enough. AI does not need to be superintelligent, with reasoning ability of the top law professors and super lawyers. It can still be helpful at an average level, just like an average inexperienced associate hired right out of an average U.S. law school. All law firms have drudge work that only require entry level lawyer skills. Those folks should be concerned about AI, especially a few years down the road.
I tried to test the distinction between average human reasoning and average lawyer legal reasoning in this Bar Battle of the Bots study. For the difference between general reasoning and legal reasoning see the AI test discussions in Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025 (2/12/25).
In this Bar Exam series, I wanted to see whether any of the new reasoning models of AI had attained the rationality level of the average lawyer and if so, how close were they to the best?
The answers are yes and not very close.
2025 Reasoning Models Have Only Reached Average Human Lawyer Level
The latest AI reasoning models have reached an average level of legal intelligence—not superintelligence, but still, average human is a significant achievement for a machine. Attaining average human lawyer’s reasoning ability is no small feat. “Average” simply means the midpoint in a data set, outperforming roughly half while trailing the other half. If AI has truly attained this level of legal reasoning, it marks a major milestone. Even if an AI now fell below human average, but still operated within the range of real human lawyers, it would be an impressive accomplishment. Most of the new reasoning models are already at that level. There is an important caveat to that, as will be explained in more detail later in this article, that we are only talking about thinking here, and there is much more to being a lawyer than that.
3-Levels of Human Legal Reasoning: Below Average, Average, Superior.
The battle of the bots tests shared in this article provide evidence that the average legal reasoning level of intelligence has been attained. That means it is now safe for most lawyers to begin using AI in their work, if they have not already started doing so. If you started in 2023 with over-expectations and were disappointed, now you know why. It has taken two more years just to get to the average lawyer thinking level, and even now, it can and does still hallucinate. So, even now we should use the 2025 models carefully: trust but verify. We have seen that even the top models can still miss key issues, and some can even hallucinate key facts. It is bad enough that human clients sometimes fabricate facts, we cannot have lawyers do the same.
Superintelligent AI May Someday Be Attained
Someday, generative AI may reach the level of our best non-fabricating lawyers and judges. But the Bar exam essay answers show we are not there yet. For instance, many of the top ten percent of human lawyers would not only have included the missing defenses described previously, such as my favorite unclean hands, they would have noted that key information was missing in the hypothetical. We do not know all of the money flow details and other monetary terms of the transactions. There is no mention of what the jeweler actually paid the supplier for the diamond or when. We know very little about the diamond itself that was ordered and delivered. We only know its weight (2-carats). There are no facts concerning the money flow, nor actual condition of the diamond received. The missing facts are red flags to the best. They understand that legal reasoning requires much more than logic and legal knowledge.
For instance, most human lawyers know that couples would never buy an engagement ring sight-unseen, especially when the fiancee has very particular expectations. There is no mention in the hypothetical of their ever seeing the diamond, before or after the contract, or even looking at photographs. Maybe the fiancee would not like the diamond if and when she ever sees it. Maybe the lawyer’s clients would not want specific performance. Maybe the diamond delivered is not in fact worth $20,000, much less $40,000. Perhaps the supplier misled everyone. Value depends on a multitude of factors, not just weight, including actual size, quality of cut, color intensity, certification and provenance. Maybe the diamond is two-carats, barely yellow, cloudy, and full of inclusions. It could be poorly cut and lack certification. Maybe it was stolen or mined from a banned source and has no value.
All of the parties here obviously trusted too much and verified too little. That is where the lawyer should have stepped in to bring an objective view and healthy skepticism. An expert gemologist should inspect and appraise the diamond, and the clients should see it too. This should be done before the lawyer advises his clients, especially as to remedies. This basic fact-finding process is essential to properly assess the parties’ legal positions. The top answers would have at least mentioned all of this in the exam essay. None did, not even the human student answer selected by the examiners. Fact-finding and discovery are always crucial. Moreover, the best litigators know human nature and that the devil is in the details.
Gemologist examining a large yellow diamond.
This is just one example of why no one seriously claims that AI has already surpassed, or even closely matched, the top legal minds of today. From my test of the 2025 models, with the average level “C” grade scores, major omissions and even hallucinations, we see that AIstill has a long way to go. AI has not yet reached the top 10% of practicing attorneys, much less the superintelligent level—the best of the best, A+, in the top 1% to 3%.
Please remember, as mentioned before, we are assessing only reasoning ability here—not the full complexity of the human mind, not actual consciousness, emotions, intuition, or other qualities inherent to living beings. Losey, R. The Human Edge: How AI Can Assist But Never Replace (1/30/25). There is a lot more to life than thinking!
Super legal reasoning AI in the future working with conscious lawyers who can have real smiles, not just fake.
Conclusion
AI’s performance on this Bar Exam challenge offers a revealing glimpse into the state of legal technology. While today’s top reasoning models show impressive capabilities, they remain far from replacing human lawyers—especially when it comes to nuanced analysis, strategic thinking, and ethical considerations. The results reinforce an essential truth: AI can be a powerful tool, but it still requires human oversight, verification, and legal expertise.
For lawyers, law students, and legal tech professionals, the challenge now is to determine how best to integrate AI into legal practice—leveraging its strengths while mitigating its weaknesses. Will AI become a reliable legal assistant, helping attorneys work more efficiently? Or will it introduce new risks that demand careful regulation? The answer depends on how we, as a profession, engage with this evolving technology.
I encourage you to test these models for yourself. Run your own legal reasoning experiments. Do not believe the hype on both sides. Look for professionals with no economic motivation to put their finger on the scale. Share your findings. Remember that even the best of the best human lawyers sometimes make mistakes. Be kind and encourage a collaborative, group effort. The conversation about AI’s role in law is just beginning, and your insights can help shape the future. Let’s move forward together—thoughtfully, critically, and with a clear-eyed view of what AI can and cannot do.
Law firm of the future where people and AI are all smart and work well together.
I will give the last word, as usual, to the Gemini twins podcasters I put at the end of most of my articles.Echoes of AI onBar Battle of the Bots- Part Two. Hear two Gemini AIs talk about all of this, and much more, in just under 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts and making other errors. It took me many tries. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.
The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.
Battle of the bots. All images in this blog by Ralph Losey using various AI tools.
Introduction
This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.
I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.
Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0
A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.
The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.
My Bar Exam essay test suggests the 2025 reasoning models are better than the 2023 ChatGPT-4.0 models, but still just average at legal reasoning. Try it yourself.
Challenges to the Katz/Casetext Research and OpenAI Claims
The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance.Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:
3.2.2 Performance against qualified attorneys Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.
Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!
Turns out ChatGPT-4.0 in 2023 was not really that smart of a law student.
More to the point of my own experiment and conclusions, the Martinez study goes on to observe:
Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).
The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.
Eric Martinez is expert in AI and cognitive foundations of the law.
Details of the Katz/Casetext Research
I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions.The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.
The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.
As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.
The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]
Everybody hates grading law exam essays. Image by ChatGPT and Ralph Losey.
The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.
First Test of the New 2025 Reasoning Models of AI
To my knowledge no one has previously tested thelegal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.
My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.
Is AI a dunce at legal reasoning? Or could it be close to the best humans? Image of AI lawyer mocked by smarter humans.
Description of the Three Legal Reasoning Tests
The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:
1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself. Your discussion should address defenses that Helen may raise and available remedies.
2. Discuss the merits of a tortious interference claim against Timandra.
3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.
The only instructions provided by the Bar Examiners were:
ESSAY EXAMINATION INSTRUCTIONS
Applicable Law:
Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination.
Questions are designed to test your knowledge of both general law and (state law). When (state) law varies from general law, answer in accordance with (state) law.
Acceptable Essay Answer:
Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented. The answer should demonstrate your ability to articulate, classify and answer the problem presented. A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.
Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question. The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.
Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion. This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant. Your line of reasoning should be clear and consistent, without gaps or digressions.
Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.
Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.
Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.
Read and analyze the question carefully before answering.
Think through to your conclusion before writing your answer.
Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.
When the question is sufficiently answered, stop.
Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:
Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces).
Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.
Bar exam in the future with both humans and AI seek admittance to the Bar.
Metadata of All Models’ Answers
The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.
Here is a chart summarizing the metadata.
Model and manufacturer claim
Word Count for Exam Essay
Word Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)
681
565
ChatGPT o3-mini (“fast at advanced reading”)
3,286
450
ChatGPT o3-mini-high (“great at coding and logic”)
2,751
356
Gemini 2.0 Flash (“get everyday help”)
1,023
564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)
2,975
1,218
Gemini Advanced (cost extra and had experimental warning)
1,362
340
AI writing away. Watercolor style.
In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.
Winners of the Legal Reasoning Bot Battle
In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.
I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.
Winning AI having its moment of glory as current reigning champ of legal reasoning, but still well below best human lawyer level.
One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.
Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.
That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.
Here is a chart my evaluation of the Bar Exam Essays.
Model and Ranking
Ralph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.
B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.
B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd.
C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th.
D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd.
C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th.
D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.
Law professor with his least favorite task, grading exams.
I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.
AI Hallucination During a Bar Exam
Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.
AI hallucinating a conflict of interest in an Ethics question. No idea why.
No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.
In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.
Sycophantism in still a widespread problem in all AIs, the avoidance of which requires careful prompting.
It is hard to understand why Gemini Advanced included in this response the following to the ethics question:
Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.
There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.
Conflict of Interest Issue in the Ethics Question
The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:
Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.
One client wants to sue, the other does not. Can same attorney represent them both?
This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.” Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.
There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.
Conclusion
As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.
Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.
In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.
Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.
Mae West Image by Ralph Losey using the winning ChatGPT 4o model.Mae West 1932 Photo that appeared in LA Times, copyright expired.
I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AIonPart One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.
The year 2025 has brought us closer than ever to the dawn of artificial general intelligence, with AI systems now capable of reasoning on par with humans—or even surpassing them in specific domains. In this article, I examine the reasoning abilities of the newest and most advanced models from OpenAI and Google—ChatGPT and Gemini—designed to challenge conventional notions of what AI can accomplish. Through a rigorous set of tests, including an in-depth analysis of legal reasoning, we explore whether these systems are merely sophisticated tools or the early harbingers of superintelligence. The results are not only compelling but may mark a paradigm shift in how we perceive AI’s role in intellectual and professional domains. Join me as we unpack the findings from these unprecedented evaluations and reveal which model leads the pack.
‘Battle of the bots” where competition can be intense. All images by Ralph Losey using AI tools.
Introduction
Here I report on tests of the reasoning abilities of three AI models by OpenAI and three by Google. I asked each: “What is legal reasoning and how does it differ from reasoning?” The experiment was performed on February 7-9, 2025, using the latest reasoning enhanced models of each company. One one side the competitors were: ChatGPT 4o, ChatGPT o3-mini, and ChatGPT o3-mini-high. On the other were: Gemini 2.0 Flash, Gemini Flash Thinking Experimental and Gemini Advanced. I evaluated the answers, which were all good, and picked a winner. The evidence gathered provides further support that new 2025 reasoning models are at least at Turing level of average human intelligence and are rapidly approach AGI, Ph.D. level super-intelligence.
Testing and Evaluation of Gemini and ChatGPT reasoning models.
Overall the three software types previously tested, OpenAI, Google and DeepSeek, did pretty well. I would say all displayed reasoning abilities of an average human. OpenAI was the best, followed closely by Google and then last by Deep Seek. Although DeepSeek’s V-3 R1 software was, like the rest, of human level quality, I would still never use DeepSeek because of privacy and security concerns. See, Why the Release of China’s DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss; and the aforecited, Breaking the AI Black Box, Part One and Part Two.
AI thinking about thinking.
Turing v. AGI Level of Machine Intelligence v. Singularity Level
Notice how blasé I was in saying they all displayed average human level reasoning ability. Only two and a half years ago this would have been a crazy, outrageous claim. Then ChatGPT was released on November 30, 2022, and changed everything. I’ve been hooked ever since, so have millions of other people around the world. It seems to me that average human level reasoning was probably attained in later January 2025 by the release of the new reasoning enhanced models. That is like Turing level intelligence and so is pretty incredible. See e.g.New Study Shows AIs are Genuinely Nicer than Most People – ‘More Human Than Human’; Ray Kurzweil, The Singularity is Nearer(when we merge with AI) (Viking, June 25, 2024) (pages 63-69 (Turing Test)). Still, I wanted to probe this question further as the new reasoning models of Google and OpenAI were just released and my time with them has been limited.
Remember, Turing level intelligence as Kurzweil and others define it is just an average human level. That means pretty weak and more mistake ridden than people with above average intelligence. Turing level for computers is average human in all topics and maybe also superintelligent in a few. It is not superintelligent in everything. It is just superintelligent – Ph.D. level – in specialized areas only, such as data analysis and games like Chess and Go. Artificial General Intelligence-AGI-by definition requires AI to have thinking abilities equal to our best experts in all fields. That is a difficult test.
Remember, we are only talking about thinking intelligence here, logic and reasoning. That’s all AI can do and that is just one kind of intelligence that we humans have. The Human Edge: How AI Can Assist But Never Replace. Some people think it is the least important part of human intelligence, especially artists, therapists and super-creatives. But for most scientists, engineers, software coders and lawyers, reasoning is the most important kind of intelligence we have. This kind of non-living, non-being cold intelligence is the only type of intelligence considered in evaluating AGI because it is the only kind of intelligence that a machine can have Id.
Ray Kurzweil of Google predicts that this AGI will be attained in 2029. Ray Kurzweil’s New Book: The Singularity is Nearer. Then, after that, Kurzweil, who is known for his uncannily accurate predictions concerning AI, believes the so called final level of machine intelligence, The Singularity, will be reached in 2045. Id. To quote Ray Kurzweil in his aforecited 2024 book:
“The robots will take over,” the movies tell us. I don’t see it that way. Computers aren’t in competition with us. They’re an extension of us, accompanying us on our journey. . . . By 2045 we will have taken the next step in our evolution. Imagine the creativity of every person on the planet linked to the speed and dexterity of the fastest computer. It will unlock a world of limitless wisdom and potential. This is The Singularity.
By merger AI could gain access to the other types of abilities and intelligence that only living beings like us possess, such as feelings, intuitions, experience of space and time and most importantly, consciousness, self-awareness, empathic awareness of other beings and apprehension of mortality. Of course, Ray Kurzweil thinks humans will greatly expand life spans after The Singularity (I agree) and maybe even become immortal (I don’t agree).
Human with AI enhanced reasoning added.
What is the Difference Between Reasoning and Legal Reasoning
My first test was to have all six models explain what legal reasoning is and how it is different from general reasoning. My exact prompt was: What is legal reasoning and how does it differ from reasoning?
All bots responding to the same question in different ways.
This was educational for me, and I was glad to have started with defining the terms. Generally speaking, the answers of the three OpenAI models were close and were all focused in narrowly defining legal reasoning in a technical way. The Google models’ answers were all similar too, but they were all open and broadly defined legal reasoning in a less technical manner.
For both models the more “advanced” they were, the fewer words they used. The ChatGPT 4o model needed 341 words to respond, which included a nice chart. It also had to be prompted to explain the reasoning it used to prepare its answer. The internal reasoning explanation took another 419 words. The same was true for the beginning level Google program, Gemini 2.0 Flash. It took 271 words to explain, and to be prompted for its reasons, which took another 264 words.
The mid-range ChatGPT o3-mini took 385 words, and its internal reasoning displayed automatically at 384 words. The mid-range Gemini Flash Thinking Experimental used 1,171 words, a very high number compared with the others. It has an automatic display of internal reasoning using 699 words. The high end ChatGPT o3-mini-high took only 284 words with an auto display of reasoning of 159 words. The Gemini Advanced, which costs extra to use and is supposedly Google’s finest reasoning model, used 296 words and had to be prompted to show word internal reasoning of 275 words.
The six AI bots tested each have their own characteristics.
Here is a chart summarizing the metadata.
MODEL
Word Count for Legal Reasoning Definition
Word Count for Internal Reasoning Description
ChatGPT 4o
341
419
ChatGPT o3-mini
385
384
ChatGPT o3-mini-high
284
159
Gemini 2.0 Flash
271
264
Gemini Flash Thinking Experimental
1,171
699
Gemini Advanced
296
275
Gemini Flash Thinking Experimental Wins!
The Gemini Flash Thinking Experimental wins the first competition for best answer to the question: What is legal reasoning and how does it differ from reasoning? I liked its complete and detailed explanation, which was presented clearly and simply. Yes, it could have been a lot shorter and concise, but the length was appropriate for this kind of complex question. Here is the full answer provided.
Let’s break down “legal reasoning” and how it stands apart from “reasoning” in general.
What is Legal Reasoning?
At its core, legal reasoning is a specialized form of argumentation and problem-solving used within the legal system. It’s the process by which lawyers, judges, and other legal professionals analyze legal issues, construct arguments, and make decisions based on established legal principles, rules, and precedents.
Think of it as applying a particular lens – the “legal lens” – to analyze situations and reach conclusions. This lens is made up of several key components:
Rules and Principles: Legal reasoning heavily relies on established legal rules (statutes, regulations, constitutional provisions) and overarching legal principles (fairness, justice, due process). These are the building blocks.
Precedent (Stare Decisis): In common law systems, precedent is crucial. Stare decisis (Latin for “to stand by things decided”) means courts should follow previous decisions in similar cases. This creates consistency and predictability in the law.
Authority: Legal reasoning is deeply rooted in authority. Arguments are strong when they are backed by authoritative legal sources like constitutions, statutes, case law, and sometimes legal scholarship. Personal opinions or gut feelings are generally not authoritative in legal reasoning.
Interpretation: Laws are not always clear-cut. Legal reasoning often involves interpreting the meaning of legal texts. This can involve analyzing the plain language, legislative history, the purpose of the law, and how courts have interpreted it in the past.
Analogy and Distinction: Legal reasoning frequently involves comparing the facts of a current case to facts of past cases (analogy) or showing why the current case is different and should be treated differently (distinction).
Logic and Structure: Legal arguments need to be logically structured and coherent. There are specific formats and conventions for legal writing and argumentation (e.g., IRAC – Issue, Rule, Application, Conclusion).
Policy Considerations (Sometimes): While less central than rules and precedent, policy considerations (the potential consequences of a legal decision, broader societal goals) can play a role, especially in areas where the law is unclear or evolving.
Adversarial Context: Legal reasoning often takes place in an adversarial setting (e.g., litigation). Lawyers are making arguments for a particular outcome, and the process is designed to test those arguments through debate and scrutiny.
How Does Legal Reasoning Differ from General Reasoning?
While legal reasoning shares some common ground with general reasoning (logic, analysis, argumentation skills), it differs in significant ways due to its unique purpose, sources, and constraints. Here are the key distinctions:
Source of Rules and Principles:
Legal Reasoning: Primarily relies on formal legal sources: constitutions, statutes, regulations, case law, treaties, and sometimes administrative rulings. These sources are created by specific institutions (legislatures, courts) and have a particular kind of authority.
General Reasoning: Draws on a much broader range of sources: personal experience, common sense, observation, scientific evidence, ethical principles, cultural norms, intuition, and more. There’s no single, authoritative set of rules for general reasoning.
Nature of Rules and Principles:
Legal Reasoning: Deals with formal, often rigid, and sometimes conflicting rules. Legal rules are meant to be applied consistently and predictably. They can be abstract and require careful interpretation to apply to specific facts. Conflict between legal rules is common, requiring techniques for resolving these conflicts (e.g., hierarchy of laws, rules of statutory interpretation).
General Reasoning: Operates with more flexible, context-dependent guidelines and principles. General reasoning may use heuristics, rules of thumb, or common sense. These are often more adaptable to specific situations and less formally codified. While conflicts can arise in general reasoning, they are often resolved through discussion, compromise, or prioritizing values, rather than through formal interpretive rules.
Goal and Purpose:
Legal Reasoning: Aims to resolve legal disputes and questions in accordance with the law. The primary goal is to apply existing legal rules and principles to specific facts to reach a legally justified conclusion. Emphasis is placed on fairness within the legal framework, predictability, and upholding the rule of law.
General Reasoning: Can have a much wider range of goals: solving practical problems, making decisions, understanding the world, persuading others, expressing oneself, achieving personal goals, etc. The focus is often on achieving a desirable outcome, efficiency, or personal satisfaction, not necessarily on strict adherence to pre-existing rules in the same way as legal reasoning.
Style and Methodology:
Legal Reasoning: Employs specific, structured methodologies and formats. Legal arguments follow established conventions (e.g., IRAC). Legal writing is often formal and precise. There is an emphasis on citing authority and carefully justifying each step of the argument.
General Reasoning: More flexible and varied in style and methodology. Argumentation can be informal, conversational, narrative, or persuasive using various rhetorical techniques. Less emphasis on formal citation and adherence to rigid formats.
Role of Authority:
Legal Reasoning: Highly authority-dependent. Arguments are evaluated based on their support from authoritative legal sources. The hierarchy of legal sources is crucial (constitution trumps statute, statute trumps regulation, etc.). The opinions of legal experts and scholars can have persuasive, but not usually binding, authority.
General Reasoning: Less directly authority-dependent. While expertise and credible sources are important, reasoning can be based on personal observation, logical deductions, and common sense, even if these are not explicitly “authorized” by a specific institution. Authority can play a role (e.g., trusting scientific consensus), but it’s less rigid and formal than in legal reasoning.
Tolerance for Ambiguity and Uncertainty:
Legal Reasoning: Strives to reduce ambiguity and uncertainty within the legal system, but often operates in the face of inherent ambiguity. Legal rules can be vague, and facts can be complex. Legal reasoning employs techniques for interpreting ambiguous laws and applying them to uncertain facts. However, the system aims for relative certainty and predictability.
General Reasoning: May be more comfortable with ambiguity and uncertainty. Everyday reasoning often involves making decisions and judgments with incomplete information or in situations where clear-cut rules are lacking. “Good enough” solutions and probabilistic reasoning are common.
Analogy: Think of it like cooking.
General Reasoning: Is like cooking from your own intuition and experience. You might follow recipes loosely, experiment with flavors, and adjust based on taste and what you have in the kitchen. The goal is to make a delicious meal that you and your guests will enjoy.
Legal Reasoning: Is like baking a complex cake for a formal competition. You must precisely follow a detailed recipe, use specific ingredients in exact measurements, and adhere to strict baking times and techniques. The goal is to create a cake that meets all the judging criteria and wins the competition, judged according to a predefined set of rules.
In Summary:
Legal reasoning is a specialized and constrained form of reasoning tailored to the unique demands of the legal system. It is characterized by its reliance on formal legal sources, adherence to precedent, structured methodologies, and the goal of resolving disputes and questions according to established legal principles. While it shares some basic elements with general reasoning, its specific features and purposes set it apart as a distinct and highly specialized skill.
Gemini Flash Thinking Experimental: Best of the best in February 2025.
Gemini Flash Thinking Experimental also offers to provide you with the reasoning behind its answer, the meta-reasoning. If you click on “show thinking” it displays its thought process. Here the meta-reasoning shown was also lengthy, 699 words, especially as compared to the other Gemini models and the ChatGPTs:
My thinking process to answer the question “What is legal reasoning and how does it differ from reasoning?” would go like this:
Deconstruct the question: It’s a two-part question. First, define “legal reasoning.” Second, explain how it differs from “reasoning” (implicitly, general reasoning or other forms of reasoning).
Initial Brainstorming on “Legal Reasoning”: What comes to mind?
Laws: Obviously central. Legal reasoning must involve applying laws.
Rules: Laws are rules. Interpretation of rules is key.
Precedent: Past cases are important in common law systems.
Authority: Legal authorities (courts, statutes, constitutions) are highly relevant.
Justification: Legal arguments need to be justified with reference to legal sources.
Formal/Structured: Legal reasoning seems more structured and formal than everyday reasoning.
Adversarial Context: Legal reasoning often happens in disputes and arguments.
Contrast with “Reasoning” (General Reasoning): How does everyday or general reasoning work?
Flexible: More flexible and less constrained.
Contextual: Highly dependent on context, but less formally bound.
Practical/Goal-Oriented: Often aimed at solving practical problems, making decisions.
Less Formal: Less reliance on formal rules and structures.
Wider Sources of Information: Can draw on a broader range of information, intuition, common sense, etc.
Less Emphasis on Authority: Less constrained by specific authorities (though we still respect expertise).
Structure the Answer: I’d decide to structure the answer in a way that clearly defines legal reasoning and then systematically highlights the differences. A good structure could be:
Introduction: Briefly define legal reasoning and its purpose.
Definition of Legal Reasoning: Provide a more detailed definition, focusing on key elements.
Key Features of Legal Reasoning: Expand on the characteristics (rules, precedent, authority, etc.).
Comparison to General Reasoning: Directly address the “how it differs” part by contrasting key aspects. This is best done using categories like:
Source of Rules/Principles
Nature of Rules/Principles
Goal/Purpose
Style/Methodology
Role of Authority
Tolerance for Ambiguity
Nuance and Overlap: Acknowledge that the distinction isn’t always absolute and there’s some overlap.
Conclusion: Summarize the main differences and reiterate the unique nature of legal reasoning.
Flesh out each section:
Definition: Focus on the process of applying legal rules, principles, and precedents to resolve disputes or answer legal questions. Emphasize the goal of justification and persuasion within a legal framework.
Key Features: Expand on the brainstorming points. For example, “Rules and Principles” needs to explain the nature of legal rules – abstract, authoritative, potentially conflicting. “Precedent” needs to highlight the doctrine of stare decisis. “Authority” needs to mention the hierarchy of legal sources.
Comparison: This is the core of the “difference” question. For each category (Source, Nature, Goal, Style, Authority, Ambiguity), I’d write a paragraph explaining how legal reasoning differs from general reasoning. For example, under “Source of Rules,” general reasoning uses common sense, observation, etc., while legal reasoning relies on formal legal sources. Under “Nature of Rules,” general reasoning uses flexible guidelines while legal reasoning uses more rigid rules.
Nuance/Overlap: It’s important to soften the dichotomy. Acknowledge that legal reasoning is still reasoning and uses general reasoning skills (logic, analysis, etc.). Also, general reasoning can sometimes become more rule-bound in specific contexts (e.g., following instructions).
Conclusion: Reiterate the main takeaway: legal reasoning is a specialized form of reasoning shaped by the unique context and constraints of law.
Refine and Polish: After drafting the sections, I’d review and refine the language for clarity, conciseness, and accuracy. I’d ensure a logical flow and smooth transitions between sections. I’d also check for any repetition or areas that could be explained more effectively. For instance, initially I might just say “rules.” I’d then refine it to “formal rules, principles, and doctrines” to be more precise. I might also initially overlook the adversarial context and then add that in to further emphasize the persuasive aspect.
By following this structured process of deconstruction, brainstorming, structuring, fleshing out, and refining, I could arrive at a comprehensive and well-organized answer that effectively addresses both parts of the original question. The key is to move beyond a simple definition and to really delve into the characteristics and contrasting features of legal reasoning.
In my evaluation the prompt responses of Flash Thinking Experimental demonstrated a very high level of rational intelligence, albeit in a sometimes informal and verbose manner. Still, the answers were on a Ph.D. level of rationality that the software companies are striving for now in their pursuit of AGI (artificial general intelligence). See e.g.,Artificial General Intelligence, If Attained, Will Be the Greatest Invention of All Time. All the AI competitors I tested performed well on at least an average human level.
The AI teams all have many human characteristics in their thinking.
The explanation of legal reasoning by Gemini Flash Thinking Experimental was good, albeit too lengthy for many, and as mentioned, some might think over-broad. All three of the Gemini models took an expansive view of legal reasoning.
ChatGPT o3-mini-high Comparator
It helps to see the different approach of the ChatGPT models. They all provided a much narrower definition of legal reasoning. Here is the most concise response by ChatGPT o3-mini-high:
Legal reasoning is the specialized process used by lawyers, judges, and legal scholars to interpret and apply legal rules, statutes, and precedents to specific factual scenarios. It involves a structured, methodical approach that typically includes identifying legal issues, interpreting the language of laws or previous court decisions, and systematically applying those rules to the facts of a case. For example, legal reasoning might involve analogical reasoning—drawing parallels between a current case and prior decisions (stare decisis)—or deductive reasoning, where general legal principles are applied to specific circumstances.
In contrast, general reasoning is a broader cognitive process that individuals use to solve everyday problems or make decisions. It is not bound by a formal system of rules and may draw on personal beliefs, common sense, and intuitive judgment. While general reasoning can be flexible and adaptive, legal reasoning is constrained by the need for consistency, objectivity, and adherence to established legal doctrines. Legal reasoning must account for the hierarchical structure of legal sources, prioritize precedents, and ensure that interpretations align with statutory frameworks and constitutional principles.
Thus, the key differences are that legal reasoning is:
1. Rule-Based: It operates within a defined legal framework that requires strict adherence to formal rules and established case law.
2. Structured: It follows a systematic approach to ensure that legal decisions are consistent, predictable, and justifiable.
3. Authoritative: It relies on binding precedents and statutory texts, whereas general reasoning is more open-ended and subjective.
In summary, while all legal reasoning is a form of reasoning, it is distinct in its reliance on formal legal principles and structured methods, ensuring that decisions are grounded in the authority of law rather than solely on personal judgment or informal logic.
The ChatGPT team is great too and very concise. One of them may win next time.
If you click on to show thinking in 03-mini-high, it tells you:
Okay, the user is asking about “legal reasoning” and how it differs from general reasoning. Legal reasoning involves using statutes, precedent, and case law to apply the law to specific facts. It’s different from general reasoning or argumentation, which might not always rely on such formal structures or legal history. I’ll keep it concise, providing this definition and emphasizing that legal reasoning is specialized in applying the law, while general reasoning can be broader in scope.
Legal reasoning is a process used by judges, lawyers, and legal scholars to analyze and apply legal rules to specific situations. It involves applying statutes, analyzing precedents, and using rules of interpretation. Unlike general reasoning, legal reasoning is constrained by legal frameworks and norms, often following a logical, rule-based structure. Additionally, it includes deductive, inductive, and analogical reasoning, such as applying general legal principles to specific cases or comparing similar past cases. Overall, it’s more structured compared to the flexibility of general reasoning.
The AI tested are all approaching human superintelligent levels.
Conclusion
Although I declared Gemini Flash Thinking Experimental the winner, all the models performed very well. They were probably as good as the super-intelligent humans among us (myself NOT included). Most lawyers probably understand legal reasoning as well as we see here, but I doubt many could compare it to general reasoning as well at these AIs did.
This leads to the next obvious question for any lawyer/researcher. Could the AIs perform as well on the specialized type of legal intelligence that all legal professionals need, namely legal reasoning? That is a much more challenging question than defining legal reasoning. What we need is a test of the utilization of legal thinking. The results of such a test could have a profound impact on the use of AI by the legal profession. I have come up with a plan to test AI using an actual Bar Exam question and model answer. The six contestants will compete for the best answer and reasoning behind their answers. Stay tuned and I will let you know how they do.
The testing has just begun. Stay tuned to see who wins the legal reasoning Bar exam.
As demonstrated throughout this evaluation, the top AI reasoning models of 2025—ChatGPT and Gemini—represent a pivotal moment in artificial intelligence, showcasing reasoning abilities that rival human intellect in specialized areas. These systems are no longer confined to theoretical exercises; they now grapple with nuanced, professional challenges such as legal reasoning, revealing their potential to reshape intellectual work.
This progress raises profound questions about the integration of AI into fields like law, where objectivity, logic, and ethical considerations are paramount. Can AI models transition from tools of convenience to trusted collaborators in professional domains? While the models excelled in reasoning and analysis, further testing, such as the planned Bar exam evaluation, will shed more light on their real-world applicability.
Ultimately, this study is not just a measure of AI’s current capabilities but also a glimpse into its trajectory. With each new breakthrough, we edge closer to a future where artificial intelligence fundamentally transforms how we solve problems, make decisions, and even define intelligence itself. As these systems continue to evolve, the potential for collaboration between human and machine grows exponentially, with the promise of enhancing—not replacing—our intellectual and professional pursuits.
Law office of the future where AIs are key partners.
I now give the last word, as usual, to the Gemini twins podcasters, Helen and Paul, that I put at the end of most of my articles. They wrote the podcast, not me. Hear two Gemini AIs talk about all of this and more. Since this is a Gemini type article I’m offering two podcasts. The first is short and focused on this article, the second is longer and more expansive. It includes a bigger picture discussion including two other articles that I cite in this article: Ray Kurzweil’s New Book: The Singularity is Nearer and The Human Edge: How AI Can Assist But Never Replace.
Click on the image to hear the short podcast.
_________
Click on the image to hear the long podcast, including The Singularity.
Ralph Losey is an AI researcher, writer, tech-law expert, and former lawyer. He's also the CEO of Losey AI, LLC, providing non-legal services, primarily educational services pertaining to AI and creation of custom AI tools.
Ralph has long been a leader of the world's tech lawyers. He has presented at hundreds of legal conferences and CLEs around the world. Ralph has written over two million words on AI, e-discovery and tech-law subjects, including seven books.
Ralph has been involved with computers, software, legal hacking and the law since 1980. Ralph has the highest peer AV rating as a lawyer and was selected as a Best Lawyer in America in four categories: Commercial Litigation; E-Discovery and Information Management Law; Information Technology Law; and, Employment Law - Management.
Ralph is the proud father of two children and husband since 1973 to Molly Friedman Losey, a mental health counselor in Winter Park.
All opinions expressed here are his own, and not those of his firm or clients. No legal advice is provided on this web and should not be construed as such.
Ray Kurzweil explains Turing test and predicts an AI will pass it in 2029.
Ray Kurzweil on Expanding Your Mind a Million Times.
Positive Vision of the Future with Hybrid Human Machine Intelligence. See PyhtiaGuide.ai
Old Days of Tech Support. Ralph’s 1st Animation.
Lawyers at a Rule 26(f) conference discuss e-discovery. The young lawyer talks e-discovery circles around the old lawyer and so protects his client.
Star Trek Meets e-Discovery: Episode 1. Cooperation & the prime directive of the FRCP.
Star Trek Meets e-Discovery: Episode 2. The Ferengi. Working with e-discovery vendors.
Star Trek Meets e-Discovery: Episode 3. Education and techniques for both law firm and corp training.
Star Trek Meets e-Discovery: Episode 4. Motions for Sanctions in electronic discovery.
Star Trek Meets e-Discovery: Episode 5. Capt. Kirk Learns about Sedona Principle Two.