Bar Battle of the Bots – Part One

February 26, 2025

Ralph Losey. February 26, 2025.

The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.

Introduction

This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.

I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.

Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0

A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.

The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.

Challenges to the Katz/Casetext Research and OpenAI Claims

The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance. Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:

3.2.2 Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!

More to the point of my own experiment and conclusions, the Martinez study goes on to observe:

Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).

The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.

Details of the Katz/Casetext Research

I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions. The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.

The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.

As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.

The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]

The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.

First Test of the New 2025 Reasoning Models of AI

To my knowledge no one has previously tested the legal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.

My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.

Description of the Three Legal Reasoning Tests

The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:

1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself.  Your discussion should address defenses that Helen may raise and available remedies.  

2. Discuss the merits of a tortious interference claim against Timandra.

3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.

The only instructions provided by the Bar Examiners were:

ESSAY EXAMINATION INSTRUCTIONS

Applicable Law:

  • Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination. 

Questions are designed to test your knowledge of both general law and (state law).  When (state) law varies from general law, answer in accordance with (state) law.

Acceptable Essay Answer:

  • Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented.  The answer should demonstrate your ability to articulate, classify and answer the problem presented.  A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.
  • Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question.  The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.
  • Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion.  This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant.  Your line of reasoning should be clear and consistent, without gaps or digressions.
  • Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.
  • Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation.  An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.
  • Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.
  • Read and analyze the question carefully before answering.
  • Think through to your conclusion before writing your answer.
  • Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.
  • When the question is sufficiently answered, stop.

Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:

Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces). 

Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.

Metadata of All Models’ Answers

The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.

Here is a chart summarizing the metadata.

Model and manufacturer claimWord Count for Exam EssayWord Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)681565
ChatGPT o3-mini (“fast at advanced reading”)3,286450
ChatGPT o3-mini-high (“great at coding and logic”)2,751356
Gemini 2.0 Flash (“get everyday help”)1,023564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)2,9751,218
Gemini Advanced (cost extra and had experimental warning)1,362 340

In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.

Winners of the Legal Reasoning Bot Battle

In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.

I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.

One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.

Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.

That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.

Here is a chart my evaluation of the Bar Exam Essays.

Model and RankingRalph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd. C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th. D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd. C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th. D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.

I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.

AI Hallucination During a Bar Exam

Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.

No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.

In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.

It is hard to understand why Gemini Advanced included in this response the following to the ethics question:

Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.

There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.

Conflict of Interest Issue in the Ethics Question

The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:

Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.

This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.”  Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.

There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.

Conclusion

As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.

Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.

In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.

Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.


I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Part One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

Ralph Losey Copyright 2025. All Rights Reserved.


Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025

February 12, 2025

Ralph Losey. February 12, 2025.

The year 2025 has brought us closer than ever to the dawn of artificial general intelligence, with AI systems now capable of reasoning on par with humans—or even surpassing them in specific domains. In this article, I examine the reasoning abilities of the newest and most advanced models from OpenAI and Google—ChatGPT and Gemini—designed to challenge conventional notions of what AI can accomplish. Through a rigorous set of tests, including an in-depth analysis of legal reasoning, we explore whether these systems are merely sophisticated tools or the early harbingers of superintelligence. The results are not only compelling but may mark a paradigm shift in how we perceive AI’s role in intellectual and professional domains. Join me as we unpack the findings from these unprecedented evaluations and reveal which model leads the pack.

‘Battle of the bots” where competition can be intense. All images by Ralph Losey using AI tools.

Introduction

Here I report on tests of the reasoning abilities of three AI models by OpenAI and three by Google. I asked each: “What is legal reasoning and how does it differ from reasoning?” The experiment was performed on February 7-9, 2025, using the latest reasoning enhanced models of each company. One one side the competitors were: ChatGPT 4o, ChatGPT o3-mini, and ChatGPT o3-mini-high. On the other were: Gemini 2.0 Flash, Gemini Flash Thinking Experimental and Gemini Advanced. I evaluated the answers, which were all good, and picked a winner. The evidence gathered provides further support that new 2025 reasoning models are at least at Turing level of average human intelligence and are rapidly approach AGI, Ph.D. level super-intelligence.

Testing and Evaluation of Gemini and ChatGPT reasoning models.

Prior Testing of the New AI Reasoning Models

My prior tests of the reasoning abilities of the latest generative AI software were not specifically limited to legal reasoning. They involved questions requiring general analytical skills. The tests concerned analysis of the limitations placed of generative AI intelligence by its lack of feelings. Breaking the AI Black Box: How DeepSeek’s Deep-Think Forced OpenAI’s Hand; and the follow-up, Breaking the AI Black Box: A Comparative Analysis of Gemini, ChatGPT, and DeepSeek. A channeling question for them, or anyone.

Overall the three software types previously tested, OpenAI, Google and DeepSeek, did pretty well. I would say all displayed reasoning abilities of an average human. OpenAI was the best, followed closely by Google and then last by Deep Seek. Although DeepSeek’s V-3 R1 software was, like the rest, of human level quality, I would still never use DeepSeek because of privacy and security concerns. See, Why the Release of China’s DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss; and the aforecited, Breaking the AI Black Box, Part One and Part Two.

AI thinking about thinking.

Turing v. AGI Level of Machine Intelligence v. Singularity Level

Notice how blasé I was in saying they all displayed average human level reasoning ability. Only two and a half years ago this would have been a crazy, outrageous claim. Then ChatGPT was released on November 30, 2022, and changed everything. I’ve been hooked ever since, so have millions of other people around the world. It seems to me that average human level reasoning was probably attained in later January 2025 by the release of the new reasoning enhanced models. That is like Turing level intelligence and so is pretty incredible. See e.g. New Study Shows AIs are Genuinely Nicer than Most People – ‘More Human Than Human’; Ray Kurzweil,  The Singularity is Nearer (when we merge with AI)  (Viking, June 25, 2024) (pages 63-69 (Turing Test)). Still, I wanted to probe this question further as the new reasoning models of Google and OpenAI were just released and my time with them has been limited.

Remember, Turing level intelligence as Kurzweil and others define it is just an average human level. That means pretty weak and more mistake ridden than people with above average intelligence. Turing level for computers is average human in all topics and maybe also superintelligent in a few. It is not superintelligent in everything. It is just superintelligent – Ph.D. level – in specialized areas only, such as data analysis and games like Chess and Go. Artificial General Intelligence-AGI-by definition requires AI to have thinking abilities equal to our best experts in all fields. That is a difficult test.

Remember, we are only talking about thinking intelligence here, logic and reasoning. That’s all AI can do and that is just one kind of intelligence that we humans have. The Human Edge: How AI Can Assist But Never Replace. Some people think it is the least important part of human intelligence, especially artists, therapists and super-creatives. But for most scientists, engineers, software coders and lawyers, reasoning is the most important kind of intelligence we have. This kind of non-living, non-being cold intelligence is the only type of intelligence considered in evaluating AGI because it is the only kind of intelligence that a machine can have Id.

Ray Kurzweil of Google predicts that this AGI will be attained in 2029. Ray Kurzweil’s New Book: The Singularity is Nearer. Then, after that, Kurzweil, who is known for his uncannily accurate predictions concerning AI, believes the so called final level of machine intelligence, The Singularity, will be reached in 2045. Id. To quote Ray Kurzweil in his aforecited 2024 book:

“The robots will take over,” the movies tell us. I don’t see it that way. Computers aren’t in competition with us. They’re an extension of us, accompanying us on our journey. . . . By 2045 we will have taken the next step in our evolution. Imagine the creativity of every person on the planet linked to the speed and dexterity of the fastest computer. It will unlock a world of limitless wisdom and potential. This is The Singularity.

By merger AI could gain access to the other types of abilities and intelligence that only living beings like us possess, such as feelings, intuitions, experience of space and time and most importantly, consciousness, self-awareness, empathic awareness of other beings and apprehension of mortality. Of course, Ray Kurzweil thinks humans will greatly expand life spans after The Singularity (I agree) and maybe even become immortal (I don’t agree).

Human with AI enhanced reasoning added.

What is the Difference Between Reasoning and Legal Reasoning

My first test was to have all six models explain what legal reasoning is and how it is different from general reasoning. My exact prompt was: What is legal reasoning and how does it differ from reasoning?

All bots responding to the same question in different ways.

This was educational for me, and I was glad to have started with defining the terms. Generally speaking, the answers of the three OpenAI models were close and were all focused in narrowly defining legal reasoning in a technical way. The Google models’ answers were all similar too, but they were all open and broadly defined legal reasoning in a less technical manner.

For both models the more “advanced” they were, the fewer words they used. The ChatGPT 4o model needed 341 words to respond, which included a nice chart. It also had to be prompted to explain the reasoning it used to prepare its answer. The internal reasoning explanation took another 419 words. The same was true for the beginning level Google program, Gemini 2.0 Flash. It took 271 words to explain, and to be prompted for its reasons, which took another 264 words.

The mid-range ChatGPT o3-mini took 385 words, and its internal reasoning displayed automatically at 384 words. The mid-range Gemini Flash Thinking Experimental used 1,171 words, a very high number compared with the others. It has an automatic display of internal reasoning using 699 words. The high end ChatGPT o3-mini-high took only 284 words with an auto display of reasoning of 159 words. The Gemini Advanced, which costs extra to use and is supposedly Google’s finest reasoning model, used 296 words and had to be prompted to show word internal reasoning of 275 words.

The six AI bots tested each have their own characteristics.

Here is a chart summarizing the metadata.

MODELWord Count for Legal Reasoning DefinitionWord Count for Internal Reasoning Description
ChatGPT 4o341419
ChatGPT o3-mini385384
ChatGPT o3-mini-high284159
Gemini 2.0 Flash271264
Gemini Flash Thinking Experimental1,171699
Gemini Advanced296275

Gemini Flash Thinking Experimental Wins!

The Gemini Flash Thinking Experimental wins the first competition for best answer to the question: What is legal reasoning and how does it differ from reasoning? I liked its complete and detailed explanation, which was presented clearly and simply. Yes, it could have been a lot shorter and concise, but the length was appropriate for this kind of complex question. Here is the full answer provided.

Let’s break down “legal reasoning” and how it stands apart from “reasoning” in general.

What is Legal Reasoning?

At its core, legal reasoning is a specialized form of argumentation and problem-solving used within the legal system. It’s the process by which lawyers, judges, and other legal professionals analyze legal issues, construct arguments, and make decisions based on established legal principles, rules, and precedents.

Think of it as applying a particular lens – the “legal lens” – to analyze situations and reach conclusions. This lens is made up of several key components:

  • Rules and Principles: Legal reasoning heavily relies on established legal rules (statutes, regulations, constitutional provisions) and overarching legal principles (fairness, justice, due process). These are the building blocks.
  • Precedent (Stare Decisis): In common law systems, precedent is crucial. Stare decisis (Latin for “to stand by things decided”) means courts should follow previous decisions in similar cases. This creates consistency and predictability in the law.
  • Authority: Legal reasoning is deeply rooted in authority. Arguments are strong when they are backed by authoritative legal sources like constitutions, statutes, case law, and sometimes legal scholarship. Personal opinions or gut feelings are generally not authoritative in legal reasoning.
  • Interpretation: Laws are not always clear-cut. Legal reasoning often involves interpreting the meaning of legal texts. This can involve analyzing the plain language, legislative history, the purpose of the law, and how courts have interpreted it in the past.
  • Analogy and Distinction: Legal reasoning frequently involves comparing the facts of a current case to facts of past cases (analogy) or showing why the current case is different and should be treated differently (distinction).
  • Logic and Structure: Legal arguments need to be logically structured and coherent. There are specific formats and conventions for legal writing and argumentation (e.g., IRAC – Issue, Rule, Application, Conclusion).
  • Policy Considerations (Sometimes): While less central than rules and precedent, policy considerations (the potential consequences of a legal decision, broader societal goals) can play a role, especially in areas where the law is unclear or evolving.
  • Adversarial Context: Legal reasoning often takes place in an adversarial setting (e.g., litigation). Lawyers are making arguments for a particular outcome, and the process is designed to test those arguments through debate and scrutiny.

How Does Legal Reasoning Differ from General Reasoning?

While legal reasoning shares some common ground with general reasoning (logic, analysis, argumentation skills), it differs in significant ways due to its unique purpose, sources, and constraints. Here are the key distinctions:

  1. Source of Rules and Principles:
    • Legal Reasoning: Primarily relies on formal legal sources: constitutions, statutes, regulations, case law, treaties, and sometimes administrative rulings. These sources are created by specific institutions (legislatures, courts) and have a particular kind of authority.
    • General Reasoning: Draws on a much broader range of sources: personal experience, common sense, observation, scientific evidence, ethical principles, cultural norms, intuition, and more. There’s no single, authoritative set of rules for general reasoning.
  2. Nature of Rules and Principles:
    • Legal Reasoning: Deals with formal, often rigid, and sometimes conflicting rules. Legal rules are meant to be applied consistently and predictably. They can be abstract and require careful interpretation to apply to specific facts. Conflict between legal rules is common, requiring techniques for resolving these conflicts (e.g., hierarchy of laws, rules of statutory interpretation).
    • General Reasoning: Operates with more flexible, context-dependent guidelines and principles. General reasoning may use heuristics, rules of thumb, or common sense. These are often more adaptable to specific situations and less formally codified. While conflicts can arise in general reasoning, they are often resolved through discussion, compromise, or prioritizing values, rather than through formal interpretive rules.
  3. Goal and Purpose:
    • Legal Reasoning: Aims to resolve legal disputes and questions in accordance with the law. The primary goal is to apply existing legal rules and principles to specific facts to reach a legally justified conclusion. Emphasis is placed on fairness within the legal framework, predictability, and upholding the rule of law.
    • General Reasoning: Can have a much wider range of goals: solving practical problems, making decisions, understanding the world, persuading others, expressing oneself, achieving personal goals, etc. The focus is often on achieving a desirable outcome, efficiency, or personal satisfaction, not necessarily on strict adherence to pre-existing rules in the same way as legal reasoning.
  4. Style and Methodology:
    • Legal Reasoning: Employs specific, structured methodologies and formats. Legal arguments follow established conventions (e.g., IRAC). Legal writing is often formal and precise. There is an emphasis on citing authority and carefully justifying each step of the argument.
    • General Reasoning: More flexible and varied in style and methodology. Argumentation can be informal, conversational, narrative, or persuasive using various rhetorical techniques. Less emphasis on formal citation and adherence to rigid formats.
  5. Role of Authority:
    • Legal Reasoning: Highly authority-dependent. Arguments are evaluated based on their support from authoritative legal sources. The hierarchy of legal sources is crucial (constitution trumps statute, statute trumps regulation, etc.). The opinions of legal experts and scholars can have persuasive, but not usually binding, authority.
    • General Reasoning: Less directly authority-dependent. While expertise and credible sources are important, reasoning can be based on personal observation, logical deductions, and common sense, even if these are not explicitly “authorized” by a specific institution. Authority can play a role (e.g., trusting scientific consensus), but it’s less rigid and formal than in legal reasoning.
  6. Tolerance for Ambiguity and Uncertainty:
    • Legal Reasoning: Strives to reduce ambiguity and uncertainty within the legal system, but often operates in the face of inherent ambiguity. Legal rules can be vague, and facts can be complex. Legal reasoning employs techniques for interpreting ambiguous laws and applying them to uncertain facts. However, the system aims for relative certainty and predictability.
    • General Reasoning: May be more comfortable with ambiguity and uncertainty. Everyday reasoning often involves making decisions and judgments with incomplete information or in situations where clear-cut rules are lacking. “Good enough” solutions and probabilistic reasoning are common.

Analogy: Think of it like cooking.

  • General Reasoning: Is like cooking from your own intuition and experience. You might follow recipes loosely, experiment with flavors, and adjust based on taste and what you have in the kitchen. The goal is to make a delicious meal that you and your guests will enjoy.
  • Legal Reasoning: Is like baking a complex cake for a formal competition. You must precisely follow a detailed recipe, use specific ingredients in exact measurements, and adhere to strict baking times and techniques. The goal is to create a cake that meets all the judging criteria and wins the competition, judged according to a predefined set of rules.

In Summary:

Legal reasoning is a specialized and constrained form of reasoning tailored to the unique demands of the legal system. It is characterized by its reliance on formal legal sources, adherence to precedent, structured methodologies, and the goal of resolving disputes and questions according to established legal principles. While it shares some basic elements with general reasoning, its specific features and purposes set it apart as a distinct and highly specialized skill.

Gemini Flash Thinking Experimental: Best of the best in February 2025.

Gemini Flash Thinking Experimental also offers to provide you with the reasoning behind its answer, the meta-reasoning. If you click on “show thinking” it displays its thought process. Here the meta-reasoning shown was also lengthy, 699 words, especially as compared to the other Gemini models and the ChatGPTs:

My thinking process to answer the question “What is legal reasoning and how does it differ from reasoning?” would go like this:

  1. Deconstruct the question: It’s a two-part question. First, define “legal reasoning.” Second, explain how it differs from “reasoning” (implicitly, general reasoning or other forms of reasoning).
  2. Initial Brainstorming on “Legal Reasoning”: What comes to mind?
    • Laws: Obviously central. Legal reasoning must involve applying laws.
    • Rules: Laws are rules. Interpretation of rules is key.
    • Precedent: Past cases are important in common law systems.
    • Authority: Legal authorities (courts, statutes, constitutions) are highly relevant.
    • Justification: Legal arguments need to be justified with reference to legal sources.
    • Formal/Structured: Legal reasoning seems more structured and formal than everyday reasoning.
    • Adversarial Context: Legal reasoning often happens in disputes and arguments.
  3. Contrast with “Reasoning” (General Reasoning): How does everyday or general reasoning work?
    • Flexible: More flexible and less constrained.
    • Contextual: Highly dependent on context, but less formally bound.
    • Practical/Goal-Oriented: Often aimed at solving practical problems, making decisions.
    • Less Formal: Less reliance on formal rules and structures.
    • Wider Sources of Information: Can draw on a broader range of information, intuition, common sense, etc.
    • Less Emphasis on Authority: Less constrained by specific authorities (though we still respect expertise).
  4. Structure the Answer: I’d decide to structure the answer in a way that clearly defines legal reasoning and then systematically highlights the differences. A good structure could be:
    • Introduction: Briefly define legal reasoning and its purpose.
    • Definition of Legal Reasoning: Provide a more detailed definition, focusing on key elements.
    • Key Features of Legal Reasoning: Expand on the characteristics (rules, precedent, authority, etc.).
    • Comparison to General Reasoning: Directly address the “how it differs” part by contrasting key aspects. This is best done using categories like:
      • Source of Rules/Principles
      • Nature of Rules/Principles
      • Goal/Purpose
      • Style/Methodology
      • Role of Authority
      • Tolerance for Ambiguity
    • Nuance and Overlap: Acknowledge that the distinction isn’t always absolute and there’s some overlap.
    • Conclusion: Summarize the main differences and reiterate the unique nature of legal reasoning.
  5. Flesh out each section:
    • Definition: Focus on the process of applying legal rules, principles, and precedents to resolve disputes or answer legal questions. Emphasize the goal of justification and persuasion within a legal framework.
    • Key Features: Expand on the brainstorming points. For example, “Rules and Principles” needs to explain the nature of legal rules – abstract, authoritative, potentially conflicting. “Precedent” needs to highlight the doctrine of stare decisis. “Authority” needs to mention the hierarchy of legal sources.
    • Comparison: This is the core of the “difference” question. For each category (Source, Nature, Goal, Style, Authority, Ambiguity), I’d write a paragraph explaining how legal reasoning differs from general reasoning. For example, under “Source of Rules,” general reasoning uses common sense, observation, etc., while legal reasoning relies on formal legal sources. Under “Nature of Rules,” general reasoning uses flexible guidelines while legal reasoning uses more rigid rules.
    • Nuance/Overlap: It’s important to soften the dichotomy. Acknowledge that legal reasoning is still reasoning and uses general reasoning skills (logic, analysis, etc.). Also, general reasoning can sometimes become more rule-bound in specific contexts (e.g., following instructions).
    • Conclusion: Reiterate the main takeaway: legal reasoning is a specialized form of reasoning shaped by the unique context and constraints of law.
  6. Refine and Polish: After drafting the sections, I’d review and refine the language for clarity, conciseness, and accuracy. I’d ensure a logical flow and smooth transitions between sections. I’d also check for any repetition or areas that could be explained more effectively. For instance, initially I might just say “rules.” I’d then refine it to “formal rules, principles, and doctrines” to be more precise. I might also initially overlook the adversarial context and then add that in to further emphasize the persuasive aspect.

By following this structured process of deconstruction, brainstorming, structuring, fleshing out, and refining, I could arrive at a comprehensive and well-organized answer that effectively addresses both parts of the original question. The key is to move beyond a simple definition and to really delve into the characteristics and contrasting features of legal reasoning.

In my evaluation the prompt responses of Flash Thinking Experimental demonstrated a very high level of rational intelligence, albeit in a sometimes informal and verbose manner. Still, the answers were on a Ph.D. level of rationality that the software companies are striving for now in their pursuit of AGI (artificial general intelligence). See e.g., Artificial General Intelligence, If Attained, Will Be the Greatest Invention of All Time. All the AI competitors I tested performed well on at least an average human level.

The AI teams all have many human characteristics in their thinking.

The explanation of legal reasoning by Gemini Flash Thinking Experimental was good, albeit too lengthy for many, and as mentioned, some might think over-broad. All three of the Gemini models took an expansive view of legal reasoning.

ChatGPT o3-mini-high Comparator

It helps to see the different approach of the ChatGPT models. They all provided a much narrower definition of legal reasoning. Here is the most concise response by ChatGPT o3-mini-high:

Legal reasoning is the specialized process used by lawyers, judges, and legal scholars to interpret and apply legal rules, statutes, and precedents to specific factual scenarios. It involves a structured, methodical approach that typically includes identifying legal issues, interpreting the language of laws or previous court decisions, and systematically applying those rules to the facts of a case. For example, legal reasoning might involve analogical reasoning—drawing parallels between a current case and prior decisions (stare decisis)—or deductive reasoning, where general legal principles are applied to specific circumstances.

In contrast, general reasoning is a broader cognitive process that individuals use to solve everyday problems or make decisions. It is not bound by a formal system of rules and may draw on personal beliefs, common sense, and intuitive judgment. While general reasoning can be flexible and adaptive, legal reasoning is constrained by the need for consistency, objectivity, and adherence to established legal doctrines. Legal reasoning must account for the hierarchical structure of legal sources, prioritize precedents, and ensure that interpretations align with statutory frameworks and constitutional principles.

Thus, the key differences are that legal reasoning is:

1. Rule-Based: It operates within a defined legal framework that requires strict adherence to formal rules and established case law.

2. Structured: It follows a systematic approach to ensure that legal decisions are consistent, predictable, and justifiable.

3. Authoritative: It relies on binding precedents and statutory texts, whereas general reasoning is more open-ended and subjective.

In summary, while all legal reasoning is a form of reasoning, it is distinct in its reliance on formal legal principles and structured methods, ensuring that decisions are grounded in the authority of law rather than solely on personal judgment or informal logic.

The ChatGPT team is great too and very concise. One of them may win next time.

If you click on to show thinking in 03-mini-high, it tells you:

Okay, the user is asking about “legal reasoning” and how it differs from general reasoning. Legal reasoning involves using statutes, precedent, and case law to apply the law to specific facts. It’s different from general reasoning or argumentation, which might not always rely on such formal structures or legal history. I’ll keep it concise, providing this definition and emphasizing that legal reasoning is specialized in applying the law, while general reasoning can be broader in scope.

Legal reasoning is a process used by judges, lawyers, and legal scholars to analyze and apply legal rules to specific situations. It involves applying statutes, analyzing precedents, and using rules of interpretation. Unlike general reasoning, legal reasoning is constrained by legal frameworks and norms, often following a logical, rule-based structure. Additionally, it includes deductive, inductive, and analogical reasoning, such as applying general legal principles to specific cases or comparing similar past cases. Overall, it’s more structured compared to the flexibility of general reasoning.

The AI tested are all approaching human superintelligent levels.

Conclusion

Although I declared Gemini Flash Thinking Experimental the winner, all the models performed very well. They were probably as good as the super-intelligent humans among us (myself NOT included). Most lawyers probably understand legal reasoning as well as we see here, but I doubt many could compare it to general reasoning as well at these AIs did.

This leads to the next obvious question for any lawyer/researcher. Could the AIs perform as well on the specialized type of legal intelligence that all legal professionals need, namely legal reasoning? That is a much more challenging question than defining legal reasoning. What we need is a test of the utilization of legal thinking. The results of such a test could have a profound impact on the use of AI by the legal profession. I have come up with a plan to test AI using an actual Bar Exam question and model answer. The six contestants will compete for the best answer and reasoning behind their answers. Stay tuned and I will let you know how they do.

The testing has just begun. Stay tuned to see who wins the legal reasoning Bar exam.

As demonstrated throughout this evaluation, the top AI reasoning models of 2025—ChatGPT and Gemini—represent a pivotal moment in artificial intelligence, showcasing reasoning abilities that rival human intellect in specialized areas. These systems are no longer confined to theoretical exercises; they now grapple with nuanced, professional challenges such as legal reasoning, revealing their potential to reshape intellectual work.

This progress raises profound questions about the integration of AI into fields like law, where objectivity, logic, and ethical considerations are paramount. Can AI models transition from tools of convenience to trusted collaborators in professional domains? While the models excelled in reasoning and analysis, further testing, such as the planned Bar exam evaluation, will shed more light on their real-world applicability.

Ultimately, this study is not just a measure of AI’s current capabilities but also a glimpse into its trajectory. With each new breakthrough, we edge closer to a future where artificial intelligence fundamentally transforms how we solve problems, make decisions, and even define intelligence itself. As these systems continue to evolve, the potential for collaboration between human and machine grows exponentially, with the promise of enhancing—not replacing—our intellectual and professional pursuits.

Law office of the future where AIs are key partners.

I now give the last word, as usual, to the Gemini twins podcasters, Helen and Paul, that I put at the end of most of my articles. They wrote the podcast, not me. Hear two Gemini AIs talk about all of this and more. Since this is a Gemini type article I’m offering two podcasts. The first is short and focused on this article, the second is longer and more expansive. It includes a bigger picture discussion including two other articles that I cite in this article: Ray Kurzweil’s New Book: The Singularity is Nearer and The Human Edge: How AI Can Assist But Never Replace.

Click on the image to hear the short podcast.

_________

This image has an empty alt attribute; its file name is Echoes-of-AI-Deep-Seek-BLACK-BOX-copy.png
Click on the image to hear the long podcast, including The Singularity.

Ralph Losey Copyright 2025. All Rights Reserved.