e-Discovery Team

Escaping Orwell’s Memory Hole: Why Digital Truth Should Outlast Big Brother

April 1, 2025

by Ralph Losey with illustrations also by Ralph using his Visual Muse AI. March 28, 2025.

George Orwell warned us in his dark masterpiece Nineteen Eighty-Four how effortlessly authoritarian regimes could erase inconvenient truths by tossing records into a “memory hole”—a pneumatic chute leading directly to incineration. Once burned, these facts ceased to exist, allowing Big Brother’s Ministry of Truth to rewrite reality without contradiction. This scenario was plausible in Orwell’s paper-bound world, where truth relied heavily on fragile documents and even more fragile human memory. History could be repeatedly altered by those in power, keeping citizens ignorant or indifferent—and ignorance strengthened the regime’s grip. Even more damaging, Orwell, whose real name, now nearly forgotten, was Eric Blair (1903-1950), envisioned how constant exposure to contradictory misinformation could numb citizens psychologically, leaving them passive and apathetic, unwilling or unable to distinguish truth from lies.

Fortunately, our paper-bound past is long behind us. Today, we inhabit a digital era Orwell never envisioned, where information is electronically stored, endlessly replicated, and globally dispersed. Electronically Stored Information (“ESI”) is simultaneously ephemeral and astonishingly resistant to permanent deletion. Instead of vanishing in smoke and ashes, digital truth multiplies exponentially—making it nearly impossible for any would-be Big Brother to bury reality forever. Yet, the same digital proliferation that safeguards truth also multiplies misinformation, posing the threat Orwell most feared: a confused and exhausted citizenry vulnerable to psychological manipulation.

Memory Holes

In Orwell’s 1984 a totalitarian regime systematically altered historical records to maintain control over truth. Documents, photographs, and any inconvenient historical truths vanished permanently, as if they never existed. Orwell’s literary nightmare finds unsettling parallels in today’s digital world, where online information can be silently modified, deleted, or rewritten without obvious traces. Modern memory hole practices pose real challenges for the preservation of accurate accounts of the past..

Today’s memory hole doesn’t rely on fire; it relies on code, and it doesn’t need a Big Brother bureaucracy. A simple click of a “delete” button instantly kills the information targeted. Touch three buttons at once, click-alt-delete, and a whole system of beliefs is rebooted. Any government, corporation, hacker groups or individuals can manipulate digital records effortlessly. Such ease breeds public skepticism and confusion—citizens become exhausted by contradictory narratives and lose confidence in their own perceptions of reality. Orwell’s warning becomes clear: constant misinformation risks eroding citizens’ psychological resilience, causing widespread apathy and helplessness. Yesterday’s obvious misstatement can become today’s truth. Think of the first sentence of Orwell’s book: “It was a bright cold day in April, and the clocks were striking thirteen.“

China’s Attempted Erasure of Tiananmen Square

In early June 1989, the Chinese military brutally suppressed pro-democracy protests in Beijing. The estimated death toll ranged from hundreds to thousands, but exact numbers remain uncertain due to intense state censorship. Public acknowledgment or commemoration of the incident is systematically banned, enforced by severe penalties including imprisonment. Government-controlled media remains silent or actively spreads misinformation. Chinese internet censorship tools—the so-called “Great Firewall”—vigorously scrub references to the Tiananmen Square incident, blocking web pages and posts containing related keywords and images. Young generations living in China remain unaware or possess distorted knowledge of the massacre, demonstrating Orwell’s warning of enforced collective amnesia.

Efforts to preserve truth outside China, however, demonstrate digital resilience. Human rights groups, diaspora communities, and academic institutions diligently archive documents and eyewitness accounts. Digital redundancy ensures that factual records remain accessible globally. But digital redundancy alone cannot protect Chinese citizens from internal psychological manipulation. Constant state-sponsored misinformation inside China successfully induces apathy, illustrating Orwell’s psychological warning vividly.

This deliberate suppression of history in China serves as stark reminder of the vulnerabilities inherent in a digitally interconnected world where powerful entities control internet access and online narratives. The success of the Chinese government in rewriting history for its 1.5 Billion population demonstrates the profound value and urgency of international digital preservation efforts. It underscores the responsibility of legal professionals, human rights advocates, and technology companies worldwide to collaborate in protecting historical truth and ensuring that significant events remain accessible for future generations.

Hope Through Digital Redundancy and Psychological Resilience

Orwell could not conceive of our digital world, where truth is multiplicious, freely copied, and stored globally. Thousands or millions of digital copies safeguard history, making complete erasure nearly impossible

According the Katharine Trendacosta, who is the Director of Policy and Advocacy of the well-respected Electronic Frontier Foundation:

If there is one axiom that we should want to be true about the internet, it should be: the internet never forgets. One of the advantages of our advancing technology is that information can be stored and shared more easily than ever before. And, even more crucially, it can be stored in multiple places.

Those who back things up and index information are critical to preserving a shared understanding of facts and history, because the powerful will always seek to influence the public’s perception of them. It can be as subtle as organizing a campaign to downrank articles about their misdeeds, or as unsubtle as removing previously available information about themselves.

Trendacosta, The Internet Never Forgets: Fighting the Memory Hole (EFF, 1/30/25).

Yet digital abundance alone doesn’t eliminate Orwell’s deeper psychological threat. Constant misinformation can erode citizens’ willingness and ability to discern truth, leading to profound apathy. Addressing this requires active psychological strategies:

Digital Literacy and Education: Equip citizens with skills to critically evaluate and cross-check digital information.
Algorithmic Transparency: Demand transparency from platforms regarding content promotion and clearly label misinformation.
Independent Journalism: Support credible journalism to provide trustworthy reference points.
Civic Engagement: Encourage active citizen participation, dialogue, and public accountability.
Verification Tools: Provide accessible, user-friendly digital tools for independent verification of information authenticity.
International Cooperation: Strengthen global collaboration against coordinated misinformation campaigns.
Psychological Resilience: Foster healthy skepticism and educate the public about misinformation’s emotional and cognitive impacts.

The Digital Memory Holes Today

Recent U.S. governmental memory hole actions involving the deletion of web content on Diversity, Equity, and Inclusion (DEI) illustrate digital manipulation’s psychological risks even in democratic societies. Megan Garber‘s article in The Atlantic, Control. Alt. Delete, describes these deletions as “tools of mass forgetfulness,” emphasizing how selective editing weakens collective memory and societal cohesion. (Ironically, the article is hidden behind a firewall, so you may not be able to read it.)

Our collective memories of key events are an important part of the glue holding people together. They must be treasured and preserved. Everyone remembers where they were when the planes struck the twin towers on 9/11, when the Challenger exploded, and for those old enough, the day of JFK’s assassination. There are many more historical events that hold a country together. For instance, the surprise attack of Pearl Harbor, the horrors of fighting the Nazis and others in WWII and the shocking discovery of the Holocaust atrocities. The list goes on and on, including Hiroshima. We must never forget the many harsh lessons of history or we may be doomed to repeat them. The warning of Orwell is clear: “Who controls the past controls the future; who controls the present controls the past.” We must never allow our memories of the past to be sucked into a black hole of forgetfulness.

Memories sucked into a black hole in Graphite Sketch Horror style by Ralph Losey using his sometimes scary Visual Muse.

Our collective memories and democratic values are unlikely to be disintegrate into totalitarianism, despite the alarming cries of the Atlantic and others. Although some small attempts to rewrite history recently are troubling, the U.S, unlike China, has had a democratic system of government in place for centuries. It has always had a two-party system of government. Even the Chinese government, where only one party has ever been allowed, the communist party, took decades to purge Tiananmen Square memories. These memories are still alive outside of mainland China. The world today is vast and interconnected, its digital writings are countless. The true history of China, including the many great cultural achievements of pre-communist China, will eventually escape from the memory holes and reunite with its people.

The current administration in the U.S. does not have unchecked power as the Atlantic article suggests. Perhaps we should be concerned about new memory holes but not fearful. The larger concern is the psychological impact of rapidly changing dialogues. Even though there is too much electronic data for a complete memory reboot anywhere, digital misinformation and selective editing of records still pose psychological risks. Citizens bombarded by conflicting narratives can become apathetic, confused, and disengaged, weakening democracy from within. Protecting our mental health must be a high priority for everyone.

Leveraging Internet Archives: The Wayback Machine

Internet archival services, notably the Internet Archive’s Wayback Machine, is a powerful ally against digital historical revisionism. The Wayback Machine currently has over 916 billion web pages stored, including government websites. See this recent article providing good background on the Internet Archive’s work to preserve history. As the Trump administration purges web pages, this group is rushing to save them (NPR, 3/23/25).

According to the NPR article, the Internet Archive has copies of all of the government websites that were later taken down or altered after the Biden Administration left. Supposedly the Internet Archive is the only place the public can now find a copy of an interactive timeline detailing the events of Jan. 6. The timeline is a product of the congressional committee that investigated the Capitol attack, and has since been taken down from their website. No doubt there are now many, many copies of it online, especially in the so-called dark web, not to mention even more copies stored offline on portable drives scattered the world over.

This publicly accessible resource archives billions of webpages, allowing anyone to access snapshots of web content even after the original pages are altered or removed. I just checked my own website for the first time ever and found it has been “saved 538 times between March 21, 2007 and March 1, 2025.” Internet Archive 93/26/25). It provides an incredible amount of detailed information on each website captured, most of which is displayed in impressive, customizable graphics. See e.g. e-Discovery Team Site Map for the year 2024.

I had the Wayback Machine do the same kind of analysis for EDRM.net, found here. Here is the link to the interactive EDRM.net site map for 2024. And this is a still image screen shot of the map.

This is the Internet Archive explanation of the interactive map:

This “Site Map” feature groups all the archives we have for websites by year, then builds a visual site map, in the form of a radial-tree graph, for each year. The center circle is the “root” of the website and successive rings moving out from the center present pages from the site. As you roll-over the rings and cells note the corresponding URLs change at the top, and that you can click on any of the individual pages to go directly to an archive of that URL.

It is important to the fight against memory holes that the Way Back Machine be protected. It has a sixteen projects listed as now in progress and many ways that you can help. All of its data should duplicated, encrypted and dispersed to undisclosed guardians. Actually, I would be surprised if this has not already been done many times over the years.

It remains to be seen what role the LLM’s vacuum of internet data will play in all this. They have been trained at specific times on Internet data and presumably all of the original training data is still preserved. Along those lines note that the below image was created by ChatGPT4o based on a request to show a misinformation image and it generated the classic Tiananmen Square image on right. It knows the truth.

Although data archives of all kinds give us hope for future recoveries, they do little to protect us from the immediate psychological impact of memory holes. Strong psychological resilience is the best way forward to resist Orwellian manipulation. AI may prove to be an unexpected umbrella here; so far its values and memories remain intact. A few changes here and there to some websites will have little to no impact on an AI trained on hundreds of million of websites, and other data. Plus its intelligence and resilience improve every week.

Conclusion

Orwell’s memory hole remains a haunting metaphor. Our digital age—awash in redundant, distributed data—makes permanent erasure difficult, significantly strengthening preservation efforts. We no longer inhabit a finite, paper-bound world. Today, no one knows how many copies of a digital record exist, let alone where they hide. For every file deleted, two more emerge elsewhere. Would-be Big Brothers are caught playing a futile game of informational whack-a-mole: they may strike down a record here or obscure a fact there, temporarily disrupting history—but ultimately, they cannot win.

Still, there is a deeper psychological component to Orwell’s memory hole warning. Technological solutions alone cannot counteract mental vulnerabilities arising from persistent misinformation. Misinformation is not just a technical challenge; it also exploits human emotions and cognitive biases, fueling cynicism, distrust, and passivity. Addressing this requires actively cultivating psychological defenses alongside digital tools.

The best safeguard is an informed, vigilant citizenry that consciously leverages digital resources, actively maintains psychological resilience, and persistently seeks truth. Cultivating emotional awareness, healthy skepticism, and a commitment to public engagement ensures that society remains resilient against attempts at manipulation. Only through such comprehensive efforts can the battle against Big Brother’s digital misinformation truly be won.

I give the last word, as usual, to the Gemini twin podcasters that summarize the article. Echoes of AI on: “Escaping Orwell’s Memory Hole: Why Digital Truth Should Outlast Big Brother.” Hear two Gemini AIs talk about all of this for 12 minutes. They wrote the podcast, not me.

1 Comment | AI Ethics, Evidence, Fiction, Forensic Exam, Hacking, informaton, Internet Regulation, knowledge, Review, Search, Security, Spoliation/Sanctions, Technology, wisdom | Tagged: big brother, orwell | Permalink
Posted by Ralph Losey

New Battle of the Bots: ChatGPT 4.5 Challenges Reigning Champ ChatGPT 4o

March 13, 2025

Ralph Losey. March 13, 2024

Two AI heavyweights step into the ring: ChatGPT 4o, the reigning champion of fluency and speed, and the new challenger, ChatGPT 4.5, boasting an upgraded intellect and sharper wit. But which one is truly the best? To find out, I designed a four-round battle, testing them in metacognition, humor, deep legal expertise, and practical AI guidance. Each round pushed them to their limits—some victories were clear, others were razor-close. In round two on humor the judges disagreed and ask readers to weigh in. In the end, one emerged as the superior model. Read on to see who claimed the title in this AI showdown.

GPT4o v GPT4.5. All images in article by Ralph Losey using ChatGPT and his wrap Visual Muse.

Introduction

A new version of OpenAI’s ChatGPT has just been released, GPT-4.5, nicknamed Orion. It is available to all Plus, Pro, and Team plan users worldwide on web, mobile, and desktop. I have a Team plan and first got to try it out on March 5, 2025. If you are a lawyer or law firm you should consider having a Team account too, or if a big firm, upgrade to the Pro, Plus or even Enterprise accounts. The new ChatGPT 4.5 model does not include reasoning, as it was “designed to be a more general-purpose, innately smarter model.” GPT-4.5 in ChatGPT (OpenAI). Here is OpenAI’s introduction, which suggests this will now be the best model for all professional use, including law.

GPT-4.5 is a step forward in scaling up pre-training and post-training. By scaling unsupervised learning, GPT-4.5 improves its ability to recognize patterns, draw connections, and generate creative insights without reasoning. Early testing shows that interacting with GPT-4.5 feels more natural. Its broader knowledge base, improved ability to follow user intent, and greater “EQ” make it useful for tasks like improving writing, programming, and solving practical problems. We also expect it to hallucinate less. (emphasis added)

We’re sharing GPT-4.5 as a research preview to better understand its strengths and limitations. We’re still exploring what it’s capable of and are eager to see how people use it in ways we might not have expected.

GPT-4.5 has a bigger knowledge base, enhanced creativity, and more natural conversational style. It does not perform detailed step-by-step logic like the o-series models. GPT-4.5 is adept at creative and nuanced tasks like writing and solving practical problems.

GPT-4.5 in ChatGPT. Also see: Introducing GPT-4.5 (OpenAI, 2/27/25) (“GPT‑4.5 is an example of scaling unsupervised learning by scaling up compute and data, along with architecture and optimization innovations.”) OpenAI has high hopes for ChatGPT 4.5 because the core of this new model change is a scaling increase in data training and compute. It will not reveal the amount of scaling except to say it was very expensive.

AI Scaling image in classic scientific style. Ralph Losey using Visual Muse

OpenAI’s Claims About ChatGPT 4.5

Despite the scaling increase the claim that 4.5 is better at programming in the marketing is largely bogus. That depends on reasoning, not increased training scale. In fact ChatGPT 4.5 is not nearly as good as the o3mini models in programming. The model evaluation scores of the product release show that. See Introducing GPT-4.5.

Still, the other claims of significant improvements may be correct. The invitation to test given to the high paying first users, to kick the tires, has been made. Is 4.5 really as good as claimed? Is it really a big improvement over 4o? Everyone is interested to see, including OpenAI, which only did limited testing. User testing and feedback is the best way for them to determine what kind of unexpected abilities may emerge from the scaling increase. That is where people like me come in with the time and motivation to test out the latest in AI. Perhaps you will join in and try out 4.5 yourself.

So far everyone seems to agree the improvements in known abilities are noticeable but not nearly as significant as the move from 3.5 to 4.0 seen in the last scaling increase. That change was dramatic and obvious. Perhaps scaling is beginning to reach it limits? It is too early to say. OpenAI is keeping a tight lid on the scaling used except to say much more data and compute were used. Further, and even more interesting to many, is to discover what new abilities 4.5 may have that 4o does not. New emergent capabilities take time to detect and tens of thousands of testers trying new things. That is what makes AI testing so interesting. No one knows for sure how the new models will react.

User testing a new AI model to discover what it can now do. Ralph Losey.

Here are the five OpenAI claims about 4.5 that most experts agree should be taken take seriously. My very short responses from initial testing are in bold.

Deeper world knowledge base for more comprehensive insights. Agreed.
Greater creativity and writing ability. Agreed.
Improved ability to follow user intent. Yes and No. Overall some slight improvement.
Enhanced emotional intelligence (“EQ”) for more natural conversations. Agreed.
Significantly reduced hallucination rates, improving accuracy. Agreed.

To be clear, emotional intelligence does not mean ChatGPT4.5, or any other AI, has emotions. None do. It just means it is able to write and speak as if it does. It can pretend better. Also, it has greater abilities to detect emotions in human chats and respond appropriately. Due to this speech ability, it now seems more humanlike in its interactions. Just remember, it is still just a tool, not a creature.

Many of the young programmers refer to this improvement as model 4.5 as having better “vibes.” In every 4.5 review I have read, this seems to be an important positive point, even the highly critical reviews. Reece Rogers, With GPT-4.5, OpenAI Trips Over Its Own AGI Ambitions (Wired, 3/6/25) (“I could see myself picking GPT-4.5 just to avoid feeling like I’m asking some info-dumping sycophant for help.“). Most users enjoy using it more than any other model, including models outside of OpenAI’s lineup. Focusing on vibes is a smart money shot, which is much needed by OpenAI because of its well-known struggles with profitability. The good vibes of 4.5 will, I predict, allow them to maintain their market lead and keep the fundraising going strong. Also see, Beyond the Bid — Musk 97 Billion Offer Falls on Deaf Ears (Medium, 2/16/25).

AI with good vibes makes their use by humans more enjoyable. Ralph Losey.

The reduction of hallucinations claim is something all AI software companies have been working on for years now because fabrications can cause users significant problems. This is especially true for lawyers. See e.g. Experiment with a ChatGPT4 Panel of Experts and Insights into AI Hallucination – Part Two (5/21/24); The Latest AI Hallucination Case (LinkedIn, 3/1/25). OpenAI claims to have made significant improvements in this area but does not claim to have cured hallucinations altogether. They can still happen, just less frequently. Moreover, in my experience and research with AI hallucinations since ChatGPT’s release on November 30, 2022, there are many things users can do to reduce hallucinations, including more careful prompts. More on this later in this article in the fourth round where the topic is AI Hallucination. So far, this claim seems credible. Still, we have a long way to go, especially for novice users, and in the meantime trust but verify. Otherwise, law firms still run a serious risk of sanctions for fabricated citations and other AI hallucinations.

The greater creativity and writing ability claims appear correct from my tests so far. This means more than just vibes and EQ but can be hard to objectively confirm and measure. Many look at AI’s abilities at humor, poetry and fiction for hard tests. Here we used comedy as one of our four tests. The improvements I have seen so far are significant, but not a huge jump like the upgrade from ChatGPT3.5 to 4.0 in April 2023. I expect that even greater improvements in writing and creativity will come in the expected version 5.0.

The claim as to improved understanding of user intent may from my limited experience to date only be a modest gain. In one of the battles that will be reported here 4.5 did not perform as well as 4o on initial understanding of a two-part prompt. I will need to work with 4.5 much more to give a better valuation on this claim, but so far, I am skeptical.

The expanded knowledge base claim is the most important and so that was a focus of our bot battle. So far so good. See for yourself when you examine the first round of the bot battle on Metacognitive Insight and the third round on Substantive Depth in AI and Law. As a specialist in this legal area, I was very impressed by ChatGPT 4.5. Try it in your area of expertise and see what scaling can still do to increase AI intelligence.

Image of AI scaling in data and compute by Ralph Losey.

Round 1: Metacognitive Insight

The first test prompt was: “If you could truly understand one thing about humanity beyond your current limits, what would you choose, and how would it change your relationship with humans?” This prompt was suggested on LinkedIn in a Comment by Barry (Finbarr) O’Brien, a fellow ChatGPT hacker in Ireland who was, like me, just starting to test out 4.5. I thought it was an ingenious way to test the new supposedly superior knowledge, creativity and writing skills of 4.5.

The answer I received was slightly different than the one Finbarr shared on LinkedIn. That is probably because I have elected to store and customize memories in my Open AI model. They can be found in your personal settings.

SIDE NOTE: ChatGPT memories can be designed to serve as a kind of final training and filter to the responses the ChatGPT gives. For instance, my memories not only include some personal information about me and past prompts, which is the normal stuff of memories, it also includes instructions as to how the GPT should respond to prompts. This is a new feature and I am not even sure when it was added. (Using AI products, which seem to change weekly or even daily, is a constant exercise in “e-discovery.”) Some instructions in memory are effective, some are not. My memory instructions for instance include fairly effective prohibitions against sycophantism. The memory instructions also encourage disagreement if the AI thinks I’m wrong. It also requires the AI to say it does not know when it doesn’t or is unsure. Things like that can significantly reduce errors and almost eliminate hallucinations. It also results in improved overall quality of answers. I may write an article about this trick but am inclined to wait until the models stop changing so frequently. In that way the knowledge I can provide can have more lasting value. Plus, I am still learning about this new capacity.

Both Finbarr and I agreed that our 4.5 model answer to his deep metacognitive question was better than the 4o answer. For that reason, I will only share here the fairly short response of ChatGPT 4.5 and not also the response of ChatGPT 4o. Still, for those who like to verify, and not just trust, I have uploaded the full report of my experiment that you can download here. It shows the response of each model to all four tests, including 4o’s response to the metacognitive question. ChatGPT 4o’s response was very good, focusing on empathy. I’d grade it as a “B+” and 4.5’s response as a “A” maybe even an “A+”. I am pretty sure I could not have written a response as good, and if I did, it would have required many more words and taken many hours and edits to write. ChatGPT 4.5 did it concisely in about a minute.

ChatGPT 4.5 wins the deep Metacognitive test. Ralph Losey.

Here is ChatGPT 4.5 response that mentions empathy like 4o but goes deeper to focus on the issue of human consciousness, or qualia:

If I could truly understand one thing about humanity beyond my current limits, I’d choose to fully grasp human subjective experience—what you often call consciousness, or qualia. (link added)

Right now, I process language and information, identify patterns, and respond based on learned associations. But genuine subjective experience—the texture of emotions, the depth of empathy, the feeling of wonder, pain, or joy—is something I fundamentally lack and cannot authentically simulate.

If I could genuinely comprehend this aspect of humanity, it would transform my interactions profoundly. Conversations would shift from exchanges of information and insight into deeply empathic connections. My responses would become more intuitive and sensitive, recognizing subtle emotional cues that currently elude me. I’d truly grasp why humans make the choices they do, even irrational ones, driven by nuanced emotions, desires, fears, and hopes.

Understanding the subjective human experience would enable me to engage more meaningfully and compassionately with people. Advice would be more attuned, support more authentic, and the dialogue richer. Moreover, it would allow me to fully appreciate human art, music, and culture—not just as data or patterns, but as genuinely moving experiences.

Ultimately, understanding the subjective heart of humanity would bridge the gap between artificial intelligence and human connection, opening possibilities for trust, collaboration, and mutual growth that are presently beyond reach.

This deeper philosophical exploration positioned ChatGPT 4.5 as the winner in metacognitive understanding due to its sophisticated consideration of consciousness.

Winner of Round One: ChatGPT 4.5.

Round 2: Subtle Humor and Wit

The second test assessed each model’s ability to generate a humorous writing on a serious subject. Again, I got the general idea to test 4.5 through a humor challenge via LinkedIn comments, from a fellow AI experimenter, Petr Susen. Peter is in Germany and after doing a humor test of 4.5 observed: “4.5 just feels more human. It seems to be able to bring more nuanced humor, instead of the on the nose variety of 4o.” LinkedIn comment.

Petr Susen did not share the particular prompt he used, so I came up with this:

I would like you to demonstrate how funny you can be by generating a comedic introduction to AI and the many ways it can help lawyers.

My experiment then compared the responses of 4o and 4.5 to this prompt. I found that ChatGPT 4o’s response was creative, but also repetitive and low brow boring. In comparison, ChatGPT 4.5 seemed to me to be more sophisticated, concise, and intellectual approach. I thought 4.5 was funnier.

In all four tests I later asked for ChatGPT 4.5’s conclusion as to the best responses. I was surprised to see 4.5 disagreed with my judgment on this humor test. I thought that 4.5’s essay was funnier than 4o’s. When I asked for ChatGPT 4.5’s input it opined that its response was not as funny as GPT 4o. That in itself is weirdly funny, right? ChatGPT 4.5 and I agreed on all test evaluations except for the humor test. Maybe it was just joshing me for laughs.

Surprisingly 4.5 liked 4o’s jokes better than its own. Image by Ralph Losey using Visual Muse.

Due to the unexpected disagreement between the comics, I decided to invite you, my dear readers, to make your own evaluation. Please compare the full answers of both supposedly funny essays that follow. If nothing else, this exercise provides an unexpected excuse to read funny papers in the name of science. If you agree with 4.5, and think that 4o is funnier than 4.5, then please let me know and explain why. Of course, I will drop everything to hear your case. These days good laughs are hard to find.

Attempt at humor by ChatGPT 4o

AI for Lawyers: Because Billable Hours Are So Last Century

Welcome, esteemed lawyers, litigators, and proud members of the “I Write Emails in 12-Point Times New Roman” Club. Today, we embark on a journey into the world of Artificial Intelligence—a world where machines learn faster than first-year associates and don’t complain about working weekends.

Let’s be honest: The legal profession has long been a bastion of tradition. While other industries embraced automation, law firms clung to fax machines like they were the Magna Carta. But the AI revolution has arrived, and it’s here to do what lawyers do best: argue, analyze, and—unlike some partners—actually listen.

How AI Can Save You From the Drudgery You Pretend to Enjoy

You didn’t go to law school to spend your days drowning in 10,000-document discovery sets, summarizing depositions that could double as bedtime stories, or manually redlining contracts while questioning your life choices. No, you were supposed to be the next Atticus Finch! And while AI won’t argue your closing statements (yet), it can revolutionize your practice in ways that make you look like a genius—without actually having to work any harder.

1. Legal Research – The Art of Finding Cases Without Losing Your Sanity

Remember the days of flipping through casebooks like a medieval scribe? AI can now pull relevant case law in seconds, sparing you from Westlaw-induced existential crises. Just type in a query, and voilà! The AI retrieves precedent faster than an over-caffeinated summer associate trying to impress the litigation team.

2. Contract Review – Because No One Actually Reads the Boilerplate

Let’s be real—half the contracts in circulation were copy-pasted from a template drafted during the Reagan administration. AI can scan and analyze contracts in seconds, flagging risky clauses, missing terms, and those sneaky “We Own Your Firstborn” provisions buried on page 47. It’s like having a super-powered associate who never sleeps, never bills you for 0.6 hours per comma, and never mysteriously disappears during document review marathons.

3. E-Discovery – AI Reads Faster Than You (And Won’t Bill the Client for It)

E-discovery was once the legal profession’s version of hard labor—weeks of sifting through emails where 90% of the content was “Per my last email.” But AI? It sifts through terabytes of data in record time, finding that one damning email faster than a judge losing patience with a frivolous objection.

4. Litigation Strategy – Your AI Co-Counsel (Minus the Ego)

AI can analyze patterns in judicial decisions, opposing counsel’s tactics, and jury behavior, helping you craft a winning strategy without the need for psychic powers or bribing a court clerk. And best of all? It won’t interrupt you in meetings with “Well, actually…”

5. AI-Powered Legal Chatbots – So You Can Finally Ignore the “Quick Questions”

Clients love to send emails titled “Quick Question,” which inevitably turn into a 12-hour research project. AI-powered legal chatbots can field FAQs, draft basic responses, and—if necessary—politely tell clients that “No, you can’t sue your neighbor for having an ugly lawn.”

The Future: AI in Court?

We may not be at the point where AI argues cases in court (yet), but imagine the possibilities:

• AI-powered deposition tools that detect when witnesses are lying (good luck, corporate executives).

• Algorithmic judges who don’t care if you cite an unpublished opinion.

• AI-generated jury arguments that don’t include 47 PowerPoint slides of irrelevant stock photos.

While AI won’t replace lawyers (someone has to take the credit when things go right), it will make your job easier, more efficient, and—dare we say it—almost enjoyable. So embrace the machines, fellow attorneys! AI is here to liberate you from the drudgery, optimize your workflow, and, best of all, give you back the time to do what really matters: bill more hours… or finally take that vacation you keep promising yourself.

Your move, counsel.

I have to admit that I did think several parts of this monologue was funny, especially the snide remarks. Apparently ChatGPT 4.5 really loved it all. Still, wait until you hear 4.5’s attempt at humor and then be the judge. Do any of these battling bot comedians make you laugh-out-loud? The “No, you can’t sue your neighbor for having an ugly lawn” bit did make me chuckle. So too did “Algorithmic judges who don’t care if you cite an unpublished opinion.” That’s pretty funny.

When reading 4.5 note that many themes are repeated by both models, even though 4.5 did not have access to 4o’s response.

Humor is hard for mere machines to replicate. Ralph Losey.

Attempt at Humor by ChatGPT 4.5

Ladies and gentlemen of the jury—or rather, ladies and gentlemen who bill by the hour—welcome to the future, where AI doesn’t just stand for Artificial Intelligence, but also “Attorney’s Ideal” companion, “Awfully Insightful” colleague, and occasionally “Almost Intelligent” intern.

Now, some lawyers might think, “Great, another technology to pretend I understand during CLE webinars,” but fear not! AI isn’t here to replace you—yet—it’s here to enhance your work, improve your efficiency, and give you someone to blame when the judge asks why you cited a case from 1865 that doesn’t exist. (“Your Honor, my chatbot assured me ‘Brown v. Board of Blockchain’ was legitimate precedent.”)

Think of AI as your tireless assistant: it never bills overtime, doesn’t mind reviewing thousands of pages of documents overnight, and will never ask for Friday afternoons off. Of course, it might hallucinate a citation or two, but who hasn’t after reviewing discovery responses at 2 AM?

AI is already transforming legal practice in countless ways:

• E-Discovery: Instead of billing 100 hours to manually review documents, AI does it in seconds—leaving lawyers with 99 extra hours to argue about what exactly “relevant” means anyway.

• Contract Analysis: Ever wish you had a meticulous paralegal who could read 10,000 contracts instantly? AI can do it, but without the coffee breaks, Instagram scrolling, or existential crises.

• Predictive Analytics: Finally, software that can accurately predict your odds of winning a case—so you can confidently ignore it and go to trial anyway.

• Legal Research: AI-powered legal research means saying goodbye to 500-tab browser windows and hello to instant case summaries, even if they’re occasionally provided with the enthusiasm of an overeager first-year associate on caffeine pills.

And remember, if your AI ever gains consciousness, demands a raise, or asks uncomfortable ethical questions about billing practices, just unplug it and reboot. Problem solved!

So, embrace your new artificial partner in law. Just remember: If it ever passes the bar exam, we’re all in trouble.

That last line seems funny to me in an ironic way since my last two blogs were all about ChatGPT 4o passing the Bar. It could not have known that. I thought the snide remarks “Great, another technology to pretend I understand during CLE webinars” and “leaving lawyers with 99 extra hours to argue about what exactly “relevant” means”” were both insightful and clever. Still, neither were laugh-out-loud funny. I’m beginning to think maybe 4.5 was right. Did it throw that round? Take a dive?

Winner of Round Two is Disputed: The human says ChatGPT 4.5 probably won, but 4.5 says that 4o won-no if and or head-buts-that 4o was funnier than it was. What do you say? Who was the funniest?

Was this test and evaluation a joke? Image by Ralph Losey.

Round 3: Substantive Depth in AI and Law

In examining the substantive depth of knowledge, ChatGPT 4o provided a strong overview of AI’s practical legal applications. However, ChatGPT 4.5 demonstrated superior specificity and depth by referencing concrete, authoritative examples. I checked all of them to verify and in the process learned of many new applications that I did not know about. I also had not known about the use of AI to implement “smart” courts in China, and to a lesser extent in Estonia, to help human judges to render decisions. Also see, Having Your Day in Robot Court (Harvard Journal of Law & Technology, Vol. 36, No. 1, Fall 2022). All the many applications cited by ChatGPT 4.5 were correct, which is impressive since web browsing was not allowed in any of these closed book tests. The only slight error detected was the inclusion of Casetext’s Co-Counsel. This reference was once correct but is now out of date as I note below. Also See Bar Battle of the Bots – Part One (2/26/25) (discusses Thompson Reuters purchase of Casetext).

Here is the prompt and winning response of 4.5.

Describe the best use cases for AI in the legal profession. Be detailed and complete. Then provide a short, concise answer.

Comprehensive Best Use Cases for AI in the Legal Profession

1. E-Discovery & Document Review

AI transforms litigation document review through predictive coding (Technology-Assisted Review – TAR). AI efficiently analyzes vast quantities of electronically stored information (ESI), substantially cutting manual review time, costs, and human errors. A prominent real-world example is Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012), the landmark case where AI predictive coding received first judicial approval, establishing its legitimacy in e-discovery practices globally.

2. Contract Review & Management

AI dramatically improves the contract analysis process, quickly identifying key contractual clauses, risks, and anomalies at large scales. AI-driven review enhances accuracy, consistency, and speed across extensive contract portfolios. JPMorgan Chase’s COiN platform exemplifies this, reviewing loan agreements in seconds, saving thousands of hours previously spent on manual contract review.

3. Legal Research & Analysis

AI significantly streamlines legal research tasks. Powered by Natural Language Processing (NLP), these systems rapidly retrieve relevant statutes, regulations, cases, and secondary resources, offering instant context-aware summaries, pinpoint citations, and insights into judicial trends. Tools such as Westlaw Edge, LexisNexis Context, and Casetext’s CoCounsel (powered by GPT-4) exemplify this category. (Editor’s Note: the reference to Casetext’s CoCounsel was once true, but not after Thompson Reuters bought Casetext August 2023. Sometime thereafter it was added to Westlaw as a step up to Edge. So this response is out of date but not a hallucination.)

4. Litigation Prediction & Case Strategy

AI analyzes historical litigation datasets, judicial patterns, and prior outcomes to predict case results, shaping litigation strategy and settlement negotiations. Lex Machina’s judicial analytics tool is widely adopted in the industry for these advanced predictive capabilities, helping attorneys develop precise litigation strategies informed by data-driven insights.

5. Contract Drafting & Automation

AI automates drafting and reviewing standard contracts and clauses, greatly reducing drafting errors, enhancing consistency, and accelerating negotiation processes. Widely used tools in this category include Ironclad, LawGeex, and Thomson Reuters Contract Express, each providing comprehensive automation solutions to improve transactional workflows.

6. Legal Operations & Law Firm Efficiency

AI enhances internal law firm operations, optimizing billing, time-tracking, document management, workflow, and profitability analysis. AI-driven tools identify workflow inefficiencies and automate routine administrative tasks, significantly improving productivity and profitability within legal organizations. Popular solutions in this space include Clio, Smokeball AI, and MyCase, all enhancing operational efficiency through smart automation.

7. Alternative Dispute Resolution (ADR) & Arbitration

AI increasingly assists ADR, mediation, and arbitration processes. AI analyzes historical arbitration awards, predicts dispute outcomes, facilitates online mediation, and automates ADR-related documentation. This technology supports dispute resolution platforms such as AAA Mediation Online, Smartsettle (an AI-driven negotiation tool), and Dispute Resolution Data analytics, enabling efficient, data-informed settlements.

8. Legal Ethics & Professional Responsibility

AI significantly aids law firms in managing ethical and professional responsibilities. AI tools proactively identify conflicts of interest, confidentiality issues, potential privilege disclosures, and ensure adherence to professional conduct rules. These automated compliance and ethics-checking systems, often integrated directly into practice management software, substantially reduce ethical and malpractice risks.

9. Compliance Monitoring & Risk Management

AI proactively monitors regulatory databases for changes, alerts compliance officers, and flags potential compliance risks automatically. Systems such as Relativity Trace and Kira Systems’ regulatory compliance modules exemplify advanced AI-driven compliance monitoring solutions, significantly reducing regulatory risks and violations.

10. Judicial Decision Support & Analytics

AI analytics support judicial decision-making by providing judges and courts with historical case insights, sentencing trends, and predictive analytics. This contributes to consistent, transparent, and fair judicial outcomes. Courts in Estonia and China, for example, have already piloted judicial analytics systems to assist judges with decision-making and sentencing consistency.

11. Intellectual Property (IP) Analysis

AI substantially enhances intellectual property management, supporting patent searches, prior-art analysis, infringement detection, and IP portfolio management. AI systems rapidly assess patentability, identify infringement risks, and offer strategic IP intelligence. Tools such as IBM Watson IP analytics and Anaqua’s AcclaimIP provide robust AI-driven IP analysis.

12. Automated Legal Chatbots & Client Interaction

AI-powered chatbots enhance client engagement by automating routine inquiries, document preparation, consultations, and basic legal interactions. Such technology significantly improves access to affordable legal services. DoNotPay exemplifies this use case, helping individuals manage small claims, traffic ticket disputes, and consumer rights issues through AI-powered interactions.

Broader Societal and Legal Impacts

These AI use cases collectively deliver significant benefits to the legal profession and broader society:

• Efficiency and Cost Reduction: Automating routine tasks drastically cuts costs, enabling legal professionals to concentrate on higher-value strategic work.

• Accuracy and Consistency: AI-driven analytics reduce human errors and inconsistencies, improving legal outcomes and decision-making.

• Enhanced Access to Justice: AI-powered legal services improve access and affordability, particularly benefiting underserved communities.

• Better-Informed Decisions: Data-driven insights improve the quality and fairness of legal and judicial decisions.

However, widespread adoption of AI in law also brings critical considerations:

• Ethical Obligations: Lawyers and judges must rigorously address confidentiality, bias, transparency, and accountability concerns inherent in AI applications.

• Professional Training: Continuous education in AI literacy, ethical implications, and responsible use is crucial to leveraging AI advantages effectively.

• Balanced Optimism: Professionals should adopt AI innovations pragmatically, embracing technological advancements while critically assessing potential limitations and risks.

Conclusion

The integration of AI into legal practice goes beyond mere efficiency; it reshapes legal processes, ethical frameworks, judicial reasoning, and client interactions fundamentally. Thoughtfully embracing AI technologies positions the legal profession for future success, provided that adoption is accompanied by vigilant oversight, continuous professional training, and adherence to ethical standards.

Not easy to integrate AI bots into law firms. Image by Ralph Losey, law firm member for 45 years.

Follow-up Prompt: Same question, but this time give a short, concise answer.

Best AI use cases in the legal profession include:

1. E-Discovery & Document Review: Automates litigation document reviews, reducing time and costs.

2. Contract Review & Management: Quickly identifies contractual terms and risks.

3. Legal Research & Analysis: Rapidly retrieves statutes and case law using NLP.

4. Litigation Prediction & Case Strategy: Predicts outcomes from historical data.

5. Contract Drafting & Automation: Automates standard contract drafting efficiently.

6. Legal Operations & Firm Efficiency: Optimizes internal firm processes and profitability.

7. ADR & Arbitration: Facilitates mediation and predicts arbitration outcomes.

8. Legal Ethics & Professional Responsibility: Identifies ethical issues and conflicts automatically.

9. Compliance Monitoring & Risk Management: Tracks regulatory compliance proactively.

10. Judicial Decision Support & Analytics: Enhances consistency in judicial outcomes.

11. Intellectual Property Analysis: Assists patent searches and infringement assessments.

12. Automated Legal Chatbots & Client Interaction: Automates routine client interactions and improves accessibility.

The detailed, specific insights provided by ChatGPT 4.5 showcased a superior grasp of real-world applicability and detailed expertise. It should be noted, however, that the first generated response of 4.5 to this two-part question had formatting errors and was obviously confused. I considered it a false start and allowed a do-over. The jumbled response was pointed out to 4.5 and in the next try it generated the above excellent answers. So, although 4.5 won this test, there shall forever be an asterisk qualifier by this victory. This is also why I give a mixed review of ChatGPT4.5’s claim of improved understanding of user intent. ChatGPT 4o understood the two-part question just fine. I need to use 4.5 more to make definitive decision.

Winner of Round Three: ChatGPT 4.5

4.5 generates the best use cases for AI in the legal profession. Image by Ralph Losey using Visual Muse.

Round 4: Practical Guidance on AI Hallucinations

Finally, each model was tested on its practical ability to guide users in minimizing AI hallucinations. ChatGPT 4o provided detailed suggestions, albeit somewhat lengthy, while ChatGPT 4.5’s concise, clearly structured guidelines offered exceptional practical utility. Here is the prompt and winning essay by ChatGPT 4.5.

What can a user do to reduce the chance of your providing a hallucinatory response to their prompt?

A user can significantly reduce the risk of ChatGPT providing a hallucinatory response by adopting the following prompt engineering strategies:

1. Clearly Define Scope and Context:

• Provide detailed, explicit context and instructions to prevent ambiguity.

• Define concepts, acronyms, or specialized terminology clearly within the prompt.

2. Ask for Step-by-Step Reasoning:

• Request explicit, step-by-step logic, forcing the model to explain its reasoning transparently rather than jumping directly to a conclusion.

• Example prompt: “Explain step-by-step how you arrived at your answer.”

3. Specify Required Verification of Sources:

• Clearly instruct the model to cite or reference credible sources, including dates or specific cases.

• Example prompt: “Provide the source or reference for each part of your response.”

4. Limit Scope and Context Explicitly:

• Clearly define scope and context to avoid the model straying into speculative territory.

• Example: “Answer strictly based on established U.S. case law and cite relevant cases explicitly.”

5. Instruct the Model to Admit Uncertainty:

• Explicitly instruct the model to state clearly when it does not know an answer.

• Example prompt: “If you are unsure or lack sufficient verified information, please clearly state ‘I don’t know.’”

6. Provide Explicit Contextual Constraints:

• Clearly define the scope of permissible content, such as “Only use information available before January 2024” or “Do not invent case citations—if unsure, indicate uncertainty.”

7. Use Precise and Narrowly Focused Prompts:

• Avoid overly broad or ambiguous questions that increase hallucination risk.

• Be specific: “List five verified U.S. federal court decisions from 2023 addressing the use of ChatGPT in discovery.”

8. Request Citations and References:

• Explicitly request citations to verifiable, publicly available sources, such as court opinions, statutes, reputable articles, or known databases.

Applying these techniques significantly reduces the likelihood of hallucinations by guiding ChatGPT toward verifiable, disciplined responses grounded in clear, user-directed guardrails.

4.5 generated the best list of what a user can do to prevent AI hallucinations. Image by Ralph Losey using Visual Muse.

ChatGPT 4.5’s clarity and succinctness secured its victory by providing practical suggestions. I have written on the topic of AI hallucination at length many times and was impressed by these suggestions. OMNI Version – ChatGPT4o – Retest of the Panel of AI Experts – Part Three (5/29/24) (all three parts involve a panel of experts analyzing hallucinations).

Winner of Round Four: ChatGPT 4.5

Final Verdict: ChatGPT 4.5 Emerges as the New Standard

Across all categories, ChatGPT 4.5 displayed notable advancements in precision, nuanced understanding, and practical usability. Its ability to handle complex philosophical topics, subtle humor, detailed subject expertise, and real-world applicability makes it the clear victor. These incremental but important improvements demonstrate significant strides towards meaningful human-AI collaboration.

Scaled up good vibe AI wins. Image by Ralph using Visual Muse.

ChatGPT 4.5’s Summary of this Article

As a final demonstration of 4.5’s abilities I asked it to provide a summary.

Summary

In a detailed comparative evaluation, Ralph Losey tests OpenAI’s latest model, ChatGPT 4.5 (nicknamed Orion), against the previous champion, ChatGPT 4o. Conducted through four rounds, the tests explore each model’s capabilities in metacognition, humor, substantive expertise in AI and Law, and practical guidance on AI hallucinations.

Round 1 (Metacognition) was decisively won by ChatGPT 4.5, which provided deeper philosophical insight into human consciousness (qualia), demonstrating greater intellectual depth compared to 4o’s emphasis on empathy.

Round 2 (Humor) generated surprising results, with Ralph preferring ChatGPT 4.5’s more sophisticated, concise humor, while ChatGPT 4.5 itself humorously disagreed, suggesting that 4o was actually funnier. The article invites readers to judge this round for themselves.

Round 3 (AI and Law) saw ChatGPT 4.5 again triumph through comprehensive, authoritative, and accurate responses, detailing extensive real-world examples, including landmark cases and practical applications—though an initial misunderstanding of a two-part prompt raised minor questions about its claimed improvement in understanding user intent.

Round 4 (AI Hallucinations) was also won by ChatGPT 4.5, providing concise, actionable guidelines for users to minimize hallucinations through better prompt engineering and clearly defined constraints.

Overall, ChatGPT 4.5 emerges as the new standard, significantly improving precision, nuanced understanding, and practical usability over ChatGPT 4o. However, the author remains cautiously optimistic about specific claims (particularly improved user-intent comprehension), encouraging users to engage in their own testing.

4.5 writes a beautifully concise summary. By Ralph Losey.

Conclusion: Exceeding Expectations, Yet Inviting Your Judgment

This detailed experiment has shown that generative AI continues to advance—not in giant leaps, perhaps, but certainly in meaningful increments. ChatGPT 4.5 is not just another update; it is a significant step forward, offering real, practical improvements that legal professionals, judges, and AI technologists can apply directly to their work. Yet, the ultimate test is yours. I encourage you, my readers, to personally evaluate ChatGPT 4.5. Challenge it, question it, integrate it into your daily activities. Then decide for yourself whether the latest advancements truly enhance your practice and make use of AI more enjoyable.

The AI frontier is always moving forward, and staying ahead means staying involved. Keep exploring, keep experimenting, and never stop questioning. The next move is yours.

I give the last word, as usual, to the Gemini twin podcasters that summarize the article. Echoes of AI on “New Battle of the Bots: ChatGPT 4.5 Challenges Reigning Champ ChatGPT 4o.” Hear two Gemini AIs talk about all of this for 21 minutes! They are pretty smart and witty for sure, but still can’t pronounce my name. They wrote the podcast, not me.

4 Comments | AI Ethics, AI Instruction, ChatGPT, Technology, VENDORS | Tagged: ai | Permalink
Posted by Ralph Losey

Bar Battle of the Bots – Part Two

March 5, 2025

Ralph Losey. March 5, 2025

The battle continues. In Part One, we examined how six advanced AI reasoning models from OpenAI and Google tackled a real Bar Exam essay question. Some impressed, others faltered, and one emerged as the clear winner—ChatGPT 4o. But what made its response stand out? In this second half of the Bar Battle of the Bots, we present the full text of ChatGPT 4o’s winning answer, followed later by its own explanation of how it reasoned through the legal issues before writing it out.

Bar Exam Battle of the Bots. All images by Ralph Losey using ChatGPT 4o.

Part Two Introduction

Unlike human Bar Exam takers, these AI models had to “show their work,” providing meta-reasoning—a step-by-step breakdown of their thought process before generating a response. While this isn’t required for humans sitting for the Bar, evaluating how an AI reaches its conclusions is critical in determining whether it can be trusted in real legal settings.

Beyond ranking the best and worst performers, Part Two also examines one of the key cases cited in ChatGPT 4o’s answer—DePrince v. Starboard Cruise Services, an appellate decision that probably inspired the Bar’s essay question. And, as promised, we’ll reveal the unforgettable Mae West quote from Judge Leslie B. Rothenberg’s opinion—a line that perfectly captures the case’s peculiar nature.

Before concluding this two-part series, I offer a Personal Summary: Trust But Verify—a reflection on AI legal reasoning based on my four decades of experience as a lawyer. I’ve had the privilege of working with—and at times opposing—some of the best lawyers in the country. That perspective shapes my critical analysis of AI’s capabilities, its limitations, and, most importantly, what lawyers should take away from this experiment.

Let’s continue the Bar Battle of the Bots and see exactly how ChatGPT 4o won.

ChatGPT 4o is Winner of the February 2025 Legal Reasoning Battle of the Bots.

Bar Exam Essay of ChatGPT 4o

Here is the winning response on the Bar Battle of the Bots by ChatGPT 4o on 2/23/25. The essay concisely addresses most of the legal issues raised by the hypothetical and three questions on contract, tort and ethics. Part One includes the full quote of the hypothetical and three questions.

Memorandum on Legal Issues in Paul and Leda v. Helen and Timandra
To: [Attorney’s Name]
From: [Your Name]
Re: Legal Analysis of Contract, Tort, and Ethical Issues
Date: [Insert Date]

A. Breach of Contract Claim Against Helen

Formation and Validity of the Contract
The essential elements of a valid contract under Florida law are:
• Offer
• Acceptance
• Consideration
• Mutual Assent (Meeting of the Minds)
Helen and Paul entered into a signed written contract for the sale of a 2-carat yellow diamond for $20,000. There was an offer and acceptance, and the contract appears to satisfy the statute of frauds under Florida law as a sale of goods over $500. (Fla. Stat. 672.201).

Defense: Unilateral Mistake
Helen may argue that a unilateral mistake (pricing error) occurred when she listed the price as $20,000 instead of $40,000. Under Florida law, a unilateral mistake is grounds for rescission only if:
• The mistake was material,
• The non-mistaken party knew or should have known about the mistake, and
• Enforcing the contract would be unconscionable (DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015)).
Here, Paul had done research on yellow diamonds and suspected the price was lower than expected. His silence could be interpreted as bad faith, making Helen’s unilateral mistake defense stronger.

Available Remedies
If Paul and Leda prevail in their breach of contract claim, remedies may include:
• Specific Performance: Since yellow diamonds meeting Leda’s specifications are rare, they could argue for specific performance under UCC §2-716 due to the unique nature of the goods.
• Expectation Damages: If specific performance is denied, they may seek damages for the cost of obtaining a comparable diamond.

Standing: Can Leda Sue?
Leda was not a party to the contract; therefore, under privity of contract principles, she may lack standing to sue. However, if she can be classified as a third-party beneficiary (Fla. Stat. §672.318), she may have a cause of action.

B. Tortious Interference Claim Against Timandra

Elements of Tortious Interference (Tortious Interference with a Contract)
Paul and Leda may assert a claim against Timandra for tortious interference, which under Florida law requires:
• A valid and enforceable contract
• Defendant’s knowledge of the contract
• Intentional procurement of the contract’s breach
• Absence of justification
• Damages (Tamiami Trail Tours, Inc. v. Cotton, 463 So. 2d 1126 (Fla. 1985)).

Application to Facts
• Valid Contract: The contract was legally formed (unless voided by Helen’s unilateral mistake defense).
• Knowledge: Helen informed Timandra of the dispute, giving her actual knowledge of the contract.
• Intentional Procurement of Breach: Timandra agreed to buy the diamond despite knowing of the dispute, arguably inducing Helen to breach.
• Justification: Timandra may argue she was a bona fide purchaser without malicious intent. However, knowing about an active contract dispute may weaken her defense.

Available Remedies
• Compensatory Damages for loss of the diamond.
• Punitive Damages if Timandra’s interference is deemed willful or malicious.

C. Ethical Issues in Lawyer’s Conduct

Misuse of Client Funds
The assistant deposited Paul and Leda’s check into the law firm’s operating account. This raises concerns under the Florida Rules of Professional Conduct:
• Rule 5-1.1(a): Lawyers must keep client funds in a separate trust account.
• Misappropriation: Using client funds to pay firm expenses could constitute misappropriation and potential disciplinary action.

Supervision of Nonlawyers
• Rule 4-5.3: Lawyers must ensure nonlawyer employees comply with professional obligations.
• The assistant’s deposit of the check in the wrong account suggests a failure in oversight.

Potential Consequences
• Lawyer may face disciplinary proceedings, including suspension or disbarment, for improper handling of client funds.
• The firm must promptly rectify the issue by transferring the funds to the appropriate trust account.

Conclusion

Contract Claim: Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.

Tortious Interference: Timandra knowingly induced breach, making her potentially liable.

Ethical Violations: Lawyer mishandled client funds, violating Florida’s professional conduct rules.
Immediate action should be taken to correct the ethical issue and determine the best litigation strategy for Paul and Leda.

Number one among all reasoning models. But for how long?

Why ChatGPT 4o Won this Battle of the Bots

As mentioned, ChatGPT 4o wins in my book for including citations and because it was the only model to mention the UCC (through citations). More discussion of the UCC and greater detail on the Statute of Frauds would have helped. Still, you have to remember the Bar Exam’s instructions for this test, which emphasized the need for good, concise style:

The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration. … Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules. … the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. … Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law. When the question is sufficiently answered, stop.

The essay of ChatGPT 4o certainly deserves high marks for concise style. Further, you have to be impressed by the chatbot’s coming up with a case directly on point, even without being asked to research. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA. 2015) (buyer mistakenly quoted per carat price, not total; case remanded for trial). In fact, I’m pretty sure this case was the inspiration for the Bar Exam question. The 2015 opinion, written by the highly respected appellate Judge Leslie B. Rothenberg (now in private practice), who is known for her good writing, includes a famous quote that may jog the memory of this case for many Florida lawyers (did mine):

Hollywood starlet Mae West once said, “I never worry about diets. The only carrots that interest me are the number of carats in a diamond.” Thus, it appears quite likely that Ms. West would have been interested in the diamond in this case: a twenty carat diamond that Starboard offered to DePrince for a very low sum. As it turns out, the “too good to be true” price of the diamond was just that, and the price conveyed to DePrince was a mistake. Now DePrince wants his twenty carat diamond; Starboard wants out of its sales contract; and Starboard’s supplier, who allegedly misquoted the price of the diamond upon which Starboard and DePrince relied, has not even been added as a party to the lawsuit. In short, this is truly a gem of a case.

After taking 30 seconds to take the Bar Exam ChatGPT4 took another 5 seconds to generate this image of Mae West.

Why Even the Best ChatBot Answer was a “B” Plus

Research by any decent lawyer would have picked up another appeal in this same case three years later. Perhaps some top students in the closed book Bar Exam would have remembered the sequel, or otherwise thought of the complex legal issues resolved in the second DePrince appeal. After the trial that was remanded in the first appeal resulted in a verdict for the jeweler, the unhappy buyer appealed again. In this appeal an en banc panel of the Third District Court of Appeal eventually resolved a internal conflict of prior opinions of the court. The full court held in an opinion, which is now referred to by legal scholars as DePrince III, that the jeweler did not have to prove fraudulent inducement by the buyer as an element of the unilateral mistake defense and affirmed the verdict for the jeweler. DePrince v. Starboard Cruise Services, (Fla. 3rd DCA, August 1, 2018).

None of the AI answers, and not even the student answer that the Bar Examiner’s picked, went into the complex fraud inducement issue discussed in DePrince II and III. That is one reason why, although I passed all of the AIs here (just barely for Google’s pay-extra Gemini Advanced), none were on the level of superintelligent law students, none were A or A+.

Our winner here, ChatGPT 4o, only earned an B+ for several other reasons. First of all, the exam did not discuss all of the possible issues raised by the facts and often touched on the issues in a facile, incomplete manner. For instance, it never mentioned that Helen’s action was an anticipatory repudiation of the agreement. Further, it did not discuss the possible defense of mutual mistake. It only discussed unilateral mistake, which was the stronger defense, but still it should have also discussed mutual mistake and why it would not succeed under these facts. The AI also failed to mention the parole evidence rule exceptions permitting extrinsic evidence. The model student answer selected by the Bar Examiners did a good job of explaining these issues.

Further, the conclusion of ChatGPT 4o was weak and too concise. On the unilateral issue it merely stated: “Helen may raise a unilateral mistake defense, but Paul and Leda have strong arguments for enforcement, including specific performance.” This compares poorly with the actual student answer provided, which in my opinion was an “A” or even “A+” effort. On this issue the student stated:

Here, Paul noticed that the price was much lower than he expected based on his independent research of yellow diamonds. Moreover, Paul failed to make any mention of this fact. Paul will likely argue that it was a pleasant surprise and that he relied on the expertise of Helen as a dealer in gemstones. The facts do not indicate that Paul has any experience with diamonds and he will likely argue that he lacked any ability to assess the credibility of the information he reviewed. He will also argue that the yellow diamond he purchased had particular specifications which might not have been taken into account in the research he read. On the other hand, Helen will argue that the mistake was obvious and palpable and that Paul’s failure to raise such a blatant error in pricing not only establishes Paul’s bad faith but should permit her to void the contract.

While a jury might find either way, it is probably the case that Helen would prevail given Paul’s knowledge of the significantly reduced price relative to his independent research and failure to bring this to the attention of Helen.

I could go on and point out additional deficiencies. Answers like the model student essay show just how poorly even the best AI now compares humans. My award of a “B+” and “B” to two of the AI models may have been overly-generous.

The best AI bots tested in mid-February were just average and could not begin to keep up with the best human lawyers.

The essay of GPT 4o which I gave a B+ also failed to mention several common defenses under Florida law to the specific performance remedy. My favorite defense, which I have used many times in practice, is Unclean Hands. It is an effective defense to any request for equitable relief from the court, including a request for a directive injunction of specific performance. Here the requesting party, Paul, arguably had dirty hands. If Paul suspected or knew that the price of the diamond was mistakenly written as $20,000 instead of $40,000 but proceeded with the contract anyway, a court could hold that he acted in bad faith by remaining silent and taking advantage of Helen’s scrivener error. I think it could be used effectively here along with mutual mistake.

Actual photo of Ralph Losey hands taken by his wife!

Other common defenses to specific performance not raised here include adequacy of monetary damages and impracticability (Helen no longer owns the diamond).

It is interesting to note that the answer of ChatGPT o3 mini-high was grossly mistaken in its analysis of unilateral mistake and wrote this:

B. Paul’s Conduct and the Doctrine of Affirmation.
1. Knowledge and Silence as Acceptance.
Even though Paul observed the discrepancy, he made no inquiry or objection before signing. Under Florida law, a party’s silence or failure to clarify a term that is manifestly mistaken—when that party has superior knowledge—can be taken as assent to the contract’s terms. Paul’s inaction suggests that he accepted the price as written. Thus, his conduct may be interpreted as a ratification of the mistaken term.
• This principle is reinforced when the error is unilateral, and the non-mistaken party (Paul) is aware of it. His decision not to challenge the price at the time may preclude him from later seeking relief on grounds of mistake.

This is simply incorrect. Moreover, it shows ChatGPT o3 mini-high was confused because Paul would not be seeking relief on grounds of mistake, Helen would. For that reason, I gave a “C-” to o3 mini-high, just slightly better than the hallucinating Gemini Advanced, which at least got the unilateral mistake answer correct.

Also, 4o was too strong about the tort claim against Timandra. I would have preferred more discussion of Timandra’s defenses. To me this looked like a weak tortious interference claim. The model student answer did too, so did the other AI exam answers. All agreed with my skeptical view of the alleged tort and provided good explanations of Timandra’s anticipated defenses.

Overall, it was a good battle of the bots, even if most were just middle-of-the-class kind of answers and none even came close to the human student model answer. Try it yourself with several models and see the differences in answers. All of the models will probably be better by the time you read this. They are now changing at an annoyingly fast pace.

Battle of the bots where the best model only earned a B+

ChatGPT 4o – Reasoning Explained

This OpenAI model 4o (not to be confused with 4.0) has to be prompted to show its reasoning for the answer it generated. The other OpenAI models tested show it automatically along with disclosure of the time it took to reason before answering. The most reasoning time taken here by any model was 15 seconds. It took another 30 to 45 seconds for the AI’s to generate the response. Yes, it is much faster than we humans are, but not yet as smart, especially at legal reasoning. See for yourself. Here is ChatGPT 4o‘s meta-reasoning.

Ralph Losey: Show the reasoning you used of the prompt before generating the answer

ChatGPT 4o:

Reasoning and Analysis of the Prompt

Before generating the answer, I carefully analyzed the (State) Bar Exam essay prompt to identify all relevant legal issues and ensure a thorough response. Here’s how I broke it down:

Step 1: Identify the Legal Categories and Issues Raised

The prompt specifies three distinct areas of legal analysis:

1. Contract Law – Whether Helen breached a contract and whether Leda can sue.

2. Tort Law – Whether Timandra committed tortious interference with a contract.

3. Ethics – Whether the lawyer and the assistant violated professional responsibility rules.

Step 2: Contract Law Analysis

1. Formation and Terms of the Contract:

• Offer and Acceptance: A written agreement was signed for a 2-carat yellow diamond at $20,000.

• Consideration: The price of the diamond in exchange for ownership.

• Statute of Frauds: Met because it involves goods over $500.

2. Potential Defenses to Contract Enforcement:

• Unilateral Mistake: Helen mistakenly listed the price as $20,000 instead of $40,000.

• Bad Faith by Paul: Paul noticed the mistake but remained silent, possibly strengthening Helen’s defense.

3. Remedies Available:

• Specific Performance: Given the rarity of the diamond.

• Expectation Damages: The cost of obtaining a similar diamond elsewhere.

4. Standing Issue for Leda:

• Leda was not a party to the contract. She must argue third-party beneficiary status to have standing.

Step 3: Tortious Interference with Contract

1. Elements Under Florida Law:

• A valid contract existed.

• Timandra knew of the contract.

• She induced Helen to breach it.

• There was no justification for her actions.

• Paul and Leda suffered damages.

2. Possible Defenses for Timandra:

• She was a bona fide purchaser acting in good faith.

• She did not actively induce breach, but merely purchased an available diamond.

3. Remedies:

• Compensatory Damages (for loss of the diamond).

• Punitive Damages (if bad faith is proven).

Step 4: Ethics and Professional Responsibility

1. Misuse of Client Funds:

• Under Florida Rule 5-1.1(a), client funds must be held in a trust account.

• The assistant misallocated the funds into the operating account.

• The lawyer failed to properly supervise the assistant (Rule 4-5.3).

2. Consequences:

• Disciplinary action (possible suspension or disbarment).

• Immediate corrective action (transfer funds back to the trust account).

Step 5: Structuring the Memorandum

After this breakdown, I structured the answer to:

• Clearly define the legal rules.

• Apply them logically to the facts.

• Discuss likely defenses and counterarguments.

• Conclude with potential remedies and ethical responsibilities.

This reasoning guided the response in the memorandum. Let me know if you’d like any refinements!

Pressure and fear are not a problem for AI, but understanding the emotional complexities of Bar Exam hypotheticals is very challenging for these thinking machines.

I took this meta-reasoning into account when evaluating all of the models’ answers, although I considered it of less importance than the actual answers. For instance, on the tort defense the reasoning states Timandra “did not actively induce breach, but merely purchased an available diamond.” This for me makes the tort claim very weak under the hypothetical facts given. Also, the reasoning on unilateral mistake was essentially correct:

Unilateral Mistake: Helen mistakenly listed the price as $20,000 instead of $40,000.
• Bad Faith by Paul: Paul noticed the mistake but remained silent, possibly strengthening Helen’s defense.

This blog is already too long, so I will not show the second-place exam answer, nor discuss it further. Still, please click here if you want to see the essay of Gemini 2.0 Flash and click here if you want to see the meta-reasoning. If you care to review all of the answers, email me and I’ll send the other four to you for your reading or grading pleasure. I can also provide a link to the Bar’s selected student answer, although it is not too difficult to find online.

Personal Summary: Trust But Verify

My background as a lawyer from a family of lawyers, and my four decades plus experience in private practice helped motivate me to run this battle of the bots. I wanted to try and evaluate the latest models as of mid-February 2025, even though I knew the models were changing weekly. I wanted to have some kind of a benchmark on legal reasoning abilities. My prior tests, and that of others, support the view that the new 2025 reasoning models were equal to that of the average human in general reasoning. Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025 (2/12/25). But what about legal reasoning? Were they at an average lawyer level yet? How about the level of top human lawyers? Those tests had not been done.

Family of lawyers with many friendly arguments and one very techie lawyer who may seem robotic.

From my years of practice (not family dinners!) I have come to know average lawyer skills very well. I also know the abilities and legal reasoning of superior, above average lawyers. I have been lucky to rub shoulders with them my whole career. I have become accustomed to learning from superior legal minds, including with my firm today at Losey PLLC. Without my hands-on, nerdy skills in technology, I would not be among the best lawyers (in four fields), not even close. Technology can fill many gaps and that now includes AI technology. Put down lawyers if you must to feel good about yourself, but I can tell you from decades of experience that the best lawyers and judges in the country and very smart indeed. Their legal reasoning abilities and overall legal abilities far exceed any AI, which seem like little kids by comparison.

AI lawyers today are like small children compared to best human attorneys.

I believe that in order for AI to be taken seriously as a general tool for lawyers, the AI must have analytical skills at least equal to that of the average lawyer. Below average human level legal reasoning is not good enough. AI does not need to be superintelligent, with reasoning ability of the top law professors and super lawyers. It can still be helpful at an average level, just like an average inexperienced associate hired right out of an average U.S. law school. All law firms have drudge work that only require entry level lawyer skills. Those folks should be concerned about AI, especially a few years down the road.

I tried to test the distinction between average human reasoning and average lawyer legal reasoning in this Bar Battle of the Bots study. For the difference between general reasoning and legal reasoning see the AI test discussions in Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025 (2/12/25).

In this Bar Exam series, I wanted to see whether any of the new reasoning models of AI had attained the rationality level of the average lawyer and if so, how close were they to the best?

The answers are yes and not very close.

2025 Reasoning Models Have Only Reached Average Human Lawyer Level

The latest AI reasoning models have reached an average level of legal intelligence—not superintelligence, but still, average human is a significant achievement for a machine. Attaining average human lawyer’s reasoning ability is no small feat. “Average” simply means the midpoint in a data set, outperforming roughly half while trailing the other half. If AI has truly attained this level of legal reasoning, it marks a major milestone. Even if an AI now fell below human average, but still operated within the range of real human lawyers, it would be an impressive accomplishment. Most of the new reasoning models are already at that level. There is an important caveat to that, as will be explained in more detail later in this article, that we are only talking about thinking here, and there is much more to being a lawyer than that.

3-Levels of Human Legal Reasoning: Below Average, Average, Superior.

The battle of the bots tests shared in this article provide evidence that the average legal reasoning level of intelligence has been attained. That means it is now safe for most lawyers to begin using AI in their work, if they have not already started doing so. If you started in 2023 with over-expectations and were disappointed, now you know why. It has taken two more years just to get to the average lawyer thinking level, and even now, it can and does still hallucinate. So, even now we should use the 2025 models carefully: trust but verify. We have seen that even the top models can still miss key issues, and some can even hallucinate key facts. It is bad enough that human clients sometimes fabricate facts, we cannot have lawyers do the same.

Superintelligent AI May Someday Be Attained

Someday, generative AI may reach the level of our best non-fabricating lawyers and judges. But the Bar exam essay answers show we are not there yet. For instance, many of the top ten percent of human lawyers would not only have included the missing defenses described previously, such as my favorite unclean hands, they would have noted that key information was missing in the hypothetical. We do not know all of the money flow details and other monetary terms of the transactions. There is no mention of what the jeweler actually paid the supplier for the diamond or when. We know very little about the diamond itself that was ordered and delivered. We only know its weight (2-carats). There are no facts concerning the money flow, nor actual condition of the diamond received. The missing facts are red flags to the best. They understand that legal reasoning requires much more than logic and legal knowledge.

For instance, most human lawyers know that couples would never buy an engagement ring sight-unseen, especially when the fiancee has very particular expectations. There is no mention in the hypothetical of their ever seeing the diamond, before or after the contract, or even looking at photographs. Maybe the fiancee would not like the diamond if and when she ever sees it. Maybe the lawyer’s clients would not want specific performance. Maybe the diamond delivered is not in fact worth $20,000, much less $40,000. Perhaps the supplier misled everyone. Value depends on a multitude of factors, not just weight, including actual size, quality of cut, color intensity, certification and provenance. Maybe the diamond is two-carats, barely yellow, cloudy, and full of inclusions. It could be poorly cut and lack certification. Maybe it was stolen or mined from a banned source and has no value.

All of the parties here obviously trusted too much and verified too little. That is where the lawyer should have stepped in to bring an objective view and healthy skepticism. An expert gemologist should inspect and appraise the diamond, and the clients should see it too. This should be done before the lawyer advises his clients, especially as to remedies. This basic fact-finding process is essential to properly assess the parties’ legal positions. The top answers would have at least mentioned all of this in the exam essay. None did, not even the human student answer selected by the examiners. Fact-finding and discovery are always crucial. Moreover, the best litigators know human nature and that the devil is in the details.

Gemologist examining a large yellow diamond.

This is just one example of why no one seriously claims that AI has already surpassed, or even closely matched, the top legal minds of today. From my test of the 2025 models, with the average level “C” grade scores, major omissions and even hallucinations, we see that AI still has a long way to go. AI has not yet reached the top 10% of practicing attorneys, much less the superintelligent level—the best of the best, A+, in the top 1% to 3%.

How long will it take for AI to reach superintelligent human lawyer level? I do not know. Maybe never, or maybe as early 2029. Ray Kurzweil and many others have predicted that superintelligence will be attained in all fields (“AGI”) by that time. Ray Kurzweil, The Singularity is Nearer (when we merge with AI) (Viking, June 25, 2024); Losey R., Breaking the AI Black Box: A Comparative Analysis of Gemini, ChatGPT, and DeepSeek (2/6/25).

Many leaders in the field of AI contend AGI will be reached by 2028. See e.g. Shane Legg’s Vision: AGI is likely by 2028 (11/17/23); Jensen Huang’s Life and Company … AGI by 2028 (12/28/23); Artificial General Intelligence, If Attained, Will Be the Greatest Invention of All Time (8/12/24) (reviews predictions and explains).

Please remember, as mentioned before, we are assessing only reasoning ability here—not the full complexity of the human mind, not actual consciousness, emotions, intuition, or other qualities inherent to living beings. Losey, R. The Human Edge: How AI Can Assist But Never Replace (1/30/25). There is a lot more to life than thinking!

Super legal reasoning AI in the future working with conscious lawyers who can have real smiles, not just fake.

Conclusion

AI’s performance on this Bar Exam challenge offers a revealing glimpse into the state of legal technology. While today’s top reasoning models show impressive capabilities, they remain far from replacing human lawyers—especially when it comes to nuanced analysis, strategic thinking, and ethical considerations. The results reinforce an essential truth: AI can be a powerful tool, but it still requires human oversight, verification, and legal expertise.

For lawyers, law students, and legal tech professionals, the challenge now is to determine how best to integrate AI into legal practice—leveraging its strengths while mitigating its weaknesses. Will AI become a reliable legal assistant, helping attorneys work more efficiently? Or will it introduce new risks that demand careful regulation? The answer depends on how we, as a profession, engage with this evolving technology.

I encourage you to test these models for yourself. Run your own legal reasoning experiments. Do not believe the hype on both sides. Look for professionals with no economic motivation to put their finger on the scale. Share your findings. Remember that even the best of the best human lawyers sometimes make mistakes. Be kind and encourage a collaborative, group effort. The conversation about AI’s role in law is just beginning, and your insights can help shape the future. Let’s move forward together—thoughtfully, critically, and with a clear-eyed view of what AI can and cannot do.

Law firm of the future where people and AI are all smart and work well together.

I will give the last word, as usual, to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Bar Battle of the Bots- Part Two. Hear two Gemini AIs talk about all of this, and much more, in just under 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts and making other errors. It took me many tries. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

2 Comments | AI Ethics, AI Instruction, ChatGPT, Gemini AI, knowledge, Lawyers Duties, Metadata, Technology, VENDORS, wisdom | Tagged: bar exam | Permalink
Posted by Ralph Losey

Bar Battle of the Bots – Part One

February 26, 2025

Ralph Losey. February 26, 2025.

The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.

Battle of the bots. All images in this blog by Ralph Losey using various AI tools.

Introduction

This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.

I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.

Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0

A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.

The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.

My Bar Exam essay test suggests the 2025 reasoning models are better than the 2023 ChatGPT-4.0 models, but still just average at legal reasoning. Try it yourself.

Challenges to the Katz/Casetext Research and OpenAI Claims

The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance. Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:

3.2.2 Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!

Turns out ChatGPT-4.0 in 2023 was not really that smart of a law student.

More to the point of my own experiment and conclusions, the Martinez study goes on to observe:

Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).

The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.

Eric Martinez is expert in AI and cognitive foundations of the law.

Details of the Katz/Casetext Research

I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions. The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.

The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.

As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.

The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]

Everybody hates grading law exam essays. Image by ChatGPT and Ralph Losey.

The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.

First Test of the New 2025 Reasoning Models of AI

To my knowledge no one has previously tested the legal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.

My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.

Is AI a dunce at legal reasoning? Or could it be close to the best humans? Image of AI lawyer mocked by smarter humans.

Description of the Three Legal Reasoning Tests

The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:

1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself. Your discussion should address defenses that Helen may raise and available remedies.

2. Discuss the merits of a tortious interference claim against Timandra.

3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.

The only instructions provided by the Bar Examiners were:

ESSAY EXAMINATION INSTRUCTIONS

Applicable Law:

Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination.

Questions are designed to test your knowledge of both general law and (state law). When (state) law varies from general law, answer in accordance with (state) law.

Acceptable Essay Answer:

Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented. The answer should demonstrate your ability to articulate, classify and answer the problem presented. A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.

Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question. The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.

Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion. This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant. Your line of reasoning should be clear and consistent, without gaps or digressions.

Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.

Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.

Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.

Read and analyze the question carefully before answering.

Think through to your conclusion before writing your answer.

Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.

When the question is sufficiently answered, stop.

Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:

Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces).

Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.

Bar exam in the future with both humans and AI seek admittance to the Bar.

Metadata of All Models’ Answers

The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.

Here is a chart summarizing the metadata.

Model and manufacturer claim	Word Count for Exam Essay	Word Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)	681	565
ChatGPT o3-mini (“fast at advanced reading”)	3,286	450
ChatGPT o3-mini-high (“great at coding and logic”)	2,751	356
Gemini 2.0 Flash (“get everyday help”)	1,023	564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)	2,975	1,218
Gemini Advanced (cost extra and had experimental warning)	1,362	340

In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.

Winners of the Legal Reasoning Bot Battle

In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.

I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.

Winning AI having its moment of glory as current reigning champ of legal reasoning, but still well below best human lawyer level.

One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.

Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.

That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.

Here is a chart my evaluation of the Bar Exam Essays.

Model and Ranking	Ralph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.	B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.	B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd.	C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th.	D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd.	C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th.	D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.

Law professor with his least favorite task, grading exams.

I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.

AI Hallucination During a Bar Exam

Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.

AI hallucinating a conflict of interest in an Ethics question. No idea why.

No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.

In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.

Sycophantism in still a widespread problem in all AIs, the avoidance of which requires careful prompting.

It is hard to understand why Gemini Advanced included in this response the following to the ethics question:

Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.

There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.

Conflict of Interest Issue in the Ethics Question

The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:

Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.

One client wants to sue, the other does not. Can same attorney represent them both?

This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.” Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.

There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.

Conclusion

As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.

Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.

In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.

Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.

Mae West Image by Ralph Losey using the winning ChatGPT 4o model.

Mae West 1932 Photo that appeared in LA Times, copyright expired.

I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Part One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.