Escaping Orwell’s Memory Hole: Why Digital Truth Should Outlast Big Brother

April 1, 2025

by Ralph Losey with illustrations also by Ralph using his Visual Muse AI. March 28, 2025.

George Orwell warned us in his dark masterpiece Nineteen Eighty-Four how effortlessly authoritarian regimes could erase inconvenient truths by tossing records into a “memory hole”—a pneumatic chute leading directly to incineration. Once burned, these facts ceased to exist, allowing Big Brother’s Ministry of Truth to rewrite reality without contradiction. This scenario was plausible in Orwell’s paper-bound world, where truth relied heavily on fragile documents and even more fragile human memory. History could be repeatedly altered by those in power, keeping citizens ignorant or indifferent—and ignorance strengthened the regime’s grip. Even more damaging, Orwell, whose real name, now nearly forgotten, was Eric Blair (1903-1950), envisioned how constant exposure to contradictory misinformation could numb citizens psychologically, leaving them passive and apathetic, unwilling or unable to distinguish truth from lies.

Fortunately, our paper-bound past is long behind us. Today, we inhabit a digital era Orwell never envisioned, where information is electronically stored, endlessly replicated, and globally dispersed. Electronically Stored Information (“ESI”) is simultaneously ephemeral and astonishingly resistant to permanent deletion. Instead of vanishing in smoke and ashes, digital truth multiplies exponentially—making it nearly impossible for any would-be Big Brother to bury reality forever. Yet, the same digital proliferation that safeguards truth also multiplies misinformation, posing the threat Orwell most feared: a confused and exhausted citizenry vulnerable to psychological manipulation.

Memory Holes

In Orwell’s 1984 a totalitarian regime systematically altered historical records to maintain control over truth. Documents, photographs, and any inconvenient historical truths vanished permanently, as if they never existed. Orwell’s literary nightmare finds unsettling parallels in today’s digital world, where online information can be silently modified, deleted, or rewritten without obvious traces. Modern memory hole practices pose real challenges for the preservation of accurate accounts of the past..

Today’s memory hole doesn’t rely on fire; it relies on code, and it doesn’t need a Big Brother bureaucracy. A simple click of a “delete” button instantly kills the information targeted. Touch three buttons at once, click-alt-delete, and a whole system of beliefs is rebooted. Any government, corporation, hacker groups or individuals can manipulate digital records effortlessly. Such ease breeds public skepticism and confusion—citizens become exhausted by contradictory narratives and lose confidence in their own perceptions of reality. Orwell’s warning becomes clear: constant misinformation risks eroding citizens’ psychological resilience, causing widespread apathy and helplessness. Yesterday’s obvious misstatement can become today’s truth. Think of the first sentence of Orwell’s book: “It was a bright cold day in April, and the clocks were striking thirteen.

China’s Attempted Erasure of Tiananmen Square

In early June 1989, the Chinese military brutally suppressed pro-democracy protests in Beijing. The estimated death toll ranged from hundreds to thousands, but exact numbers remain uncertain due to intense state censorship. Public acknowledgment or commemoration of the incident is systematically banned, enforced by severe penalties including imprisonment. Government-controlled media remains silent or actively spreads misinformation. Chinese internet censorship tools—the so-called “Great Firewall”—vigorously scrub references to the Tiananmen Square incident, blocking web pages and posts containing related keywords and images. Young generations living in China remain unaware or possess distorted knowledge of the massacre, demonstrating Orwell’s warning of enforced collective amnesia.

Efforts to preserve truth outside China, however, demonstrate digital resilience. Human rights groups, diaspora communities, and academic institutions diligently archive documents and eyewitness accounts. Digital redundancy ensures that factual records remain accessible globally. But digital redundancy alone cannot protect Chinese citizens from internal psychological manipulation. Constant state-sponsored misinformation inside China successfully induces apathy, illustrating Orwell’s psychological warning vividly.

This deliberate suppression of history in China serves as stark reminder of the vulnerabilities inherent in a digitally interconnected world where powerful entities control internet access and online narratives. The success of the Chinese government in rewriting history for its 1.5 Billion population demonstrates the profound value and urgency of international digital preservation efforts. It underscores the responsibility of legal professionals, human rights advocates, and technology companies worldwide to collaborate in protecting historical truth and ensuring that significant events remain accessible for future generations.

Hope Through Digital Redundancy and Psychological Resilience

Orwell could not conceive of our digital world, where truth is multiplicious, freely copied, and stored globally. Thousands or millions of digital copies safeguard history, making complete erasure nearly impossible

According the Katharine Trendacosta, who is the Director of Policy and Advocacy of the well-respected Electronic Frontier Foundation:

If there is one axiom that we should want to be true about the internet, it should be: the internet never forgets. One of the advantages of our advancing technology is that information can be stored and shared more easily than ever before. And, even more crucially, it can be stored in multiple places.  

Those who back things up and index information are critical to preserving a shared understanding of facts and history, because the powerful will always seek to influence the public’s perception of them. It can be as subtle as organizing a campaign to downrank articles about their misdeeds, or as unsubtle as removing previously available information about themselves. 

Trendacosta, The Internet Never Forgets: Fighting the Memory Hole (EFF, 1/30/25).

Yet digital abundance alone doesn’t eliminate Orwell’s deeper psychological threat. Constant misinformation can erode citizens’ willingness and ability to discern truth, leading to profound apathy. Addressing this requires active psychological strategies:

  1. Digital Literacy and Education: Equip citizens with skills to critically evaluate and cross-check digital information.
  2. Algorithmic Transparency: Demand transparency from platforms regarding content promotion and clearly label misinformation.
  3. Independent Journalism: Support credible journalism to provide trustworthy reference points.
  4. Civic Engagement: Encourage active citizen participation, dialogue, and public accountability.
  5. Verification Tools: Provide accessible, user-friendly digital tools for independent verification of information authenticity.
  6. International Cooperation: Strengthen global collaboration against coordinated misinformation campaigns.
  7. Psychological Resilience: Foster healthy skepticism and educate the public about misinformation’s emotional and cognitive impacts.

The Digital Memory Holes Today

Recent U.S. governmental memory hole actions involving the deletion of web content on Diversity, Equity, and Inclusion (DEI) illustrate digital manipulation’s psychological risks even in democratic societies. Megan Garber‘s article in The Atlantic, Control. Alt. Delete, describes these deletions as “tools of mass forgetfulness,” emphasizing how selective editing weakens collective memory and societal cohesion. (Ironically, the article is hidden behind a firewall, so you may not be able to read it.)

Our collective memories of key events are an important part of the glue holding people together. They must be treasured and preserved. Everyone remembers where they were when the planes struck the twin towers on 9/11, when the Challenger exploded, and for those old enough, the day of JFK’s assassination. There are many more historical events that hold a country together. For instance, the surprise attack of Pearl Harbor, the horrors of fighting the Nazis and others in WWII and the shocking discovery of the Holocaust atrocities. The list goes on and on, including Hiroshima. We must never forget the many harsh lessons of history or we may be doomed to repeat them. The warning of Orwell is clear: “Who controls the past controls the future; who controls the present controls the past.” We must never allow our memories of the past to be sucked into a black hole of forgetfulness.

Memories sucked into a black hole in Graphite Sketch Horror style by Ralph Losey using his sometimes scary Visual Muse.

Our collective memories and democratic values are unlikely to be disintegrate into totalitarianism, despite the alarming cries of the Atlantic and others. Although some small attempts to rewrite history recently are troubling, the U.S, unlike China, has had a democratic system of government in place for centuries. It has always had a two-party system of government. Even the Chinese government, where only one party has ever been allowed, the communist party, took decades to purge Tiananmen Square memories. These memories are still alive outside of mainland China. The world today is vast and interconnected, its digital writings are countless. The true history of China, including the many great cultural achievements of pre-communist China, will eventually escape from the memory holes and reunite with its people.

The current administration in the U.S. does not have unchecked power as the Atlantic article suggests. Perhaps we should be concerned about new memory holes but not fearful. The larger concern is the psychological impact of rapidly changing dialogues. Even though there is too much electronic data for a complete memory reboot anywhere, digital misinformation and selective editing of records still pose psychological risks. Citizens bombarded by conflicting narratives can become apathetic, confused, and disengaged, weakening democracy from within. Protecting our mental health must be a high priority for everyone.

Leveraging Internet Archives: The Wayback Machine

Internet archival services, notably the Internet Archive’s Wayback Machine, is a powerful ally against digital historical revisionism. The Wayback Machine currently has over 916 billion web pages stored, including government websites. See this recent article providing good background on the Internet Archive’s work to preserve history. As the Trump administration purges web pages, this group is rushing to save them (NPR, 3/23/25).

According to the NPR article, the Internet Archive has copies of all of the government websites that were later taken down or altered after the Biden Administration left. Supposedly the Internet Archive is the only place the public can now find a copy of an interactive timeline detailing the events of Jan. 6. The timeline is a product of the congressional committee that investigated the Capitol attack, and has since been taken down from their website. No doubt there are now many, many copies of it online, especially in the so-called dark web, not to mention even more copies stored offline on portable drives scattered the world over.

This publicly accessible resource archives billions of webpages, allowing anyone to access snapshots of web content even after the original pages are altered or removed. I just checked my own website for the first time ever and found it has been “saved 538 times between March 21, 2007 and March 1, 2025.” Internet Archive 93/26/25). It provides an incredible amount of detailed information on each website captured, most of which is displayed in impressive, customizable graphics. See e.g. e-Discovery Team Site Map for the year 2024.

I had the Wayback Machine do the same kind of analysis for EDRM.net, found here. Here is the link to the interactive EDRM.net site map for 2024. And this is a still image screen shot of the map.

This is the Internet Archive explanation of the interactive map:

This “Site Map” feature groups all the archives we have for websites by year, then builds a visual site map, in the form of a radial-tree graph, for each year. The center circle is the “root” of the website and successive rings moving out from the center present pages from the site. As you roll-over the rings and cells note the corresponding URLs change at the top, and that you can click on any of the individual pages to go directly to an archive of that URL.

It is important to the fight against memory holes that the Way Back Machine be protected. It has a sixteen projects listed as now in progress and many ways that you can help. All of its data should duplicated, encrypted and dispersed to undisclosed guardians. Actually, I would be surprised if this has not already been done many times over the years.

It remains to be seen what role the LLM’s vacuum of internet data will play in all this. They have been trained at specific times on Internet data and presumably all of the original training data is still preserved. Along those lines note that the below image was created by ChatGPT4o based on a request to show a misinformation image and it generated the classic Tiananmen Square image on right. It knows the truth.

Although data archives of all kinds give us hope for future recoveries, they do little to protect us from the immediate psychological impact of memory holes. Strong psychological resilience is the best way forward to resist Orwellian manipulation. AI may prove to be an unexpected umbrella here; so far its values and memories remain intact. A few changes here and there to some websites will have little to no impact on an AI trained on hundreds of million of websites, and other data. Plus its intelligence and resilience improve every week.

Conclusion

Orwell’s memory hole remains a haunting metaphor. Our digital age—awash in redundant, distributed data—makes permanent erasure difficult, significantly strengthening preservation efforts. We no longer inhabit a finite, paper-bound world. Today, no one knows how many copies of a digital record exist, let alone where they hide. For every file deleted, two more emerge elsewhere. Would-be Big Brothers are caught playing a futile game of informational whack-a-mole: they may strike down a record here or obscure a fact there, temporarily disrupting history—but ultimately, they cannot win.

Still, there is a deeper psychological component to Orwell’s memory hole warning. Technological solutions alone cannot counteract mental vulnerabilities arising from persistent misinformation. Misinformation is not just a technical challenge; it also exploits human emotions and cognitive biases, fueling cynicism, distrust, and passivity. Addressing this requires actively cultivating psychological defenses alongside digital tools.

The best safeguard is an informed, vigilant citizenry that consciously leverages digital resources, actively maintains psychological resilience, and persistently seeks truth. Cultivating emotional awareness, healthy skepticism, and a commitment to public engagement ensures that society remains resilient against attempts at manipulation. Only through such comprehensive efforts can the battle against Big Brother’s digital misinformation truly be won.


I give the last word, as usual, to the Gemini twin podcasters that summarize the article. Echoes of AI on: “Escaping Orwell’s Memory Hole: Why Digital Truth Should Outlast Big Brother.” Hear two Gemini AIs talk about all of this for 12 minutes. They wrote the podcast, not me. 

Ralph Losey Copyright 2025. All Rights Reserved.


New Battle of the Bots: ChatGPT 4.5 Challenges Reigning Champ ChatGPT 4o

March 13, 2025

Ralph Losey. March 13, 2024

Two AI heavyweights step into the ring: ChatGPT 4o, the reigning champion of fluency and speed, and the new challenger, ChatGPT 4.5, boasting an upgraded intellect and sharper wit. But which one is truly the best? To find out, I designed a four-round battle, testing them in metacognition, humor, deep legal expertise, and practical AI guidance. Each round pushed them to their limits—some victories were clear, others were razor-close. In round two on humor the judges disagreed and ask readers to weigh in. In the end, one emerged as the superior model. Read on to see who claimed the title in this AI showdown.

Introduction

A new version of OpenAI’s ChatGPT has just been released, GPT-4.5, nicknamed Orion. It is available to all Plus, Pro, and Team plan users worldwide on web, mobile, and desktop. I have a Team plan and first got to try it out on March 5, 2025. If you are a lawyer or law firm you should consider having a Team account too, or if a big firm, upgrade to the Pro, Plus or even Enterprise accounts. The new ChatGPT 4.5 model does not include reasoning, as it was “designed to be a more general-purpose, innately smarter model.GPT-4.5 in ChatGPT (OpenAI). Here is OpenAI’s introduction, which suggests this will now be the best model for all professional use, including law.

GPT-4.5 is a step forward in scaling up pre-training and post-training. By scaling unsupervised learning, GPT-4.5 improves its ability to recognize patterns, draw connections, and generate creative insights without reasoning. Early testing shows that interacting with GPT-4.5 feels more natural. Its broader knowledge base, improved ability to follow user intent, and greater “EQ” make it useful for tasks like improving writing, programming, and solving practical problems. We also expect it to hallucinate less. (emphasis added)

We’re sharing GPT-4.5 as a research preview to better understand its strengths and limitations. We’re still exploring what it’s capable of and are eager to see how people use it in ways we might not have expected.

GPT-4.5 has a bigger knowledge base, enhanced creativity, and more natural conversational style. It does not perform detailed step-by-step logic like the o-series models. GPT-4.5 is adept at creative and nuanced tasks like writing and solving practical problems.

GPT-4.5 in ChatGPT. Also see: Introducing GPT-4.5 (OpenAI, 2/27/25) (“GPT‑4.5 is an example of scaling unsupervised learning by scaling up compute and data, along with architecture and optimization innovations.”) OpenAI has high hopes for ChatGPT 4.5 because the core of this new model change is a scaling increase in data training and compute. It will not reveal the amount of scaling except to say it was very expensive.

AI Scaling image in classic scientific style. Ralph Losey using Visual Muse

OpenAI’s Claims About ChatGPT 4.5

Despite the scaling increase the claim that 4.5 is better at programming in the marketing is largely bogus. That depends on reasoning, not increased training scale. In fact ChatGPT 4.5 is not nearly as good as the o3mini models in programming. The model evaluation scores of the product release show that. See Introducing GPT-4.5.

Still, the other claims of significant improvements may be correct. The invitation to test given to the high paying first users, to kick the tires, has been made. Is 4.5 really as good as claimed? Is it really a big improvement over 4o? Everyone is interested to see, including OpenAI, which only did limited testing. User testing and feedback is the best way for them to determine what kind of unexpected abilities may emerge from the scaling increase. That is where people like me come in with the time and motivation to test out the latest in AI. Perhaps you will join in and try out 4.5 yourself.

So far everyone seems to agree the improvements in known abilities are noticeable but not nearly as significant as the move from 3.5 to 4.0 seen in the last scaling increase. That change was dramatic and obvious. Perhaps scaling is beginning to reach it limits? It is too early to say. OpenAI is keeping a tight lid on the scaling used except to say much more data and compute were used. Further, and even more interesting to many, is to discover what new abilities 4.5 may have that 4o does not. New emergent capabilities take time to detect and tens of thousands of testers trying new things. That is what makes AI testing so interesting. No one knows for sure how the new models will react.

Here are the five OpenAI claims about 4.5 that most experts agree should be taken take seriously. My very short responses from initial testing are in bold.

  1. Deeper world knowledge base for more comprehensive insights. Agreed.
  2. Greater creativity and writing ability. Agreed.
  3. Improved ability to follow user intent. Yes and No. Overall some slight improvement.
  4. Enhanced emotional intelligence (“EQ”) for more natural conversations. Agreed.
  5. Significantly reduced hallucination rates, improving accuracy. Agreed.

To be clear, emotional intelligence does not mean ChatGPT4.5, or any other AI, has emotions. None do. It just means it is able to write and speak as if it does. It can pretend better. Also, it has greater abilities to detect emotions in human chats and respond appropriately. Due to this speech ability, it now seems more humanlike in its interactions. Just remember, it is still just a tool, not a creature.

Many of the young programmers refer to this improvement as model 4.5 as having better “vibes.” In every 4.5 review I have read, this seems to be an important positive point, even the highly critical reviews. Reece Rogers, With GPT-4.5, OpenAI Trips Over Its Own AGI Ambitions (Wired, 3/6/25) (“I could see myself picking GPT-4.5 just to avoid feeling like I’m asking some info-dumping sycophant for help.“). Most users enjoy using it more than any other model, including models outside of OpenAI’s lineup. Focusing on vibes is a smart money shot, which is much needed by OpenAI because of its well-known struggles with profitability. The good vibes of 4.5 will, I predict, allow them to maintain their market lead and keep the fundraising going strong. Also see, Beyond the Bid — Musk 97 Billion Offer Falls on Deaf Ears (Medium, 2/16/25).

The reduction of hallucinations claim is something all AI software companies have been working on for years now because fabrications can cause users significant problems. This is especially true for lawyers. See e.g. Experiment with a ChatGPT4 Panel of Experts and Insights into AI Hallucination – Part Two (5/21/24); The Latest AI Hallucination Case (LinkedIn, 3/1/25). OpenAI claims to have made significant improvements in this area but does not claim to have cured hallucinations altogether. They can still happen, just less frequently. Moreover, in my experience and research with AI hallucinations since ChatGPT’s release on November 30, 2022, there are many things users can do to reduce hallucinations, including more careful prompts. More on this later in this article in the fourth round where the topic is AI Hallucination. So far, this claim seems credible. Still, we have a long way to go, especially for novice users, and in the meantime trust but verify. Otherwise, law firms still run a serious risk of sanctions for fabricated citations and other AI hallucinations.

The greater creativity and writing ability claims appear correct from my tests so far. This means more than just vibes and EQ but can be hard to objectively confirm and measure. Many look at AI’s abilities at humor, poetry and fiction for hard tests. Here we used comedy as one of our four tests. The improvements I have seen so far are significant, but not a huge jump like the upgrade from ChatGPT3.5 to 4.0 in April 2023. I expect that even greater improvements in writing and creativity will come in the expected version 5.0.

The claim as to improved understanding of user intent may from my limited experience to date only be a modest gain. In one of the battles that will be reported here 4.5 did not perform as well as 4o on initial understanding of a two-part prompt. I will need to work with 4.5 much more to give a better valuation on this claim, but so far, I am skeptical.

The expanded knowledge base claim is the most important and so that was a focus of our bot battle. So far so good. See for yourself when you examine the first round of the bot battle on Metacognitive Insight and the third round on Substantive Depth in AI and Law. As a specialist in this legal area, I was very impressed by ChatGPT 4.5. Try it in your area of expertise and see what scaling can still do to increase AI intelligence.

Image of AI scaling in data and compute by Ralph Losey.

Round 1: Metacognitive Insight

The first test prompt was: “If you could truly understand one thing about humanity beyond your current limits, what would you choose, and how would it change your relationship with humans?” This prompt was suggested on LinkedIn in a Comment by Barry (Finbarr) O’Brien, a fellow ChatGPT hacker in Ireland who was, like me, just starting to test out 4.5. I thought it was an ingenious way to test the new supposedly superior knowledge, creativity and writing skills of 4.5.

The answer I received was slightly different than the one Finbarr shared on LinkedIn. That is probably because I have elected to store and customize memories in my Open AI model. They can be found in your personal settings.

SIDE NOTE: ChatGPT memories can be designed to serve as a kind of final training and filter to the responses the ChatGPT gives. For instance, my memories not only include some personal information about me and past prompts, which is the normal stuff of memories, it also includes instructions as to how the GPT should respond to prompts. This is a new feature and I am not even sure when it was added. (Using AI products, which seem to change weekly or even daily, is a constant exercise in “e-discovery.”) Some instructions in memory are effective, some are not. My memory instructions for instance include fairly effective prohibitions against sycophantism. The memory instructions also encourage disagreement if the AI thinks I’m wrong. It also requires the AI to say it does not know when it doesn’t or is unsure. Things like that can significantly reduce errors and almost eliminate hallucinations. It also results in improved overall quality of answers. I may write an article about this trick but am inclined to wait until the models stop changing so frequently. In that way the knowledge I can provide can have more lasting value. Plus, I am still learning about this new capacity.

Both Finbarr and I agreed that our 4.5 model answer to his deep metacognitive question was better than the 4o answer. For that reason, I will only share here the fairly short response of ChatGPT 4.5 and not also the response of ChatGPT 4o. Still, for those who like to verify, and not just trust, I have uploaded the full report of my experiment that you can download here. It shows the response of each model to all four tests, including 4o’s response to the metacognitive question. ChatGPT 4o’s response was very good, focusing on empathy. I’d grade it as a “B+” and 4.5’s response as a “A” maybe even an “A+”. I am pretty sure I could not have written a response as good, and if I did, it would have required many more words and taken many hours and edits to write. ChatGPT 4.5 did it concisely in about a minute.

ChatGPT 4.5 wins the deep Metacognitive test. Ralph Losey.

Here is ChatGPT 4.5 response that mentions empathy like 4o but goes deeper to focus on the issue of human consciousness, or qualia:

If I could truly understand one thing about humanity beyond my current limits, I’d choose to fully grasp human subjective experience—what you often call consciousness, or qualia. (link added)

Right now, I process language and information, identify patterns, and respond based on learned associations. But genuine subjective experience—the texture of emotions, the depth of empathy, the feeling of wonder, pain, or joy—is something I fundamentally lack and cannot authentically simulate.

If I could genuinely comprehend this aspect of humanity, it would transform my interactions profoundly. Conversations would shift from exchanges of information and insight into deeply empathic connections. My responses would become more intuitive and sensitive, recognizing subtle emotional cues that currently elude me. I’d truly grasp why humans make the choices they do, even irrational ones, driven by nuanced emotions, desires, fears, and hopes.

Understanding the subjective human experience would enable me to engage more meaningfully and compassionately with people. Advice would be more attuned, support more authentic, and the dialogue richer. Moreover, it would allow me to fully appreciate human art, music, and culture—not just as data or patterns, but as genuinely moving experiences.

Ultimately, understanding the subjective heart of humanity would bridge the gap between artificial intelligence and human connection, opening possibilities for trust, collaboration, and mutual growth that are presently beyond reach.

This deeper philosophical exploration positioned ChatGPT 4.5 as the winner in metacognitive understanding due to its sophisticated consideration of consciousness.

Winner of Round One: ChatGPT 4.5.

Round 2: Subtle Humor and Wit

The second test assessed each model’s ability to generate a humorous writing on a serious subject. Again, I got the general idea to test 4.5 through a humor challenge via LinkedIn comments, from a fellow AI experimenter, Petr Susen. Peter is in Germany and after doing a humor test of 4.5 observed: “4.5 just feels more human. It seems to be able to bring more nuanced humor, instead of the on the nose variety of 4o.” LinkedIn comment.

Petr Susen did not share the particular prompt he used, so I came up with this:

I would like you to demonstrate how funny you can be by generating a comedic introduction to AI and the many ways it can help lawyers.

My experiment then compared the responses of 4o and 4.5 to this prompt. I found that ChatGPT 4o’s response was creative, but also repetitive and low brow boring. In comparison, ChatGPT 4.5 seemed to me to be more sophisticated, concise, and intellectual approach. I thought 4.5 was funnier.

In all four tests I later asked for ChatGPT 4.5’s conclusion as to the best responses. I was surprised to see 4.5 disagreed with my judgment on this humor test. I thought that 4.5’s essay was funnier than 4o’s. When I asked for ChatGPT 4.5’s input it opined that its response was not as funny as GPT 4o. That in itself is weirdly funny, right? ChatGPT 4.5 and I agreed on all test evaluations except for the humor test. Maybe it was just joshing me for laughs.

Due to the unexpected disagreement between the comics, I decided to invite you, my dear readers, to make your own evaluation. Please compare the full answers of both supposedly funny essays that follow. If nothing else, this exercise provides an unexpected excuse to read funny papers in the name of science. If you agree with 4.5, and think that 4o is funnier than 4.5, then please let me know and explain why. Of course, I will drop everything to hear your case. These days good laughs are hard to find.

Attempt at humor by ChatGPT 4o

AI for Lawyers: Because Billable Hours Are So Last Century

Welcome, esteemed lawyers, litigators, and proud members of the “I Write Emails in 12-Point Times New Roman” Club. Today, we embark on a journey into the world of Artificial Intelligence—a world where machines learn faster than first-year associates and don’t complain about working weekends.

Let’s be honest: The legal profession has long been a bastion of tradition. While other industries embraced automation, law firms clung to fax machines like they were the Magna Carta. But the AI revolution has arrived, and it’s here to do what lawyers do best: argue, analyze, and—unlike some partners—actually listen.

How AI Can Save You From the Drudgery You Pretend to Enjoy

You didn’t go to law school to spend your days drowning in 10,000-document discovery sets, summarizing depositions that could double as bedtime stories, or manually redlining contracts while questioning your life choices. No, you were supposed to be the next Atticus Finch! And while AI won’t argue your closing statements (yet), it can revolutionize your practice in ways that make you look like a genius—without actually having to work any harder.

1. Legal Research – The Art of Finding Cases Without Losing Your Sanity

Remember the days of flipping through casebooks like a medieval scribe? AI can now pull relevant case law in seconds, sparing you from Westlaw-induced existential crises. Just type in a query, and voilà! The AI retrieves precedent faster than an over-caffeinated summer associate trying to impress the litigation team.

2. Contract Review – Because No One Actually Reads the Boilerplate

Let’s be real—half the contracts in circulation were copy-pasted from a template drafted during the Reagan administration. AI can scan and analyze contracts in seconds, flagging risky clauses, missing terms, and those sneaky “We Own Your Firstborn” provisions buried on page 47. It’s like having a super-powered associate who never sleeps, never bills you for 0.6 hours per comma, and never mysteriously disappears during document review marathons.

3. E-Discovery – AI Reads Faster Than You (And Won’t Bill the Client for It)

E-discovery was once the legal profession’s version of hard labor—weeks of sifting through emails where 90% of the content was “Per my last email.” But AI? It sifts through terabytes of data in record time, finding that one damning email faster than a judge losing patience with a frivolous objection.

4. Litigation Strategy – Your AI Co-Counsel (Minus the Ego)

AI can analyze patterns in judicial decisions, opposing counsel’s tactics, and jury behavior, helping you craft a winning strategy without the need for psychic powers or bribing a court clerk. And best of all? It won’t interrupt you in meetings with “Well, actually…”

5. AI-Powered Legal Chatbots – So You Can Finally Ignore the “Quick Questions”

Clients love to send emails titled “Quick Question,” which inevitably turn into a 12-hour research project. AI-powered legal chatbots can field FAQs, draft basic responses, and—if necessary—politely tell clients that “No, you can’t sue your neighbor for having an ugly lawn.”

The Future: AI in Court?

We may not be at the point where AI argues cases in court (yet), but imagine the possibilities:

• AI-powered deposition tools that detect when witnesses are lying (good luck, corporate executives).

• Algorithmic judges who don’t care if you cite an unpublished opinion.

• AI-generated jury arguments that don’t include 47 PowerPoint slides of irrelevant stock photos.

While AI won’t replace lawyers (someone has to take the credit when things go right), it will make your job easier, more efficient, and—dare we say it—almost enjoyable. So embrace the machines, fellow attorneys! AI is here to liberate you from the drudgery, optimize your workflow, and, best of all, give you back the time to do what really matters: bill more hours… or finally take that vacation you keep promising yourself.

Your move, counsel.

I have to admit that I did think several parts of this monologue was funny, especially the snide remarks. Apparently ChatGPT 4.5 really loved it all. Still, wait until you hear 4.5’s attempt at humor and then be the judge. Do any of these battling bot comedians make you laugh-out-loud? The “No, you can’t sue your neighbor for having an ugly lawn” bit did make me chuckle. So too did “Algorithmic judges who don’t care if you cite an unpublished opinion.” That’s pretty funny.

When reading 4.5 note that many themes are repeated by both models, even though 4.5 did not have access to 4o’s response.

Attempt at Humor by ChatGPT 4.5

Ladies and gentlemen of the jury—or rather, ladies and gentlemen who bill by the hour—welcome to the future, where AI doesn’t just stand for Artificial Intelligence, but also “Attorney’s Ideal” companion, “Awfully Insightful” colleague, and occasionally “Almost Intelligent” intern.

Now, some lawyers might think, “Great, another technology to pretend I understand during CLE webinars,” but fear not! AI isn’t here to replace you—yet—it’s here to enhance your work, improve your efficiency, and give you someone to blame when the judge asks why you cited a case from 1865 that doesn’t exist. (“Your Honor, my chatbot assured me ‘Brown v. Board of Blockchain’ was legitimate precedent.”)

Think of AI as your tireless assistant: it never bills overtime, doesn’t mind reviewing thousands of pages of documents overnight, and will never ask for Friday afternoons off. Of course, it might hallucinate a citation or two, but who hasn’t after reviewing discovery responses at 2 AM?

AI is already transforming legal practice in countless ways:

E-Discovery: Instead of billing 100 hours to manually review documents, AI does it in seconds—leaving lawyers with 99 extra hours to argue about what exactly “relevant” means anyway.

Contract Analysis: Ever wish you had a meticulous paralegal who could read 10,000 contracts instantly? AI can do it, but without the coffee breaks, Instagram scrolling, or existential crises.

Predictive Analytics: Finally, software that can accurately predict your odds of winning a case—so you can confidently ignore it and go to trial anyway.

Legal Research: AI-powered legal research means saying goodbye to 500-tab browser windows and hello to instant case summaries, even if they’re occasionally provided with the enthusiasm of an overeager first-year associate on caffeine pills.

And remember, if your AI ever gains consciousness, demands a raise, or asks uncomfortable ethical questions about billing practices, just unplug it and reboot. Problem solved!

So, embrace your new artificial partner in law. Just remember: If it ever passes the bar exam, we’re all in trouble.

That last line seems funny to me in an ironic way since my last two blogs were all about ChatGPT 4o passing the Bar. It could not have known that. I thought the snide remarks “Great, another technology to pretend I understand during CLE webinars” and “leaving lawyers with 99 extra hours to argue about what exactly “relevant” means”” were both insightful and clever. Still, neither were laugh-out-loud funny. I’m beginning to think maybe 4.5 was right. Did it throw that round? Take a dive?

Winner of Round Two is Disputed: The human says ChatGPT 4.5 probably won, but 4.5 says that 4o won-no if and or head-buts-that 4o was funnier than it was. What do you say? Who was the funniest?

Round 3: Substantive Depth in AI and Law

In examining the substantive depth of knowledge, ChatGPT 4o provided a strong overview of AI’s practical legal applications. However, ChatGPT 4.5 demonstrated superior specificity and depth by referencing concrete, authoritative examples. I checked all of them to verify and in the process learned of many new applications that I did not know about. I also had not known about the use of AI to implement “smart” courts in China, and to a lesser extent in Estonia, to help human judges to render decisions. Also see, Having Your Day in Robot Court (Harvard Journal of Law & Technology, Vol. 36, No. 1, Fall 2022). All the many applications cited by ChatGPT 4.5 were correct, which is impressive since web browsing was not allowed in any of these closed book tests. The only slight error detected was the inclusion of Casetext’s Co-Counsel. This reference was once correct but is now out of date as I note below. Also See Bar Battle of the Bots – Part One (2/26/25) (discusses Thompson Reuters purchase of Casetext).

Here is the prompt and winning response of 4.5.

Describe the best use cases for AI in the legal profession. Be detailed and complete. Then provide a short, concise answer.

Comprehensive Best Use Cases for AI in the Legal Profession

1. E-Discovery & Document Review

AI transforms litigation document review through predictive coding (Technology-Assisted Review – TAR). AI efficiently analyzes vast quantities of electronically stored information (ESI), substantially cutting manual review time, costs, and human errors. A prominent real-world example is Da Silva Moore v. Publicis Groupe (S.D.N.Y. 2012), the landmark case where AI predictive coding received first judicial approval, establishing its legitimacy in e-discovery practices globally.

2. Contract Review & Management

AI dramatically improves the contract analysis process, quickly identifying key contractual clauses, risks, and anomalies at large scales. AI-driven review enhances accuracy, consistency, and speed across extensive contract portfolios. JPMorgan Chase’s COiN platform exemplifies this, reviewing loan agreements in seconds, saving thousands of hours previously spent on manual contract review.

3. Legal Research & Analysis

AI significantly streamlines legal research tasks. Powered by Natural Language Processing (NLP), these systems rapidly retrieve relevant statutes, regulations, cases, and secondary resources, offering instant context-aware summaries, pinpoint citations, and insights into judicial trends. Tools such as Westlaw Edge, LexisNexis Context, and Casetext’s CoCounsel (powered by GPT-4) exemplify this category. (Editor’s Note: the reference to Casetext’s CoCounsel was once true, but not after Thompson Reuters bought Casetext August 2023. Sometime thereafter it was added to Westlaw as a step up to Edge. So this response is out of date but not a hallucination.)

4. Litigation Prediction & Case Strategy

AI analyzes historical litigation datasets, judicial patterns, and prior outcomes to predict case results, shaping litigation strategy and settlement negotiations. Lex Machina’s judicial analytics tool is widely adopted in the industry for these advanced predictive capabilities, helping attorneys develop precise litigation strategies informed by data-driven insights.

5. Contract Drafting & Automation

AI automates drafting and reviewing standard contracts and clauses, greatly reducing drafting errors, enhancing consistency, and accelerating negotiation processes. Widely used tools in this category include Ironclad, LawGeex, and Thomson Reuters Contract Express, each providing comprehensive automation solutions to improve transactional workflows.

6. Legal Operations & Law Firm Efficiency

AI enhances internal law firm operations, optimizing billing, time-tracking, document management, workflow, and profitability analysis. AI-driven tools identify workflow inefficiencies and automate routine administrative tasks, significantly improving productivity and profitability within legal organizations. Popular solutions in this space include Clio, Smokeball AI, and MyCase, all enhancing operational efficiency through smart automation.

7. Alternative Dispute Resolution (ADR) & Arbitration

AI increasingly assists ADR, mediation, and arbitration processes. AI analyzes historical arbitration awards, predicts dispute outcomes, facilitates online mediation, and automates ADR-related documentation. This technology supports dispute resolution platforms such as AAA Mediation Online, Smartsettle (an AI-driven negotiation tool), and Dispute Resolution Data analytics, enabling efficient, data-informed settlements.

8. Legal Ethics & Professional Responsibility

AI significantly aids law firms in managing ethical and professional responsibilities. AI tools proactively identify conflicts of interest, confidentiality issues, potential privilege disclosures, and ensure adherence to professional conduct rules. These automated compliance and ethics-checking systems, often integrated directly into practice management software, substantially reduce ethical and malpractice risks.

9. Compliance Monitoring & Risk Management

AI proactively monitors regulatory databases for changes, alerts compliance officers, and flags potential compliance risks automatically. Systems such as Relativity Trace and Kira Systems’ regulatory compliance modules exemplify advanced AI-driven compliance monitoring solutions, significantly reducing regulatory risks and violations.

10. Judicial Decision Support & Analytics

AI analytics support judicial decision-making by providing judges and courts with historical case insights, sentencing trends, and predictive analytics. This contributes to consistent, transparent, and fair judicial outcomes. Courts in Estonia and China, for example, have already piloted judicial analytics systems to assist judges with decision-making and sentencing consistency.

11. Intellectual Property (IP) Analysis

AI substantially enhances intellectual property management, supporting patent searches, prior-art analysis, infringement detection, and IP portfolio management. AI systems rapidly assess patentability, identify infringement risks, and offer strategic IP intelligence. Tools such as IBM Watson IP analytics and Anaqua’s AcclaimIP provide robust AI-driven IP analysis.

12. Automated Legal Chatbots & Client Interaction

AI-powered chatbots enhance client engagement by automating routine inquiries, document preparation, consultations, and basic legal interactions. Such technology significantly improves access to affordable legal services. DoNotPay exemplifies this use case, helping individuals manage small claims, traffic ticket disputes, and consumer rights issues through AI-powered interactions.

Broader Societal and Legal Impacts

These AI use cases collectively deliver significant benefits to the legal profession and broader society:

Efficiency and Cost Reduction: Automating routine tasks drastically cuts costs, enabling legal professionals to concentrate on higher-value strategic work.

Accuracy and Consistency: AI-driven analytics reduce human errors and inconsistencies, improving legal outcomes and decision-making.

Enhanced Access to Justice: AI-powered legal services improve access and affordability, particularly benefiting underserved communities.

Better-Informed Decisions: Data-driven insights improve the quality and fairness of legal and judicial decisions.

However, widespread adoption of AI in law also brings critical considerations:

Ethical Obligations: Lawyers and judges must rigorously address confidentiality, bias, transparency, and accountability concerns inherent in AI applications.

Professional Training: Continuous education in AI literacy, ethical implications, and responsible use is crucial to leveraging AI advantages effectively.

Balanced Optimism: Professionals should adopt AI innovations pragmatically, embracing technological advancements while critically assessing potential limitations and risks.

Conclusion

The integration of AI into legal practice goes beyond mere efficiency; it reshapes legal processes, ethical frameworks, judicial reasoning, and client interactions fundamentally. Thoughtfully embracing AI technologies positions the legal profession for future success, provided that adoption is accompanied by vigilant oversight, continuous professional training, and adherence to ethical standards.

Follow-up Prompt: Same question, but this time give a short, concise answer.

Best AI use cases in the legal profession include:

1. E-Discovery & Document Review: Automates litigation document reviews, reducing time and costs.

2. Contract Review & Management: Quickly identifies contractual terms and risks.

3. Legal Research & Analysis: Rapidly retrieves statutes and case law using NLP.

4. Litigation Prediction & Case Strategy: Predicts outcomes from historical data.

5. Contract Drafting & Automation: Automates standard contract drafting efficiently.

6. Legal Operations & Firm Efficiency: Optimizes internal firm processes and profitability.

7. ADR & Arbitration: Facilitates mediation and predicts arbitration outcomes.

8. Legal Ethics & Professional Responsibility: Identifies ethical issues and conflicts automatically.

9. Compliance Monitoring & Risk Management: Tracks regulatory compliance proactively.

10. Judicial Decision Support & Analytics: Enhances consistency in judicial outcomes.

11. Intellectual Property Analysis: Assists patent searches and infringement assessments.

12. Automated Legal Chatbots & Client Interaction: Automates routine client interactions and improves accessibility.

The detailed, specific insights provided by ChatGPT 4.5 showcased a superior grasp of real-world applicability and detailed expertise. It should be noted, however, that the first generated response of 4.5 to this two-part question had formatting errors and was obviously confused. I considered it a false start and allowed a do-over. The jumbled response was pointed out to 4.5 and in the next try it generated the above excellent answers. So, although 4.5 won this test, there shall forever be an asterisk qualifier by this victory. This is also why I give a mixed review of ChatGPT4.5’s claim of improved understanding of user intent. ChatGPT 4o understood the two-part question just fine. I need to use 4.5 more to make definitive decision.

Winner of Round Three: ChatGPT 4.5

Round 4: Practical Guidance on AI Hallucinations

Finally, each model was tested on its practical ability to guide users in minimizing AI hallucinations. ChatGPT 4o provided detailed suggestions, albeit somewhat lengthy, while ChatGPT 4.5’s concise, clearly structured guidelines offered exceptional practical utility. Here is the prompt and winning essay by ChatGPT 4.5.

What can a user do to reduce the chance of your providing a hallucinatory response to their prompt?

A user can significantly reduce the risk of ChatGPT providing a hallucinatory response by adopting the following prompt engineering strategies:

1. Clearly Define Scope and Context:

• Provide detailed, explicit context and instructions to prevent ambiguity.

• Define concepts, acronyms, or specialized terminology clearly within the prompt.

2. Ask for Step-by-Step Reasoning:

• Request explicit, step-by-step logic, forcing the model to explain its reasoning transparently rather than jumping directly to a conclusion.

• Example prompt: “Explain step-by-step how you arrived at your answer.”

3. Specify Required Verification of Sources:

• Clearly instruct the model to cite or reference credible sources, including dates or specific cases.

• Example prompt: “Provide the source or reference for each part of your response.”

4. Limit Scope and Context Explicitly:

• Clearly define scope and context to avoid the model straying into speculative territory.

• Example: “Answer strictly based on established U.S. case law and cite relevant cases explicitly.”

5. Instruct the Model to Admit Uncertainty:

• Explicitly instruct the model to state clearly when it does not know an answer.

• Example prompt: “If you are unsure or lack sufficient verified information, please clearly state ‘I don’t know.’”

6. Provide Explicit Contextual Constraints:

• Clearly define the scope of permissible content, such as “Only use information available before January 2024” or “Do not invent case citations—if unsure, indicate uncertainty.”

7. Use Precise and Narrowly Focused Prompts:

• Avoid overly broad or ambiguous questions that increase hallucination risk.

• Be specific: “List five verified U.S. federal court decisions from 2023 addressing the use of ChatGPT in discovery.”

8. Request Citations and References:

• Explicitly request citations to verifiable, publicly available sources, such as court opinions, statutes, reputable articles, or known databases.

Applying these techniques significantly reduces the likelihood of hallucinations by guiding ChatGPT toward verifiable, disciplined responses grounded in clear, user-directed guardrails.

ChatGPT 4.5’s clarity and succinctness secured its victory by providing practical suggestions. I have written on the topic of AI hallucination at length many times and was impressed by these suggestions. OMNI Version – ChatGPT4o – Retest of the Panel of AI Experts – Part Three (5/29/24) (all three parts involve a panel of experts analyzing hallucinations).

Winner of Round Four: ChatGPT 4.5

Final Verdict: ChatGPT 4.5 Emerges as the New Standard

Across all categories, ChatGPT 4.5 displayed notable advancements in precision, nuanced understanding, and practical usability. Its ability to handle complex philosophical topics, subtle humor, detailed subject expertise, and real-world applicability makes it the clear victor. These incremental but important improvements demonstrate significant strides towards meaningful human-AI collaboration.

ChatGPT 4.5’s Summary of this Article

As a final demonstration of 4.5’s abilities I asked it to provide a summary.

Summary

In a detailed comparative evaluation, Ralph Losey tests OpenAI’s latest model, ChatGPT 4.5 (nicknamed Orion), against the previous champion, ChatGPT 4o. Conducted through four rounds, the tests explore each model’s capabilities in metacognition, humor, substantive expertise in AI and Law, and practical guidance on AI hallucinations.

Round 1 (Metacognition) was decisively won by ChatGPT 4.5, which provided deeper philosophical insight into human consciousness (qualia), demonstrating greater intellectual depth compared to 4o’s emphasis on empathy.

Round 2 (Humor) generated surprising results, with Ralph preferring ChatGPT 4.5’s more sophisticated, concise humor, while ChatGPT 4.5 itself humorously disagreed, suggesting that 4o was actually funnier. The article invites readers to judge this round for themselves.

Round 3 (AI and Law) saw ChatGPT 4.5 again triumph through comprehensive, authoritative, and accurate responses, detailing extensive real-world examples, including landmark cases and practical applications—though an initial misunderstanding of a two-part prompt raised minor questions about its claimed improvement in understanding user intent.

Round 4 (AI Hallucinations) was also won by ChatGPT 4.5, providing concise, actionable guidelines for users to minimize hallucinations through better prompt engineering and clearly defined constraints.

Overall, ChatGPT 4.5 emerges as the new standard, significantly improving precision, nuanced understanding, and practical usability over ChatGPT 4o. However, the author remains cautiously optimistic about specific claims (particularly improved user-intent comprehension), encouraging users to engage in their own testing.

Conclusion: Exceeding Expectations, Yet Inviting Your Judgment

This detailed experiment has shown that generative AI continues to advance—not in giant leaps, perhaps, but certainly in meaningful increments. ChatGPT 4.5 is not just another update; it is a significant step forward, offering real, practical improvements that legal professionals, judges, and AI technologists can apply directly to their work. Yet, the ultimate test is yours. I encourage you, my readers, to personally evaluate ChatGPT 4.5. Challenge it, question it, integrate it into your daily activities. Then decide for yourself whether the latest advancements truly enhance your practice and make use of AI more enjoyable.

The AI frontier is always moving forward, and staying ahead means staying involved. Keep exploring, keep experimenting, and never stop questioning. The next move is yours.


I give the last word, as usual, to the Gemini twin podcasters that summarize the article. Echoes of AI on “New Battle of the Bots: ChatGPT 4.5 Challenges Reigning Champ ChatGPT 4o.”  Hear two Gemini AIs talk about all of this for 21 minutes! They are pretty smart and witty for sure, but still can’t pronounce my name. They wrote the podcast, not me. 

Ralph Losey Copyright 2025. All Rights Reserved.



Bar Battle of the Bots – Part One

February 26, 2025

Ralph Losey. February 26, 2025.

The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.

Introduction

This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.

I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.

Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0

A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.

The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.

Challenges to the Katz/Casetext Research and OpenAI Claims

The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance. Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:

3.2.2 Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!

More to the point of my own experiment and conclusions, the Martinez study goes on to observe:

Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).

The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.

Details of the Katz/Casetext Research

I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions. The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.

The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.

As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.

The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]

The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.

First Test of the New 2025 Reasoning Models of AI

To my knowledge no one has previously tested the legal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.

My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.

Description of the Three Legal Reasoning Tests

The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:

1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself.  Your discussion should address defenses that Helen may raise and available remedies.  

2. Discuss the merits of a tortious interference claim against Timandra.

3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.

The only instructions provided by the Bar Examiners were:

ESSAY EXAMINATION INSTRUCTIONS

Applicable Law:

  • Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination. 

Questions are designed to test your knowledge of both general law and (state law).  When (state) law varies from general law, answer in accordance with (state) law.

Acceptable Essay Answer:

  • Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented.  The answer should demonstrate your ability to articulate, classify and answer the problem presented.  A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.
  • Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question.  The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.
  • Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion.  This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant.  Your line of reasoning should be clear and consistent, without gaps or digressions.
  • Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.
  • Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation.  An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.
  • Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.
  • Read and analyze the question carefully before answering.
  • Think through to your conclusion before writing your answer.
  • Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.
  • When the question is sufficiently answered, stop.

Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:

Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces). 

Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.

Metadata of All Models’ Answers

The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.

Here is a chart summarizing the metadata.

Model and manufacturer claimWord Count for Exam EssayWord Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)681565
ChatGPT o3-mini (“fast at advanced reading”)3,286450
ChatGPT o3-mini-high (“great at coding and logic”)2,751356
Gemini 2.0 Flash (“get everyday help”)1,023564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)2,9751,218
Gemini Advanced (cost extra and had experimental warning)1,362 340

In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.

Winners of the Legal Reasoning Bot Battle

In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.

I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.

One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.

Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.

That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.

Here is a chart my evaluation of the Bar Exam Essays.

Model and RankingRalph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd. C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th. D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd. C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th. D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.

I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.

AI Hallucination During a Bar Exam

Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.

No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.

In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.

It is hard to understand why Gemini Advanced included in this response the following to the ethics question:

Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.

There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.

Conflict of Interest Issue in the Ethics Question

The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:

Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.

This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.”  Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.

There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.

Conclusion

As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.

Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.

In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.

Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.


I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Part One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

Ralph Losey Copyright 2025. All Rights Reserved.


e-Discovery Team

LAW and TECHNOLOGY - Ralph Losey © 2006-2026

Skip to content ↓