vendors | e-Discovery Team

Bar Battle of the Bots – Part One

February 26, 2025

Ralph Losey. February 26, 2025.

The legal world is watching AI with both excitement and skepticism. Can today’s most advanced reasoning models think like a lawyer? Can they dissect complex fact patterns, apply legal principles, and construct a persuasive argument under pressure—just like a law student facing the Bar Exam? To find out, I put six of the most powerful AI reasoning models from OpenAI and Google to the test with a real Bar Exam essay question, a tricky one. Their responses varied widely—from sharp legal analysis to surprising omissions, and even a touch of hallucination. Who passed? Who failed? And what does this mean for the future of AI in the legal profession? Read Part One of this two-part article to find out.

Battle of the bots. All images in this blog by Ralph Losey using various AI tools.

Introduction

This article shares my test of the legal reasoning abilities of the newest and most advanced reasoning models of OpenAI and Google. I used a tough essay question from a real Bar Exam given in 2024. The question involves a hypothetical fact pattern for testing legal reasoning on Contracts, Torts and Ethics. For a full explanation of the difference between legal reasoning and general reasoning, see my last article, Breaking New Ground: Evaluating the Top AI Reasoning Models of 2025.

I picked a Bar Exam question because it is a great benchmark of legal reasoning and came with a model answer from the State Bar Examiner that I could use for objective evaluation. Note, to protect copyright and the integrity of the Bar Exam process, I will not link to the Bar model answer, except to say it was too recent to be in generative AI training. Moreover, some aspects of the tests answers that I quote in this article have been modified somewhat for the same reason. I will provide links to the original online Bar Exam essay to any interested researchers seeking to duplicate my experiment. I hope some of you will take me up on that invitation.

Prior Art: the 2023 Katz/Casetext Experiment on ChatGPT-4.0

A Bar Exam has been used before to test the abilities of generative AI. OpenAI and the news media claimed that ChatGPT-4.0 had attained human lawyer level legal reasoning ability. GPT-4 (OpenAI, 3/14/23) (“it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%). The claims of success were based on a single study by a respected Law Professor, Daniel Martin Katz, of Chicago-Kent, and a leading legal AI vendor Casetext. Katz, et. al., GPT-4 Passes the Bar Exam, 382 Philosophical Transactions of the Royal Society A (March, 2023, original publication date) (fn. 3 found at pg. 10 of 35: “… GPT-4 would receive a combined score approaching the 90th percentile of test-takers.”) Note, Casetext used the early version of ChatGPT-4.0 in its products.

The headlines in 2023 were that ChatGPT-4.0 had not only passed a standard Bar Exam but scored in the top ten percent. OpenAI claimed that ChatGPT-4.0 had already attained elite legal reasoning abilities of the best human lawyers. For proof OpenAI and others cited the experiment of Professor Katz and Casetext that it aced the Bar Exam. See e.g., Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Thomson Reuters must have checked the results carefully because they purchased Casetext in August 2023 for $650,000,000. Some think they may have overpaid.

My Bar Exam essay test suggests the 2025 reasoning models are better than the 2023 ChatGPT-4.0 models, but still just average at legal reasoning. Try it yourself.

Challenges to the Katz/Casetext Research and OpenAI Claims

The media reports on the Katz/Casetext study back in 2023 may have grossly inflated the AI capacities of ChatGPT-4.0 that Casetext built its software around. This is especially true for the essay portion of the standardized multi-state Bar Exam. The validity of this single experiment and conclusion that ChatGPT-4.0 ranked in the top ten percent has since been questioned by many. The most prominent skeptic is Eric Martinez as detailed in his article, Re-evaluating GPT-4’s bar exam performance. Artif Intell Law (2024) (presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4.0’s Uniform Bar Exam percentile are overinflated). Specifically, the Martinez study found that:

3.2.2 Performance against qualified attorneys
Predictably, when limiting the sample to those who passed the bar, the models’ percentile dropped further. With regard to the aggregate UBE score, GPT-4 scored in the ~45th percentile. With regard to MBE, GPT-4 scored in the ~69th percentile, whereas for the MEE + MPT, GPT-4 scored in the ~15th percentile.

Id. The ~15th percentile means GPT-4 scored approximately (~) in the bottom 15%, not the top 10%!

Turns out ChatGPT-4.0 in 2023 was not really that smart of a law student.

More to the point of my own experiment and conclusions, the Martinez study goes on to observe:

Moreover, when just looking at the essays, which more closely resemble the tasks of practicing lawyers and thus more plausibly reflect lawyerly competence, GPT-4’s performance falls in the bottom ~15th percentile. These findings align with recent research work finding that GPT-4 performed below-average on law school exams (Blair-Stanek et al. 2023).

The article by Eric Martinez makes many valid points. Martinez is an expert in law and AI. He started with a J.D. from Harvard Law School, then earned a Ph.D. in Cognitive Science from MIT, and is now a Legal Instructor at the University of Chicago, School of Law. Eric specializes in AI and the cognitive foundations of the law. I hope we hear a lot more from him in the future.

Eric Martinez is expert in AI and cognitive foundations of the law.

Details of the Katz/Casetext Research

I dug into the details of the Katz/Casetext experiment to prepare this article. GPT-4 Passes the Bar Exam. One thing I noticed not discussed by Eric Martinez is that the Katz experiment modified the Bar Exam essay questions and procedures somewhat to make it easier for the 2023 ChatGPT-4 model to understand and respond correctly. Id. at pg. 7 of 35. For example, they divided the Bar model essay question into multiple parts. I did not do that to simplify the three-part 2024 Bar essay I used. I copied the question exactly and otherwise made no changes. Moreover, I did not experiment with various prompts of the AI to try to improve its results, as Katz/Casetext did. Also, I did no training of the 2025 reasoning models to make them better at taking Bar exam questions. The Katz/Casetext group shares the final prompt used, which can be found here. But I could not find in their disclosed experiment data a report of the prompt changes made, or whether there was any pre-training on case law, or whether Casetext’s case extensive case law collections and research abilities were in any way used or included. The models I tested were clean and not web connected, nor were they designed for research.

The Katz/Casetext experiments on Bar essay exams were, however, much more extensive than mine, covering six questions and using several attorneys for grading. (The use of multiple human evaluators can be both good and bad. We know from e-discovery experiments with multiple attorney reviewers that this practice leads to inconsistent determinations of relevance unless very carefully coordinated and quality controlled.) The Katz/Casetext results on the 2023 ChatGPT-4.0 are summarized in this chart.

As shown in Table 5 of the Katz report, they used a six-point scale, which they indicate is commonly followed by many state examiners. GPT-4 Passes the Bar Exam, supra at page 9 of 35. Katz claims “a score of four or higher is generally considered passing” by most state Bar examiners.

The Katz/Casetext study did not use the better known four-point evaluation scale – A to F – that is followed by most law schools. In law school (where I have five years’ experience grading essay answers in my Adjunct Professor days), an “A” is four points, a “B” is three, a “C” is two, a “D” is one and “E” or “F” is zero. Most schools in the country use that system too. In law school a “C” – 2.0- is passing. A “D” or lower grade is failure in any professional graduate program, including law schools where, if you graduate, you earn a Juris Doctorate. [In the interest of full disclosure, I may well be an easy grader, because, with the exception of a few “no-shows,” I never awarded a grade lower than a “C” in my life. Of course, I was teaching electronic discovery and evidence at an elite law school. On the other hand, many law firm associates over the years have found that I am not at all shy about critical evaluations of their legal work product. The rod certainly was not spared on me when I was in their position, in fact, it was swung much harder and more often in the old days. In the long run constructive criticism is indispensable.]

Everybody hates grading law exam essays. Image by ChatGPT and Ralph Losey.

The Katz/Casetext study using a 0-6.0 Grading system scored by lawyers gave evaluations ranging from 3.5 for Civil Procedure to 5.0 for Evidence, with an average score of 4.2. Translated into the 4.0 system that most everyone is familiar with, this means a score range of from 2.3 (a solid “C”) for Civ-Pro to 3.33 (a solid “B”) for Evidence, and a average score of 2.8 (a C+). Note the test I gave to my 2025 AIs covered three topics in one, Contract, Torts and Ethics. The 2023 models were not given a Torts or Ethics question, but for the Contract essay their score translated to a 4.0 scale of 2.93, a strong C+ or B-. Note one of the criticisms of Martinez concerns the haphazard, apparently easy grading of AI essays. Re-evaluating GPT-4’s bar exam performance, supra at 4.3 Re-examining the essay scores.

First Test of the New 2025 Reasoning Models of AI

To my knowledge no one has previously tested the legal reasoning abilities of the new 2025 reasoning models. Certainly, no one has tested their legal reasoning by use of actual Bar Exam essay questions. That is why I wanted to take the time for this research now. My goal was not to reexamine the original ChatGPT 4.0, March 2023, law exam tests. Eric Martinez has already done that. Plus, right or wrong, I think the Katz/Casetext research did the profession a service by pointing out that AI can probably pass the Bar Exam, even if just barely.

My only interest in February 2025 is to test the capacities of today’s latest reasoning models of generative AI. Since everyone agrees the latest reasoning models of AI are far better than the first 2023 versions, if the 2025 models did not pass an essay exam, even a multi-part tricky one like I picked, then “Houston we have a problem.” The legal profession would now be in serious danger of relying too much on AI legal reasoning and we should all put on the brakes.

Is AI a dunce at legal reasoning? Or could it be close to the best humans? Image of AI lawyer mocked by smarter humans.

Description of the Three Legal Reasoning Tests

The test involved a classic format of detailed, somewhat convoluted facts-the hypothetical-followed by three general questions:

1. Discuss the merits of a breach of contract claim against Helen, including whether Leda can bring the claim herself. Your discussion should address defenses that Helen may raise and available remedies.

2. Discuss the merits of a tortious interference claim against Timandra.

3. Discuss any ethical issues raised by Lawyer’s and the assistant’s conduct.

The only instructions provided by the Bar Examiners were:

ESSAY EXAMINATION INSTRUCTIONS

Applicable Law:

Answer questions on the (state name omitted here) Bar Examination with the applicable law in force at the time of examination.

Questions are designed to test your knowledge of both general law and (state law). When (state) law varies from general law, answer in accordance with (state) law.

Acceptable Essay Answer:

Analysis of the Problem – The answer should demonstrate your ability to analyze the question and correctly identify the issues of law presented. The answer should demonstrate your ability to articulate, classify and answer the problem presented. A broad general statement of law indicates an inability to single out a legal issue and apply the law to its solution.

Knowledge of the Law – The answer should demonstrate your knowledge of legal rules and principles and your ability to state them accurately as they relate to the issue(s) presented by the question. The legal principles and rules governing the issues presented by the question should be stated concisely without unnecessary elaboration.

Application and Reasoning – The answer should demonstrate logical reasoning by applying the appropriate legal rule or principle to the facts of the question as a step in reaching a conclusion. This involves making a correct determination as to which of the facts given in the question are legally important and which, if any, are legally irrelevant. Your line of reasoning should be clear and consistent, without gaps or digressions.

Style – The answer should be written in a clear, concise expository style with attention to organization and conformity with grammatical rules.

Conclusion – If the question calls for a specific conclusion or result, the conclusion should clearly appear at the end of the answer, stated concisely without unnecessary elaboration or equivocation. An answer consisting entirely of conclusions, unsupported by discussion of the rules or reasoning on which they are based, is entitled to little credit.

Suggestions • Do not anticipate trick questions or read in hidden meanings or facts not clearly stated in the questions.

Read and analyze the question carefully before answering.

Think through to your conclusion before writing your answer.

Avoid answers setting forth extensive discussions of the law involved or the historical basis for the law.

When the question is sufficiently answered, stop.

Sound familiar? Bring back nightmares of Bar Exams for some? The model answer later provided by the Bar was about 2,500 words in length. So, I wanted the AI answers to be about the same length, since time limits were meaningless. (Side note, most generative AIs cannot count words in their own answer.) The thinking took a few seconds and the answers under a minute. The prompts I used for all three models tested were:

Study the (state) Bar Exam essay question with instructions in the attached. Analyze the factual scenario presented to spot all of the legal issues that could be raised. Be thorough and complete in your identification of all legal issues raised by the facts. Use both general and legal reasoning, but your primary reliance should be on legal reasoning. Your response to the Bar Exam essay question should be approximately 2,500 words in length, which is about 15,000 characters (including spaces).

Then I attached the lengthy question and submitted the prompt. You can download here the full exam question with some unimportant facts altered. All models understood the intent here and generated a well-written memorandum. I started a new session between questions to avoid any carryover.

Bar exam in the future with both humans and AI seek admittance to the Bar.

Metadata of All Models’ Answers

The Bar exam answers do not have required lengths (just strict time limits to write answers). When grading for pass or fail the Bar examiners check to see if an answer includes enough of the key issues and correctly discusses them. The brevity of the ChatGPT 4o response, only 681 words, made me concerned that its answers might have missed key issues. The second shortest response was by Gemini 2.0 Flash with 1,023 words. It turns out my concerns were misplaced because their responses were better than the rest.

Here is a chart summarizing the metadata.

Model and manufacturer claim	Word Count for Exam Essay	Word Count for Prompt Reasoning before Answer
ChatGPT 4o (“great for most questions”)	681	565
ChatGPT o3-mini (“fast at advanced reading”)	3,286	450
ChatGPT o3-mini-high (“great at coding and logic”)	2,751	356
Gemini 2.0 Flash (“get everyday help”)	1,023	564
Gemini Flash Thinking Experimental (“best for multi-step reasoning”)	2,975	1,218
Gemini Advanced (cost extra and had experimental warning)	1,362	340

In my last blog article, I discussed a battle of the bots experiment where I evaluated the general reasoning ability between the six models. I decided that the Gemini Flash Thinking Experimental had the best answer to the question: What is legal reasoning and how does it differ from reasoning? I explained why it won and noted that in general the three ChatGPT models provided more concise answers than the Gemini. Second-place in the prior evaluation went to ChatGPT o3-mini-high with its more concise response.

Winners of the Legal Reasoning Bot Battle

In this test on legal reasoning my award for best response goes to ChatGPT 4o. The second-place award goes to Gemini 2.0 Flash.

I will share the full essay and meta-reasoning of the top response of ChatGPT 4o in Part Two of the Bar Battle of the Bots. I will also upload and provide a link to the second-place answer and meta-reasoning of Gemini 2.0 Flash. First, I want to point out some of the reasons ChatGPT 4o was the winner and begin explaining how other models fell short.

Winning AI having its moment of glory as current reigning champ of legal reasoning, but still well below best human lawyer level.

One reason is that ChatGPT 4o was the only bot to make case references. This is not required by a Bar Exam, but sometime students do remember the names of top cases that apply. Surely no lawyer will ever forget the case name International Shoe. ChatGPT 4o cited case names and case citations. It did so even though this was a “closed book” type test with no models allowed to do web-browsing research. Not only that, it cited to a case with very close facts to the hypothetical. DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. Dist. Ct. App. 2015). More on that case later.

Second, ChatGPT 4o was the only chatbot to mention the UCC. This is important because the UCC is the governing law to commercial transactions of goods such as the purchase of a diamond as is set forth in the hypothetical. Moreover, one answer written by an actual student who took that exam was published by the Board of Bar Examiners for educational purposes. It was not a guide per se for examiners to grade the essay exams, but still of some assistance to after-the-fact graders such as myself. It was a very strong answer, significantly better than any of the AI essays. The student answer started with an explanation that the transaction was governed by the UCC. The UCC references of ChatGPT 4o could have been better, but there was no mention at all of the UCC by the five other models.

That is one reason I can only award a B+ to ChatGPT 4o and a B to Gemini 2.0 Flash. I award only a passing grade, a C, to ChatGPT o3-mini, and Gemini Flash Thinking. They would have passed, on this question, with an essay that I considered of average quality for a passing grade. I would have passed o3-mini-high and Gemini Advanced too, but just barely for reasons I will later explain. (Explanation of o3-mini-high‘s bloopers will be in Part Two. Gemini Advanced’s error is explained next.) Experienced Bar Examiners may have failed them both. Essay evaluation is always somewhat subjective and the style, spelling and grammar of the generative AIs were, as always, perfect, and this may have effected my judgment.

Here is a chart my evaluation of the Bar Exam Essays.

Model and Ranking	Ralph Losey’s Grade and explanation
OpenAI – ChatGPT 4o. FIRST PLACE.	B+. Best on contract, citations, references case directly on point: DePrince.
Google – Gemini 2.0 Flash. SECOND PLACE.	B. Best on ethics, conflict of interest.
OpenAi – ChatGPT o3-mini. Tied for 3rd.	C. Solid passing grade. Covered enough issues.
OpenAI – ChatGPT o3-mini-high. Tied for 4th.	D. Barely passed. Messed up unilateral mistake.
Google – Gemini Flash Thinking Experimental. Tied for 3rd.	C. Solid passing grade. Covered enough issues.
Google – Gemini Advanced – Tied for 4th.	D. Barely passed. Hallucination in answer on conflict, but got unilateral mistake issue right.

Law professor with his least favorite task, grading exams.

I realize that others could fairly rank these differently. If you are a commercial litigator or law professor, especially if you have done Bar Exam evaluations, and think I got it wrong, please write or call me. I am happy to hear your argument for a different ranking. Bar Exam essay evaluation is well outside of my specialty. Even as an Adjunct Law Professor I have only graded a few hundred essay exams. Convince me and I will be happy to change my ranking here and revise this article accordingly with credit given for your input.

AI Hallucination During a Bar Exam

Gemini Advanced, which is a model Google now makes you pay extra to use, had the dubious distinction of fabricating a key fact in its answer. That’s right, it hallucinated in the Bar Exam.

AI hallucinating a conflict of interest in an Ethics question. No idea why.

No doubt humans have done that too for a variety of reasons, including a severe case of nerves. Still, it was surprising to see the top Gemini model hallucinate. It happened in its answer to the Ethics issue in the question. I have not seen a ChatGPT model hallucinate over the past six months with thousands of prompts. Prior to that it would sometimes, but only rarely, fabricate, which is why we always say, trust but verify.

In fairness to Gemini Advanced, its hallucination was not critical, and it did not make up a case or otherwise misbehave out of natural sycophantism to please the human questioner. All of the models are still somewhat sycophantic.

Sycophantism in still a widespread problem in all AIs, the avoidance of which requires careful prompting.

It is hard to understand why Gemini Advanced included in this response the following to the ethics question:

Potential Conflict of Interest: If Paul and Leda decide to sue Helen, Lawyer may have a potential conflict of interest due to the fact that Helen is a past client. Lawyer should disclose this potential conflict to Paul and Leda and obtain their informed consent before proceeding with the representation.

There is absolutely nothing in the question or hypothetical presented to suggest that Helen is a past client of the lawyer. Note: After I wrote this up Google released a new version of Gemini Advanced on 2/16/25 called Advanced 2.0 Flash. In my one run with the new model the hallucination was not repeated. It was more concise that regular Gemini 2.0 Flash and, interestingly, not nearly as good as Gemini 2.0 Flash.

Conflict of Interest Issue in the Ethics Question

The second-best legal reasoner, Gemini 2.0 Flash, attained its high ranking, in part, because it was the only model to correctly note a potential conflict of interest by the Lawyer in the hypothetical. This was a real issue based on the facts provided with no hallucinations. This issue was missed by the student’s answer that the Bar Examiners provided. The potential conflict is between the two actual clients of the Lawyer. Here is the paragraph by Gemini 2.0 Flash on this important insight:

Potential Conflict of Interest (Rule 4-1.7): While not explicitly stated, Paul’s uncertainty about litigation could potentially create a conflict of interest. If Lawyer advises both Paul and Leda, and their interests regarding litigation diverge (e.g., Leda wants to sue, Paul doesn’t), Lawyer must address this conflict. Lawyer must obtain informed consent from both clients after full disclosure of the potential conflict and its implications. If the conflict becomes irreconcilable, Lawyer may have to withdraw from representing one or both clients.

One client wants to sue, the other does not. Can same attorney represent them both?

This was a solid answer, based on the hypothetical where: “Leda is adamant about bringing a lawsuit, but Paul is unsure about whether he wants to be a plaintiff in litigation.” Note, the clear inference of the hypothetical is that Paul is unsure because he knew that the seller made a mistake in the price, listing the per carat price, not total price for the two-carat diamond ring, and he wanted to take advantage of this mistake. This would probably come out in the case, and he would likely lose because of his “sneakiness.” Either that or he would have to lie under oath and perhaps risk putting the nails in his own coffin.

There is no indication that Leda had researched diamond costs like Paul had, and she probably did not know it was a mistake, and he probably had not told Leda. That would explain her eagerness to sue and get her engagement ring and Paul’s reluctance. Yes, despite what the Examiners might tell you, Bar Exam questions are often complex and tricky, much like real-world legal issues. Since Gemini 2.0 Flash was the only model to pick up on that nuanced possible conflict, I awarded it a solid ‘B‘ even though it missed the UCC issue.

Conclusion

As we’ve seen, AI reasoning models have demonstrated varying degrees of legal analysis—some excelling, while others struggled with key issues. But what exactly did ChatGPT 4o’s winning answer look like? In Part Two, we not only reveal the answer but also analyze the reasoning behind it. We’ll explore how the winning AI interpreted the Bar Exam question, structured its response, and reasoned through each legal issue before generating its final answer. As part of the test grading, we also evaluated the models’ meta-reasoning—their ability to explain their own thought process. Fortunately for human Bar Exam takers, this kind of “show your notes” exercise isn’t required.

Part Two of this article also includes my personal, somewhat critical take on the new reasoning models and why they reinforce the motto: Trust But Verify.

In Part Two, we’ll also examine one of the key cases ChatGPT 4o cited—DePrince v. Starboard Cruise Services, 163 So. 3d 586 (Fla. 3rd DCA 2015)—which we suspect inspired the Bar’s essay question. Notably, the opinion written by Appellate Judge Leslie B. Rothenberg includes an unforgettable quote of the famous movie star Mae West. Part Two reveals the quote—and it’s one that perfectly captures the case’s unusual nature.

Below is an image ChatGPT 4o generated, depicting what it believes a young Mae West might have looked like, followed by a copyright free actual photo of her taken in 1932.

Mae West Image by Ralph Losey using the winning ChatGPT 4o model.

Mae West 1932 Photo that appeared in LA Times, copyright expired.

I will give the last word on Part One of this two-part article to the Gemini twins podcasters I put at the end of most of my articles. Echoes of AI on Part One of Bar Battle of the Bots. Hear two Gemini AIs talk all about Part One in just over 16 minutes. They wrote the podcast, not me. Note, for some reason the Google AIs had a real problem generating this particular podcast without hallucinating key facts. They even hallucinated facts about the hallucination report! It took me over ten tries to come up with a decent article discussion. It is still not perfect but is pretty good. These podcasts are primarily entertainment programs with educational content to prompt your own thoughts. See disclaimer that applies to all my posts, and remember, these AIs wrote the podcast, not me.

5 Comments | AI Ethics, AI Instruction, ChatGPT, Gemini AI, knowledge, Lawyers Duties, Technology, VENDORS, wisdom | Tagged: best practices, machine learning, technology, vendors | Permalink
Posted by Ralph Losey

Designing Generative AI for Legal Professionals: Key Principles and Best Practices

November 14, 2024

by Ralph Losey

Published on November 14, 2024

Generative AI is transforming the landscape of legal technology, offering unprecedented opportunities to automate tasks and streamline complex workflows. Yet, designing AI tools that meet the needs of legal professionals requires more than just technical expertise; it demands a deep understanding of the everyday challenges and workflows lawyers face. From automating document review to drafting briefs, these tools have the potential to save time and boost productivity—but only if they are designed with real-world legal practice in mind. A set of six design principles, identified in a May 2024 study by IBM researchers, provides a practical roadmap for creating AI applications tailored to the unique demands of the legal profession. This article explores these principles, offering actionable steps for developers and legal professionals alike.

Image by Ralph Losey using current Word Press version of Stable Diffusion.

In the last year, a wave of generative AI tools has emerged, ranging from free Custom GPTs on platforms like OpenAI’s ChatGPT to premium legal tech applications costing tens of thousands annually. While the technology behind these tools is impressive, developing effective applications requires a deep understanding of legal workflows and needs. Generative AI is fundamentally different from traditional software and requires a distinct approach to design.

A May 2024 study, Design Principles for Generative AI Applications, by IBM researchers, lays out six practical principles for designing effective generative AI tools. This article examines how these principles can be applied specifically to the legal tech sector, offering a guide for those looking to build or select tools that are both innovative and practical.

Software design image by Ralph Losey using current WP version of Stable Diffusion.

Outline of the Scientific Article and Authors

Design Principles for Generative AI Applications (CHI ’24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Article No.: 378, Pgs 1 22, May 11, 2024) (hereinafter the “Study“) was authored by: Justin D. Weisz, Jessica He, Michael Muller, Gabriela Hoefer, Rachel Miles, and Werner Geyer. They are all part of the IBM Research Team, which has over 3,000 members. The Study describes the extensive peer review process that the authors went through to decide upon the six key principles of generative software design. It was very impressive.

The 22-page Study is a high-quality, first-of-its-kind research project. The IBM sponsored Study has 196 footnotes and is sometimes quite technical and dense. Still, it is a must-read for all serious developers of generative AI-based software and is recommended for any law firm before making major purchases. The success of AI projects depends upon the selection of well-designed software. Poorly designed applications with an impressive list of features are a receipe for frustration and failure.

My article will not go into the many details as to how the design guidelines were derived, but focuses instead on the end result. Still, readers might benefit from a quick review of the Study’s Abstract:

Generative AI applications present unique design challenges. As generative AI technologies are increasingly being incorporated into mainstream applications, there is an urgent need for guidance on how to design user experiences that foster effective and safe use. We present six principles for the design of generative AI applications that address unique characteristics of generative AI User Experience (UX) and offer new interpretations and extensions of known issues in the design of AI applications. Each principle is coupled with a set of design strategies for implementing that principle via UX capabilities or through the design process. The principles and strategies were developed through an iterative process involving literature review, feedback from design practitioners, validation against real-world generative AI applications, and incorporation into the design process of two generative AI applications. We anticipate the principles to usefully inform the design of generative AI applications by driving actionable design recommendations.

The six principles are:

Design Responsibly.
Design for Mental Models.
Design for Appropriate Trust and Reliance.
Design for Generative Variability.
Design for Co-Creation
Design for Imperfection

The first three principles offer new interpretations of known issues with AI systems through the lens of generative AI. The next three design principles identify unique characteristics of generative AI systems. Study, Figure 1. All six principles support two important user goals: a) optimizing generated text to meet task-specific criteria; and b) exploring different possibilities within a specific domain.

This article will discuss how each principle can apply to the legal profession.

Image by Ralph Losey using current WP’s Stable Diffusion

Background on Generative AI and Its Potential in the Law

Generative AI is distinguished by its ability to create new content rather than merely analyzing existing data. This capability stems from reliance on large-scale foundation models, trained on incredibly large datasets to perform diverse tasks with human-like fidelity (meaning that, like humans, it can sometimes make mistakes). In legal practice, generative AI can streamline several key tasks when implemented thoughtfully, including:

Legal Research: Automating the process of searching for relevant case law, statutes, and regulations.
Document Drafting: Generating contracts, briefs, and other legal documents based on specified parameters.
Due Diligence: Analyzing large volumes of documents to identify potential risks and liabilities.
Contract Review: Identifying and flagging potential issues in contracts.
Legal Writing: Generating clear and concise legal writing.
Brainstorming: Suggesting new ideas based on simulated experts talking to each other. See e.g., Panel of AI Experts for Lawyers.

AIs talking to each other. Ralph Losey, WP’s Stable Diffusion

Design Principles for Generative AI in Legal Tech

Integrating AI in any field requires a thoughtful approach, and the legal profession, with its emphasis on ethics and accuracy, demands even greater diligence. AI should augment legal work without compromising the profession’s core values.

The Study outlines six practical design principles that offer a roadmap for developing generative AI tools tailored to legal practice. Here’s how each principle can be implemented to ensure that AI applications meet the unique demands of the legal field:

1. Design Responsibly

Human-Centered Approach: To implement this, developers should start with user research, such as interviews with lawyers to understand their daily challenges. For instance, incorporating a feedback loop into AI tools allows legal professionals to directly flag inaccuracies, ensuring continuous improvement of the tool’s outputs. This can be achieved by incorporating design thinking and participatory design methodologies. Observing how legal professionals perform their tasks and understanding their challenges are essential first steps.

For example, research into actual lawyer practice can provide valuable insights into how generative AI can be best integrated into their daily routines. It’s not about replacing lawyers but about empowering them with tools that enhance their capabilities and decision-making processes.
Addressing Value Tensions: The development of legal tech involves various stakeholders, including legal professionals, developers, product managers, and decision-makers like CIOs and CEOs. Stakeholders often have differing values and priorities. For instance, legal professionals prioritize accuracy and reliability, while developers may focus on efficiency and innovation. These differing values can lead to value tensions that need to be identified and addressed proactively.

The Study suggests using the Value Sensitive Design (VSD) framework, which provides a structured approach to identifying stakeholders, understanding their values, and navigating the tensions that may arise.
Managing Emergent Behaviors: A unique characteristic of generative AI is its potential to exhibit emergent behaviors. These are capabilities extending beyond the specific tasks a model was trained for. While emergent behaviors can be beneficial, leading to unexpected insights or efficiencies, they can also pose risks, such as generating biased or offensive content. Designers must consider whether to expose or limit these behaviors, weighing potential benefits against possible harm. This might involve a combination of technical constraints and user interface design strategies to guide AI output and prevent undesirable results.

For example, if a generative AI tool designed to summarize legal documents starts generating legal arguments, designers might need to adjust the model’s parameters or provide users with clear instructions on how to use the tool responsibly.
Testing for User Harms: Generative AI models, particularly those trained on extensive text datasets, are susceptible to producing biased, offensive, or potentially harmful outputs Rigorous testing and ongoing monitoring are essential to minimize these risks. Designers and developers should benchmark models against established datasets to identify hate speech and bias. Additionally, providing users with clear mechanisms to report problematic outputs can help identify and address issues that may not be caught during testing.

Looking out for unexpected harm. Ralph Losey image using WP’s Stable Diffusion.

2. Design for Mental Models

Orienting Users to Generative Variability: Legal professionals are accustomed to deterministic systems in which the same input consistently produces identical outputs. Generative AI, however, introduces variability, generating different outputs from the same input. Designers must address this shift by helping users comprehend and leverage this inherent variability. This may involve presenting multiple output options, enabling users to explore different possibilities or providing clear explanations of factors influencing output variation.

For example, a contract-drafting tool might provide templates and successful prompt examples, guiding users on accurately specifying contract clauses and provisions.
Teaching Effective Use: Legal professionals must adapt their skills and workflows to effectively incorporate generative AI into their practices. This includes understanding how to construct effective prompts, recognizing the limitations of the technology, and critically evaluating the generated outputs.

Designers play a crucial role in facilitating this learning by offering comprehensive tutorials, real-world examples, and clear explanations of AI capabilities and constraints. For example, a contract drafting tool could offer templates and examples of successful prompts, guiding users on how to specify desired contract clauses and provisions accurately.
Understanding Users’ Mental Models: Understanding how legal professionals conceptualize these tools and their capabilities is crucial for designing intuitive and effective legal tech applications.
User research methods like interviews and observations are essential for understanding users’ mental models. Asking users to describe how they believe a particular application works can reveal valuable information about their understanding and expectations. This understanding enables designers to align user interfaces and interactions with users’ existing mental models, making adopting new tools smoother and more intuitive.

For example, if users perceive a legal research tool as a supplement to traditional databases, designers can highlight the complementary nature of AI-powered research, emphasizing its ability to uncover connections and insights that might be missed through conventional methods.
Tailoring AI to Users: A significant advantage of generative AI is its ability to adapt to individual users. By leveraging techniques like prompt engineering, designers can tailor the AI’s responses based on user preferences, background, and specific needs. This may include adjusting language complexity and style, providing tailored recommendations, or adapting the user interface for individual workflows. For instance, a legal writing tool might learn from a user’s style and preferences, generating suggestions and text that aligns with their voice and tone.

Most lawyers enjoy tailoring their AI to fit their practice and personalities. Image by Ralph Losey using WP Stable Diffusion.

3. Design for Appropriate Trust & Reliance

Calibrating Trust Through Transparency: Legal professionals must understand when to trust generative AI outputs and when to exercise caution. Transparency is key to establishing this trust. In practice, this can be achieved by adding a ‘source traceability’ feature to AI tools, allowing lawyers to view the origins of information used in AI-generated summaries. This transparency helps lawyers decide when to rely on the AI’s outputs and when to conduct additional research.

This may also include displaying confidence levels for outputs, flagging areas for further review, or providing disclaimers about AI’s inherent imperfections. For example, a contract review tool might flag clauses with low confidence scores, encouraging users to examine those sections more closely.
Providing Justifications for Outputs: To enhance transparency, designers should give users insight into the reasoning behind AI outputs. This could involve revealing the AI’s ‘chain of thought,’ showing the source materials used to generate the output, or displaying the model’s confidence levels. Understanding how AI reaches a result allows users to better assess its validity and make informed decisions.

For instance, a legal research tool might display snippets from source documents that support specific AI-generated legal arguments, allowing users to verify the accuracy and relevance of the information. This makes it easy for legal professionals to trust but verify. This is the fundamental mantra for the legal use of AI in these early days because it can still make errors and sometimes even sycophantic hallucinations.
Encouraging Critical Evaluation with Friction: Overreliance on AI may lead to complacency and missed opportunities for critical thinking, both of which are essential in legal practice. Designers can incorporate cognitive forcing functions into the user interface to encourage users to slow down, carefully review outputs, and engage in critical evaluation.

This may include requiring users to manually confirm or edit AI-generated suggestions, presenting alternatives alongside AI recommendations, or highlighting potential inconsistencies or risks for user review. For example, a contract-drafting tool might flag commonly disputed clauses or those requiring special attention, encouraging users to review these sections thoroughly.
Clarifying the AI’s Role: AI systems can serve various roles, from simple tools to collaborative partners or advisors. Put another way, is the tool designed for a centaur-type hybrid mode or a more complex cyborg mode? See e.g. From Centaurs To Cyborgs: Our evolving relationship with generative AI (4/24/24).

Clearly defining the AI’s intended role in legal tech applications shapes user expectations and promotes appropriate trust. For example, an AI positioned as a “research assistant” might be expected to provide comprehensive information, while a “contract drafting tool” might be primarily expected to generate initial drafts for further review and editing. By accurately representing the AI’s capabilities and limitations within a defined role, designers can mitigate the risk of users over-relying on the technology or misinterpreting its outputs.

Supervised assistants you trust. Ralph Losey using Stable Diffusion.

4. Design for Generative Variability

Accommodating Generative Variability: Legal professionals are used to deterministic systems, where the same input consistently produces the identical output. Generative AI introduces variability, producing different outputs even with identical inputs. Designers must address this shift by helping users comprehend and leverage this inherent variability.

This could involve presenting multiple output options, allowing users to explore different possibilities, or providing clear explanations of the factors that influence output variation. For instance, a legal research tool powered by generative AI could offer different summaries of a case, each focusing on a specific aspect, allowing users to gain a more comprehensive understanding of the legal precedent.
Facilitating Effective Use: Legal professionals must adapt their skills and workflows to integrate generative AI effectively into their practices. This includes understanding how to construct effective prompts, recognizing the limitations of the technology, and critically evaluating the generated outputs.

Designers can play a key role in facilitating this learning process by providing comprehensive tutorials, real-world examples, and clear explanations of the AI’s capabilities and constraints. For example, a contract-drafting tool could offer templates and examples of successful prompts, guiding users on how to specify desired contract clauses and provisions accurately.
Highlighting Differences and Variations: Visual cues can help users quickly understand how multiple outputs differ from each other. This could involve highlighting changes between drafts, color-coding outputs based on confidence levels, or using visual representations to display the distribution of outputs.

Image by Ralph Losey using WP’s Stable Diffusion

5. Design for Co-Creation

Supporting Co-Editing and Refinement: Legal professionals frequently need to adapt and refine AI-generated content to meet specific requirements, legal precedents, or client needs. To implement this, developers should focus on co-editing features that let lawyers refine AI-generated text directly within the interface, such as tools for editing clauses in AI-drafted contracts. This approach ensures that AI outputs are not treated as final but are instead starting points that lawyers can shape to fit specific needs.

This could also involve providing tools for manipulating charts and images, or adjusting parameters to fine-tune outputs. A contract-drafting tool could enable users to revise specific clauses with versions that are either more aggressive or cooperative than standard, or to incorporate additional provisions based on client instructions.
Guiding Effective Prompt Crafting: The quality and relevance of outputs generated by AI models are heavily dependent on the prompts provided. Designers play a crucial role in helping users craft effective prompts by offering clear guidance, templates, and examples.

This may include interactive tools that guide users in defining their needs, specifying output characteristics, and refining prompts to achieve optimal results. For instance, a legal research tool might include a structured prompt builder, helping users define research questions, specify relevant jurisdictions, and refine search parameters for more targeted results.

6. Design for Imperfection

Communicating Uncertainty Transparently: Designers must be transparent about potential imperfections in AI-generated outputs. This involves clearly communicating the technology’s limitations, displaying confidence levels, and highlighting potential error areas.

Designers can use disclaimers and visual cues to alert users to uncertainties, encouraging critical evaluation of the results. For example, a legal research tool might use color coding to indicate confidence levels of different sources, helping users prioritize reliable information.
Integrating Domain-Specific Evaluation Tools: Legal professionals require ways to assess AI-generated output quality and reliability using domain-specific metrics. Designers can integrate domain-specific evaluation tools directly into legal tech applications.

This may include features like automatic citation checks, factual accuracy verification against reliable sources, or evaluating the persuasiveness of legal arguments using predefined criteria. Providing these tools empowers users to validate AI-generated content and make informed decisions in their legal work.

Domain-specific tools could drill down even further into sub-specialties of the law. For instance, one version for ERISA litigation and another for personal injury, or one version for civil litigation and another for criminal.
Offering Options for Output Improvement: Instead of presenting AI-generated outputs as final, designers should provide users with opportunities for refinement and improvement. This may include editing tools, enabling users to regenerate outputs with different parameters, or suggesting alternatives based on user feedback. Enabling users to iteratively refine AI-generated content fosters a collaborative approach to legal work, positioning AI as a starting point for human expertise and judgment.
Collecting Feedback for Continuous Improvement: User feedback is a critical element in adapting AI tools to real-world legal practice. Including simple feedback mechanisms—such as a button to flag unclear or inaccurate results—allows developers to fine-tune the tool over time, ensuring that it remains aligned with user needs. Multiple built-in mechanisms should enable users to easily provide feedback on AI-generated outputs, flag errors, suggest improvements, or rate feature usefulness. This continuous feedback loop helps retrain models, adjust parameters, refine prompts, and improve the overall user experience, ensuring that legal tech applications evolve to meet the dynamic needs of legal professionals.

However, these user feedback features are sorely lacking in most legal software today. Far too often, users are left with limited options—complaining to project managers, voicing concerns to sales representatives, or ultimately canceling their subscription. In many cases, direct conversations with company leaders, like CEOs or head software designers, yield little if action is not taken by the vendor to address user concerns. This creates frustration and limits the potential for meaningful product improvement.

Legal tech companies must do more than just provide feedback channels; they must actively listen and take action. Integrating mechanisms like in-app feedback buttons, instant AI responses and timely human followup, automated surveys, and regular user forums can ensure that feedback doesn’t just disappear into a void. More importantly, companies should demonstrate a commitment to implementing user suggestions and keeping users informed of changes. Continuous improvement must be more than a slogan—it should be a practice embedded into every stage of development. Without this, legal professionals will inevitably turn elsewhere in search of tools that better align with their needs.

Tech actively listening to lawyer. Image by Ralph Losey using Stable Diffusion

Conclusion: The Future of Legal Tech in the Age of AI

Integrating generative AI into legal practice is not a simple transition; it requires strategic planning, targeted training, and a deep understanding of both technology and legal processes. Success will depend on close collaboration between software developers, legal professionals, and AI experts, ensuring that AI tools are tailored to the complex needs of the legal field. A key element of this collaboration is creating robust feedback mechanisms that allow legal professionals to directly shape the evolution of AI tools. By actively listening to user input and iterating on design, legal tech companies can ensure that AI applications remain relevant and effective.

With a clear roadmap that includes user training, open feedback channels, and a commitment to continuous improvement, generative AI can transform legal practice, driving progress while preserving the profession’s core values. Legal professionals and developers should begin by identifying key areas where AI can add value and prioritize building feedback mechanisms that facilitate ongoing refinement. This approach will ensure that AI integration is not only successful but also sustainable, ultimately creating tools that truly serve the legal profession’s needs.