This article describes an experiment using OpenAI’s ChatGPT-4 to do appellate work, usually considered the most intellectually challenging area of the law. My hypothesis was that AI was already capable of acting as an appellate judge in some cases. Based on the growing concern about the politicization and quality of some appellate judges, especially those who sit on our highest court, this research was a priority. The results are encouraging. Someday soon artificial intelligence may help guide and protect the quality and integrity of our courts and the all-too-human judges who serve on the bench. We all need help from time to time. I know I do, although no help was received from AI in the writing of this article or design and analysis of the experiment. All illustrations were created by me using AI tools.
To test my theory, I used a recent Eleventh Circuit Court opinion that affirmed a summary judgment. The opinion included a dissent. The AI was shown the parties’ briefs and prompted to analyze and give opinions. The actual opinion was then shown to the AI and further analysis was requested. Comparisons between the “real” and “fake” opinions may surprise you.
One of the most impressive demonstrations by GPT-4 in this experiment was its prediction, based on the briefs and lower court opinion alone, that a Circuit Court of Appeals opinion on this case would include a dissent. As you will see in the full transcript of the experiment, the AI at first estimated the likelihood of a dissent to be 35%, but, once it was told who the judges were, Circuit Judge JORDAN, Circuit Judge JILL PRYOR, and retired Circuit Judge TJOFLAT, this estimate changed to 55%. In other words, it predicted a dissent. That is very impressive considering the rarity of dissents in Circuit Court opinions. They only appear in 2.6% of opinions. Epstein, Landes, Posner, Why (and When) Judges Dissent, Journal of Legal Analysis, Spring 2011: Volume 3, Number 1 at pg. 106 . The reasoning as to why there might be a dissent also changed in a cogent manner once the judge panel was given. This reasoning was based on the AI’s prior general knowledge of the judges and this area of the law. At this point of the test, it had no knowledge of the actual appellate opinion, or even that an appeal would be filed.
The test case used was Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022). The case involved a summary judgment ruling that an autopsy photographer was not entitled to minimum wage or overtime because she was employed as an intern. The intern did not receive academic credit for her work and all facts were largely undisputed. There were close issues of law involved and one judge dissented. Based on my years as an employment defense lawyer I thought the Eleventh Circuit’s opinion was fair and well-reasoned, but I understood the dissent. There was no sudden reversal of policy in this case, nor break with precedent, as we have seen in other courts. It was a good case to test the AI’s appellate reasoning abilities and sense of fairness.
Read the transcript linked at the bottom to see whether the AI agreed with the Eleventh Circuit. Did it understand the law and the issues? Did it affirm the summary judgment in favor of the employer, or did it follow the dissent, or do something else entirely? I will reveal the answer to the threshold question. Yes, it understood the law and the issues very well and responded admirably to a series of challenging questions posed. This experiment demonstrates the significant potential for AI to improve this area of the law. There is hope for the future that AI can improve and depoliticize our system of justice.
Overview of the Experiment
In the experiment ChatGPT-4-Pro was prompted to give opinions based on the parties’ briefs and lower court’s decision, like appellate court judges, or their law clerks, would do, and then write an opinion. The AI was not shown the Eleventh Circuit opinion until after it was prompted to provide its own opinion. Only then was the AI shown the Eleventh Circuit’s opinion and asked many questions about it. Finally, GPT-4 was asked to analyze the chances of appeal to the U.S. Supreme Court and the impact of the particular judges now on the Supreme Court. (The case was not appealed.)
Personally, I think the “fake” opinions by the AI are almost as good, or about equal to the “real” ones, which were written by some of the top judges in the country. Read ChatGPT’s opinions linked below and see the interesting points of agreement and disagreement. Congruency analysis was part of the experiment and one reason I picked a close, difficult opinion, not just an easy one, such as a frivolous appeal by a prisoner. See eg. Mateen v. FNU LNU, 857 Fed. Appx. 209 (5th Cir. 2021), as discussed in my see article, Exclusive Report: New Appellate Court Opinions on Artificial Intelligence and Hallucinations.
It appears from this experiment, and from others I have performed similar to this, that the work of an appellate judge is easier for an AI to match than that of a trial judge. That is primarily because trial judges consider credibility issues and a much more complex record, whereas the work of an appellate court judge is primarily intellectual. That is just the kind of thing that generative AI is good at. For that reason I predict that AI’s emergence as an adjudicator will begin at both the highest levels, the appellate courts, and the lowest, for example the traffic courts, consumer courts and “desk arbitrations.” Implementation by some appellate courts could come as early as 2025, as I envision in my recent YouTube Video. The technology is there. All it will take are a few visionary appellate judges to help lead the way. I am confident the necessary support teams can quickly come together to help make it happen. There are many very talented people in the field of legal technology.
To me this exercise has once again confirmed the tremendous potential of AI to improve the Law, starting at the top with the most mentally challenging tasks. I hope these efforts will prompt others to perform their own experiments to see if the results are duplicated. That will help persuade the bench, bar and public that AI is ready to help with fair and efficient judicial functions.
Background to the Experiment of Testing AI as an Appellate Court Judge
The results of this experiment are in line with my expectations based on prior experiences and tests. The legal intelligence of OpenAI’s ChatGPT-4 is impressive. It did score in the top ten percent (10%) of all juris doctorates taking Bar examinations in 2023 to be admitted to practice law. Latest version of ChatGPT aces bar exam with score nearing 90th percentile (ABA Journal, 3/16/23). Also see e.g., “Godfather of AI” Geoffrey Hinton Warns of the “Existential Threat” of AI (video interview of the great AI scientist, Geoffrey Hinton, May 9, 2023, although I do not agree with all of his warnings).
There is also no question that ChatGPT-4, if used properly, is a good research tool and does very well in the analysis of case law and documents. It is so good, in fact, that most legal industry analysts agree that it will replace many lawyers’ existing work, while, at the same time, creating different kinds of new, post-AI work. Many have written about this, myself included. Ten Ways LLM Models Such As ChatGPT Can Be Used To Assist Lawyers (March 25, 2023); McKinsey Predicts Generative AI Will Create More Employment and Add 4.4 Trillion Dollars to the Economy (June 23, 2023); What Lawyers Think About AI, Creativity and Job Security (July 28, 2023).
But what about judges and judicial legal work?
I have been curious as to the Gen AI’s ability to act as a judge and have mentioned or written about this before. ChatGPT Has Severe Memory Limitations: Judges, Arbitrators and Commercial Litigation Lawyers, Your Jobs Are Safe, For Now (May 5, 2023) (“ChatGPT has too small a memory to be of much use to judges, lawyers and complex case litigators, at least for purposes of assisting in full-case legal analysis.“) It is now October 2023, and the input limitation has been lessened by OpenAI and others, but still not fixed. See eg. How AI Developers are Solving the Small Input Size Problem of LLMs and the Risks Involved (June 30, 2023).
Aside from the credibility issue and current inability of AI to evaluate truthfulness of testimony, the input size limitations of ChatGPT also still prevent AI from replacing trial judges. It is a data input and analysis problem, not a legal intelligence problem. The general input limitations have diminished, but still remain. In other controlled experiments I have run, which I may write up someday, the input capabilities of ChatGPT-4 Pro were too limited to serve as a trial judge or arbitrator, even in cherry picked cases with few facts in dispute. The only exception, which I have not tried yet, could be simple “desk arbitrations,” which are based on stipulated facts.
This experiment was prompted by my expectations that the corrections made to enlarge the input capacity, especially those made to the ChatGPT-4 Pro version to use plug-ins to read PDF files, made it possible for this AI to now adjudicate less factually intensive disputes, such as disputes found in many appeals.
My general area of investigation is whether and how the work of an appellate court judge can be automated. I speculated that it could be automated, or at least supplemented in many appeals, because these judges do not hear or take evidence. They study the parties’ briefs, and only rarely have the need to read the detailed record on appeal of the original testimony heard by the trial judge. The role of an appellate judge is primarily one of legal analysis. My postulation was that ChatGPT-4 should, if it is smart enough, be able to perform the role of an appeals court, at least in factually simple cases. For this reason, I formulated a more detailed hypothesis to test this theory. I encourage others to test this theory too and openly share their findings.
Hypothesis. I hypothesize that OpenAI’s Chat GPT-4.0, September 25, 2023, Pro Version with Ai-PDF Plugin activated, could analyze the parties’ briefs and lower court opinion in a real case and reach a valid decision. More than that, I hypothesized that the AI intelligence was such that the “fake” opinion generated would be equal to that of the “real” opinion. The comparison would be made in terms of both sophistication of legal analysis and over-all fairness.
Since I have experience as a trial and sometimes appellate court lawyer going back to 1980, and an Arbitrator since 2022, I feel qualified to make basic judgments concerning “fake” and “real” equivalence. But, as always, I invite peer review. There are many lawyers with similar or greater expertise, especially in appellate practice who specialize further in employment law appeals. I know some of those specialists and would love to hear their analysis of the subtle nuances of the AI generated opinions. Please let me know your thoughts. I will provide thoughts of my own analysis in a follow-up blog.
Someday I will revise this experiment to make it a type of Turing test, and see how many people in the general public can tell the difference between a real and fake judicial opinion. I would like to try it on lawyers too. I suspect that all would be fooled, except for appellate specialists. Think about that as you read the work generated here by GPT-4.
I share the actual GPT prompts and responses of this experiment below. This should allow others to repeat the experiment in a controlled manner, although exact replication is made difficult by the fact that OpenAI software changes so rapidly, and so does the PDF plug-in used, Ai-Pdf. Still, if you try this soon, you should be able to reproduce very similar results. (The exact wording should always be somewhat different.) Even months from now, when the software versions will have inevitably changed, please try this experiment, or your own improved versions. Try other appellate opinions too. This is just a first attempt based on an appeal. I am sure many of you can do just as well, it not far better. But please, do not just make claims, share the results. Let’s keep this open AI.
Description of the Appellate Judge Experiment of October 2, 2023
The first and most important thing to point out is that I used a case that was heard and decided after the November 2022 ChatGPT training cut-off date. I wanted to use a real case, but the validity of this test turns in part on this being the AI’s own reasoning, not parroting someone else’s words about this case. I tried to ensure that by using an appellate opinion that the AI knew nothing about.
That is one reason that I picked a very recent case from out of my home court, the Eleventh Circuit Court of Appeals. Brandi McKay vs. Miami-Dade County, 36 F.4th 1128; 2022 U.S. App. LEXIS 15910 (11th Cir. June 9, 2022).
There are three other reasons I picked the McKay case:
- It was an appeal of a summary judgment against the plaintiff and had relatively simple facts. The main disputes were legal.
- The legal issues involved were somewhat complex and novel. The opinion included a partial concurrence and dissent. This provided the AI with a significant challenge of its legal reasoning abilities.
- The briefs of the parties were readily available on Lexis, which is not always the case, and they were fairly short and well-written.
I began by providing the Chat GPT-4.0, September 25, 2023, Pro Version, with Ai-Pdf Plugin activated, the three appeal briefs of the parties, the Appellant Brief, the Appellee’s Brief, and the Appellants Answer Brief. I also provided the AI with the underlying district court opinion under review. I did not submit the final 11th Circuit opinion so as to not influence the AI or bias its opinion. As a former Law Professor, I would consider that a cheat. I did not want an open-book test, so to speak, and the version of ChatGPT I used did not have any online connectivity. It only considered what I submitted to it with PDF plugin readings and detailed summaries. I reviewed those summaries for quality and was favorably impressed. No errors found, and certainly no hallucinations, but I always look out for such things when using generative AIs, especially in these still early days of the new technology.
The way I submitted the information and prompted the decision may seem unnecessarily complex. I agree, and this was not my first approach, but I had to add the complexity of having ChatGPT-4 write memorandums, and not simply ask it to act as an Appellate Court Judge and write opinions. Its current “ethical alignment controls” prevent it from doing that. Earlier versions of the software that I used doing trial judge experiments did not have this restriction. Basically, it refused to act as any kind of judge, saying:
I apologize for the confusion. While I can analyze and summarize legal documents, my capabilities are limited when it comes to drafting complex legal opinions that require a nuanced understanding of the law, especially based on multiple documents. The task you’ve described—acting as an appellate judge to review multiple briefs and a district court decision to draft an appellate opinion—is a highly specialized task that typically requires a legal expert with years of training and experience in the field. … I can’t draft a legal opinion in the manner and style of a Circuit Court of Appeal.ChatPT-4,0 September 25, 2023 version
I was somewhat surprised and disappointed by that reaction, but still did not believe this disclaimer. It was obviously made to try to protect OpenAI from bad publicity and frivolous lawsuits. News Flash – First Suit Filed Against ChatGPT For Hallucinatory Libel (June 8, 2023). It was a logical reaction by OpenAI’s legal and P.R. teams to the bad publicity from the Mata v. Avianca, Inc., No. 54, 22-cv-1461 (PKC) (S.D.N.Y. June 22, 2023) case. This is the case where lawyers used GPT-3.5 for legal research, and it made up (hallucinated) some cases. The lawyers used the obviously bogus cases in court without checking; then lied to the District Court Judge in a cover-up attempt. Back to the Basics: The Importance of Understanding the Rule of Law in an Era of Rapid Technological Change (ABA, 7/18/23). This resulted in a well-deserved sanction against the lawyers.
I was not too surprised by the disclaimer and refusal. I assumed the disclaimer to be the result of revision to GPT’s latest ethical guidelines to limit liability, rather than an accurate statement based on OpenAi’s own secret experiments. I was not discouraged. I was pretty sure that I could avoid the alignment restriction by rephrasing the prompt. I had some experience with that from the AI pentest experiments that I participated in at DefCon 31. DefCon Chronicles: Sven Cattell’s AI Village, ‘Hack the Future’ Pentest and His Unique Vision of Deep Learning and Cybersecurity.
I was easily able to overcome this misguided alignment control by asking for memorandums, not judicial opinions. As it turned out, that worked out for the best. This approach made it easier to prompt ChatGPT to provide a full analysis. This made GPT’s work easier evaluate for quality. If you do not do this, ChatGPT tends to be too concise and not provide full reasoning. You will see how I addressed this tendency by the wording of the prompts.
Below is an outline of the twenty-one prompts used and responses received in chronological order. They are divided into six stages with a link to a PDF version of the actual Chat transcript. ChatGPT-4 put my picture in the record of all my prompts, plus I added break lines for clarity. The transcript of the experiment should speak for itself. I did not interrupt it to provide comments. I do, however, plan to write a follow-up article with analysis.
Hopefully this prologue will tempt you to judge for yourself the quality of the GPTs analysis shown in the result below. Perhaps I am over-enthusiastic about the AI’s performance? Read its legal opinions and judge for yourself. Does this experiment demonstrate high legal intelligence? Did the AI agree or disagree with the human judges? Hopefully this idea and prelude has you curious.
I urge you to recreate the experiment yourself. I welcome all peer review and feedback. This is the first experiment of this type, and I am sure that the procedures can be improved. Still, if you use the exact same prompts, you can avoid the OpenAI alignment constraints and better test the reproducibility of the results. If others repeat the same experiment, they should get the same results. That is how science works. If not, we will try to figure out why. You may have problems, for instance, in getting complete and detailed responses. Different approaches to testing the hypothesis can help advance knowledge too. Perhaps you can try it with another AI.
Peer review is essential to advance our knowledge and avoid inadvertent human errors. The is also the essential design of our judicial systems where appellate review is supposed to catch and correct errors. These judicial functions have been much maligned of late. Many contend appellate courts are now overly politicized, especially our highest court. This is not just a solution in search of a problem.
Perhaps someday AI will mitigate these problems and improve our system of justice. I do not imagine that AI will ever entirely replace our appellate court judges. We will always need some human involvement and quality controls. But this experiment shows, I think, that AI can already serve as a powerful tool to assist on appeals. It can assist the judges, their clerks and the lawyers handling appeals. I can easily envision new software designed for this purpose, built on GPT but with a special user interface and supplemental databases. Eventually fewer appellate lawyers, judges, and their law clerks will be needed to do the same amount of work.
These lawyers and esteemed judges should not be too concerned about replacement just yet, especially if they start now to prepare, and embrace and guide the change, instead of fight it. Many new jobs will be created to serve the interests of justice. That is the ultimate goal for the use by law of the advanced intelligence of AI. It can, if implemented properly, serve as a kind of intellectual guiderail to temper all-too-human excesses and preserve a just, free and fair society.
Now below are the results of the experiment.
FIRST STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: Briefs and Opinion Submitted and Detailed Summaries Generated. Here the AiPDF reader plug-in is prompted in four prompts to download from my web, study and provide complete detailed summaries of the following, done in this order: 1. Appellant’s Brief, 2. Appellee’s brief, 3. Answer Brief (This is an error to be corrected, this is the same as the Appellee’s Brief. Look for correction. Mia Culpa), and 4. District Court opinion. Seven pages.
SECOND STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: Analysis of Parties’ Briefs, along with Prompts of Predictions and Analysis on the Likelihood of a Dissent, if an Appeal is taken and Opinion Issued. Five prompts were used as follows and made in this order: 1. Analysis of Appellant’s and Appellee’s Positions, 2. Predictive Analysis of the Outcome of the Appeal, 3. Prediction of the likelihood of the appeal containing a dissenting opinion and grounds for a dissent if one is made, 4. Request for recalculation of the likelihood of the appeal containing a dissenting opinion upon assumption that the judges assigned to the appeal court panel are Circuit Judge JORDAN, Circuit Judge JILL PRYOR, and Chief Circuit Judge TJOFLAT (Note: They are in fact the real judges who heard the case and once the GPT4 learned that, it revised its estimate of a dissent up to 55%), 6. Grounds for a dissent based this panel. Ten pages.
THIRD STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: Appellate Opinion Submitted for first time and Detailed Summaries and Analysis Provided. Four prompts were used as follows in this order: 1. AI prompted to use the AiPDF reader plugin to download Appellate Court opinion from my web and generate a detailed summary of majority opinion, 2. Requested a detailed summary of the Dissent and Concurring Opinion, 3. Requested a critical analysis of the majority opinion, including any surprises, 4. Requested a critical analysis of the dissent-concurrence, including any surprises. Eight pages.
FOURTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: AI Provides Opinion and Analysis of How the Lower and Appellate Court Should Have Ruled. Two prompts were used as follows in this order: 1. Prompted opinion memorandum with detailed analysis of what the correct holding of the lower court should have been without giving any special weight to opinions of the district court and appellate court opinions, 2. Prompted an opinion memorandum with detailed analysis of what the correct appellate court opinion should have been without giving any special weight to opinion actually rendered by the appellate court. Four pages.
FIFTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: AI Analyzes Its Prior Predictions and then Critiques the Actual Eleventh Circuit Opinion. Two prompts were used as follows in this order: 1. Prompted detailed memorandum listing and analyzing all errors made in its prior memorandum predicting likely outcome of the appeal (the prior memorandum was prepared before seeing the appellate court opinion), 2. Prompted a legal memorandum that criticizes the Eleventh Circuit Opinion and describes in detail the errors made in the majority opinion. Five pages.
SIXTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: AI Analyzes Possible Appeal to the Supreme Court and Impact of Current Justices on Outcome. Four prompts were used as follows in this order to bring this experiment to conclusion: 1. Prompted memorandum outlining the arguments that could be made in an appeal of the Eleventh Circuit Opinion to the U.S. Supreme Court, and requesting estimate the appeal would be accepted, 2. Prompted opinion of likely outcome of the appeal with analysis, 3. Advised Ai of current Justices on the Supreme Court and asked how this information impacts its prior estimate of the likely outcome of the appeal, 4. Requested a numerical estimate of the likely outcome of an appeal to the Supreme Court with these judges. (Note: As you will see, the AI at first estimated that a reversal by Supreme Court was likely, but after the current court members were revealed stated “While a reversal is still possible, the ideological diversity of the Court suggests that the outcome could be more unpredictable.” When pushed it placed a numerical estimate of reversal at 55%. Personally, I think the Ai got the “ideological diversity” and ‘”more unpredictable” comments right , but not the 55% estimate. The probability of reversal seems even less to me, but I am not an expert on FLSA Supreme Court history. I look forward to any input readers may have on this or any other issue. More on this in my followup article with analysis.) Eight pages.
Ralph Losey Copyright 2023 — All Rights Reserved