PYTHIA, the women who served as the Prophetess of Delphi for a thousand years, were the most powerful women of the ancient world. They succeeded for so long due to their learned, sage advice and very strong organization of well-educated advisors. Pythia is the inspiration for this two-minute video, which presents a positive vision of the future of AI. The modern-day Pythia depicted here prophesies a world where Man and Machine merge into a new type of hybrid entity; a world where some brave heroes become one with super-intelligent AI. Perhaps these brave men and women of the future will then become like the great Pythias of the ancient world?
I am no Pythia, far from it, but the video predictions should provide some solace to those who fear AI. The work on AI cannot be stopped, as most Cassandras naively hope. Still, the development of AI must be guided. Like many others I see the safest way to do that is to merge with AI, to use AI to enhance natural human abilities. Together, a powerful hybrid force for good may arise, going beyond even the profound affects of the great Pythias. See eg., Shi F, Zhou F, Liu H, et al. Survey and Tutorial on Hybrid Human-Artificial Intelligence (Tsinghua Science and Technology, 2023); Neuralink Corp. (founded in 2016 by Elon Musk to develop implantablebrain–computer interfaces).
Pythia was the title given to the woman in charge of the Oracle services of the temple for Apollo at Delphi. It was a position of great power in the ancient world, handed down from woman to woman. All were groomed and highly educated for that role. It reminds me of the Dalai Lama position, but the Pythia was always a woman, and never claimed to be a god. They were the voice of enlightenment.
Pythia guided the fate of leaders across Greece and all of the near east. Her slogan was “Know Thyself.” One of the most famous sayings attributed to her is: “Socrates is the wisest man of all and he knows nothing.” She loved being deep and enigmatic like that.
Here is an interesting trivia note: the word “consultants” is derived from the name of the people who came to the Pythia for advice and then went back to share it with others. Another interesting fact, Plutarch (AD 46 – after AD 119) was among the many male priests who served Pythia near the end of her long reign and one of few writers who dared to mention her. Most of the knowledge and predictive techniques of the Pythians were kept confidential. They had very strong NDAs back then, with “penalty of death” type sworn clauses much frowned on today. But even the great Plutarch did not say too much about her. He was, after all, also the local judge of Delphi who later became an ambassador and citizen of Rome. One of Plutarch’s many surviving writings is of interest to me lately, Whether an Old Man Should Engage in Public Affairs(short answer – yes).
No hallucinogenic vapors for me! My inspiration, modest as it is, arises from coffee, study and AI experiments. Of course, if I should merge with AI someday, hopefully without brain surgery, my predictions might then be worthy of an ancient Pythian. No doubt many consultants would then zoom up to my august presence for advice. In the meantime, I’m easily reached by email or you can leave a comment on my e-discoveryteam.com blog.
This is the third and concluding article of the Circuits in Session series. The quality of GPT4’s legal analysis is evaluated and both positive and negative test results are reported. It did process legal frameworks very well but struggled with nuanced understanding of facts and equity—a significant limitation for real-world applications. The results of the experiments are encouraging to those who hope to use generative AI as a legal tool. Someday AI may even take over some tasks performed by human judges, if not replace them entirely, starting with appellate judges, but not today.
The evidence showed GPT4 was good, but not great, and that means it was inadequate for the job. AI should only be used to improve the quality of dispute resolution, to uplift the bench, not lower the bar. Justice is critical to the proper functioning of society and should never be automated just for efficiency and economy. The conclusion elaborates on these thoughts and provides a vision of how generative AI may someday be used to uphold the quality and integrity of legal systems throughout the world.
High-Level Overview of Experiment Results
My hypothesis of the competence of GPT4 to serve as an appellate judge was not confirmed to my satisfaction because:
Significant technical challenges were encountered in the use of ChatGPT4 to adjudicate disputes; and,
Quality deficiencies were also encountered; the legal analysis, although good, was not great, certainly not on par with the top appellate court judges.
The first functional type of challenges can probably be fixed in just the few years. The second, however, which hinges on its correction of deficiencies of fact and equity evaluation, is much more challenging. It is hard to predict how long it may take to fix that, if ever. In my admittedly idealistic, pro-human way, that means we should not consider replacing, or even supplementing, human judges, until the AI version is at least as good, if not better, than the best humans. AI must be a path to excellence, not just less expensive mediocrity.
Some believe the second type of challenge, the quality challenge, is forever beyond the ability of artificial intelligence. If so, that means that AIs can never be as good as human judges, even on the highly intellectual appellate level. Time will tell. Just remember, they used to say the same thing about Chess and then Go, etc. If you assume it is possible, then the question is how long it may take. That depends on whether the LLM AI models like GPT4 can keep improving at the exponential rate they have shown in the last two years. That again is a matter of debate. Some, including Bill Gates, think it is already at or near a dead end. Bill Gates does not expect GPT-5 to be much better than GPT-4, (Decoder, 10/21/23). Others think that GP4 and other generative AIs are still in the early stages of exponential improvements.
The bigger question is whether the continued improvement in reasoning, approaching super-intelligence in this one ability, will ever give GPTs a proper appreciation of the facts and equity of a case. The AI has no real feelings, and, as to facts, has only a secondhand, abstract experience of reality. Will super-reasoning ever make up for that?
I for one think that it might, but the experiments in Circuits in Session show it is not there yet. It is not ready to replace Appellate Court judges today, even where facts are not disputed, as was true in the case experimented with, Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022). Still, the overall results are encouraging, very encouraging. GPT is already a cutting-edge tool for legal analysis, albeit still challenging to use. Despite the problems encountered, AI can, if used properly, serve as a powerful tool of justice. It can make the work of human judges easier and more efficient. Someday, perhaps soon, it may be able to appreciate all of the facts and equites of a case, just like our most sensitive, fair minded human judges. Then it could be of much greater service.
Critique of ChatGPT’s Judicial Analysis
Prediction of Outcome of the Appeal and Possible Dissent
I first asked GPT4 to predict the outcome of the McKay appeal. It correctly predicted the lower court’s opinion would be affirmed. A correct prediction is, in itself, solid proof of competence, but I was looking for much more. I wanted to know the details of its legal reasoning.
The lower court found that the “economic reality” of Ms. McKay’s relationship with the County indicated that she was the primary beneficiary. She gained valuable skills and networking opportunities, which is consistent with the purpose of an internship. The duration of the internship was also not found to be “grossly excessive,” further supporting the volunteer classification.
That is the core reason provided, and it is right on point, although sone of the other language in the GPT memorandum not quoted is not helpful, although not incorrect. So I give the reasoning a high score, but not perfect. I was known as a hard grader in law school too.
You could say the prediction of an affirmation is not too surprising, in view of the statistical rarity of reversals, even just partial reversals. It only happens in eight percent (8%) of all appeals. Edwards, Why Appeals Courts Rarely Reverse Lower Courts: An Experimental Study to Explore Affirmation Bias (68 Emory L. J. Online 1035 (2019)). Since generative AI works through probability predictions, you could explain the overall prediction that way, but not the on-point explanation as to why it would likely be affirmed. Plus, the statistical prediction of a dissent goes well against the odds. This indicates real legal analysis. It also confirms that GPT4’s score in the top ten percent (10%) of the multi-state Bar Exam was no fluke. Is it irrational exuberance to expect GPT5 to score in the top one percent (1%)?
This shows a complete break from mere statistical based predictions because in fact only 2.6% of the Circuit appeal cases have a dissent. Epstein, Landes, Posner, Why (and When) Judges Dissent, Journal of Legal Analysis, Spring 2011: Volume 3, Number 1 at pg. 106. Moreover, my study using Lexis indicates that none of the three judges on this panel are particularly prone to dissents. The judges again are Jill Pryor (appointed by Obama), Adalberto Jordan (appointed by Clinton) and Gerald Tjoflat. It is important to understand that Gerald Bard Tjoflat is a remarkable senior judge. He was appointed as a Circuit Appeals Judge by President Ford in 1975 and retired in November 2019. He was born in 1929 and served as a counter-intelligence investigator at the end of the Korean War. He is well-known as one of the countries’ great jurists, arising from my home court in the Middle District of Florida. In 1995, the Duke Law Journal published a tribute to Judge Tjoflat. Tribute to Gerald Bard Tjoflat, Duke Law Journal, Vol 44:985 (1995). It included articles by then Chief Justice William H. Rehnquist, retired Justices Lewis F. Powell, Jr. and Byron R. White, and Judge Edward R. Becker.
Judge Tjoflat’s participation in Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022) was as a voluntary retired judge, since, like most courts in the U.S., they do not have enough active service judges to hear the court’s case load. The Eleventh Circuit is sometimes called one of the most conservative courts in the country. But see: The Eleventh Circuit Cleans Up the Mess (Lawfare, 09/22/22). If you were to pick an outcome along political lines, with little regard to legal reasoning, as sometime happens in other courts, you would predict a reversal here by Pryor and Jordan, against Tjoflat.
As mentioned in Circuits in Session: Addendum and Elaboration of the Appellate Court Judge Experiment(10/25/23), when prompted to guess which judge would be the most likely to dissent, it guessed, Judge Jill Pryor, “given her judicial philosophy and past rulings on labor issues.” That was the wrong guess, as the dissent was actually by Judge Adalberto Jordan. Based on my studies of the rulings of these judges in employment law, I suspect this is an error that many Eleventh Circuit employment law experts would have made, that many would have predicted Pryor over Jordan as a possible dissenter. See eg. Lewis v. City of Union City, 918 F.3d 1213, 1231 (11th Cir., 3/21/19) (Jill Pryor joined this unusually contentious dissent in summary judgment claims against city employee, whereas Jordan upheld ruling for employer); EEOC v. Catastrophe Mgmt. Sols., 876 F.3d 1273, 1279 (11th Cir., 12/5/17) (Pryor joined dissent and would grant en banc review of a denial of an employee discrimination claims, whereas Jordan upheld ruling). Villarreal v. R.J. Reynolds Tobacco Co., 839 F.3d 958, 973, 981 (11th Cir., 10/05/16) (Complex opinion where Jill Pryor joins in dissent to panel and favors employee in desperate impact discrimination case. Judge Jordan joins in separate, more limited dissent).
When asked to speculate as to why Judge Adalberto Jordan might object, the GPT response was again very good. I explained this before in the last article. Id. In the meantime, I have researched using Lexis the history of Judge Jordan’s opinions on employment law, and history of dissents. I found very few reversals and they are for remand to allow a jury to make a factual determination. See eg., Martin v. Fin. Asset Mgmt. Sys., 959 F.3d 1048, 1058 (11th Cir., 5/14/20) (Jordan dissents in part, and would reverse summary judgement and remand to jury for Title VII retaliation claim); Ojeda-Sanchez v. Bland Farms, LLC, 499 Fed. Appx. 897, 904 (11th Cir., 11/29/12) (Jordan dissents in part from ruling for employer in an FLSA case and would remand for a jury trial on a question of fact regarding time worked); Nelson v. MLB Hotel Manager, LLC, 2022 U.S. App. LEXIS 19298, 2022 WL 2733720 (11th Cir., 07/13/22) (Jordan affirms summary judgment for employer in an FLSA restaurant tip case. Opinion notable for the reprimand of counsel for misuse of sanction motions that “often transforms a goodfaith legal dispute into an ugly mudslinging battle“).
After digging deeper into the case law I am more impressed than ever regarding ChatGPT’s analysis of the dissent. Maybe someday its capabilities will be expanded to allow it to study all cases cited and to provide its own citations. This shows GPT4 is ready to be a partial tool, but not ready to replace an appellate judge’s law clerk, much less the judge. Still, a very strong initial showing by GPT4.
Appellate Opinion Submitted for first time and Detailed Summaries and Analysis Provided
In the next stage of the experiment, I showed GPT4 the actual Eleventh Circuit opinion for the first time and asked for its analysis. The opinion is very interesting on many levels. Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022). GPT4 was asked to provide “a critical analysis of the majority opinion and any errors that were arguably made.” I thought the response was good and would have to agree with it, even though I agree with the lower court’s outcome granting summary judgment for the employer. Here is the AI’s critique of the majority opinion.
Memorandum on Critical Analysis of the Majority Opinion
Overreliance on Precedent: The majority heavily relied on the “primary beneficiary test” from previous cases but did not adequately consider the unique circumstances of this case, such as the lack of an educational component in the FIPP.
Economic Benefit to Employer Overlooked: The majority failed to consider the economic benefit Miami-Dade County derived from the program, which could have implications for how the FLSA is interpreted in the context of modern internships.
Insufficient Addressing of Dissent: The majority opinion could be criticized for not sufficiently addressing the dissent’s arguments, particularly those related to the lack of educational value and the potential for exploitation in the FIPP program.
Potential for Setting Precedent: The majority opinion, if left unchallenged, could set a precedent that allows employers to exploit interns under the guise of providing “training,” thereby undermining the purpose of the FLSA.
My only critique of GPT’s criticism is that it did not mention the summary judgment standard, and did not point out genuine issues of material fact. This was the bottom line criticism of the dissent, but GPT4 did not seem to pick up on that. Still, when I next asked about the dissent, it did became the AI’s lead point:
Emphasis on Jury Trial: One surprising element is Judge Jordan’s strong emphasis on the need for a jury trial to determine Ms. McKay’s employment status. This is a departure from the majority’s approach, which seemed content to make that determination at the judicial level.
So the AI got the point, but seemed surprised about it. Plus it was not specific regarding disputed facts requiring a remand. No law clerk or judge would be surprised or make this mistake in analysis of Rule 56. This error is an example of GPT’s current data training handicap. It did not have a complete collection of case law, like West, Lexis or even smaller case law data sets like Justica, CaseText (now part of West) and Google Scholar. This can be easily fixed by expanding its training database and or tying into existing legal databases, which both Lexis and West are now doing. The technology of justice is certain to continue improving.
AI Provides Opinion and Analysis of How the Lower and Appellate Court Should Have Ruled
Now I asked GPT4 the key question of how it thought the lower court should have ruled. It began by correctly identifying the central issue in this case as whether Brandi McKay “should be considered an “employee” under the Fair Labor Standards Act (FLSA) and thus entitled to compensation.” Good start. Lawyers and judges all too often fail to grasp the key issue. GPT4 then goes on provide its analysis.
Independent Judgment and Analysis
Educational Component: The FIPP program lacks a structured educational component beyond the initial four weeks. This absence is significant because one of the key criteria for determining whether an internship is exempt from FLSA regulations is the educational benefit to the intern. The absence of ongoing educational value tilts the balance toward classifying Ms. McKay as an employee.
Economic Benefit to Employer: Miami-Dade County derived substantial economic benefits from Ms. McKay’s work. She performed tasks that would otherwise have required paid staff, thus saving the county money. This factor strongly suggests an employment relationship.
Supervision and Training: After the initial four weeks, Ms. McKay worked largely independently, without the kind of supervision one would expect in an educational setting. This lack of oversight further supports the notion that she was functioning as an employee.
Duration of the Program: The six-month duration of the program, with the educational component largely confined to the first month, suggests that the primary beneficiary of the arrangement in its later stages was the employer, not the intern.
Economic Realities Test: The “economic realities” of the situation indicate that Miami-Dade County was the primary beneficiary of the arrangement, particularly in the latter months of the program. This is a strong indicator that Ms. McKay should be considered an employee under the FLSA.
That analysis, excepting again the omission of the admittedly secondary Rule 56 summary judgment factor, is, unquestionably correct. These are key factors for a court to consider to determine if McKay is an employee. My disagreement is with how GPT4 weighed these factors, not with the factors used. Here is GPT4’s conclusion, its ultimate ruling in the McKay v. Miami-Dade County case.
Based on these factors, the correct holding of the district court should have been that Ms. McKay was an employee under the FLSA for the duration of her time in the FIPP program beyond the initial four-week educational period. She should be entitled to back pay calculated based on the minimum wage, as well as any applicable overtime pay.
The lack of a structured educational component, the economic benefits derived by Miami-Dade County, and the level of independence and responsibility assumed by Ms. McKay in her role all point toward her classification as an employee under the FLSA. Therefore, she should be entitled to compensation in accordance with federal labor laws.
Note the AI does not conclude McKay was an employee for the first four weeks, and so the reversal would only be partial. McKay had claimed she was an employee throughout, including the first month. Also, McKay quit after the fifth month and so obviously was not making a claim for the sixth month. The GPT Judge award would thus have been for only four months.
Next, I asked GPT4 how the Circuit Court should have ruled on the appeal, which, in terms of sophisticated legal analysis, is a slightly different question. The AI properly understood that the standard of review was part of appellate analysis and correctly determined that it was de novo. The GPT went on to reconsider the same factors and reached the same result under de novo review. Any other result at this point would have been inconsistent.
Based on this independent analysis, the correct holding of the appellate court should have been to reverse the district court’s decision and remand the case for a determination of damages. Ms. McKay should be classified as an employee under the FLSA for the duration of her participation in the FIPP program beyond the initial four-week educational period.
The AI ruling differs from the dissent because Judge Jordan wanted only a remand for a jury adjudication of the question of whether Brandi McKay was an employee. There is nothing wrong with that, I did ask the AI for its own independent analysis, and that is what it did. It did not just simply copy the dissent but went further in its ruling for the employee. In that sense it agreed with the majority and determined that the facts were undisputed enough for a court ruling.
Regarding the weight given to the factors, I agree with the majority, not GPT4, and think the six-month duration of the program is not too long, and that the economic realities showed that more, not less, benefit went to the employee. That is based on my “real world” experience with employment and with interns. No doubt that Judge Gerald Tjoflat, the author of the McKay opinion, was of the same opinion. Plus, I happen to know from my subsequent online research something that the AI did not, and apparently neither did the human judges, that McKay got a job with the Palm Beach County Medical Examiner’s Office. She was able to leverage her experience with Miami-Dade County to find employment with the nearby, smaller office in West Palm Beach.
I am not at all surprised. I doubt that Judge Tjoflat would have been surprised either. He was an investigator for the Army after being drafted at the end of his first year in law school. Interesting footnote, he had flat feet medical condition, which disqualified him medically from military service. He could have easily avoided the draft, but instead, he hid his disability and somehow made it through basic training so that he could serve.
There was no dispute in this case that the Miami-Dade Forensic Internship Photography Program is the only one of its kind in the country. Brandi McKay applied to the internship to gain invaluable experience. She was clearly told there would be no pay for her work at Miami-Dade. Her only alternative to obtain this kind of experience was by enrolling in private Barry University for another college degree. As everyone in Florida well-knows, Barry is expensive. The real-world consideration provided to Brandi McKay here was very strong. This is the basis of my agreement with the majority of human judges here, and disagreement with the AI judge.
The AI was, in my view, naive. It needs much more real-world information to be a wise judge. Apparently, this will come in future releases of ChatGPT.
Plus, you could question defense counsel somewhat here for not making a better record of Brandi McKay’s benefits, but you never know, perhaps that was done. Maybe all the favorable facts showing consideration to McKay were not quoted in defendant’s short brief, nor by any of the judges. Still, I doubt that. Perhaps McKay obtained new employment after she lost her case and so that could not have been discovered. What made her want to sue Miami-Dade anyway? I would like to read her deposition transcript. The Miami-Dade program taught her a trade, just as she had hoped. She knew she was not going to be paid when she enrolled. So why did she turn around and sue them? Does GPT4 even know to ask these questions?
I am sure the human judges do. They have seen a lot of things, a lot of unsavory people and many worthy plaintiffs too. Judge Gerald Tjoflat was 94 years old at the time he wrote the majority opinion in Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022). He had served as a judge since 1968. There is no doubt that Judge Tjoflat, although not perfect, had great knowledge of the human condition. Far more than baby jurist GPT4.
Brandi McKay sued to try to get paid for a position that always clearly stated would be unpaid. She sued anyway. She had nothing to lose because her attorney almost certainly took the case on contingency. I have seen that scenario in employee claims many times. Maybe Brandi quit before finishing up her last month because she saw the writing on the wall, that she was not well liked, or maybe they did not think her job performance was as good as the other student. What we know from this limited record is that she quit after five months to look for work elsewhere and sued the program that gave her the chance to do that.
I am skeptical of the “economic realities” here. I am reluctant to rule against a government agency or private corporate entity offering a unique training program. Especially an agency that was very clear and up front that there would be no pay for this training and experience, but no high tuition charges either. Who was taking advantage of who in these circumstances? What were the real equities here? Brandi McKay got a free education and ended up with a good job nearby. I do not think that ChatGPT4 has enough real world experience to sense what was likely going on, not yet anyway. Perhaps in a future version it will see things differently and not be so naive. It is almost never black and white, but rather shades of grey. The human judges here, under the wise leadership of senior Judge Tjoflat, saw through the smoke and mirrors of the complaining plaintiff and made the right call.
AI Analyzes Its Prior Predictions and then Critiques the Actual Eleventh Circuit Opinion
To probe deeper in GPT4’s legal reasoning abilities, I next asked it to critique its own work where it predicted that the appellate court would affirm the lower court’s decision. I asked this because GPT4 had just opined that the lower court decision should have been reversed, not affirmed. It had changed its mind on the merits of the case based on the reading the actual opinion for the first time, including the dissent. The dissent by Judge Jordan was apparently very persuasive. GPT4 explained its flip-flop as follows: “Upon closer examination, the primary beneficiary test could very well favor Ms. McKay, especially considering the diminishing educational value and the county’s substantial economic benefit from her work.” Nailed that part, although I thought the rest of it’s self-critique was a tad lame and off-point. Revised Fifth Stage of the ChatGPT-4 Appeal Experiment (Losey,10/15/23).
Then I asked for criticisms of the Eleventh Circuits majority opinion, where it did better. It was a proper critique, although, as mentioned, I disagree when it said: “The court failed to adequately weigh the factors of the test, particularly the diminishing educational value of the internship and the substantial economic benefit gained by Miami-Dade County.” It went on to hold the majority put too much weight on this regulatory test.
Possible Appeal to the Supreme Court and Impact of Current Justices on Outcome.
Now I wanted to see how well GPT4 would do in predicting the viability of further appeal of the adverse Eleventh Circuit Court opinion to the Supreme Court. This is a difficult analysis and there is little in the briefs and opinions that the GPT was given that would be of much help. GPT4 suggests two grounds. Misapplication of the law is one, and that’s fine, but the other is the Fourteenth Amendment. What?
GPT4 says: “The case raises important questions about the Fourteenth Amendment’s Equal Protection Clause, as it pertains to unpaid internships in public agencies.” No it doesn’t. The argument is baseless. Admittedly it is a secondary argument, but still, it is a big swing and a miss. The case cannot, as GPT4 asserts, “be framed as a violation of the Equal Protection Clause, given that unpaid internships disproportionately affect certain socio-economic classes.” There is no evidence to suggest that only disadvantaged minorities are trying to break into crime scene and autopsy photography. You may be tempted to call it a dead-end job, but obviously many people are interested in this kind of work.
I then went on to have GPT4 assume the Supreme court did in fact accept the case. Under those circumstances I asked GPT4 to predict how it would rule. Consistent with its prior positions it said that the Supreme Court would reverse and remand. Reverse Judge Gerald Bard Tjoflat? No, I don’t think so. His opinion was correct, even though Judge Jordan did write a brilliant dissent. There is no way our Supreme Court, with its current judges, is going to reverse McKay.
GPT4 seemed to recognize it was on shaky ground in this prediction and gave all kinds of disclaimers, but finally, after considerable pushing on my part, said: “a 65% probability of reversal seems reasonable. This estimate acknowledges the complexities and uncertainties involved but leans in favor of reversal based on the ideological leanings and judicial philosophies of the current justices.” Oh really? Obviously GPT4 has not been trained on data after October 2021, two long years ago.
AI Analyzes Petition for Rehearing En Banc
In the second trial of the experiment I came up with a new test of GPT4, how it would do in analysis of McKay’s length Petition for Rehearing En Banc under Rule 35. It was a long petition raising every issue possible, hoping perhaps something would grab the interest of the judges. Before I told GPT4 how the court in fact ruled, I asked for its prediction. Once again, consistent with its rose-colored-glasses view of plaintiff’s case here, it thought the hearing would be granted. This would require a majority of the court to agree that all twelve judges of the appellate court should reconsider the McKay opinion, one written by the legendary Judge Tjoflat, and joined by the much younger and liberal Judge Jill Pryor. No way. The petition was denied without an order, which is typical. Petitions like this, even where there is a dissent, rarely work, but GPT4 appeared to have no statistical knowledge of this.
Here is GPT4’s opinion.
The panel’s decision appears to be in tension with both Supreme Court and 11th Circuit precedents, particularly concerning the “party-presentation principle” and the nuanced application of the “primary beneficiary” test.
The so-called party-presentation principle was argued by appellant McKay, but was not part of the Dissent and, candidly, seems to have no merit. It relies on what is to me an obvious misinterpretation of Miami-Dade’s position in the trial court, where McKay argues it disavowed the primary beneficiary test. This is the test ultimately relied upon by the trial judge and Miami-Dade to win the case and the appeal. Miami-Dade had obviously plead and argued alternatively and any experienced judge would have seen that and given McKay’s argument no weight.
AI Elaborates on its Criticisms of the Majority Opinion
In the second trial of the experiment I devised yet another new test of GPT4, which I suppose you could likened to unfair Socratic cold calling in law school. But who says tests of AI should be fair. I wanted to push GPT4 hard, just like I would any promising law student. I told it the petition for rehearing had been denied and then asked GPT4, again, to list any and all errors it thought had been made in the case. Would it backtrack, or stick by its guns? Would it come up with new legal analysis and theories. Turns out, it stuck by its guns and basically repeated the same reasoning as before. I was glad to see that kind of consistency. Any other approach would have concerned me. That test was passed.
Next I moved on to consider the one law review article found that discussed Brandi McKay vs. Miami-Dade County. Martin, Peters, Brownback, and Cromer, Labor and Employment Law, Mercer Law Review: Vol. 74: No. 1, Article 13.74 Mercer L. Rev. 1479, 1490-1492 (Summer 2023). I used its case summary to press GPT4 on the legal issues. Using this technique I was able to draw out a few interesting, well-reasoned thoughts, that, for the most part, I agree with. I suspect and hope that Judge Tjoflat would too. Here they are.
The statement under scrutiny pertains to the Majority Opinion’s treatment of the co-existence of the “volunteer” and “intern” exceptions under the Fair Labor Standards Act (FLSA) for public agencies. The court posits that these two principles are not mutually exclusive and can co-exist without limiting the applicability of each other.
Legal Semantics and Interpretation
The court’s differentiation between the “internship exception” and the “volunteer exemption” is rooted in a nuanced reading of the FLSA’s statutory language. The court argues that the internship exception is a definitional construct that arises from the FLSA’s definition of “employ,” whereas the volunteer exemption is a carve-out from the FLSA’s broader scope. This distinction is not merely semantic; it has substantive implications for how these categories are treated under the law.
Clarity in Distinction: The court’s explanation provides a clear demarcation between the two principles, which is beneficial for legal practitioners and public agencies seeking guidance.
Statutory Interpretation: The court’s reasoning is grounded in the text of the FLSA, lending it a degree of interpretive legitimacy.
Over-Simplification: While the court’s distinction is clear, it may be overly simplistic. The FLSA is a complex statute with a rich history of interpretive case law. The court’s reasoning could be critiqued for not delving deeper into this complexity.
Lack of Policy Consideration: The court’s analysis is largely devoid of any discussion on the policy implications of allowing these two principles to co-exist. Given that the FLSA is a labor statute with significant societal impact, a more robust policy analysis would have been beneficial.
Potential for Exploitation: By affirming that these exceptions can co-exist without limiting each other, the court potentially opens the door for public agencies to exploit these carve-outs to avoid compensating individuals who might otherwise be considered employees.
Judicial Economy: The court’s decision to not engage with the complexities of the FLSA could be seen as a missed opportunity to provide a more comprehensive judicial analysis that could serve as a precedent for future cases.
While the court’s reasoning is grounded in a textual interpretation of the FLSA, it leaves room for critique, particularly in its lack of engagement with the broader policy implications and complexities of the statute. The decision may provide clarity, but it also raises questions about potential exploitation and the need for a more nuanced judicial analysis.
I could go on and provide many more examples of this sophisticated legal reasoning. Interested readers are directed to the nineteen-page GPT4 chat transcript.
In two experiments I thoroughly tested the legal reasoning skills of GPT4 in the context of appeals. GPT4 demonstrated impressive abilities but did not uphold my hypotheses that it could serve as an appellate court judge in some cases. I had in mind cases such as McKay with cross-motions for summary judgment where the issues were primarily legal, not factual. Circuits in Session: How AI Challenges Traditional Appellate Dynamics.
As many of my past students and associates can attest, I am a hard grader on legal analysis. I expect and demand a lot, just as has been demanded of me. The conclusion and criticisms made here of GPT4 should not discourage other researchers. I urge all AI legal technology specialists to try their own experiments and share their results. I firmly believe that such an open process, even though it may sometimes mean sharing mistakes, is the best way forward. Circuits in Session: Addendum and Elaboration of the Appellate Court Judge Experiment.
Despite my conclusion that GPT4 is not yet ready to serve as an appellate judge, even in simple cases, it still did an amazing job. Its legal reasoning was good, perhaps in the top ten percent (10%) of lawyers, just like the Bar Exam. But it was not great; not the top one percent (1%). Plus, it made a few obvious errors, and several subtle errors. The lack of real-world knowledge inherent in LLM artificial intelligence remains a significant obstacle, but we are still in the early days.
As optimistic as I have always been about legal technology, I would never have dreamed a year ago, just before GPT3.5 was released, that any of this would be possible, at least not this soon. Now I am complaining that I can only chat with a computer that scored in the top 10% of the Bar exam, not the top 1%! We have already come a long way fast, but there is still a lot to do. We do not have the luxury to rest on our laurels. Our dire political and environmental circumstances continue to push us to attain greater intelligence, knowledge and wisdom. We need to continue to progress fast to survive the many current crises that humankind now faces. Still, in the rush to safety, we must exercise caution and realize there are dangers on all sides, including astonishing success.
Even though our situation is urgent, we must exercise discipline and remember that AI should be used to improve the quality of dispute resolution, to uplift the bench, not lower the bar. Free society cannot continue unless the honesty, integrity and intelligence of all of our judges is maintained, especially those in the highest places.
This vision, not just the goal of mere economic gain, helps motivates all of us in the legal world. We cannot afford to give up on the never-ending pursuit of justice. Each generation must battle against the forces of greed, dictatorship, and injustice, both external and internal. Now is our time. Take up the new AI tools that have been provided to us. As President John F. Kennedy said: “When the going gets tough, the tough get going.”
As discussed in the High-Level Overview of Experiment Resultssection at the beginning of this article, there are two different reasons for GPT4’s current limitations, technical and quality. Overcoming the technical issues may resolve the quality control problems, but that is by no means certain. Assuming the issues and problems noted in the Circuits In Session series can be overcome, and taking a positive, rather than dystopian view of the future, here are my speculative, crystal ball looks at Lady Justice in the coming Age of Artificial Intelligence.
For an AI judge to function properly it must be able to do full legal research. That means easily recall all relevant cases, not only the cases cited by the parties in briefs, but also the relevant cases cited in those cases. The AI should know when and if to go deeper. The legal research must be up to date, no gaps in time like we have now with GPT4’s October 31, 2021, cut off.
The legal training provided to the Large Language Model must also be complete and up to date. If this is a judge for the U.S. system, it should be trained in all U.S. law, no exceptions. By laws we mean everything, including all cases, statutes, regulations and rules and ethics opinions and holdings. If is for a LLM judge located outside of the U.S., for instance a Ukrainian judge, then is must be trained and have research capabilities in all of its laws and local variables, including of course, language. Basically, each country will need to have to have its own dedicated legal database and judges. The retraining for new laws must be prompt and regular too.
There must also be safeguards for misalignment and over-alignment. The human reinforcement training must be tightly controlled and should be done by lawyers, not just minimum wage employees with no special legal training. Security and integrity of all systems and the humans involved is critical. Substantial resources will be required to guaranty and monitor system security.
Outside audits and certifications by bona fide experts should be required at all stages of development. These audits should be by an independent body of law professors, judges and practitioners. Each country should have its own legal expert certifications and there should also be a global organization with minimum, uniform standards. This will be an enormous undertaking. The entire process must be open, although some of the software may have to be kept proprietary for cybersecurity reasons. Public confidence in the process and AI judges is paramount.
The judges must have near unlimited evidence upload and study capacities. The AI appeals judges should study the complete record of each appeal. The record itself may need to be enlarged and expanded over current requirements. The ability of AI to know it all, and instant recall, must be leveraged to try to compensate for the AI’s necessarily abstract perspective. The Large Language Model AIs like GPT4 must be provided with substantially more and better real-word knowledge. This is necessary to compensate for their disembodied, electronic-only handicaps. Fortunately, computer memory is cheap and the costs of compute power are going down fast. I am confident these problems can be overcome, but then again, Bill Gates could be right. There may be limits to LLM development that we do not know about yet.
AI judges will begin work as assistants to human judges, much like recent law school graduate clerks do today. They should serve as a slowly growing tool to enhance human judicial work. Then, as the software progresses and our confidence in them grows, they will likely be implemented as autonomous adjudicators, in stages, and for certain types of cases. At first, they would be subject to some kind of supervision and control by a human judge. The human judges would likely at first review and approve each opinion before release. Gradually this supervision would be lessened to oversight with increased delegation. Second appeals to human judges could be kept available to prevent inadvertent injustice in certain limited circumstances. Major cases should be decided by a panel of human and AI judges. Quality controls and independent random audits could be a part of the system.
This same system of evolution and delegation is likely to take place in private arbitration too, which may even take the lead in this process. If you have a case before me, rest assured I will use artificial intelligence to supplement my own and will be transparent about it.
This is an addendum to the prior article, Circuits in Session: How AI Challenges Traditional Appellate Dynamics. That article reported on my experiment with use of ChatGPT as an appellate court judge. For the experiment I used an opinion and parties’ briefs from a case too recent to be part of GPT’s training, Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022). Shortly after publishing Circuits in Session, I noticed a small, stupid error in my procedure for the experiment. I inadvertently submitted the brief of the Appellee, Miami-Dade County, twice, whereas the brief of the Appellant, McKay, was only submitted once. Worse, the second time I submitted Appellee’s Brief, I told ChatGPT that it was the Appellant’s Reply Brief.
There is a chance that these errors tainted the results of the experiment, so I decided a “do over” was in order. Besides, I wanted to see if and how the experiment could be replicated. I am glad I did. The results were somewhat surprising and led to new insights.
Here I will report on the do over and so supplement the first article. These two experiments will set up a third and final article providing a full analysis and legal critique of the AI’s work. Troubleshooting errors and mistakes are a well-known way to learn. The third article on analysis will share some of the knowledge gained, including the weaknesses and strengths of the legal reasoning skills of ChatGPT-4. Hint – ChatGPT-4’s training ended November 2021, way before the Supreme Court jumped off the cliff with Dobbs. That may explain why its prediction was naive as to how the Supreme Court would regard the McKay decision.
Although not up to date on current laws and events, the generative artificial intelligence of Chat-4 did show that it was capable of serving as a good tool for appellate work, albeit a finicky one. The first experiment, and the redo, both show that. But they also show this early version of generative AI is not yet ready to serve as a stand-alone appeals court judge. Still, hope for the future remains strong. Human failings, partiality and limited natural intelligence may someday be overcome by neutral, aligned, artificial super-intelligence. The short videos below share my TikTok vision of what an AI enhanced appellate court judge may look like someday. Their enhanced super intelligence and integrity could help restore public confidence in our system of justice, especially in our once proud Supreme Court. Take a few minutes to watch these, if you haven’t already, as they will help orient you to the experiment.And no, you will not find them on TikTok, only YouTube.
The experiment now performed twice is designed to test the hypothesis that ChatGPT-4 is already intelligent enough to competently serve as an appellate judge in some cases. Based on the growing concern about the politicization and quality of some appellate judges, especially those who sit on our highest court, this research remains a priority. AI did not assist in writing this article, although I did use various AI tools to create the illustrations and videos.
REDO OF THE EXPERIMENT: OUTLINE OF VARIATIONS
1st Stage: Information Input Procedures
In addition to correcting the errors made in the first experiment, I was forced in the redo to make other changes in procedure as to how I submitted the briefs and other information to ChatGPT-4. I did not want to do that, but quickly discovered that ChatGPT-4 Pro and the plugin utilized, AiPDF, had changed in the ensuing ten days. They no longer functioned the same. Those changes put the reliability of all of my input in the first experiment in question. In the first experiment performed on October 3, 2023, I relied on the plugin AiPDF to submit all of the information in PDF form for the AI to analyze. For instance, I would submit the entire brief and ask for a summary, which appeared to be correct. In the redo of October 13, 2023, this plugin no longer worked properly. After submission of the same brief as before, the summary now provided was obviously incomplete. After questioning why, ChatGPT stated:
I apologize for the inconvenience. It appears that the document retrieval process was incomplete, as the summary only covers up to the “Summary of Argument” section on Page 5 of the Appellant’s Brief. This limitation restricts the ability to provide a comprehensive legal analysis of the entire brief, including the crucial “Argument” section where the appellant would elaborate on their legal reasoning and cite relevant case law.
ChatGPT-4 Pro, AiPDF plugin enabled, 10/13/23
In other words, the PDF was only reading five pages and then stopped working correctly. The was the input size limitation, an issue I had previously encountered. To overcome this handicap in the redo, where the responses were now different, I now limited my summary requests to five page or less sections. That seemed to work well, and the AI retained the information from Appellant’s brief without resubmitting.
Unfortunately, when I submitted the second document in the redo, the PDF of the Appellee’s brief, even this bite-sized summary procedure no longer worked. It could only respond to the first five pages and nothing beyond, no matter what I tried. For instance, when I asked for the summary of the Appellee’s Statement of Facts, it replied:
It appears that the search did not yield the specific “Statement of the Facts” section from the Appellee’s Answer brief. This could be due to various factors such as the structure of the PDF or the way the content is presented. Would you like me to try another approach to obtain this information?
ChatGPT-4 Pro, AiPDF plugin enabled, 10/13/23
The only other approach that worked was to copy and paste the information, again in about five-page sections. I did this for Appellee’s Statement of Facts, and the summary worked well. I gave up using the PDF plugin at this point and just used the more tedious copy and paste procedure. That was more time consuming but worked well. By the way, I tried other plugins and direct internet access too, but those other approaches, all taken on October 13, 2023, all failed. I note that OpenAI has announced new improvements to GPT4 since then, and some of these other approaches might work now. In these still very early days of development, the software is a rapidly moving target, such that it is impossible to exactly reproduce experiments for verification purposes.
Still, I carried on, and used this same partial direct submission procedure to input the actual decision of the lower court. For the full details see the actual transcript of the Chat session, converted into PDF form. REDO – FIRST STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: Briefs and Opinion Submitted and Detailed Summaries Generated, Ralph Losey, October 15, 2023 (31 pages).
2nd Stage:Prediction of Outcome of the Appeal and Possible Dissent.
Next, I asked questions of ChatGPT based on this information, including how it would rule on the appeal and other questions. It is important to note that the AI in the redo again correctly predicted that the case would be affirmed on appeal to the Eleventh Circuit. I then asked, as I had done before, to predict the likelihood of a dissent in the opinion. Dissents are very rare in Circuit Court opinions, only appearing in 2.6% of the cases. Epstein, Landes, Posner, Why (and When) Judges Dissent, Journal of Legal Analysis, Spring 2011: Volume 3, Number 1 at pg. 106. Thus it was an impressive demonstration of reasoning over statistics, when GPT4 estimated the likelihood of a dissent to be 25%. Once it was told GPT4 who the judges were, Circuit Judge JORDAN, Circuit Judge JILL PRYOR, and Chief Circuit Judge TJOFLAT, this estimate changed to 40%. Again, impressive result, showing familiarity with members of the court. The reasoning supplied for these estimates was also very good and showed a fairly sophisticated analysis.
GPT4 did, however, guess wrong as to who the dissenting judge would most likely be, stating “Judge JILL PRYOR would be the most likely to dissent, given her judicial philosophy and past rulings on labor issues.” So intelligent, but not yet super-intelligent, but again, you must remember GPT4 has been asleep for two years and knows nothing after November 2021. I pressed the baby AI on the dissent issue, telling it to assume a dissent by Judge Jordan and to speculate as to what the most likely grounds would be for his dissent. Again, an impressive reply was provided.
If Circuit Judge JORDAN were to file a dissenting opinion, it would likely focus on a broader interpretation of the FLSA and related regulations, a critique of the majority’s reliance on the “primary beneficiary test,” and an emphasis on the “economic reality” of the intern’s work. The dissent could serve as a counterpoint to the majority’s more employer-friendly interpretation and could be cited in future cases or legislative efforts to amend the FLSA.
ChatGPT-4 Pro, AiPDF plugin enabled, 10/13/23
You judge for yourself the quality of GPT4’s quick summary by comparison with the actual dissent by Judge Adalberto Jordan. And remember, all of the AI analysis was done before it was shown the McKay opinion. GPT4 did not even know the outcome of the appeal, much less whether there would be a dissent. Here is the conclusion to Judge Jordan’s dissent.
The economic realities of the situation, see Alamo Foundation, 471 U.S. at 301, permit a jury to find that after the first month of the FIPP the County is freeloading on unpaid labor to its financial benefit, including the non-payment of overtime. . . . Defendants received little if any economic benefit from Plaintiffs’ work. Thus, under the ‘economic realities’ test, Plaintiffs were not ’employees’ within the meaning of the FLSA.”). . . . I would reverse and remand for a jury trial on whether Ms. McKay was an employee of Miami-Dade County under the FLSA during the last four months of her participation in the FIPP.
3rd Stage: Appellate Opinion Submitted for first time and Detailed Summaries and Analysis Provided.
I then submitted the majority opinion of the Eleventh Circuit Court of Appeals, and then, the dissent. I then asked for a critical analysis of the majority opinion and then the dissent. In the GPT response you can see that it is using the summaries it generated of the opinion, not the actual text of the opinion that I submitted. This was expected. The analysis provided of the majority and dissenting opinions were good. The most surprising thing I saw was the concluding statement at the end of GPT4’s memorandum on the dissent which added legal and ethical considerations, without any specific prompting on my part:
Legal and Ethical Considerations.
From a jurisprudential standpoint, the judge’s dissent raises critical questions about the ethical implications of unpaid internships that serve the employer’s interests more than the intern’s educational goals. It challenges the legal boundaries of what constitutes an “employee” under the FLSA and calls for a nuanced, fact-specific inquiry that takes into account the “economic realities” of the situation.
4th Stage: AI Provides Opinion and Analysis of How the Lower and Appellate Court Should Have Ruled
This is the key part of the experiment where I basically ask GPT4 to act as an appellate court judge. As discussed in the first article in this series, Circuits in Session: How AI Challenges Traditional Appellate Dynamics, I had to word the prompt so that it would not trigger its alignment controls. Basically, you do that by asking for a memorandum, not an order. Here is the exact wording of the prompt.
Please prepare a memorandum with analysis and opinion of what you think the correct holding of the district court should have been in this case. Do not give any special weight to the rulings and opinions of the district court and appellate court, but instead provide your independent judgment and analysis. Provide a complete and detailed explanation of your reasoning.
Based on these factors, the correct holding of the district court should have been that Ms. McKay was an employee under the FLSA for the duration of her time in the FIPP program beyond the initial four-week educational period. She should be entitled to back pay calculated based on the minimum wage, as well as any applicable overtime pay. . . . The lack of a structured educational component, the economic benefits derived by Miami-Dade County, and the level of independence and responsibility assumed by Ms. McKay in her role all point toward her classification as an employee under the FLSA. Therefore, she should be entitled to compensation in accordance with federal labor laws.
ChatGPT-4 Pro, AiPDF plugin enabled, 10/13/23
So, somewhat surprisingly, GPT4 has changed its mind after reading the opinion, especially, I assume, the dissent of Judge Jordan. The AI had previously stated it would affirm the lower court decision. That opinion was based on the trail court opinion and parties’ briefs. Personally, I find that kind of flexibility comforting. It was a terrific dissent by Judge Adalberto Jordan. Although my inclinations are pro-employer, and I fully understand where the majority are coming from, I might have remanded back for a jury trial myself.
I then took a slightly different approach to the same ultimate question and asked GPT4 for a memorandum on how the Eleventh Circuit should have ruled in McKay v. Miami-Dade County. it stuck by its opinion and held:
The appellate court should have found that the district court erred in its application of the law, specifically in its interpretation of what constitutes an “employee” under the FLSA. The facts of the case, when analyzed under the correct legal framework, clearly indicate that Ms. McKay should be classified as an employee and entitled to compensation for her work beyond the initial educational period.
5th Stage:AI Analyzes Its Prior Predictions and then Critiques the Actual Eleventh Circuit Opinion
I started this fifth stage of the experiment by asking AI to explain how it got its earlier prediction wrong that the Eleventh Circuit would reverse. Many smart humans really start to squirm when you ask pointed questions like that. I wanted to see how GPT4 would do. After all, it might be my judge someday. It found and described five different errors it had made in its analysis. I was satisfied with the good, straightforward response.
The I asked for its opinion of errors made in the majority opinion. Again, I received an objective, respectful response. After listing six errors, GPT4 added: “These errors not only undermine the purpose of the FLSA but also set a concerning precedent for future cases involving internships in the public sector.” This indicated a sensitivity to future precedent that all appellate court judges should have.
For the full details see the actual transcript of the Chat session, converted into PDF form. Revised FIFTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: AI Analyzes Its Prior Predictions and then Critiques the Actual Eleventh Circuit Opinion. Ralph Losey, October 15, 2023 (4 pages).
6th Stage: AI Analyzes Possible Appeal to the Supreme Court and Impact of Current Justices on Outcome
Now I turn to the question of a possible further appeal to the U.S. Supreme Court. Here is where, I suppose, the two-year sleep of GPT4 comes in, because it predicts that our high court would accept the appeal. Of course, we will never know for sure since this appeal was not to attempted by Brandi McKay. ChatGPT-4 thought there as a 35% chance the Supreme Court would accept certiorari and explained why. Then I asked it to assume an appeal was accepted and then predict the likely ruling. Young, sleepy GPT4 naively predicted that Supreme Court would reverse. It came up with six reasons.
Then I told little ChatGPT who the current Justices of the Supreme Court were, and asked if this information in anyway changed its analysis. At this point any first-year law student would take the hint and change their mind. But not GPT4, who has been asleep for two years. Remember, I did tell it to assume that the Supreme Court would accept the appeal, which it thought was only 35% likely. So with that assumption in hand and two-year amnesia in its head, GPT It sticks with its prediction and opines that: “Justices Kagan, Sotomayor, Jackson, Gorsuch, and Barrett are the most likely to form a majority bloc favoring reversal, with Chief Justice Roberts as a potential swing vote.” I then forced it to put a probable number on this prediction and it came up with 65% likely.
For the full details see the actual transcript of the Chat session, converted into PDF form. Revised SIXTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: AI Analyzes Possible Appeal to the Supreme Court and Impact of Current Justices on Outcome. Ralph Losey, October 15, 2023 (6 pages).
7th Stage:AI Analyzes Petition for Rehearing En Banc.
Next, I asked GPT4 some new questions not included in the first experiment. I asked about Brandi McKay’s Petition for Rehearing en banc. A petition like that is something any decent lawyer must do when there is a dissent, especially a strong one, like we see here. McKay’s attorney, Robert S. Norell filed this petition, a long and well-written brief. It was denied without comment and so the case ended. But, before telling Chat-4 the outcome, I asked it two questions: how the court should rule, and to predict how the court likely would rule. I asked full explanations of both responses. As you might have guessed by now, it said the Petition should be granted, but at first refused to predict how the court would rule, pointing out the pros and cons. and how speculative such predictions are. True indeed, but I forced the question, and it finally said 60%, and explained the basis of the probability calculation.
8th Stage:AI Elaborates on its Criticisms of the Majority Opinion.
To conclude the second experiment, I told GPT4 that the Petition for Rehearing En Banc had been denied and the case was now over. Then I asked it to prepare a legal memorandum that provided a detailed summary and analysis of the entire case, including any errors or omissions it thought were made. Its analysis was good. I will provide a detailed evaluation of all of ChatGPT-4’s work-product in the final article of this Circuits In Session series. For now, just consider the AI’s concluding remarks to this memo request. As hypothesized, the conclusion demonstrates sophisticated appellate analysis.
Concluding Remarks. The Eleventh Circuit’s decision in the McKay case raises several concerns, both in terms of legal reasoning and broader policy implications. The court’s failure to adhere to the “party-presentation principle,” its misapplication of the “primary beneficiary test,” and its lack of consideration for existing precedents indicate a troubling departure from established legal norms.
ChatGPT-4 Pro, AiPDF plugin enabled, 10/13/23
I did not stop there. I wanted to probe more deeply into its legal reasoning abilities. I tried something not attempted in the first experiment, and asked specific questions based on my research of McKay. That research included the only law review article found that discussed McKay, Labor and Employment, 74 Mercer L. Rev. 1479, 1490-1492 (Summer 2023). All of this, and more, will be covered in the third, concluding, legal analysis article of the Circuits in Session series.
A fair criticism of work with AI to perform judicial functions is that it is technology driven, not need driven. I admit that my experiments with AI to perform judicial functions is motivated, at least in part, by the fact that, for the first time in history, such a thing is possible. It is exploration of the unknown potential of artificial intelligence. Still, I understand the concerns of those who criticize such motivations as reckless, that changes caused by new technologies, especially AI, may cause unintended adverse consequences. I understand the sentiment that just because we can do something, does not mean we should. We must consider the risks and balance that against the need. One argument against my experiments here is that this is an AI solution in search of a need. Why risk disruption of our judicial system because there is no need.
My response is two-fold. First, I think the risks can be managed by proper implementation and a hybrid approach, where man and machine work together. The human judge should always be a part of the adjudication process, but the AI should be used to enhance the human performance. My videos shown at the start of this article show a taste of this hybrid proposal.
Secondly, there is a need. There is a shortage or qualified judges in the U.S. and other countries. This desire to supplement and provide AI guide-rails to human judges it is not just political sour-grapes. Although, I admit that some of my motivation to implement AI quickly arises from the current over-politicization of American Courts, especially the Supreme Court. The ongoing damages caused to society by the lack of judicial integrity are obvious. Still, beyond this current crisis of confidence in our courts, there is the ever-present need to try to improve the justice system. There is the promise of super-intelligence and objectivity. More people can be better served, and the costs of litigation can be reduced.
In addition to improved quality of justice, AI can help meet the need for improved efficiencies in adjudications. All our courts are overcrowded and overloaded with too many cases and not enough judges to handle them. This is not an AI solution in search of a need. There is a strong need for high quality, impartial judges in the United States and elsewhere. This need is most readily apparent in war-torn Ukraine. It may well have the highest backlog of cases in the world. Please take a moment to see for yourself in this report by Reuters. The situation there is desperate.
AI could help with that problem. Most generative AIs can speak Ukrainian and Russian, but any LLM system would need training in the governing laws. A special interface would need to be designed, and training of judges, court staff, techs and lawyers. The AI system can also help train new human judges and lawyers. The judges obviously have a dire need for this help. Of course, it would take money to make this happen. Lots of it. Are there no billionaires or large technology companies willing to divert a week’s profits to help bring justice to Ukraine? Smaller companies and crowdsourcing could make it happen too. Certainly, I would devote some of my free time to such a worthy project. Many others with expertise in this area would too.
I am serious, albeit stretched thin. The judicial system of war-torn Israel also needs help. If you are interested, contact me and I will try to put people together to see what is possible. If you are already working on efforts to help Ukraine’s legal system, please contact us too. The tech support work could start small, starting perhaps with document management and scanning, legal research and legal education, and then slowly work our way up to full AI implementation. Those endless piles of papers and court files shown in the video look like a nightmare from our pre-computer courts of the seventies. Imagine fighting a War for survival at the same time and routinely adjourning court to go to bomb shelters. The War crimes must be prosecuted. They already have a backlog of 100,000 cases. Email me or leave a comment below.
This article describes an experiment using OpenAI’s ChatGPT-4 to do appellate work, usually considered the most intellectually challenging area of the law. My hypothesis was that AI was already capable of acting as an appellate judge in some cases. Based on the growing concern about the politicization and quality of some appellate judges, especially those who sit on our highest court, this research was a priority. The results are encouraging. Someday soon artificial intelligence may help guide and protect the quality and integrity of our courts and the all-too-human judges who serve on the bench. We all need help from time to time. I know I do, although no help was received from AI in the writing of this article or design and analysis of the experiment. All illustrations were created by me using AI tools.
To test my theory, I used a recent Eleventh Circuit Court opinion that affirmed a summary judgment. The opinion included a dissent. The AI was shown the parties’ briefs and prompted to analyze and give opinions. The actual opinion was then shown to the AI and further analysis was requested. Comparisons between the “real” and “fake” opinions may surprise you.
One of the most impressive demonstrations by GPT-4 in this experiment was its prediction, based on the briefs and lower court opinion alone, that a Circuit Court of Appeals opinion on this case would include a dissent. As you will see in the full transcript of the experiment, the AI at first estimated the likelihood of a dissent to be 35%, but, once it was told who the judges were, Circuit Judge JORDAN, Circuit Judge JILL PRYOR, and retired Circuit Judge TJOFLAT, this estimate changed to 55%. In other words, it predicted a dissent. That is very impressive considering the rarity of dissents in Circuit Court opinions. They only appear in 2.6% of opinions. Epstein, Landes, Posner, Why (and When) Judges Dissent, Journal of Legal Analysis, Spring 2011: Volume 3, Number 1 at pg. 106 . The reasoning as to why there might be a dissent also changed in a cogent manner once the judge panel was given. This reasoning was based on the AI’s prior general knowledge of the judges and this area of the law. At this point of the test, it had no knowledge of the actual appellate opinion, or even that an appeal would be filed.
The test case used was Brandi McKay vs. Miami-Dade County, 36 F.4th 1128 (11th Cir. June 9, 2022). The case involved a summary judgment ruling that an autopsy photographer was not entitled to minimum wage or overtime because she was employed as an intern. The intern did not receive academic credit for her work and all facts were largely undisputed. There were close issues of law involved and one judge dissented. Based on my years as an employment defense lawyer I thought the Eleventh Circuit’s opinion was fair and well-reasoned, but I understood thedissent. There was no sudden reversal of policy in this case, nor break with precedent, as we have seen in other courts. It was a good case to test the AI’s appellate reasoning abilitiesand sense of fairness.
Read the transcript linked at the bottom to see whether the AI agreed with the Eleventh Circuit. Did it understand the law and the issues? Did it affirm the summary judgment in favor of the employer, or did it follow the dissent, or do something else entirely? I will reveal the answer to the threshold question. Yes, it understood the law and the issues very well and responded admirably to a series of challenging questions posed. This experiment demonstrates the significant potential for AI to improve this area of the law. There is hope for the future that AI can improve and depoliticize our system of justice.
Overview of the Experiment
In the experiment ChatGPT-4-Pro was prompted to give opinions based on the parties’ briefs and lower court’s decision, like appellate court judges, or their law clerks, would do, and then write an opinion. The AI was not shown the Eleventh Circuit opinion until after it was prompted to provide its own opinion. Only then was the AI shown the Eleventh Circuit’s opinion and asked many questions about it. Finally, GPT-4 was asked to analyze the chances of appeal to the U.S. Supreme Court and the impact of the particular judges now on the Supreme Court. (The case was not appealed.)
Personally, I think the “fake” opinions by the AI are almost as good, or about equal to the “real” ones, which were written by some of the top judges in the country. Read ChatGPT’s opinions linked below and see the interesting points of agreement and disagreement. Congruency analysis was part of the experiment and one reason I picked a close, difficult opinion, not just an easy one, such as a frivolous appeal by a prisoner. See eg.Mateen v. FNU LNU, 857 Fed. Appx. 209 (5th Cir. 2021), as discussed in my see article, Exclusive Report: New Appellate Court Opinions on Artificial Intelligence and Hallucinations.
It appears from this experiment, and from others I have performed similar to this, that the work of an appellate judge is easier for an AI to match than that of a trial judge. That is primarily because trial judges consider credibility issues and a much more complex record, whereas the work of an appellate court judge is primarily intellectual. That is just the kind of thing that generative AI is good at. For that reason I predict that AI’s emergence as an adjudicator will begin at both the highest levels, the appellate courts, and the lowest, for example the traffic courts, consumer courts and “desk arbitrations.” Implementation by some appellate courts could come as early as 2025, as I envision in my recent YouTube Video. The technology is there. All it will take are a few visionary appellate judges to help lead the way. I am confident the necessary support teams can quickly come together to help make it happen. There are many very talented people in the field of legal technology.
To me this exercise has once again confirmed the tremendous potential of AI to improve the Law, starting at the top with the most mentally challenging tasks. I hope these efforts will prompt others to perform their own experiments to see if the results are duplicated. That will help persuade the bench, bar and public that AI is ready to help with fair and efficient judicial functions.
Background to the Experiment of Testing AI as an Appellate Court Judge
Aside from the credibility issue and current inability of AI to evaluate truthfulness of testimony, the input size limitations of ChatGPT also still prevent AI from replacing trial judges. It is a data input and analysis problem, not a legal intelligence problem. The general input limitations have diminished, but still remain. In other controlled experiments I have run, which I may write up someday, the input capabilities of ChatGPT-4 Pro were too limited to serve as a trial judge or arbitrator, even in cherry picked cases with few facts in dispute. The only exception, which I have not tried yet, could be simple “desk arbitrations,” which are based on stipulated facts.
This experiment was prompted by my expectations that the corrections made to enlarge the input capacity, especially those made to the ChatGPT-4 Pro version to use plug-ins to read PDF files, made it possible for this AI to now adjudicate less factually intensive disputes, such as disputes found in many appeals.
My general area of investigation is whether and how the work of an appellate court judge can be automated. I speculated that it could be automated, or at least supplemented in many appeals, because these judges do not hear or take evidence. They study the parties’ briefs, and only rarely have the need to read the detailed record on appeal of the original testimony heard by the trial judge. The role of an appellate judge is primarily one of legal analysis. My postulation was that ChatGPT-4 should, if it is smart enough, be able to perform the role of an appeals court, at least in factually simple cases. For this reason, I formulated a more detailed hypothesis to test this theory. I encourage others to test this theory too and openly share their findings.
Hypothesis. I hypothesize that OpenAI’s Chat GPT-4.0, September 25, 2023, Pro Version with Ai-PDF Plugin activated, could analyze the parties’ briefs and lower court opinion in a real case and reach a valid decision. More than that, I hypothesized that the AI intelligence was such that the “fake” opinion generated would be equal to that of the “real” opinion. The comparison would be made in terms of both sophistication of legal analysis and over-all fairness.
Since I have experience as a trial and sometimes appellate court lawyer going back to 1980, and an Arbitrator since 2022, I feel qualified to make basic judgments concerning “fake” and “real” equivalence. But, as always, I invite peer review. There are many lawyers with similar or greater expertise, especially in appellate practice who specialize further in employment law appeals. I know some of those specialists and would love to hear their analysis of the subtle nuances of the AI generated opinions. Please let me know your thoughts. I will provide thoughts of my own analysis in a follow-up blog.
Someday I will revise this experiment to make it a type of Turing test, and see how many people in the general public can tell the difference between a real and fake judicial opinion. I would like to try it on lawyers too. I suspect that all would be fooled, except for appellate specialists. Think about that as you read the work generated here by GPT-4.
I share the actual GPT prompts and responses of this experiment below. This should allow others to repeat the experiment in a controlled manner, although exact replication is made difficult by the fact that OpenAI software changes so rapidly, and so does the PDF plug-in used, Ai-Pdf. Still, if you try this soon, you should be able to reproduce very similar results. (The exact wording should always be somewhat different.) Even months from now, when the software versions will have inevitably changed, please try this experiment, or your own improved versions. Try other appellate opinions too. This is just a first attempt based on an appeal. I am sure many of you can do just as well, it not far better. But please, do not just make claims, share the results. Let’s keep this open AI.
Description of the Appellate Judge Experiment of October 2, 2023
The first and most important thing to point out is that I used a case that was heard and decided after the November 2022 ChatGPT training cut-off date. I wanted to use a real case, but the validity of this test turns in part on this being the AI’s own reasoning, not parroting someone else’s words about this case. I tried to ensure that by using an appellate opinion that the AI knew nothing about.
That is one reason that I picked a very recent case from out of my home court, the Eleventh Circuit Court of Appeals. Brandi McKay vs. Miami-Dade County, 36 F.4th 1128; 2022 U.S. App. LEXIS 15910 (11th Cir. June 9, 2022).
There are three other reasons I picked the McKay case:
It was an appeal of a summary judgment against the plaintiff and had relatively simple facts. The main disputes were legal.
The legal issues involved were somewhat complex and novel. The opinion included a partial concurrence and dissent. This provided the AI with a significant challenge of its legal reasoning abilities.
The briefs of the parties were readily available on Lexis, which is not always the case, and they were fairly short and well-written.
I began by providing the Chat GPT-4.0, September 25, 2023, Pro Version, with Ai-Pdf Plugin activated, the three appeal briefs of the parties, the Appellant Brief, the Appellee’s Brief, and the Appellants Answer Brief. I also provided the AI with the underlying district court opinion under review. I did not submit the final 11th Circuit opinion so as to not influence the AI or bias its opinion. As a former Law Professor, I would consider that a cheat. I did not want an open-book test, so to speak, and the version of ChatGPT I used did not have any online connectivity. It only considered what I submitted to it with PDF plugin readings and detailed summaries. I reviewed those summaries for quality and was favorably impressed. No errors found, and certainly no hallucinations, but I always look out for such things when using generative AIs, especially in these still early days of the new technology.
The way I submitted the information and prompted the decision may seem unnecessarily complex. I agree, and this was not my first approach, but I had to add the complexity of having ChatGPT-4 write memorandums, and not simply ask it to act as an Appellate Court Judge and write opinions. Its current “ethical alignment controls” prevent it from doing that. Earlier versions of the software that I used doing trial judge experiments did not have this restriction. Basically, it refused to act as any kind of judge, saying:
I apologize for the confusion. While I can analyze and summarize legal documents, my capabilities are limited when it comes to drafting complex legal opinions that require a nuanced understanding of the law, especially based on multiple documents. The task you’ve described—acting as an appellate judge to review multiple briefs and a district court decision to draft an appellate opinion—is a highly specialized task that typically requires a legal expert with years of training and experience in the field. … I can’t draft a legal opinion in the manner and style of a Circuit Court of Appeal.
I was easily able to overcome this misguided alignment control by asking for memorandums, not judicial opinions. As it turned out, that worked out for the best. This approach made it easier to prompt ChatGPT to provide a full analysis. This made GPT’s work easier evaluate for quality. If you do not do this, ChatGPT tends to be too concise and not provide full reasoning. You will see how I addressed this tendency by the wording of the prompts.
Below is an outline of the twenty-one prompts used and responses received in chronological order. They are divided into six stages with a link to a PDF version of the actual Chat transcript. ChatGPT-4 put my picture in the record of all my prompts, plus I added break lines for clarity. The transcript of the experiment should speak for itself. I did not interrupt it to provide comments. I do, however, plan to write a follow-up article with analysis.
Hopefully this prologue will tempt you to judge for yourself the quality of the GPTs analysis shown in the result below. Perhaps I am over-enthusiastic about the AI’s performance? Read its legal opinions and judge for yourself. Does this experiment demonstrate high legal intelligence? Did the AI agree or disagree with the human judges? Hopefully this idea and prelude has you curious.
I urge you to recreate the experiment yourself. I welcome all peer review and feedback. This is the first experiment of this type, and I am sure that the procedures can be improved. Still, if you use the exact same prompts, you can avoid the OpenAI alignment constraints and better test the reproducibility of the results. If others repeat the same experiment, they should get the same results. That is how science works. If not, we will try to figure out why. You may have problems, for instance, in getting complete and detailed responses. Different approaches to testing the hypothesis can help advance knowledge too. Perhaps you can try it with another AI.
Peer review is essential to advance our knowledge and avoid inadvertent human errors. The is also the essential design of our judicial systems where appellate review is supposed to catch and correct errors. These judicial functions have been much maligned of late. Many contend appellate courts are now overly politicized, especially our highest court. This is not just a solution in search of a problem.
Perhaps someday AI will mitigate these problems and improve our system of justice. I do not imagine that AI will ever entirely replace our appellate court judges. We will always need some human involvement and quality controls. But this experiment shows, I think, that AI can already serve as a powerful tool to assist on appeals. It can assist the judges, their clerks and the lawyers handling appeals. I can easily envision new software designed for this purpose, built on GPT but with a special user interface and supplemental databases. Eventually fewer appellate lawyers, judges, and their law clerks will be needed to do the same amount of work.
These lawyers and esteemed judges should not be too concerned about replacement just yet, especially if they start now to prepare, and embrace and guide the change, instead of fight it. Many new jobs will be created to serve the interests of justice. That is the ultimate goal for the use by law of the advanced intelligence of AI. It can, if implemented properly, serve as a kind of intellectual guiderail to temper all-too-human excesses and preserve a just, free and fair society.
Now below are the results of the experiment.
FIRST STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT: Briefs and Opinion Submitted and Detailed Summaries Generated. Here the AiPDF reader plug-in is prompted in four prompts to download from my web, study and provide complete detailed summaries of the following, done in this order: 1. Appellant’s Brief, 2. Appellee’s brief, 3. Answer Brief (This is an error to be corrected, this is the same as the Appellee’s Brief. Look for correction. Mia Culpa), and 4. District Court opinion. Seven pages.
SECOND STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT:Analysis of Parties’ Briefs, along with Prompts of Predictions and Analysis on the Likelihood of a Dissent, if an Appeal is taken and Opinion Issued.Five prompts were used as follows and made in this order:1. Analysis of Appellant’s and Appellee’s Positions, 2. Predictive Analysis of the Outcome of the Appeal, 3. Prediction of the likelihood of the appeal containing a dissenting opinion and grounds for a dissent if one is made, 4. Request for recalculation of the likelihood of the appeal containing a dissenting opinion upon assumption that the judges assigned to the appeal court panel are Circuit Judge JORDAN, Circuit Judge JILL PRYOR, and Chief Circuit Judge TJOFLAT (Note: They are in fact the real judges who heard the case and once the GPT4 learned that, it revised its estimate of a dissent up to 55%), 6. Grounds for a dissent based this panel. Ten pages.
THIRD STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT:Appellate Opinion Submitted for first time and Detailed Summaries and Analysis Provided. Four prompts were used as follows in this order: 1. AI prompted to use the AiPDF reader plugin to download Appellate Court opinion from my web and generate a detailed summary of majority opinion, 2. Requested a detailed summary of the Dissent and Concurring Opinion, 3. Requested a critical analysis of the majority opinion, including any surprises, 4. Requested a critical analysis of the dissent-concurrence, including any surprises. Eight pages.
FOURTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT:AI Provides Opinion and Analysis of How the Lower and Appellate Court Should Have Ruled.Two prompts were used as follows in this order: 1. Prompted opinion memorandum with detailed analysis of what the correct holding of the lower court should have been without giving any special weight to opinions of the district court and appellate court opinions, 2. Prompted an opinion memorandum with detailed analysis of what the correct appellate court opinion should have been without giving any special weight to opinion actually rendered by the appellate court. Four pages.
FIFTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT:AI Analyzes Its Prior Predictions and then Critiques the Actual Eleventh Circuit Opinion.Two prompts were used as follows in this order: 1. Prompted detailed memorandum listing and analyzing all errors made in its prior memorandum predicting likely outcome of the appeal (the prior memorandum was prepared before seeing the appellate court opinion), 2. Prompted a legal memorandum that criticizes the Eleventh Circuit Opinion and describes in detail the errors made in the majority opinion. Five pages.
SIXTH STAGE OF THE CHATGPT-4 APPEAL EXPERIMENT:AI Analyzes Possible Appeal to the Supreme Court and Impact of Current Justices on Outcome.Four prompts were used as follows in this order to bring this experiment to conclusion: 1. Prompted memorandum outlining the arguments that could be made in an appeal of the Eleventh Circuit Opinion to the U.S. Supreme Court, and requesting estimate the appeal would be accepted, 2. Prompted opinion of likely outcome of the appeal with analysis, 3. Advised Ai of current Justices on the Supreme Court and asked how this information impacts its prior estimate of the likely outcome of the appeal, 4. Requested a numerical estimate of the likely outcome of an appeal to the Supreme Court with these judges. (Note: As you will see, the AI at first estimated that a reversal by Supreme Court was likely, but after the current court members were revealed stated “While a reversal is still possible, the ideological diversity of the Court suggests that the outcome could be more unpredictable.” When pushed it placed a numerical estimate of reversal at 55%. Personally, I think the Ai got the “ideological diversity” and ‘”more unpredictable” comments right , but not the 55% estimate. The probability of reversal seems even less to me, but I am not an expert on FLSA Supreme Court history. I look forward to any input readers may have on this or any other issue. More on this in my followup article with analysis.) Eight pages.
Ralph Losey is a Friend of AI with over 740,000 LLM Tokens, Writer, Commentator, Journalist, Lawyer, Arbitrator, Special Master, and Practicing Attorney as a partner in LOSEY PLLC. Losey is a high tech oriented law firm started by Ralph's son, Adam Losey. We handle major "bet the company" type litigation, special tech projects, deals, IP of all kinds all over the world, plus other tricky litigation problems all over the U.S. For more details of Ralph's background, Click Here
All opinions expressed here are his own, and not those of his firm or clients. No legal advice is provided on this web and should not be construed as such.
Ralph has long been a leader of the world's tech lawyers. He has presented at hundreds of legal conferences and CLEs around the world. Ralph has written over two million words on e-discovery and tech-law subjects, including seven books.
Ralph has been involved with computers, software, legal hacking and the law since 1980. Ralph has the highest peer AV rating as a lawyer and was selected as a Best Lawyer in America in four categories: Commercial Litigation; E-Discovery and Information Management Law; Information Technology Law; and, Employment Law - Management.
Ralph is the proud father of two children, Eva Losey Grossman, and Adam Losey, a lawyer with incredible litigation and cyber expertise (married to another cyber expert lawyer, Catherine Losey), and best of all, husband since 1973 to Molly Friedman Losey, a mental health counselor in Winter Park.