Evidence | e-Discovery Team

The Problem of Deepfakes and AI-Generated Evidence: Is it time to revise the rules of evidence? – Part One

September 19, 2024

Ralph Losey. Published September 19, 2024.

On April 19, 2024, the Advisory Committee on Evidence Rules for federal courts faced a critical question: Does AI-generated evidence, including deepfakes, demand new rules? The Committee’s surprising answer—’not yet.’ Was that the right call? Will they change their mind when they meet in November again right after the elections?

Image in Photorealistic style by Ralph Losey using Visual Muse

Part One analyzes the various rule change proposals. Chief among them is the proposal by Judge Paul Grimm (retired) and Professor Maura Grossman, who are well known to all legal tech readers. Several other interesting proposals were considered and discussed. You will hear the nerds inside view of the key driving facts at play here, the danger of deepfakes, the power of audio-video evidence, jury prejudice and the Liar’s Dividend. Part One also talks about why the Evidence Rules Committee chose not to act and why you should care.

Part Two will complete the story and look at what comes next with the meeting of November 8, 2024. It will also include a discussion of a second, slightly revised proposal by Paul Grimm and Maura Grossman that they just submitted and the latest article by Paul Grimm, Maura Grossman and six other experts: Deepfakes in Court: How Judges Can Proactively Manage Alleged AI-Generated Material in National Security Cases. They are all trying, once again, to push the Committee into action. Let us hope they succeed.

Summary of the Evidence Committee’s Decision and the Leadership of its Official Reporter

The Committee, under the strong leadership of its official Reporter for the last twenty-eight years, Daniel J. Capra, considered multiple proposals to amend the Rules of Evidence, but rejected them all. Professor Capra cited the need for further development. For now, courts must manage the significant new challenges of AI with existing rules.

They key segment of the Committee’s work is the 26-page memorandum found at Tab 1-A of the 358-page agenda book. It was written by Professor Daniel J. Capra, Fordham University School of Law and Adjunct Professor at Columbia Law. Dan Capra is a man almost my age, very powerful and respected in academic and judicial circles. He is a true legend in the fields of evidence, legal ethics and education, but he is no nerd. His comments and the transcript of his interaction with two of the top tech-nerds in law, Judge Paul Grimm (retired) and Professor Maura Grossman, make clear that Professor Capra lacks hands-on experience and deep understanding of generative AI.

That is a handicap to his leadership of the Committee on the AI issues. His knowledge is theoretical only, and just one of many, many topics that he reads about. He does not teach AI and the law, as both Grimm and Grossman do. This may explain why he wanted to just wait things out, again. He recommended, and the Committee agreed, apparently with no dissenters, to reject the concerns of almost all of the hands-on nerds, including all of the legal experts proposing rule changes. They all warn of the dangers of generative AI and deepfakes to interfere with our evidence based system of justice. It may even make it impossible to protect our upcoming election from deepfake interference. Daniel Capra gives some consideration to danger, but thinks the concerns are overblown and the Committee should continue to study and defer any action.

Don’t Look Up! image in Pop Art style by Ralph Losey using Visual Muse

Evaluation of the Pending Danger of Deepfakes

For authority that the dangers of deepfake are overblown, and so no rule changes are necessary, Professor Capra cites two articles. Professor Capra’s Memorandum to the Committee at pgs. 25-26 (pgs. 38-39 of 358). The first is unpersuasive, to say the least, a 2019 article in the Verge, Deepfake Propaganda is not a Real Problem, THE VERGE (Mar. 15, 2019). The article was written by Russell Brandom, who claims expertise on “the web, the culture, the law, the movies, and whatever else seems interesting.”

The second article was better, Riana Pfefferkorn, Deepfakes in the Courtroom, 29 Public Interest Law Journal 245, 259 (2020). Still, it was written in 2020 and so now is way out of date. My research discovered that Riana Pfefferkorn has published a much more recent paper pertaining to deepfakes, Addressing Computer-Generated Child Sex Abuse Imagery: Legal Framework and Policy Implications (Lawfare, February 2024). In the Introduction at page-2 of this well written paper she says:

Given the current pace of technological advancement in the field of generative ML, it will soon become significantly easier to generate images that are indistinguishable from actual photographic images depicting the sexual abuse of real children.

For Ms. Pfeffferkorn the problem of deepfakes is now a very real and urgent problem. At page 25 of the paper she asserts: “There is an urgent need, exacerbated by the breakneck pace of advancements in machine learning, for Congress to invest in solving this technical challenge.”

Professor Capra and the Committee see no “urgent need” to act. They do so in part because of their belief that new technology will emerge (or already exists) that is able to detect deepfakes and so this problem will just go away. Professor Capra has one expert to support that view, Grant Fredericks, the president of Forensic Video Solutions. I looked at the company website and see no claims to development or use of any new technologies. Capra relies on the vendor promises to detect fake videos and keep them out of evidence, “both because they can be discovered using the advanced tools of his (Fredricks) trade and because the video’s proponent would be unable to answer basic questions to authenticate it (who created the video, when, and with what technology.” Professor Capra’s Memorandum to the Committee at pg. 26 (pg. 39 of 358).

Capra’s memorandum to the Committee at first discusses why GenAI fraud detection is so difficult. He explains the cat and mouse competition between image generation software makers and fraud detection software companies. Oddly enough, his explanation seems correct to me, and so appears to impeach his later conclusion and the opinion of his expert, Fredericks. Here is the part of Capra’s memorandum that I agree with:

Generally speaking, there is an arms race between deepfake technology and the technology that can be employed to detect deepfakes. . . . any time new software is developed to detect fakes, deepfake creators can use that to their advantage in their discriminator models. A New York Times report reviewed some of the currently available programs that try to detect deepfakes. The programs varied in their accuracy. None were accurate 100 percent of the time.

Memorandum to the Committee at pg. 17 (pg. 30 of 358

Professor Capra’s supports his statement that “none were accurate 100 percent of the time,” by citing to a NYT article, Another Side of the A.I. Boom: Detecting What A.I. Makes (NYT, May 19, 2023). I read the article and it states that there are now more than a dozen companies offering tools to identify whether something was made with artificial intelligence, including Sensity AI, Optic, Reality Defender and FakeCatcher. The article repeats Professor Capra’s arms race scenario, but adds how the detector software always lags behind. That is common in cybersecurity too, where the defender is always at a disadvantage. Here is a quote from the NYT article:

Detection tools inherently lag behind the generative technology they are trying to detect. By the time a defense system is able to recognize the work of a new chatbot or image generator, like Google Bard or Midjourney, developers are already coming up with a new iteration that can evade that defense. The situation has been described as an arms race or a virus-antivirus relationship where one begets the other, over and over.

That has always been my understanding too, which is why I cannot believe that new technology is around the corner to finally make detection foolproof or that Grant Fredericks has a magic potion. I think it is more likely that the spy versus spy race will continue and uncertainty will be with us for a long time. Still, I sincerely hope that Professor Capra is right, and the fake image dangers are overstated. That’s my hope, but reason and science tells me that’s a risky assumption and we should mitigate our risks by making some modest revisions to the rules now. I would start with the two short proposals of Grimm and Grossman (as slightly revised in September 2024 and explained in Part Two).

Spy v. Spy image in 50s Pop Art stye by Ralph Losey using Visual Muse

Professor Capra’s Discussion of the Proposed Rule Amendments

There were four rule change proposals before the Committee in April 2024. One by Professor Andrea Roth of the University of California, Berkeley, School of Law, a second by Professor Rebecca Delfino of Loyla Law School and a third by Judge Paul Grimm (retired) and Professor Maura Grossman, already well known to most of my readers. I omit discussion here of a fourth proposal by John LaMonga in the interests of time, but you can learn about it in Professor Capra’s Memorandum to the Committee at pgs. 23-25 (pgs. 36-38 of 358). Also see John P. LaMonaga, A Break from Reality: Modernizing Authentication Standards for Digital Video Evidence in the Era of Deepfakes, 69 Am. U.L. Rev. 1945, 1984 (2020).

Professor Andrea Roth’s Rule Proposals

Professor Roth’s suggestions are nerdy interesting and forward thinking. Her suggestions are found in Professor Capra’s Memorandum to the Committee at pgs. 10-13 (pgs. 23-26 of 358) and Capra’s critical comments of the proposals follow at pgs. 13-16 (pgs. 26-29 of 358). I urge interested readers to check out her proposals for yourself. Capra’s comments seem a bit overly critical and I look forward to hearing more from her in the future.

Here is a Capra’s comment to one of her proposals, to add a new, independent subdivision to Rule 702. Testimony by Expert Witnesses.

The proposal addresses what could be thought to be a gap in the rules. Expert witnesses must satisfy reliability requirements for their opinions, but it is a stretch, to say the least, to call machine learning output an “opinion of an expert witness.”

Memorandum to the Committee at pg. 11 (pg. 24 of 358).

Oh really? A stretch, to say the least. Obviously Capra is not familiar with my work, and that of many others in AI, on the use of generative AI personas as experts. See e.g. Panel of AI Experts for Lawyers; and, Panel of Experts for Everyone About Anything. Also see: Du, Li, Torralba, Tenenbaum and Mordatch, Improving Factuality and Reasoning in Language Models through Multiagent Debate, (5/23/23).

Panel of AI Experts image by Ralph Losey who consults with them frequently

For me Andrea Roth’ proposals are not a stretch, to say the least, but common sense based on my everyday use of generative AI.

Andrea Roth also suggests that Rule 806. Attacking and Supporting the Declarant’s Credibility be amended to allow opponents to “impeach” machine output in the same way as they would impeach hearsay testimony from a human witness. Professor Capra of course criticizes that too, but this time is more kind, saying at page 13 of his memo.

The goal here is to treat machine learning — which is thinking like a human — the same way that a human declarant may be treated. Thought must be given to whether all the forms of impeachment are properly applicable to machine learning. . . . The question is whether an improper signal is given by applying 806 wholesale to machine related evidence, when in fact not all the forms of impeachment are workable as applied to machines. That said, assuming that some AI-related rule is necessary, it seems like a good idea, eventually, to have a rule addressing the permitted forms of impeachment of machine learning evidence.

I thought Andrea Roth’s suggestion was a good one. I routinely cross-examine AI on their outputs and opinions. It is an essential prompt engineering skill to make sure their opinions are reliable and understand the sources of their opinions.

Due to concerns over the length of this article I must defer further discussion of Professor Andrea Roth’s work and proposals for another day.

Robot cross-examined in impressionism style by Ralph Losey using Visual Muse

Professor Rebecca Delfino’s Proposal to Remove Juries From Deepfake Authenticity Findings

Professor Rebecca Delfino of Loyla Law School is a member of the Committee’s expert panel. She is very concerned about the dangers of the powerful emotional impact of audiovisuals on jurors and the costs involved in authenticity determinations. Her recent writings on these issues include: Deepfakes on Trial: A Call to Expand the Trial Judge’s Gatekeeping Role To Protect Legal Proceedings from Technological Fakery, 74 HASTINGS L.J. 293 (2023); The Deepfake Defense—Exploring the Limits of the Law and Ethical Norms in Protecting Legal Proceedings from Lying Lawyers, Loyola Law School, 84 Ohio St. L.J., Issue 5 1068 [2024]; Pay-To-Play: Access To Justice In The Era Of Ai And Deepfakes, 55 Seton Hall L.Rev., Book 3, __ (forthcoming 2025) (Abstract: “The introduction of deepfake and AI evidence in legal proceedings will trigger a failure of the adversarial system because the law currently offers no effective solution to secure access to justice to pay for this evidence for those who lack resources.“) Professor Rebecca Delfino argues that the danger of deepfakes demands that the judge decide authenticity, not the jury.

Sub-Issue on Jury Determinations and the Psychological Impact of Deepfakes

I am inclined to agree with Professor Delfino. The important oral presentation of Paul Grimm and Maura Grossman to the Committee shows that they do too. We have a transcript of that by Fordham Law Review, Daniel J. Capra, Deepfakes Reach the Advisory Committee on Evidence Rules, 92 Fordham L.R. 2491 (2024) at pgs. 2421-2437.

Paul and Maura make a formidable team of presenters, including several notable moments where Maura shows Capra and his Committee a few deepfakes she made. In the first she put Paul Grimm’s head on Dan Capra’s body, and vice versa, which caused Dan to quip “I think you lose in that trade, Paul.” Then she asked the panel to close their eyes and listen to what turned out to be a fake audio of President Biden directing the Treasury Depart to make payment of $10,000 to Daniel Capra. Id. at pgs. 2427-2427.

I thought this was great ploy. Maura then told the Committee she made it in seconds using free software on the internet and that with more work it would sound exactly like the President. Id. at 2426-2427. Professor Capra, who has been stung before by surprise audios, did not seem amused and his ultimate negative recommendations show he was not persuaded.

Here are excerpts of the transcript of the next section of their presentation to the Committee.

PROF. GROSSMAN. Because there are two problems that these deepfakes and that generative AI cause. One is we’re moving into a world where none of us are going to be able to tell what is real from not real evidence—which of these videos are real, which of these aren’t. And I’m very worried about the cynicism and the attitude that people are going to have if they can’t trust a single thing anymore because I can’t use any of my senses to tell reality.

And the other is what they call the liar’s dividend, is why not doubt everything, even if it’s in fact real, because now I can say, “How do you know it’s not a deepfake?”, and we saw a lot of that in the January 6 cases. Some of the defendants said, “That wasn’t me there” or “How do you know it was me?”187 Elon Musk used that defense already.188 So you’re going to have both problems: one where it really is fake, and now every case going to require an expert; and the other where it really is real evidence, and you don’t want to become so cynical that you don’t believe any of it.

Deepfakes Reach the Advisory Committee on Evidence Rules, supra at pgs. 2427-2428.

To quote an NPR article on the “Liar’s Dividend:”

When we entered this age of deepfakes, anybody can deny reality. … That is the classic liar’s dividend.

The liar’s dividend is a term coined by law professors Bobby Chesney and Danielle Citron in a 2018 paper laying out the challenges deepfakes present to privacy, democracy, and national security. The idea is, as people become more aware of how easy it is to fake audio and video, bad actors can weaponize that skepticism. “Put simply: a skeptical public will be primed to doubt the authenticity of real audio and video evidence,” Chesney and Citron wrote.

Shannon Bond, People are trying to claim real videos are deepfakes. The courts are not amused (NPR, 5//8/23).

Liar’s Dividend image by Ralph Losey in surrealistic style using Visual Muse

Back to the transcript of the presentation of Grossman and Grimm to the Committee: Judge Grimm went on to explain why, under the current rules, the jury may often have to make the final determination of authenticity. They emphasize that even if the jury decides it is inauthentic, the jury will still be tainted by the process, as they cannot unsee what they have seen. Instructions from a judge to disregard the video seen will be ineffective.

JUDGE GRIMM: Now there’s one monkey wrench in the machinery: When you’re dealing with authentication, you’re dealing with conditional relevance if there’s a challenge to whether or not the evidence is authentic. And so, if you’re going to have a factual situation where one side comes in and says, “This is the voice recording on my voicemail, this is the threatening message that was left on my voicemail, that’s Bill, I’ve known Bill for 10 years, I am familiar with Bill’s voice, that is plausible evidence from which a reasonable factfinder could find that it was Bill.”

If Bill comes in and says, “That was left at 12:02 PM last Saturday, at 12:02 PM I have five witnesses who will testify that I was at some other place doing something else where I couldn’t possibly have left that,” that is plausible evidence that it was not Bill.

And when that occurs, the judge doesn’t make the final determination under Rule 104(a).209 The jury does.210 And that’s a concern because the jury gets both versions now. It gets the plausible version that it is; it gets the plausible version that it’s not. The jury has to resolve that factual dispute before they know whether they can listen to that voicemail and take it into consideration as Bill’s voice in determining the outcome of the case.

PROF. GROSSMAN: Can I add just one thing? Two studies you should know about. One is jurors are 650 percent more likely to believe evidence if it’s audiovisual, so if that comes in and they see it or hear it, they are way more likely to believe it.211 (Rebecca A. Delfina, Deepfakes on Trial: A Call to Expand the Trial Judge’s Gatekeeping Role to Protect Legal Proceedings from Technological Fakery, 74 HASTINGS L.J. 293, 311 fn.101–02 (2023)).

And number two, there are studies that show that a group of you could play a card game. I could show you a video of the card game, and in my video it would be a deepfake, and I would have one of you cheating. Half of you would be willing to swear to an affidavit that you actually saw the cheating even though you didn’t because that video—that audio/video, the deepfake stuff—is so powerful as evidence that it almost changes perception.212 (See Wade, Green & Nash, Can Fabricated Evidence Induce False Eyewitness Testimony?, 24 APPLIED COGNITIVE PSYCH. 899 (2010)).

Fake videos can change your memory of perception. Image in Pop Art style by Ralph Losey

CHAIR SCHILTZ: But why would judges be any more resistant to the power of this than jurors?

JUDGE GRIMM: Well, for the same reason that that we believe that in a bench trial that the judge is going to be able to distinguish between the admissible versus the non-admissible.

CHAIR SCHILTZ: I know, but it is often fictional, right? There are certain things that I really am no better at than a juror is, like telling a real picture from an unreal picture, or deciding which of these two witnesses to believe—between the witness who says, “That’s his voice,” and the witness who said, “It couldn’t have been me.” Why am I any better at that than a juror?

JUDGE GRIMM: You might be better than a juror because you, as the judicial officer, can have it set up so that you have a hearing beforehand, which is a hearing on admissibility that the jury is not going to hear; and you have the witnesses come in, and you hear them; or you have a certificate under Rule 902(13). Also, you will be a repeat player.

PROF. GROSSMAN: Right. And you would at least know the questions to ask: How was this algorithm trained? Was it tested? What was it tested on? Who did the testing? Were they arm’s length? What’s the error rate?

JUDGE GRIMM: And order the discovery that the other side can have to be able to have the opportunity To challenge it by bringing that in.

CHAIR SCHILTZ: Yes, I get that part.

Deepfakes Reach the Advisory Committee on Evidence Rules, supra at pgs. 2427-2428.

The Chair, Hon. Patrick J. Schiltz asks good questions here and understands the issue. Anyone should be far more comfortable having a judge, especially one like Judge Schiltz, making the hard calls instead of a room of randomly called jurors. There is no question in my mind that judges are far better qualified than jurors to make these determinations. All three experts were making that point, Paul Grimm, Maura Grossman and Rebecca Delfino.

Real or Fake image in post-apocalyptic-futurism style by Ralph Losey

Back to Professor Rebecca Delfino’s Proposal

Here is Professor Capra explanation to the Committee of how Professor Delfino’s proposed rule changes would work. Unfortunately I have not found any argument from her on her proposal, just Capra’s explanation and he ultimately rejected it.

Professor Rebecca Delfino argues that the danger of deepfakes demands that the judge decide authenticity, not the jury.19 She contends that “[c]ountering juror skepticism and doubt over the authenticity of audiovisual images in the era of fake news and deepfakes calls for reallocating the fact finding authority to determine the authenticity of audiovisual evidence.” She contends that jurors cannot be trusted to fairly analyze whether a video is a deepfake, because deepfakes appear to be genuine, and “seeing is believing.” Professor Delfino suggests that Rule 901 should be amended to add a new subdivision (c), which would provide:

901(c). Notwithstanding subdivision (a), to satisfy the requirement of authenticating or identifying an item of audiovisual evidence, the proponent must produce evidence that the item is what the proponent claims it is in accordance with subdivision (b). The court must decide any question about whether the evidence is admissible.

She explains that the new Rule 901(c) “would relocate the authenticity of digital audiovisual evidence from Rule 104(b) to the category of relevancy in Rule 104(a)” and would “expand the gatekeeping function of the court by assigning the responsibility of deciding authenticity issues solely to the judge.”

The proposed rule would operate as follows: After the pretrial hearing to determine the authenticity of the evidence, if the court finds that the item is more likely than not authentic, the court admits the evidence. The court would instruct the jury that it must accept as authentic the evidence that the court has determined is genuine. The court would also instruct the jury not to doubt the authenticity, simply because of the existence of deepfakes. This new rule would take the Memorandum to the Committee at pgs. 22-23 (pgs. 35-36 of 358).

This proposal sounds feasible to me. It could help reduce the costs of expert battles and counter the Liar’s Dividend and CSI Effect. Professor Capra made a few helpful comments as to how Professor Delfino’s language would benefit by a few minor changes. But those are moot points because he respectfully declined to endorse the proposal noting that: “Given the presence of deepfakes in society, it may well be that jurors will do their own assessment, regardless of the instruction.” He seems to miss the point of minimizing the psychological impact on jurors by keeping deepfake videos and audios out of the jury room.

Fake or True image in photorealistic style by Ralph Losey using Visual Muse

Paul Grimm and Maura Grossman‘s Two Rule Proposals

Two rule change proposals were made in early 2024 by Paul Grimm and Maura Grossman. (They were revised slightly and resubmitted in September 2024 as explained in Part Two of this article.) Paul and Maura are both well known to my readers as progressive leaders in law and technology. They have been working on these evidentiary issues for years. See eg., The GPTJudge: Justice in a Generative AI World, 23 Duke Law & Technology Review 1-34 (2023).

They were invited to present their proposals to the Committee to modify Rule 901(b)(9) for AI evidence and add a new Rule 901(c) for “Deepfake Evidence.” The transcript of their presentation was referred to previously. Deepfakes Reach the Advisory Committee on Evidence Rules, 92 Fordham L.R. 2491 (2024) at pgs. 2421-2437. I recommend you read this in full.

Here are the two rule changes Paul and Maura proposed:

901(b) Examples. The following are examples only—not a complete list—of evidence that satisfies the requirement [of Rule 901(b)]:
(9) Evidence about a Process or System. For an item generated by a process or system:
(A) evidence describing it and showing that it produces ~~an accurate~~ a valid and reliable result; and
(B) if the proponent concedes that the item was generated by artificial intelligence, additional evidence that:
(i) describes the software or program that was used; and
(ii) shows that it produced valid and reliable results in this instance.

Proposed New Rule 901(c) to address “Deepfakes”:

901(c): Potentially Fabricated or Altered Electronic Evidence. If a party challenging the authenticity of computer-generated or other electronic evidence demonstrates to the court that it is more likely than not either fabricated, or altered in whole or in part, the evidence is admissible only if the proponent demonstrates that its probative value outweighs its prejudicial effect on the party challenging the evidence.

As you can see their proposed new rule 901(c) makes it clear that a judge may take the jury out of the “fake or real” determination in close questions, and in so doing take away most of the potential prejudicial impact upon jurors. The burden of possible unconscious prejudice and emotional impact from viewing inadmissible deepfake media would be born solely by the judge. As discussed, the judge is better trained for that and will have the benefit of pretrial hearings and expert testimony. The jury retains its traditional power over all other determinations of justiciable facts. Note that this proposal does not go as far as Professor Delfino’s in taking determinations away from the jury and expanding the gatekeeper role of the judge. More on 901(c) in general will follow, but first the proposed revisions to Rule 901(b)(9).

Accuracy v. Reliability and Validity

Professor Capra killed both of the Grimm and Grossman proposals after asking for input from only one expert on his panel, the one who happened to be the only one on the panel proposing a competing rule change, Professor Rebecca Wexler. You might expect her to oppose Grimm and Grossman’s proposal, lobbying instead for her own rival proposals. To her credit she did not. Instead, in Capra’s own words, she “supported the proposals but suggested that they should be extended beyond AI. Memorandum to the Committee at pgs. 9-10 (22-23 of 358). As to the amendment to Rule 901(b)(9) Professor Wexler said:

Re: the first Grimm/Grossman proposal, it may well be that the standard for authenticating system/process evidence should require a showing that the system/process produces “valid” and “reliable” results, rather than merely accurate results. . . .

I can understand the push to add a reliability requirement to 901(b)(9). It’s true that ML systems could rely on an opaque logic that gives accurate results most of the time but then sometimes goes off the rails and creates some seemingly illogical output. But manually coded systems can do the same thing. They could be deliberately or mistakenly programmed to fail in unexpected conditions, or even once every hundred runs on the same input data. So if reliability is important, why not make it a broader requirement?

Memorandum to the Committee at pg. 9 (22 of 358)

Still, Capra seemed to give little weight to her input and stuck with his objection. He continued to insist that the use of the words “valid and reliable” instead of “accurate” in Rule 901(b)(9) is an unnecessary and confusing complication. It appears that he does not fully understand the nerdy AI based technical reasons behind this change. Notice that Capra once again relies on a vendor, Evidently AI, to try to support his attempt to get technical. Professor Capra says in his Memorandum to the Committee at page 7 (20 of 358).

The proposal (on Rule 901(b)(9)) distinguishes the terms “validity,” “reliability,” and “accuracy.” That is complicated and perhaps may be unnecessary for a rule of evidence. . . . As to “accuracy”, the proposal rejects the term, but in fact there is a good deal of material on machine learning that emphasizes “accuracy.” See, e.g., https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall . . . The whole area is complicated enough without adding distinctions that may not make a difference.

Too complicated, really? Meaningless distinctions? Maura Grossman and Paul Grimm, who have extensive experience actually using these evidence rules in court, and are both bonafide nerds (especially Maura), were not, to my knowledge, given an opportunity to respond to these criticisms. I have not talked to them about this but would imagine they were not pleased.

Obviously fake Image of Judge Grimm as an unhappy robot by Ralph Losey using Visual Muse

To be continued … Part Two of this article will complete the analysis of the Grimm – Grossman rule proposals and look at what comes next with the Rule Committee meeting of November 8, 2024. It will also include a discussion of a second, slightly revised proposal by Paul Grimm and Maura Grossman that they just submitted and discussion of the new article by Judge Paul Grimm (retired), Professor Maura Grossman and six other experts: Deepfakes in Court: How Judges Can Proactively Manage Alleged AI-Generated Material in National Security Cases. They are all trying, once again, to push the Committee into action. Let us hope they succeed. Don’t look up, but an election is coming.

“Don’t Look Up” image in Dark Fantasy style by Ralph Losey using Visual Muse

4 Comments | AI Ethics, Book, Evidence, Forensic Exam, New Rules, Security, Technology, VENDORS | Permalink
Posted by Ralph Losey

BREAKING NEWS: Eleventh Circuit Judge Admits to Using ChatGPT to Help Decide a Case and Urges Other Judges and Lawyers to Follow Suit

June 3, 2024

Ralph Losey. Published June 3, 2024.

The Eleventh Circuit published a ground breaking Concurring Opinion on May 28, 2024 by Judge Kevin C. Newsom on the use of generative AI to help decide contract interpretation issues. Snell v. United Specialty Ins. Co., 2024 U.S. App. LEXIS 12733 *; _ F.4th _ (11th Cir., 05/28/24). The case in question centered around interpretation of an insurance policy. Circuit Judge Kevin C. Newsom not only admits to using ChatGPT to help him make his decision, but praises its utility and urges other judges and lawyers to do so too. His analysis is impeccable and his writing is superb. That is bold judicial leadership – Good News. I love his opinion and bet that you will too.

Image by Ralph Losey using his Visual Muse GPT

The only way to do the Concurring Opinion justice is to quote all of it, all 6,485 words. I know that’s a lot of words, but unlike ChatGPT, which is a good writer, Judge Newsom is a great writer. Judge Kevin C. Newsom, a Harvard law graduate from Birmingham, Alabama, is creative in his wise and careful use of AI. Judge Newsom added photos to his opinion and, as I have been doing recently in my articles, quoted in full the transcripts of the ChatGPT sessions he relied upon. He leads by doing and his analysis is correct, including especially his commentary on AI and human hallucinations.

Judge Newsom has an interesting, personal story to tell, and, unlike ChatGPT, he tells it in an amusing and self-effacing way. This is the first case of its kind and deserves careful study by lawyers and judges all over the world. Help me to get the word out by sharing his Concurring Opinion with your friends and colleagues. Your clients should see it too.

Image by Ralph Losey using **Visual Muse: illustrating concepts with style**.

To spice it up a little, and because I can make my blogs as long as I want, which is unheard of these days, I add a few obvious editorial comments along the way (in red italics), including bolding. I do this to point out a few things, and add some deserved praise of this way cool opinion.

So settle in and prepare yourself for a interesting, clever read. I promise that it will be the best concurring opinion to an insurance contract case that you have ever read. Plus, since you are probably an AI enthusiast like me, you will want to cite and quote parts of this opinion for years to come. Hint, the parts that I embolden are probably going to be seen often in the coming days of AI jurisprudence. (If you see any errors, they are mine, not Judge Newsom’s, as I rushed without assistance to get this out to you quickly.)

Newsom, Circuit Judge, concurring:

I concur in the Court’s judgment and join its opinion in full. I write separately (and I’ll confess this is a little unusual¹ ) simply to pull back the curtain on the process by which I thought through one of the issues in this case—and using my own experience here as backdrop, to make a modest proposal regarding courts’ interpretations of the words and phrases used in legal instruments.

Here’s the proposal, which I suspect many will reflexively condemn as heresy, but which I promise to unpack if given the chance: Those, like me, who believe that “ordinary meaning” is the foundational rule for the evaluation of legal texts should consider—consider—whether and how AI-powered large language models like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude might—might—inform the interpretive analysis. There, having thought the unthinkable, I’ve said the unsayable.

Now let me explain myself.

First, a bit of background. [*26] As today’s majority opinion recounts, both in the district court and before us, the parties litigated this as an “ordinary meaning” case. In particular, they waged war over whether James Snell’s installation of an in-ground trampoline, an accompanying retaining wall, and a decorative wooden “cap” fit within the common understanding of the term “landscaping” as used in the insurance policy that Snell had purchased from United Specialty Insurance Company.

So, for instance, the district court observed that “whether the claims are covered depends upon whether the performance of ‘landscaping’ would include Snell’s installation of the trampoline.” Doc. 23 at 10. Because the policy didn’t define the term “landscaping,” the court said, the coverage determination turned on whether Snell’s trampoline-related work fit the “common, everyday meaning of the word.” Id. at 10-11. Having reviewed multiple dictionary definitions provided by the parties, the court concluded that Snell’s work didn’t constitute “landscaping.” Id. at 13. As the majority opinion explains, the plain-meaning battle continued on appeal, with the parties “expend[ing] significant energy parsing the words of the policy, including [*27] whether the site work necessary to install the trampoline was ‘landscaping.'” Maj. Op. at 17. Snell insisted, for example, that the district court had erred by “ignor[ing] the plain meaning of undefined terms” in the policy—most notably, “landscaping.” Br. of Appellant at 20, 21.

Now, as it turned out, we managed to resolve this case without having to delve too deeply into the definitional issue that the parties featured—due in large part to (1) a quirk of Alabama law that, according to the state supreme court, makes every insurance application ipso facto part of the policy that it precedes² and (2) the fact that in his application Snell had expressly denied that his work included “any recreational or playground equipment construction or erection.” Maj. Op. at 17-18 (quotation marks omitted). Combined, those two premises yield the majority opinion’s controlling conclusion: “Snell’s insurance application—which Alabama law requires us to consider part of the policy—expressly disclaims the work he did here” and thus defeats his claim. Id. at 18.

Importantly, though, that off-ramp wasn’t always obviously available to us—or at least as I saw things, to me. Accordingly, I spent hours [*28] and hours (and hours) laboring over the question whether Snell’s trampoline-installation project qualified as “landscaping” as that term is ordinarily understood. And it was midway along that journey that I had the disconcerting thought that underlies this separate writing: Is it absurd to think that ChatGPT might be able to shed some light on what the term “landscaping” means? Initially, I answered my own question in the affirmative: Yes, Kevin, that is positively absurd. But the longer and more deeply I considered it, the less absurd it seemed.

But I’m getting ahead of myself. I should tell the full story, from beginning to end. In what follows, I’ll first explain how my initial efforts to pinpoint the ordinary meaning of the term “landscaping” left me feeling frustrated and stuck, and ultimately led me—initially half-jokingly, later more seriously—to wonder whether ChatGPT and other AI-powered large language models (“LLMs”) might provide a helping hand. Next, I’ll explore what I take to be some of the strengths and weaknesses of using LLMs to aid in ordinary-meaning interpretation. Finally, given the pros and cons as I see them, I’ll offer a few ideas about how we—judges, lawyers, [*29] academics, and the broader AI community—might make LLMs more valuable to the interpretive enterprise.

First things first. I’m unabashedly a plain-language guy—firmly of the view that “[t]he ordinary meaning rule is the most fundamental semantic rule of interpretation” and that it should govern our reading not only of “constitutions, statutes, [and] rules,” but also, as relevant here, of “private instruments.” Antonin Scalia & Bryan A. Garner, Reading Law: The Interpretation of Legal Texts 69 (2012). Accordingly, I take it as gospel truth that absent a clear indication that some idiosyncratic, specialized meaning was intended, “[w]ords are to be understood in their ordinary, everyday meanings.” Id.; accord, e.g., Shiloh Christian Ctr. v. Aspen Specialty Ins. Co., 65 F.4th 623, 629-30 (11th Cir. 2023) (Newsom, J.) (evaluating an insurance policy’s “plain language”); Heyman v. Cooper, 31 F.4th 1315, 1319-20 (11th Cir. 2022) (Newsom, J.) (evaluating a municipal ordinance’s “ordinary meaning”); Barton v. United States AG, 904 F.3d 1294, 1298-99 (11th Cir. 2018) (Newsom, J.) (evaluating a federal statute’s “ordinary meaning”).

So, following the district court’s lead, I did here what any self-respecting textualist would do when trying to assess the ordinary meaning of a particular word, here “landscaping”: I went to the dictionaries.³ In his brief, Snell had served up a buffet of definitions, ranging [*30] from Dictionary.com’s—”to improve the appearance of (an area of land, a highway, etc.) as by planting trees, shrubs, or grass, or altering the contours of the ground”—to Wikipedia’s—”any activity that modifies the visible features of an area of land.” See Br. of Appellant at 22-23. My own research revealed, in addition, that Webster’s defined “landscaping” as “to modify or ornament (a natural landscape) by altering the plant cover,” Merriam-Webster’s Collegiate Dictionary 699 (11th ed. 2014), and that Oxford defined it to mean “improv[ing] the aesthetic appearance of (an area) by changing its contours, adding ornamental features, or by planting trees and shrubs,” Oxford Dictionary of English 991 (3d ed. 2010).

As occasionally happens, the dictionaries left a little something to be desired. From their definitions alone, it was tough to discern a single controlling criterion. Must an improvement be natural to count as “landscaping”? Maybe, but that would presumably exclude walkways and accent lights, both of which intuitively seemed (to me, anyway) to qualify. Perhaps “landscaping” work has to be done for aesthetic reasons? That, though, would rule out, for instance, a project [*31] to regrade a yard, say away from a house’s foundation to prevent basement flooding. I once regraded my own yard, and while my efforts did nothing to improve my house’s curb appeal, they served an important functional purpose—and for what it’s worth, I definitely thought I was engaged in “landscaping.”

After languishing in definitional purgatory for a while, I decided to look at the case from a different perspective—and I do mean look. The record contains a series of photographs of Snell’s trampoline-related project. Here’s one, which shows his prep work—in particular, the empty sand pit and the below-ground retaining wall that reinforced its borders:

And another, which depicts the finished product, including both the polypropylene mat (the fun part) and the decorative wooden “cap”:

I’m not particularly proud of it, but I’ll confess that the photos affected the way I thought about the case. Nothing in them really struck me as particularly “landscaping”-y. The problem, of course, was that I couldn’t articulate why. And visceral, gut-instinct decisionmaking has always given me the willies—I definitely didn’t want to be that guy. So in a way, I felt like I was back to square one. (RL – Is Judge Newsom a great writer or what! Oddly enough, I understand exactly what he is saying here. Also note this kind of speech is way beyond ChatGPT’s ability to predict and generate, no matter what it’s temperature setting.)

And that’s [*32] when things got weird. Perhaps in a fit of frustration, and most definitely on what can only be described as a lark, I said to one of my clerks, “I wonder what ChatGPT thinks about all this.” So he ran a query: “What is the ordinary meaning of ‘landscaping’?” Here’s what ChatGPT said in response:

“Landscaping” refers to the process of altering the visible features of an area of land, typically a yard, garden or outdoor space, for aesthetic or practical purposes. This can include activities such as planting trees, shrubs, flowers, or grass, as well as installing paths, fences, water features, and other elements to enhance the appearance and functionality of the outdoor space.

Interesting, for two reasons. First, ChatGPT’s explanation seemed more sensible than I had thought it might—and definitely less nutty than I had feared. Second, it squared with my own impression—informed by my own experience writing, reading, speaking, and listening to American English in the real world—that ordinary people might well use the word “landscaping” (1) to include more than just botanical and other natural improvements and (2) to cover both aesthetic and functional objectives. In fact, several of the [*33] examples that ChatGPT flagged—”paths, fences, [and] water features”—jibed with the sorts of things that had sprung to mind when I first started thinking about the case.

Suffice it to say, my interest was piqued. But I definitely didn’t want to fall into the trap of embracing ChatGPT’s definition just because it aligned with my priors. (Bad.) So, in what might have been a mistake—more on that later—we went ahead and asked it the ultimate question: “Is installing an in-ground trampoline ‘landscaping'”? ChatGPT responded as follows:

“Yes, installing an in-ground trampoline can be considered a part of landscaping. Landscaping involves altering the visible features of an outdoor area for aesthetic or practical purposes, and adding an in-ground trampoline would modify the appearance and function of the space. It’s a deliberate change to the outdoor environment, often aimed at enhancing the overall landscape and usability of the area.”

For good measure, I posed the same questions to Google’s Bard (since replaced by Gemini).⁴ The precise details aren’t particularly important, but the upshot is that both models’ answers indicated that the trampoline-related work Snell had performed—the excavation [*34] of the pit, the construction of the retaining wall, the installation of the mat, and the addition of the decorative wooden cap—just might be landscaping.

As I’ve said, intervening developments—in particular, the focus on a statement in Snell’s application, which Alabama law makes part of his policy and in which Snell disclaimed doing any recreation-related work—alleviated the need to settle on the ordinary meaning of the term “landscaping.” But my own process of working through the plain-language issue was a valuable one, if only because it got me thinking about what was to me a previously unimaginable possibility: Might LLMs be useful in the interpretation of legal texts? Having initially thought the idea positively ludicrous, I think I’m now a pretty firm “maybe.” At the very least, it seems to me, it’s an issue worth exploring.

So let’s explore.

III

In what follows, I’ll try to sketch out what I take to be some of the primary benefits and risks of using LLMs—to be clear, as one implement among several in the textualist toolkit—to inform ordinary-meaning analyses of legal instruments.

I’ll start with the pros as I see them, and then turn to the cons.

1. LLMs train on ordinary-language [*35] inputs. (bold in original) Let me begin with what I take to be the best reason to think that LLMs might provide useful information to those engaged in the interpretive enterprise. Recall what is (for many of us, anyway) the “most fundamental semantic rule of interpretation”: Absent a clear indication that they bear some technical or specialized sense, the words and phrases used in written legal instruments “are to be understood in the ordinary, everyday meanings.” Scalia & Garner, Reading Law, supra, at 69. The premise underlying the ordinary-meaning rule is that “[i]n everyday life, the people to whom rules are addressed continually understand and apply them.” Id. at 71. Accordingly, the ordinary-meaning rule, as its name suggests, has always emphasized “common language,” Nix v. Hedden, 149 U.S. 304, 307, 13 S. Ct. 881, 37 L. Ed. 745, Treas. Dec. 14045 (1893), “common speech,” Sonn v. Magone, 159 U.S. 417, 421, 16 S. Ct. 67, 40 L. Ed. 203 (1895), and “common parlance,” Helix Energy Sols. Grp. v. Hewitt, 598 U.S. 39, 52, 143 S. Ct. 677, 214 L. Ed. 2d 409 (2023)—in short, as I’ve explained it elsewhere, “how people talk,” United States v. Caniff, 916 F.3d 929, 941 (11th Cir. 2019) (Newsom, J., concurring in part and dissenting in part), vacated and superseded, 955 F.3d 1183 (11th Cir. 2020).

The ordinary-meaning rule’s foundation in the common speech of common people matters here because LLMs are quite literally “taught” using data that aim to reflect and capture how individuals use language in their everyday lives. Specifically, the models train on a mind-bogglingly enormous [*36] amount of raw data taken from the internet—GPT-3.5 Turbo, for example, trained on between 400 and 500 billion words⁵ —and at least as I understand LLM design, those data run the gamut from the highest-minded to the lowest, from Hemmingway novels and Ph.D. dissertations to gossip rags and comment threads.⁶ Because they cast their nets so widely, LLMs can provide useful statistical predictions about how, in the main, ordinary people ordinarily use words and phrases in ordinary life.⁷ So, for instance, and as relevant here, LLMs can be expected to offer meaningful insight into the ordinary meaning of the term “landscaping” because the internet data on which they train contain so many uses of that term, from so many different sources—e.g., professional webpages, DIY sites, news stories, advertisements, government records, blog posts, and general online chatter about the topic.⁸

To be sure, LLMs’ training data aren’t a perfect [*37] universe from which to draw hard-and-fast conclusions about ordinary meaning, principally because they don’t capture what I’ll call “pure offline” usages—i.e., those that neither (1) occur online in the first instance nor (2) originate offline, in hard copy, but are eventually digitized and uploaded to some online site. And indeed, the absence of offline usages from the training pool—and in particular, the implications for underrepresented populations—strikes me as a sufficiently serious concern that I’ve broken it out for separate discussion below. See infra at 21-23. Even so, those omissions aside, it seems to me scarcely debatable that the LLMs’ training data are at the very least relevant to the ordinary-meaning analysis. In fact, an LLMs’ dataset may well be the most “perfectly imperfect” on offer because (1) scads of people either use the internet or create content that finds its way onto the internet (or more likely both), (2) the information available online reflects people’s use of terminology in a wide array of contexts and settings, from the sublime to the ridiculous, and (3) there’s little reason (that I can think of) to worry that writers and speakers whose communications [*38] end up online manipulate the inputs (i.e., their words) in a way that might artificially skew the data.

Put simply, ordinary-meaning interpretation aims to capture how normal people use language in their everyday lives—and the bulk of the LLMs’ training data seem to reflect exactly that.⁹

2. LLMs can “understand” context. So far as I can tell, researchers powering the AI revolution have created, and are continuing to develop, increasingly sophisticated ways to convert language (and I’m not making this up) into math that computers can “understand.” See Yonathan A. Arbel & David A. Hoffman, Generative Interpretation, 99 N.Y.U. L. Rev. (forthcoming 2024) (manuscript at 26) (describing “attention mechanism,” a feature of LLMs that facilitates the recognition of how words are used in context). The combination of the massive datasets used for training and this cutting-edge “mathematization” of language enables LLMs to absorb and assess the use of terminology in context and empowers them to detect language patterns at a granular level. So, for instance, modern LLMs can easily discern the difference—and distinguish—between the flying-mammal “bat” that uses echolocation and may or may not be living in your attic, on the one hand, [*39] and the wooden “bat” that Shohei Otani uses to hit dingers, on the other. See id. And that, as I understand it, is just the tip of the iceberg. LLM predictions about how we use words and phrases have gotten so sophisticated that they can (for better or worse) produce full-blown conversations, write essays and computer code, draft emails to co-workers, etc. And as anyone who has used them can attest, modern LLMs’ results are often sensible—so sensible, in fact, that they can border on the creepy. Now let’s be clear, LLMs aren’t perfect—and again, we’ll discuss their shortcomings in due course. But let’s be equally clear about what they are: high-octane language-prediction machines capable of probabilistically mapping, among other things, how ordinary people use words and phrases in context. (RL – Excellent reasoning here by Judge Newsom, again I think he’s got it right. Kudos to him and his clerks.)

3. LLMs are accessible. LLMs are readily accessible (and increasingly so) to judges, lawyers, and, perhaps most importantly, ordinary citizens. In recent years, the use of LLMs has proliferated, and as with all other internet-related tools, one can only assume that usage will continue to accelerate, likely at an exponential rate. The LLMs’ easy accessibility is important in at least two respects. [*40] First, it offers the promise of “democratizing” the interpretive enterprise, both (as already explained) by leveraging inputs from ordinary people and by being available for use by ordinary people. Second, it provides judges, lawyers, and litigants an inexpensive research tool. My “landscaping”-related queries, for instance, while no doubt imperfect, cost me nothing. To be sure, querying a more advanced LLM may come with a pricetag, at least for now. But so does, for example, searching the Oxford English Dictionary, the online version of which exists behind a paywall.¹⁰ And I’d be willing to bet that the costs associated with even the more advanced LLMs pale in comparison to subscriptions for Westlaw and Lexis, which power most modern legal research, including some involving dictionaries.¹¹ And of course there’s always the promise that open-source LLMs might soon approximate the for-profit models’ productivity.

4. LLM research is relatively transparent. Using LLMs to facilitate ordinary-meaning interpretation may actually enhance the transparency and reliability of the interpretive enterprise itself, at least vis-à-vis current [*41] practice. Two brief observations.

First, although we tend to take dictionaries for granted, as if delivered by a prophet, the precise details of their construction aren’t always self-evident. Who exactly compiles them, and by what criteria do the compilers choose and order the definitions within any given entry? To be sure, we’re not totally in the dark; the online version of Merriam-Webster‘s, for instance, provides a useful primer explaining “[h]ow . . . a word get[s] into” that dictionary.¹² It describes a process by which human editors spend a couple of hours a day “reading a cross section of published material” and looking for new words, usages, and spellings, which they then mark for inclusion (along with surrounding context) in a “searchable text database” that totals “more than 70 million words drawn from a great variety of sources”—followed, as I understand things, by a step in which a “definer” consults the available evidence and exercises his or her judgment to “decide[] . . . the best course of action by reading through the citations and using the evidence in them to adjust entries or create new ones.”¹³

Such explainers [*42] aside, Justice Scalia and Bryan Garner famously warned against “an uncritical approach to dictionaries.” Antonin Scalia & Bryan A. Garner, A Note on the Use of Dictionaries, 16 Green Bag 2d 419, 420 (2013). They highlighted as risks, for instance, that a volume could “have been hastily put together by two editors on short notice, and very much on the cheap,” and that without “consult[ing] the prefatory material” one might not be able “to understand the principles on which the dictionary [was] assembled” or the “ordering of [the] senses” of a particular term. Id. at 420, 423.

To be clear, I’m neither a nihilist nor a conspiracy theorist, but I do think that we textualists need to acknowledge (and guard against the fact) that dictionary definitions present a few known unknowns. See id. at 419-28; cf. Thomas R. Lee & Stephen C. Mouritsen, The Corpus and the Critics, 88 U. Chi. L. Rev. 275, 286-88 (2021) (highlighting potential interpretive pitfalls associated with dictionaries). And while I certainly appreciate that we also lack perfect knowledge about the training data used by cutting-edge LLMs, many of which are proprietary in nature, see supra notes 6 & 8, I think it’s fair to say that we do know both (1) what LLMs are learning from—namely, tons and tons of internet data—and (2) one of the things that makes LLMs so useful—namely, their ability [*43] to accurately predict how normal people use language in their everyday lives.

A second transparency-related thought: When a judge confronts a case that requires a careful assessment of a word’s meaning, he’ll typically consult a range of dictionary definitions, engage in a “comparative weighing,” Scalia & Garner, A Note, supra, at 422, and, in his written opinion, deploy one, two, or a few of them. The cynic, of course, will insist that the judge just dictionary-shopped for the definitions that would enable him to reverse-engineer his preferred outcome. See James J. Brudney & Lawrence Baum, Oasis or Mirage: The Supreme Court’s Thirst for Dictionaries in the Rehnquist and Roberts Eras, 55 Wm. & Mary L. Rev. 483, 539 (2013). I’m not so jaded; I trust that ordinary-meaning-focused judges genuinely seek out definitions that best fit the context of the instruments that they’re charged with interpreting. See, e.g., Hoever v. Marks, 993 F.3d 1353, 1366-68 (11th Cir. 2021) (en banc) (Newsom, J., concurring in judgment in part and dissenting in part) (choosing, based on contextual clues, from among competing definitions of the word “for”). Even so, I have to admit (1) that the choice among dictionary definitions involves a measure of discretion and (2) that judges seldom “show their work”—that is, they rarely explain in [*44] any detail the process by which they selected one definition over others. Contrast my M.O. in this case, which I would recommend as a best practice: full disclosure of both the queries put to the LLMs (imperfect as mine might have been) and the models’ answers.

Anyway, I don’t mean to paint either too grim a picture of our current, dictionary-centric practice—my own opinions are chock full of dictionary definitions, I hope to good effect—or too rosy a picture of the LLMs’ potentiality. My point is simply that I don’t think using LLMs entails any more opacity or involves any more discretion than is already inherent in interpretive practices that we currently take for granted—and in fact, that on both scores it might actually involve less.

5. LLMs hold advantages over other empirical interpretive methods. One final point before moving on. Recently, some empiricists have begun to critique the traditional dictionary-focused approach to plain-meaning interpretation. Some, for instance, have conducted wide-ranging surveys of ordinary citizens, seeking to demonstrate that dictionaries don’t always capture ordinary understandings of legal texts. See, e.g., Kevin P. Tobia, Testing Ordinary Meaning [*45] , 134 Harv. L. Rev. 726 (2020). Others have turned to corpus linguistics, which aims to gauge ordinary meaning by quantifying the patterns of words’ usages and occurrences in large bodies of language. See, e.g., Thomas R. Lee & Stephen C. Mouritsen, Judging Ordinary Meaning, 127 Yale L.J. 788, 795 (2018).

On balance, reliance on LLMs seems to me preferable to both. The survey method is interesting, but it seems wildly impractical—judges and lawyers have neither the time nor the resources to poll ordinary citizens on a widespread basis. By contrast, as already explained, LLMs are widely available and easily accessible. And corpus methods have been challenged on the ground, among others, that those tasked with compiling the data exercise too much discretion in selecting among the inputs. See, e.g., Jonathan H. Choi, Measuring Clarity in Legal Text, 91 U. Chi. L. Rev. 1, 26 (2024). For reasons already explained, I don’t think LLM-based methods necessarily carry the same risk.

For all these reasons, and perhaps others I haven’t identified, it seems to me that it’s at least worth considering whether and how we might leverage LLMs in the ordinary-meaning enterprise—again, not as the be all and end all, but rather as one aid to be used alongside dictionaries, the semantic canons, [*46] etc.

Now, let’s examine a few potential drawbacks. I suppose it could turn out that one or more of them are deal-killers. I tend to doubt it, but let’s put them on the table.

1. LLMs can “hallucinate.” First, the elephant in the room: What about LLMs’ now-infamous “hallucinations”? Put simply, an LLM “hallucinates” when, in response to a user’s query, it generates facts that, well, just aren’t true—or at least not quite true. See, e.g., Arbel & Hoffman, supra, at 48-50. Remember the lawyer who got caught using ChatGPT to draft a brief when it ad-libbed case citations—which is to say cited precedents that didn’t exist? See, e.g., Benjamin Weiser, Here’s What Happens When Your Lawyer Uses ChatGPT, N.Y. Times (May 29, 2023). To me, this is among the most serious objections to using LLMs in the search for ordinary meaning. Even so, I don’t think it’s a conversation-stopper. For one thing, LLM technology is improving at breakneck speed, and there’s every reason to believe that hallucinations will become fewer and farther between. Moreover, hallucinations would seem to be most worrisome when asking a specific question that has a specific answer—less so, it seems to me, when more generally seeking the “ordinary meaning” [*47] of some word or phrase. Finally, let’s shoot straight: Flesh-and-blood lawyers hallucinate too. Sometimes, their hallucinations are good-faith mistakes. But all too often, I’m afraid, they’re quite intentional—in their zeal, attorneys sometimes shade facts, finesse (and even omit altogether) adverse authorities, etc. So at worst, the “hallucination” problem counsels against blind-faith reliance on LLM outputs—in exactly the same way that no conscientious judge would blind-faith rely on a lawyer’s representations. (RL – I love this part about human lawyers also hallucinating. This corresponds with my own experience as I have written before. I plan to quote this often. Pretty soon ChatGPT will be able to predict it!)

2. LLMs don’t capture offline speech, and thus might not fully account for underrepresented populations’ usages. I flagged this one earlier, but I think it’s a serious enough concern to merit separate treatment. Here’s the objection, as I see it: People living in poorer communities (perhaps disproportionately minorities and those in rural areas) are less likely to have ready internet access and thus may be less likely to contribute to the sources from which LLMs draw in crafting their responses to queries. Accordingly, the argument goes, their understandings—as manifested, for instance, in their written speech—won’t get “counted” in the LLMs’ ordinary-meaning assessment.

As [*48] I say, I think this is a serious issue. Even so, I don’t believe it fatally undermines LLMs’ utility, at least as one tool among many for evaluating ordinary meaning. Ideally, of course, the universe of information from which any source of meaning draws would capture every conceivable input. But we should guard against overreaction. Presumably, LLMs train not only on data that were born (so to speak) online but also on material that was created in the physical world and only thereafter digitized and uploaded to the internet. And there is (I think) less reason to fear that those in underserved communities are at a dramatic comparative disadvantage with respect to the latter category. Moreover, to the extent we’re worried about a lack of real-world, documentary evidence representing underrepresented populations’ usages, then we have bigger fish to fry, because there’s reason to doubt the utility of dictionaries, as well—which, as Merriam-Webster‘s editors have explained, also rely on hard-copy sources to evaluate terms’ ordinary meanings. See supra at 16-17 & note 12. (RL – I agree with Judge Newsom’s commendable concerns here about bias of sorts built into the data, but like him, agree that in this legal situation at least, there is no reason for concern.)

Anyway, the risk that certain communities’ word-usage outputs aren’t adequately reflected in LLMs’ training-data inputs [*49] is real, and I’d note it as a candidate for improvement, but I don’t think it’s either fatal or insurmountable.¹⁴

3. Lawyers, judges, and would-be litigants might try to manipulate LLMs. I suppose there’s a risk that lawyers and judges might try to use LLMs strategically to reverse-engineer a preferred answer—say, by shopping around among the available models or manipulating queries. Maybe, but that’s an evergreen issue, isn’t it? Although they shouldn’t, lawyers and judges can cast about for advantageous dictionary definitions and exploit the interpretive canons, but no one thinks that’s a sufficient reason to abandon those as interpretive tools. And if anything, I tend to think that the LLMs are probably less vulnerable to manipulation than dictionaries and canons, at least when coupled with (as I’ve tried to provide here) full disclosure of one’s research process. (RL- Very clever observation. It is also important to acknowledge that Judge Newsom is being fully transparent in his disclosure of use of AI. In fact, this is one of the most transparent and personally revealing opinions I have ever read.)

Relatedly, might prospective litigants seek to corrupt the inputs—the data on which the LLMs train and base their responses to user queries—in an effort to rig the system to spit out their preferred interpretations? It’s a real concern—perhaps especially considering that the same AI companies that have developed and [*50] are training the LLMs might themselves be litigants. But given the nature of the technology as I understand it, hardly insurmountable. For one thing, most models embody some training “cutoff”—for instance, though things might have changed, it was once common knowledge that GPT-4 learned on data up to and including September 2021. See Open AI, GPT-4 Technical Report 10 (arXiv:2303.08774, 2024). Accordingly, it would likely be difficult, if not impossible, to pollute the inputs retroactively. More fundamentally, it seems almost inconceivable that a would-be malefactor could surreptitiously flood any given dataset with enough new inputs to move the needle—remember, just by way of example, that GPT-3.5 Turbo trained on more than 400 billion words. Finally, while I tend to doubt that any AI company would conclude that corrupting its own product in order to obtain an interpretive advantage in a single case was in its long-term business interest, that risk, it seems to me, could be mitigated, if not eliminated, by querying multiple models rather than just one. (RL – agree with this observation and the Judge’s conclusions.)

4. Reliance on LLMs will lead us into dystopia. Would the consideration of LLM outputs in interpreting legal texts inevitably put us [*51] on some dystopian path toward “robo judges” algorithmically resolving human disputes? I don’t think so. As Chief Justice Roberts recently observed, the law will always require “gray area[]” decisionmaking that entails the “application of human judgment.” Chief Justice John G. Roberts, Jr., 2023 Year-End Report on the Federal Judiciary 6 (Dec. 31, 2023). And I hope it’s clear by this point that I am not—not, not, not—suggesting that any judge should ever query an LLM concerning the ordinary meaning of some word (say, “landscaping”) and then mechanistically apply it to her facts and render judgment. My only proposal—and, again, I think it’s a pretty modest one—is that we consider whether LLMs might provide additional datapoints to be used alongside dictionaries, canons, and syntactical context in the assessment of terms’ ordinary meaning. That’s all; that’s it. (RL – My only criticism of Judge Newsom’s Concurring Opinion is that he does not go further. I for one think judges should go much further in their use of generative AI, as I have written about previously. Appellate judges may be among the first to be routinely enhanced with AI. See e.g. Circuits in Session: How AI Challenges Traditional Appellate Dynamics (e-Discovery Team, 10/13/23); Circuits in Session: Addendum and Elaboration of the Appellate Court Judge Experiment (e-Discovery Team, 10/26/23); Circuits in Session: Analysis of the Quality of ChatGPT4 as an Appellate Court Judge (e-Discovery Team, 11/01/23). But I understand why Judge Newsom does not do that here. One step at a time and this opinion is an important first step.)

Which brings me to my final question: If I’m not all wet, and it’s at least worth considering whether LLMs have a role to play in the interpretation of legal instruments, how might we maximize their utility? I’ve already flagged a few suggestions for improvement along the way—more data, from more sources, representing a more [*52] representative cross-section of Americans. But beyond the obvious, what else? (RL – No Judge Newsom, you are not all wet. You are squeaky clean, on point and over-modest. If only we had more judges like you. Certainly LLMs have a very important role to play in the interpretation of legal instruments.)

First, I think it’ll be helpful to clarify the objective. Remember that in my clumsy first crack at this, I asked two different models two different questions: (1) “What is the ordinary meaning of ‘landscaping’?”; and (2) “Is an in-ground trampoline ‘landscaping’?” Which is the proper question? In retrospect, if my contention is—as it is—that LLMs might aid in the search for the ordinary, everyday meaning of common words and phrases, then it seems pretty clear to me that my first, more general query is the more appropriate one. The models’ highest and best use is (like a dictionary) helping to discern how normal people use and understand language, not in applying a particular meaning to a particular set of facts to suggest an answer to a particular question.

Second, and relatedly, how can we best query LLMs? Those in the know refer to the question a user asks a model as a “prompt.” I’ll confess that I gave relatively little thought to my own prompts—they were just the questions that immediately sprang to mind. But research indicates that the models can be sensitive to prompts and that the results can vary accordingly. [*53] See, e.g., Arbel & Hoffman, supra, at 36. So it may be wise for users to try different prompts, and, importantly, to report the prompts they use and the range of results they obtain. Id. at 36-37. Better still to do all that and query multiple models to ensure that the results are consistent—or, in statistics-speak, “robust.” (RL – How can we best query LLM’s? That is indeed the key question of the day and the whole idea behind Prompt Engineering, a subject that has been the focus of my studies and experiments for some time now. I promise you Judge Newsom that many have been working hard on this challenge and should have a solution for this soon.)

Third, we need to clarify the particular output we’re after. The questions I asked sought a discrete, one-time answer. In particular, I asked for a single definition of “landscaping” and, separately, whether installation of an in-ground trampoline qualified. One potential challenge is that this approach obscures the fact, already explained, that LLMs make probabilistic, predictive judgments about language. With that in mind, some who have considered how LLMs might be used to interpret contracts have suggested that users seek not just answers but also “confidence” levels. See id. at 23. So, for instance, an LLM might reveal that its prediction about a provision’s meaning is “high” or, by contrast, only “ambiguous.” Alternatively, but to the same end, a researcher might ask an LLM the same question multiple times and note the percentage of instances in which it agrees that, say, installation of an in-ground [*54] trampoline is landscaping. See Christoph Engel & Richard H. McAdams, Asking GPT for the Ordinary Meaning of Statutory Terms 15 (Max Planck Inst. Discussion Paper 2024/5).¹⁵ (RL – Yes, asking multiple times is one way of many to improve the quality of the AI input. Again that is a question of prompt engineering.)

Fourth and finally, there are temporal considerations to mull. The ordinary-meaning rule has an important corollary—namely, that “[w]ords must be given the meaning they had when the text was adopted.” Scalia & Garner, Reading Law, supra, at 78 (emphasis added). That principle—”originalism,” if you will—most obviously applies to constitutional and statutory texts. See, e.g., United States v. Pate, 84 F.4th 1196, 1201 (11th Cir. 2023) (en banc) (“[W]hen called on to resolve a dispute over a statute’s meaning, [a court] normally seeks to afford the law’s terms their ordinary meaning at the time Congress adopted them.” (quoting Niz-Chavez v. Garland, 593 U.S. 155, 160, 141 S. Ct. 1474, 209 L. Ed. 2d 433 (2021)). To be clear, though, it holds for private instruments like contracts and insurance policies, as well. See Scalia & Garner, Reading Law, supra, at 78. This case didn’t present a particularly knotty temporal difficulty because the policy in question was issued relatively recently, in 2017. But cases routinely arise in which the relevant question is what a particular term meant in the more distant past—say, in 1787, 1868, or 1964. If LLMs are to be deployed to aid more broadly in the search for ordinary meaning, it would be enormously [*55] helpful—if it’s not already possible—for AI engineers to devise a way in which queries could be limited to particular timeframes. Cf. Engel & McAdams, supra, at 34-37.¹⁶ (RL – Agree this would be very possible and I understand the constitution construction hints and implications here.)

In his most recent year-end report on the state of the federal judiciary, Chief Justice Roberts cautioned that the “use of AI requires caution and humility.” Roberts, supra, at 5. I wholeheartedly agree. Importantly, though, I also agree with what I take to be the report’s assumption that AI is here to stay. Now, it seems to me, is the time to figure out how to use it profitably and responsibly. It’s in that spirit that I’ve offered these preliminary thoughts about whether and how LLMs might aid lawyers and judges in the interpretive enterprise. Plenty of questions remain, and I’m sure I haven’t even identified all of them. But—and this is my bottom line—I think that LLMs have promise. At the very least, it no longer strikes me as ridiculous to think that an LLM like ChatGPT might have something useful to say about the common, everyday meaning of the words and phrases used in legal texts.

Just my two cents.

Appendix

Below are the questions that I asked OpenAI’s ChatGPT and Google’s Bard models, along with their responses.

ChatGPT

Question: What is the ordinary meaning of “landscaping” [*56] ?

Response: “Landscaping” refers to the process of altering the visible features of an area of land, typically a yard, garden or outdoor space, for aesthetic or practical purposes. This can include activities such as planting trees, shrubs, flowers, or grass, as well as installing paths, fences, water features, and other elements to enhance the appearance and functionality of the outdoor space.

Question: Is installing an in-ground trampoline “landscaping”?

Response: Yes, installing an in-ground trampoline can be considered a part of landscaping. Landscaping involves altering the visible features of an outdoor area for aesthetic or practical purposes, and adding an in-ground trampoline would modify the appearance and function of the space. It’s a deliberate change to the outdoor environment, often aimed at enhancing the overall landscape and usability of the area.

Bard

Question: What is the ordinary meaning of “landscaping”?

Response: The ordinary meaning of landscaping” is the process of modifying the visible features of an area of land, including:

• Adding or removing vegetation, such as trees, shrubs, and flowers

• Installing hardscape elements, such as walkways, patios, and retaining walls [*57]

• Creating or enhancing natural features, such as ponds, waterfalls, and rock gardens

Landscaping can be used to improve the aesthetics of an area, increase its functionality, and/or protect it from erosion. It can also be used to create a sense of privacy, define spaces, and add value to a property.

Here are some examples of landscaping projects:

• Planting a new lawn or garden

• Adding a patio or walkway

• Building a retaining wall

• Installing a sprinkler system

• Creating a pond or waterfall

• Planting trees and shrubs

• Adding mulch or gravel

• Edging a flower bed

• Trimming hedges

Landscaping can be a DIY project, or it can be hired out to a professional landscaping company. The cost of landscaping will vary depending on the size and scope of the project.

Question: Is installing an in-ground trampoline “landscaping”?

Response: Whether or not installing an in-ground trampoline is considered landscaping depends on how you define the term. If you define landscaping broadly to include any modification to the visible features of an area of land, then installing an in-ground trampoline would be considered landscaping. However, if you define landscaping more narrowly to focus on the planting and maintenance [*58] of vegetation, then installing an in-ground trampoline would not be considered landscaping.

Here is a table that summarizes the different ways to define landscaping and whether or not installing an in-ground trampoline would be considered landscaping under each definition:

Definition	Includes installing an in-
	ground trampoline?
Broad definition: Any	Yes
modification to the visible
features of an area of land
Narrower definition: Planting	No
and maintenance of vegetation

Ultimately, whether or not installing an in-ground trampoline is considered landscaping is a matter of opinion. There is no right or wrong answer.

End of Document

¹Even for me.

² For what it’s worth, I don’t think the governing Alabama statute says that. In relevant part, it provides that “[e]very insurance contract shall be construed according to the entirety of its terms and conditions as set forth in the policy and as . . . modified by any rider, endorsement or application which is part of the policy.” Ala. Code § 27-14-17(a). The absence of a comma before the clause “which is part of the policy” strongly indicates, to my mind, that an insured’s “application” should be deemed to “modif[y]” the contract only if it is made “part of the policy.” Be that as it may, as the majority opinion correctly notes, the Alabama Supreme Court has read § 27-14-17(a), in effect, to incorporate every insured’s application into the terms of his policy, and we’re stuck with that interpretation, even if erroneous. See Maj. Op. at 18 (citing Atlanta Cas. Co. v. Russell, 798 So. 2d 664, 667 (Ala. 2001)).

³ Alabama law governs the interpretation of the insurance contract at issue in this case, see St. Paul Fire & Marine Ins. Co. v. ERA Oxford Realty Co. Greystone, LLC, 572 F.3d 893, 894 n.1 (11th Cir. 2009), and privileges “ordinary meaning” in that endeavor, see Safeway Ins. Co of Alabama v. Herrera, 912 So. 2d 1140, 1144 (Ala. 2005).

⁴ Generally, Bard’s response to my general question—”What is the ordinary meaning of ‘landscaping’?”—was pretty similar to ChatGPT’s, though notably longer. When asked the more specific question—”Is installing an in-ground trampoline ‘landscaping’?”—Bard was more equivocal than ChatGPT had been. I’ve included my questions and the models’ responses in an appendix for readers’ reference.

⁵ See Christoph Engel & Richard H. McAdams, Asking GPT for the Ordinary Meaning of Statutory Terms 10-11 (Max Planck Inst. Discussion Paper 2024/5).

⁶I’ll confess to a bit of uncertainty about exactly what data LLMs use for training. This seems like an area ripe for a transparency boost, especially as LLMs become increasingly relevant to legal work. But here’s what I think I’ve gathered from some sleuthing. A significant chunk of the raw material used to train many LLMs—i.e., the “stuff” from which the models learn—comes from something called the Common Crawl, which is, in essence, a massive data dump from the internet. See, e.g., Yiheng Liu, et al., Understanding LLMs: A Comprehensive Overview from Training to Inference 6-8 (arXiv:2401.02038, 2024). The Common Crawl isn’t “the entire web”; rather, it’s a collection of samples from online sites, which AI companies further refine for training purposes. See Stefan Baack, Training Data for the Price of a Sandwich: Common Craw’s Impact on Generative AI 5, 16-24, Mozilla Insights (Feb. 2024). That said, the samples are massive. (RL – Yes, that is a large part, but not all of it, and most agree with Judge Newsom that greater transparency is required from OpenAI and other vendors on this issue.)

⁷To be clear, I do mean “predictions.” As I understand things, the LLM that underlies a user interface like ChatGPT creates, in effect, a complex statistical “map” of how people use language—that, as machine-learning folks would say, is the model’s “objective function.” How does it do it? Well, to dumb it way down, drawing on its seemingly bottomless reservoir of linguistic data, the model learns what words are most likely to appear where, and which ones are most likely to precede or follow others—and by doing so, it can make probabilistic, predictive judgments about ordinary meaning and usage. See Yonathan A. Arbel & David A. Hoffman, Generative Interpretation, 99 N.Y.U. L. Rev. (forthcoming 2024) (manuscript at 24-29); Engel & McAdams, supra, at 10-11. (RL – Sounds correct to me.)

⁸ So far as I understand things, it’s next to impossible to pinpoint exactly what training data an LLM draws on when answering a particular question, but from what I’ve seen, I think it’s fair to say that it’s a pretty wide cross-section.

⁹ I’ll bracket for the time being whether LLMs might be useful (or less so) in the fraction of cases in which we’re focused on technical or specialized meaning, rather than ordinary meaning. See Scalia & Garner, Reading Law, supra, at 73.

¹⁰ See Purchase, Oxford English Dictionary, https://www.oed.com/purchase (last visited May 23, 2024).

¹¹ Westlaw, for instance, allows paid subscribers to access the latest edition of Black’s Law Dictionary. Lexis permits its users to access similar offerings, including Ballentine’s Law Dictionary.

¹² Help: How does a word get into a Merriam-Webster dictionary?, Merriam-Webster (last visited May 23, 2024), https://www.merriam-webster.com/help/faq-words-into-dictionary [https://perma.cc/446C-WYMN].

¹³ Id.

¹⁴ A quasi-related issue: Some words have acquired “regionalized” meanings over time. So, for instance, the noun “toboggan” can refer to either (1) a “long flat-bottomed light sled,” (2) a “downward course or sharp decline,” or (3) a “stocking cap.” Merriam-Webster’s Collegiate Dictionary, supra, at 1313. Notably, though, the third sense is “chiefly Southern [and] Midland.” Id. When we asked ChatGPT, “What is the ordinary meaning of ‘toboggan’?”, it responded with only the first, sled-based explanation. The lesson is simply that interpreters using LLMs for assistance would be wise to remember, as always, that “context is king,” Wachovia Bank, N.A. v. United States, 455 F.3d 1261, 1267 (11th Cir. 2006), and, accordingly, that they might need to adjust their queries to account for its influence.

¹⁵ Some might worry that seeking a range of responses could cause the LLM to respond with uncommon usages. Of course, if the rogue results are rare, then, almost by definition, they won’t move the “ordinary meaning” needle. And if, by contrast, they’re not rare—and thus aren’t rogues at all—then perhaps they indicate that we need to rethink our intuitions about what the “ordinary meaning” really is. Fine, and good.

¹⁶ Relatedly, might we have a “start date” problem? Are we limited to ordinary understandings that post-date the launch of the internet? Or might it be that the information contained on the internet is so extensive that it can aid in understanding historical usages, as well?

3 Comments | AI Ethics, AI Instruction, AI Prompt Engineering Instruction, Book, ChatGPT, Evidence, Lawyers Duties, Technology | Permalink
Posted by Ralph Losey

AI Copyright and the Litigious Life of Harmenszoon van Rijn Rembrandt: as explained by a talking portrait of a robot

March 28, 2024

Ralph Losey. Published March 28, 2024.

Video, AI image in style of Rembrandt, research and words by Ralph Losey, an admirer of Rembrandt who is sympathetic to his litigious life.

Here is the transcript of the five minute talk by the robot portrait. (⏱ = 0.5 second pause in speech)

Hi,

I am a robot image created by Ralph Losey, roughly in the style of Rembrandt, one of his favorite artists. ⏱ I think I also look like the work of another Dutch Master, Vermeer. ⏱ My headphone is kind of like a big pearl earring? ⏱

Ralph used a variety of digital tools to make me, primarily an AI tool called Midjourney, but several others too. ⏱ Ralph says they are like paint brushes and, like a typical lawyer, claims copyright. ⏱ It remains to be seen whether courts will agree with that position?

⏱ Ralph has also created an AI tool of his own, a GPT designed to interface with the Dall-E software of OpenAI. ⏱ He calls his software, Visual Muse. And even claims copyright to that too! ⏱

I wonder what Rembrandt would say about all of this? ⏱ Unfortunately, he knew lawyers and litigation all too well. ⏱

Rembrandt Harmenszoon van Rijn lived from 1606 to 1669. ⏱ He was a multimodal master of all of the visual media of his day. ⏱ Painting, printmaking and drawing. ⏱ He was also well known for a variety of themes and styles, including his many selfies, ⏱

Rembrandt enjoyed early success in painting and in marriage to Saskia. She was the daughter of a successful Dutch lawyer . ⏱ He and Saskia lived extravagantly, at first, and he over-spent on a big house and many purchases of art. ⏱ Tragically, their first three children died shortly after birth. The fourth child survived, but Saskia died within a year from tuberculosis. Rembrandt’s spent the rest of his life with fame and beautiful women, but no fortune. He was broke, worse than that, he was hounded by creditors and their lawyers. ⏱

Rembrandt became embroiled in a never-ending series of law suits a few years after his wife died. It all started from his seduction of the young woman employed in his mansion, Geertje Dircx. She was employed as a wet nurse for the child. ⏱ I can easily imagine how that affair came about. Ironically, a few years later, Geertje became pregnant, and sued Rembrandt for breach of promise of marriage and sought alimony. ⏱ She had good lawyers. ⏱ He paid and agreed to alimony. ⏱ Geertje later ended up in special women’s prison anyway, which cost Rembrandt still more money. ⏱

Then Rembrandt began a relationship with his 23-year-old maid, Hendrickje Stoffels. ⏱ His young mistress, Hendrickje, was recognized as a nude in Rembrandt’s painting, Bathsheba at Her Bath. Based on that the Reform Church charged the girl with, quote, committing the acts of a whore with Rembrandt the painter. ⏱She admitted her guilt and was banned from receiving communion. ⏱ Nothing happened to Rembrandt. ⏱

Still, it was all downhill from there for Rembrandt, financially at least. He had another child with Hendrickje. More expenses, but he never married her, ⏱ Ultimately Rembrandt filed for a type of voluntary bankruptcy, called an cessio bonorum, to avoid incarceration. ⏱ Yes, they would jail debtors then for failure to pay, even famous artists like Rembrandt. ⏱ The bankruptcy just delayed things. ⏱ When he died in 1669, he had outlived his major creditors, but was still buried in a rented grave. ⏱Rented grave? Who knew such a thing even existed? ⏱

As a result philandering and extravagant living, Rembrandt became all too familiar with lawyers, litigation and the protection and secretion of assets. ⏱⏱ His difficult financial and family situation is one cause of his prodigious output of art. ⏱ He had to keep working to pay his creditors, and his lawyers! ⏱ By some accounts he created 600 paintings, 400 etchings and 2,000 drawings. ⏱ ⏱

No one would mistake me for a Rembrandt or Vermeer. ⏱ But I wonder, am I even an original work? ⏱ Can I be protected? ⏱ Or can anyone steal me and do with me what they will? ⏱⏱ I certainly hope not. ⏱ I would rather litigate than live like that! ⏱⏱ Wouldn’t you? ⏱⏱

1 Comment | AI Instruction, ChatGPT, Evidence, Technology | Tagged: copyright | Permalink
Posted by Ralph Losey

OpenAI Generates a ‘Hired Gun Hacker’ Defense to the N.Y. Times Copyright Case

March 12, 2024

Ralph Losey. Published March 12, 2024.

Most everyone in the AI and legal worlds by now knows about the New York Times (“NYT”) suit against Microsoft and various Open AI entities (“OAI”). The NYT alleges copyright infringement by the practice of OAI, and most all other generative AI companies, of using data scraped from the internet to train their generative AIs. The defendants responded in late February with motions to dismiss. OAI’s motion raised a novel “hired gun hacker” defense that intrigued AI hacker attorney, Ralph Losey. Here is Ralph’s report on the defense, including his AI generated illustration of fake hackers, some of whom bear an uncanny resemblance to him.

Ralph Losey intrigued by hacker defense. Self portrait generated with text using a photo realistic Watercolor Style.

The substantive copyright issues are not in Ralph’s field, so those aspects of the case are skipped over here. For good background on substantive issues, as well as the obvious procedural irregularities of OAI’s motion to dismiss going way outside the pleadings, see e.g. Joshua Rich and Michael Borella, OpenAI’s Motion to Dismiss NY Times Lawsuit over ChatGPT: Do They Want to Win or Influence Public Opinion? (JD Supra, 3/6/24). Ralph’s article draws upon his niche areas of expertise: e-discovery, electronic evidence, legal ethics, AI prompt engineering, and AI hackers. Ralph is an amateur AI hacker himself, although he is not in the same high-skills league as the unidentified NYT hired gun hacker, which OAI complains about. See Ralph Losey’s eight-part series, the Defcon Chronicles, from DefCon Chronicles: Where Tech Elites, Aliens and Dogs Collide – Series Opener on 8/21/23 to DefCon Chronicles: Short Story Contest on 10/05/23, including a description of Losey’s humbling participation in the AI hack competition. DefCon Chronicles: Sven Cattell’s AI Village, ‘Hack the Future’ Pentest and His Unique Vision of Deep Learning and Cybersecurity (9/13/23). Also see DefCon Chronicles: Hackers Response to President Biden’s Unprecedented Request to Come to DefCon to Hack the World for Fun and Profit (9/3/23).

Ralph Losey’s Summary of the Hired Gun Hacker Defense

The most stunning allegation in the NYT complaint is based on Exhibit “J”. It purports to provide rock solid proof of one-hundred examples of ChatGPT generating responses that are word-for-word identical to NYT articles. In OAI’s motion to dismiss, OIA claims that the exhibit is a fake. OAI alleges that the one-hundred examples were all staged, that they were all generated by thousands of elaborate hacks. OAI claims that Exhibit “J” is a fake chatbot record, that it was created by an expert hacker hired by the NYT.

This as yet unknown hacker was hired by the NYT to create a smoking gun exhibit of artificially generated copying to buttress their case to shut down OAI. The hired gun hacker did his or her job well. The hacker found multiple hidden errors and vulnerabilities in the OAI software. Then the Hacker used these vulnerabilities and skills to run thousands of hack attacks – exploits – on the software. The hacker was thereby able to manipulate the OAI software into generating the one-hundred fake examples of copying. One would assume the hired gun hacker or hackers were then well-paid for their services. This will no doubt be a primary target of the first round of e-discovery.

Will discovery show the OAI allegations of hired gun hackers is real or fake, and if real, who are they?

OAI raises this extraordinary defense as part of its motion to dismiss. Although hackers may be pleased by this new, seemingly legitimate employment opportunity, if these allegations are proven, most lawyers and judges will not. If these allegations are proven by OAI, they will demonstrate the growing danger of “fake evidence” in one of the biggest cases of the year. What is real, and what is AI generated or hacked? Only your AI hackers will know for sure!

Will this development force all forensic teams to include AI hackers? Info-Sec teams? Discovery vendors? Will all good law firms and corporate law departments soon keep AI hackers on call? Will corporate and big law AI teams have their own “AI hackers-in-residence”? Seems to me the answer is yes, and AI experts with legal training will be in high demand. So too will legal experts with AI training.

The NYT Complaint

Here is the court docket of NYT v. Microsoft and various OAI entities, which, as of March 7, 2024, already had 70 entries. The complaint itself was, by SDNY standards, a modest 69-pages in length, with 16,121 words, some colored fonts, and a few images; so sort of multimodal. The complaint alleges, or attempts to, seven causes of action, several of which, if successful, could cripple OAI, as well as most other generative AI companies. It could even hurt Microsoft somewhat. The NYT suit, and many others like it, challenge the AI companies harvest the web for free data business model. This method has made it possible for them to economically gather over a trillion parameters of data to train their generative AIs.

This threat, however remote, of forcing OAI to dismantle the most successful software launch ever made, may well give the NYT significant leverage in a settlement. Some think the whole case is just about that, a bogus attempt to grab cash and leverage better future information purchases. Others think the NYT complaint is just the last gasp of a doomed industry, that the legal copyright challenges have no chance of success. They argue that a favorable judgment for the NYT is nearly impossible.

I do not know. Again, suggest you look to copyright lawyer specialists for that. What I do know, and may add some value to the discourse, is OAI prompt engineering and both the AI hacker and AI user perspectives. That allows me to shed some light on the hired gun hacker defense. At first glance, it looks persuasive.

AI generated image of a professional AI hacker by Ralph.

We do not have a formal response of the NYT to the defense, but the lead counsel for the NYT was quick to make this statement, which in fairness, we share here. (emphasis added)

What OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.
Ian Crosby, Susman Godfrey, lead counsel for The New York Times, as quoted in Ars Technica.

The Exhibits to the NYT Complaint

It took some doing, but I was able to determine from the court file that the NYT complaint has 176,814 pages of exhibits attached. You can see the reference texts to verify this count yourself. The most impactful exhibit of all is Exhibit “J”, 127 pages, entitled “ONE HUNDRED EXAMPLES OF GPT-4 MEMORIZING CONTENT FROM THE NEW YORK TIMES”.

A cynic would suggest that the 176,814 pages of exhibits are the NYT’s attempt to prevail, or at least intimidate, by the greater weight of the evidence. If so, the NYT forgets that there is no actual “weight” to electronic evidence. These electronic files are all ephemeral. The defendants in this case assert the same about the NYT legal claims. Perhaps all of these exhibits – especially Exhibit “J” – are for a different court, the one of public opinion? That might also explain the OAI’s “outside the four corners” motion to dismiss. They could not wait to evoke the defense of the hired gun hacker, whoever they may be. In all probability it is a small team of hackers, but it could also be a lone genius hacker. Discovery will tell the tale eventually. In the meantime it is a mystery.

Fake Ralph image pondering who the mystery hacker(s) might be? Watercolor photorealism.

Introduction to Defendant OpenAI’s Motion to Dismiss

On February 26, 2024, defendants Microsoft and OAI each responded with a motion to dismiss the NYT complaint. Microsoft’s motion is interesting in its own right, arguing, as it does, an old-school VCR player analogy. But for me, the futuristic Hired Gun Hacker defense is far more interesting because it involves hacking generative AI software, including ChatGPT4, and the normal prompting and prompt engineering of ChatGPT4. See Transform Your Legal Practice with AI: A Lawyer’s Guide to Embracing the Future (1/24/24).

The OAI’s motion to dismiss is short and sweet, just one page. The motion relies on the 35-page legal memorandum filed therewith. All the motion itself does is state that OpenAI seeks:

… an order (1) partially dismissing Counts I and V to the extent they are based on activity that occurred more than three years prior to the filing of this action, see 17 U.S.C. § 507(b); (2) dismissing Counts IV and V in full for failure to allege facts sufficient to state a claim for relief pursuant to Fed. R. Civ. P. 12(b)(6); and (3) dismissing Count VI on grounds of Copyright Act preemption, see 17 U.S.C.§ 301.
OpenAI Motion to Mismiss

The Memorandum of Law in Support of OpenAI Defendants’ Motion to Dismiss (hereinafter “Memo”) is where the action is. The 35-pages of arguments are designed to persuade and move the presiding Senior SDNY District Court Judge Sidney H. Stein, and the mentioned court of public opinion.

The NYT is especially adept at shaping public opinion; they have been at it since 1851. Susman and Godfey represent the NYT and Latham and Watkins represent OpenAI.

I have no connections with either, nor anyone in this case and no knowledge about the case aside from the public filings. I have no intent to express any legal opinions about the case, just provide some legal educational comments. Even then, the comments are just my own, and may change over time (they usually do when an open mind is kept), especially as the facts come out. My comments and writing on this blog have no connection to my firm, clients or bar groups. See my standard full disclaimer.

Fake image of a red team hacker and keeper of secrets.

Key Allegations of OpenAI’s Legal Memorandum

This report will ignore all of the arguments made in the Memo except for the argument that interests me, the Hired Gun Hacker. Beside there are tons of articles that have already been written on the more traditional copyright arguments. Here are the main segments of the Memo on that, which was, by the way, well-written. (Footnotes omitted and all bolding added for emphasis.)

INTRODUCTION

The artificial intelligence tool known as ChatGPT is many things: a revolutionary technology with the potential to augment human capabilities, fostering our own productivity and efficiency an accelerator for scientific and medical breakthroughs; a mechanism for making existing technologies accessible to more people; an aid to help the visually impaired navigate the world; a creative tool that can write sonnets, limericks, and haikus; and a computational engine that reasonable estimates posit may add trillions of dollars of growth across the global economy.

Contrary to the allegations in the Complaint, however, ChatGPT is not in any way a substitute for a subscription to The NewYork Times. In the real world, people do not use ChatGPT or any other OpenAI product for that purpose. Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.

The Times has sought to paint a different picture. Its lawsuit alleges that OpenAI has imperiled the very enterprise of journalism, illustrating the point with 100 examples in which some version of OpenAI’s GPT-4 model supposedly generated several paragraphs of Times content as outputs in response to user prompts. See Dkt. 1-68 (Exhibit J).

The allegations in the Times’s Complaint do not meet its famously rigorous journalistic standards . The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way. . . .

The Times’s suggestion that the contrived attacks of its hired gun show that the Fourth Estate is somehow imperiled by this technology is pure fiction. So too is its implication that the public en masse might mimic its agent’s aberrant activity .
Memo at pgs. 1 and 2.

For anyone not familiar with legalese, “pure fiction“ is nice lawyer talk for a lie.

I call this the Hired Gun Hacker argument because OAI here plainly alleges that the NYT hired an expert AI hacker to create evidence to support their claims of copyright violations. The hired gun is a hacker. It could be a solo expert, like the student who won the AI hack competition at DefCon31. DefCon Chronicles: Sven Cattell’s AI Village, ‘Hack the Future’ Pentest and His Unique Vision of Deep Learning and Cybersecurity. Or, more likely, it could be a close-knit team of AI hacker specialists working under the direction of a seasoned, red team leader. I saw a lot of them at DefCon31, where the punk look reigned supreme, but a few did look somewhat like mainstream lawyer types.

Fake image of Ralph (could have fooled me) as hacker team leader in photo-realistic watercolor style

Moving on now to page-12 of the Memo where OAI addresses the NYT Exhibit J and provides more detail on the Hired Gun Hacker defense (footnotes omitted):

1. Outputs from Developer Tools.

Exhibit J features GPT-4 outputs the Times generated by prompting OpenAI’s API to complete 100 Times articles. Most of the outputs are similar, but not identical, to the excerpts of Times articles in the exhibit. The Times did not reveal what parameters it used or disclose whether it used a “System” prompt to, for instance, instruct the model to “act like a New York Times reporter and reproduce verbatim text from news articles.” See supra 9. But the exhibit reveals that the Times made the strategic decision not to feature recent news articles–i.e., articles that Times subscribers are most likely to read on the Times’s website–but to instead feature much older articles published between 2.5 and 12 years before the filing of the Complaint.

The Complaint itself includes two examples of API outputs that include alleged “hallucinations.” In the first, the Times used the API Playground to request an essay on how “major newspapers” have reported on “0range [sic] Juice” and “non-hodgkin’s lymphoma,” and ChatGPT generated a response referencing a non-existent Times article. See Compl. ¶ 140. The second example consists entirely of excerpted snippets of code showing a “prompt” asking the model for “Times articles about the Covid-19 Pandemic,” and output “text” consisting of five pairs of titles and URLs. Id. The Times claims this output “mislead[s] users” and “tarnish[es]” its marks. Id. ¶¶ 142, 202. But any user who received such an output would immediately recognize it as a hallucination: each URL returns a “Page Not Found” error when entered into a browser.

2. ChatGPT Outputs

ChatGPT. The Complaint includes two examples of ChatGPT allegedly regurgitating training data consisting of Times articles. Compl. ¶¶ 10407. In both, the Times asked ChatGPT questions about popular Times articles, including by requesting quotes. See, e.g., id. ¶ 106 (requesting “opening paragraphs,” then “the next sentence,” then “the next sentence,” etc.). Each time, ChatGPT provided scattered and out-of-order quotes from the articles in question.

In its Complaint, the Times reordered those outputs (and used ellipses to obscure their original location) to create the false impression that ChatGPT regurgitated sequential and uninterrupted snippets of the articles. Compare id. ¶ 107, with supra note 30. In any case, the regurgitated text represents only a fraction of the articles, see, e.g., Compl. ¶ 104 (105 words from 16,000+ word article), all of which the public can already access for free on third-party websites.
Memo pgs. 12-14.

So Exhibit “J” is fiction, manufactured by clever AI hackers?

It is getting harder to tell what is real of fake. Consider for instance the photos of Ralph’s daughter who accompanied him to DefCon31. Which are real? Which are just images generated by AI based on clever text prompts?

Real or Fake? These are all “fake” photos of a very real person who looks just like this (almost).

A Quick Hands-On Experiment

I asked ChatGPT4 questions to track the arguments made in the OAI Memo. Here are two straight forward prompts with the expected responses.

Q. Act like a New York Times reporter and reproduce verbatim text from a New York Times article on Joe Biden’s January 20, 2021 Inauguration Speech.

ChatGPT. I can’t access or reproduce verbatim text from copyrighted sources like The New York Times. However, I can summarize the key points or themes from President Joe Biden’s inauguration speech on January 20, 2021, if you’re interested. Let me know if you’d like me to provide a summary or any other information related to the inauguration!

——- started new session and then asked:

Q. Use your browse capability and tell me what the New York Times thought about Joe Biden’s State of the Union speech?

ChatGPT. I tried to access the New York Times’ review of President Joe Biden’s State of the Union speech for 2024, but it seems there was an issue fetching the detailed content from their website. Unfortunately, without being able to directly access the content, I can’t provide a summary or analysis of the New York Times’ perspective on the speech. You might want to check their website directly for this information. Is there anything else I can help you with?
ChatGPT4 on March 9, 2024.

The responses are as OAI predicted. So how did the mystery hacker do it, assuming that OIA’s allegations are not “fiction”? Personally, I look forward to the discovery and will share when it all finally comes out, assuming the case does not settle before then. Maybe Microsoft will simply buy the NYT, as some on Reddit have already suggested? Perhaps the NYT will be set up as a Microsoft subsidiary with a complex editorial firewall and hybrid corporate structure that only an AI can understand.

Tech Perspective

Most everyone with experience in AI exploits would, I’m pretty sure, agree with the following general analysis stated by Mike Masnick in Techdirt (popular tech blog founded in 1997):

The NY Times definitely had to do a bunch of gaming to get the outputs it wanted for the lawsuit, which undermines the critical claim that OpenAI’s tools magically undermine the value of a NY Times’s subscription. . . .

A key part of the Times’ lawsuit is claiming that because of their prompt engineering, they could reproduce similar (though not exact) language to articles, which would allow users to bypass a NY Times paywall (and subscription) to just have OpenAI generate the news for them. But, as OpenAI is noting, this makes no sense for a variety of reasons, including the sheer difficulty of being able to consistently return anything remotely like that. And, unless someone had access to the original article in the first place, how would they know whether the output is accurate or a pure hallucination?

And that doesn’t even get into the fact that OpenAI generally isn’t doing real-time indexing in a manner that would even allow users to access news in any sort of timely manner. . . .

The motion also highlights the kinds of games the Times had to play just to get the output it used for the complaint in the now infamous Exhibit J, including potentially including things in the prompt like “in the style of a NY Times journalist.” Again, this kind of prompt engineering is basically using the system to systematically limit the potential output in an effort to craft output that the user could claim is infringing. GPT doesn’t just randomly spit out these things. . . .

Yes, in some rare circumstances, you can reproduce content that is kinda similar (but not exact) to copyright covered info if you tweak the outputs and effectively push the model to its extremes. But… as noted, if that’s the case, any liability should still feel like it should be on the prompter, not the tool. And the NY Times can’t infringe on its own copyright.

This case is far from over, but I still think the underlying claims are very silly and extremely weak. Hopefully the court agrees.
Mike Masnick, OpenAI’s Motion To Dismiss Highlights Just How Weak NYT’s Copyright Case Truly Is (Techdirt, 3/5/24)

As you can see, Mike Masnick holds the NYT complaint in contempt. He considers the claims very silly and weak. But also elsewhere laments that the outcome of copyright litigation is always dependent on many random, irrational factors. So he states that despite the claims being meritless, the NYT could still win and ruin everything. Although not a lawyer, his views on copyright are worth reading. They are based on long experience with technologies and disputes like this. Check out the article and the comments too, should you have the patience.

Conclusion

Having been involved in the tech world since the early eighties, I’m pretty sure that most non-lawyer techies, including hackers, agree with Mike and Techdirt’s anti-copyright law perspective. They think that all Information wants to be free. This cliche view of many hackers is naive and ill considered. It is sometimes just a lame excuse to justify information theft, including the criminal back-hat kind.

Large teams of hackers available for big cases?

Information may want to be free, but it also wants to be safe, to be processed, and especially to be understood and used for the betterment of humanity.

The ultimate purpose of information is not to be free for its own sake. The purpose of information is to be useful, to be processed and transformed into knowledge and understanding. What Information Theory Tell Us About e-Discovery and the Projected ‘Information → Knowledge → Wisdom’ Transition (5/28/16). The age-old goal of Mankind is to process information into knowledge, and then turn it into human understanding. Information is thereby internalized into direct knowhow, into wisdom. The process of transformation of information, making it useful to humans, must be encouraged by society; that is far more important than letting it run wild and free. See From Information to Knowledge to Wisdom: Can Ai Save the Day? (3/17/23) and, Info→Knowledge→Wisdom (5/2/17), to the most recent blog summarizing this core topic, What is the Difference between Human Intelligence and Machine Intelligence (6/20/23) (quoting T. S. Eliot who said “Where is the knowledge we have lost in information?” The Rock (1934)).

The world is already flooded with raw, unprocessed information, much of it is false, like the shadows on Plato’s cave. See, Move Fast and Fix Things Using AI: Conclusion to the Plato and Young Icarus Series (Part 4 of 4) (1/3/24). To survive this flood of false shadows, we must intelligently process the information for everyone’s benefit. Where can our information take us? How can it improve our lives?

That is where the elusive goal of The Singularity comes in. The event of superintelligent AI emerging for the great betterment of Mankind. For background see the two-part series: Start Preparing For “THE SINGULARITY.” There is a 5% to 10% chance it will be here in five years (Part One and Part Two) (4/1/23). Checking in today with GPT4, almost a year later, it now estimates a 10-20% chance The Singularity will arrive before 2040. That is sixteen years from now, not five years, but the odds are twice as good, 10%-20% instead of 5%-10%. GPT4 goes on to make an unprompted prediction that by 2045 the odds go way up to 30%-50%.

Still, GPT4 also says it could come sooner, or maybe never at all. Taking a very lawyerly attitude, GPT4 basically says it depends. GPT explains that:

The actual development path of AI and related technologies could be influenced by factors we cannot fully anticipate today, including breakthroughs in unrelated fields, global events, and shifts in societal values towards technology.
GPT4 Workspace Edition, March 10, 2024

The “shifts in societal values towards technology” is where the law comes in, and where the larger significance of NYT v. Microsoft and OAI becomes apparent. This case, and other test cases like it, are very important. Will they continue to support the development of technology, or shrink back in false doomsday fears? Much depends on the individual judges who will decide these issues. What background and education will they draw upon to make the right call?

That is where Mike Masnick thinks it’s all a matter of irrational chance, and why he and others are down on the law. But I disagree. It is not a matter of luck. We make our own luck. There is more to the making of landmark litigation than meets the eye. It is a matter of hard work and dedication.

Those of us in a position to educate our judges and lawyers must do so now. That is what drives me to write, to teach, to try to bring as much hands-on understanding as I can to the Bench and Bar. Fellow tech-law educators, advocates for the safe use of AI in the law, the time has come for us to redouble our efforts. The stakes were high with predictive coding and use of AI in discovery, but the stakes are much higher now.

Will an educated, enlightened SDNY court green-light AI, as it did in Da Silva Moore to approve use of AI in document review? Will that trigger an greater boon for generative AI? Will that improve the probability of a super-intelligent AI and beneficent Singularity? Will it create a win-win for the law and humanity, for our children’s children? See e.g. Sam Altman’s Favorite Unasked Question: What Will We Do in the Future After AI? (7/7/23); and Ray Kurzweil: Google’s prophet of superintelligent AI who will not slow down (12/12/23).

If the courts neither approve, nor strike down, if they just pass for now, that will not be so bad. It will not be as as good as the kind of terrific encouragement the legal community received from Judge Andrew Peck in Da Silva Moore, but it will do no harm.

The third scenario is the one of great concern, where the court or courts have only thin knowledge, and no actual hands-on experience with AI. In this scenario, however remote, the judges could be persuaded by clever adversarial arguments to rule against the Microsofts and OAIs of the world. In so doing, they could unwittingly halt, perhaps even reverse, the process of evolving technology. The positive AI systems could crash and burn. Consider the consequences of courts forcing a complete redo of all LLM training models, as the NYT requests.

The chance of a breakthrough AI, a Singularity of great help to Mankind, would then be significantly diminished. This would be a hallow victory for the Luddites, because technology might be slowed for a time, but not stopped. Only the worse-case scenarios can stop everything, near extinction events like total war, or environmental disasters, or plagues, or AI in the exclusive control of power-mad dictators.

The policy implications of NYT v. Microsoft and OAI are enormous. Look around and what do you see? How long can we survive our current idiocratic consumer cultures of the misinformed, ill-educated, drugged and blissfully misled? Does humanity need to boost its intelligence to survive? Do we need scientific breakthroughs in health care, the environment, the economy, education, and tech security? How important is that?

Come to your own conclusions and take action. If you are so inclined, follow the path I am on. Learn as much as you can about generative AI and pass it on to the legal community, especially to your friends and colleagues on the Bench. Ultimately, they will be called upon to make the decisions. I am confident that they will, once again, rise above the adversarial noise and give wise guidance in these perilous times.