Epiphanies or Illusions? Testing AI’s Ability to Find Real Knowledge Patterns – Part One

August 4, 2025

Ralph Losey, August 4, 2025.

Humans are inherently pattern-seeking creatures. Our ancestors depended upon recognizing recurring patterns in nature to survive and thrive, such as the changing of seasons, the migration of animals and the cycles of plant growth. This evolutionary advantage allowed early humans to anticipate danger, secure food sources, and adapt to ever-changing environments. Today, the recognition and interpretation of patterns remains a cornerstone of human intelligence, influencing how we learn, reason, and make decisions.

Pattern recognition is also at the core of artificial intelligence. In this article, I will test the ability of advanced AI, specifically ChatGPT, to uncover meaningful new patterns across different fields of knowledge. The goal is ambitious: to discover genuine epiphanies—true moments of insight that expand human understanding and open new doors of knowledge—while avoiding the pitfalls of apophenia, the human tendency to perceive illusions or false connections. This experiment probes an age-old tension: can AI reliably distinguish between genuine breakthroughs and compelling yet misleading illusions?

Video by Ralph Losey using SORA AI.

We will begin by exploring the risks of apophenia, understanding how this psychological tendency can mislead human and possibly AI perception. Throughout, videos created by AI will help illustrate key points and vividly communicate these ideas. There are twelve new videos in Part One and another fourteen in Part Two.

Are the patterns real? Video by Ralph Losey using SORA AI.

Apophenia: Avoiding the Pitfalls of False Patterns

We humans are masters of pattern detection, but we do have hinderances to this ability. Primary among them is our limited information and knowledge, but also our tendency to see patterns that are not there. We tend to assume the stirring we hear in the bushes is a tiger ready to pounce when really it is just the breeze. Evolution tends to favor this phobia. So, although we can and frequently do miss real patterns, fail to recognize the underlying connections between things, we often make them up too.

Here it is hoped that AI will boost our abilities on both fronts. It will help us to uncover true new patterns, genuine epiphanies, moments where profound insights emerge clearly from the complexity of data. At the same time, AI may expose illusions, false connections we mistakenly believe are real due to our natural cognitive biases. Even though we have made great progress over the millennia in understanding the Universe, we still have a long way to go to see all of the patterns, to fully understand the Universe, and to free ourselves of superstitions and delusions. We are especially weak at seeing patterns and intertwined with different fields of knowledge.

Apophenia is a kind of mental disorder where people think they see patterns that are not there and sometimes even hallucinate them. Most of the time when people see patterns, for instance, faces in the clouds, they know it cannot be real and there is no problem. But sometimes when people see other images, for instance, rocks on Mars that look like a face, or even images on toast, they delude themselves into believing all sorts of nonsense. For instance, the below 10-year old grilled cheese sandwich, which supposedly bears the image of the Virgin Mary, sold to an online casino on eBay in 2004 for $28,000.

In a similar vein, some people suffering from apophenia are prone to posit meaning – causality – in unrelated random events. Sometimes the perceptions of new patterns is a spark of genius, which is later verified, think of Einstein’s epiphany at age 16 when he visualized chasing a beam of light. The new pattern recognitions can lead to great discoveries or detect real tigers in the bush. Epiphanies are rare but transformative moments, like Einstein’s visualization of chasing a beam of light, Newton’s realization of gravity beneath the apple tree, or the insights behind Darwin’s theory of evolution. They genuinely advance human understanding. Apophenia, by contrast, deceives with illusions—patterns that seem meaningful but lead nowhere.

It is probably more often the case that when people “see” new connections and then go on to act upon them with no attempts to verify, they are dead wrong. When that happens, psychologists call this apophenia, the tendency to see meaningful patterns where none exist. This can lead to strange and aberrant behaviors: burning of witches, superstitious cosmology theories, jumping at shadows, addiction to gambling.

Unfortunately, it is a natural human tendency to think you see meaningful patterns or connections in random or unrelated data. That is a major reason casinos make so much money from poor souls suffering from a form of apophenia called the Gambler’s Fallacy. Careful scientists look out for defects in their own thinking and guide their experiments accordingly.

In everyday life, apophenia can also cause some people, even scientists, academics and professionals, to have phobic fears of conspiracies and other severe paranoid delusions. Think of John Nash, a Nobel Prize winning mathematician, and the movie A Beautiful Mind, that so dramatically portrayed his paranoid schizophrenia and involuntary hospitalization in 1959. Think of politics in the U.S today. Are there really lizard people among us? In some cases, as we’ve seen with Nash, apophenia can lead to severe schizophrenia.

A man looking distressed, surrounded by glowing numbers and mathematical symbols, evoking a sense of confusion and complexity.
Mental anguish & insanity from severe apophenia. Image by Losey using Sora inspired by Beautiful Mind movie.

The Greek roots of the now generally accepted medical term apophenia are:

  • Apo- (ἀπο-): Meaning “away from,” “detached,” “from,” “off,” or “apart”.
  • Phainein (φαίνειν): Meaning “to show,” “to appear,” or “to make known”.

The word was first coined by Klaus Conrad, an otherwise apparently despicable person whom I am reluctant to cite, but feel I must, due to the general acceptance of word and diagnosis today. Conrad was a German psychiatrist and Nazi who experimented on German soldiers returning from the eastern front during WWII. He coined the term in his 1958 publication on this mental illness. Per Wikipedia:

He defined it as “unmotivated seeing of connections [accompanied by] a specific feeling of abnormal meaningfulness”.[4] [5] He described the early stages of delusional thought as self-referential over-interpretations of actual sensory perceptions, as opposed to hallucinations.

Apophenia has also come to describe a human propensity to unreasonably seek definite patterns in random information, such as can occur in gambling.

Apophenia can be considered a commonplace effect of brain function. Taken to an extreme, however, it can be a symptom of psychiatric dysfunction, for example, as a symptom in schizophrenia,[7] where a patient sees hostile patterns (for example, a conspiracy to persecute them) in ordinary actions.

Apophenia is also typical of conspiracy theories, where coincidences may be woven together into an apparent plot.[8]

Video by Ralph Losey using SORA AI.

Can AI Be Infected with a Human Illness?

It is possible that generative AI, based as it is on human language, may have the same propensities. That is unknown as of yet, and so my experiments here were on the lookout for such errors. It could be one of the causes of AI hallucinations.

In information science a mistake in seeing a connection that is not real, an apophenia, leads to what is called a false positive. This technical term is well known in e-discovery law, where AI is used to search large document collections. When the patterns analyzed suggest a document is relevant, and it is not, that mistake is called a false positive. It is like a human apophenia. The AI can also detect patterns that cause it to predict a document is irrelevant, and in fact the document is relevant, that is a false negative. There as a pattern, a connection, that was not seen. That can be bad thing in e-discovery because it often leads to withholding production of a relevant document, which can in turn lead to court sanctions.

In e-discovery it is well known that AI consistently has far lower false positives and false negative rates than human reviewers, at least in large document reviews. Generative AI may also be more reliable and astute that we are, but maybe not. This is a new field. Se we should always be on the lookout for false positives and false negatives in AI pattern recognition. That is one lesson I learned well, and sometimes the hard way, in my ten years of working with predictive coding type AI in the e-discovery (2012-2022). In the experiments described in this article we will look for apophenic mistakes.

Video by Ralph Losey using SORA AI.

It is my hope that Advanced AI, properly trained and validated, can provide a counterbalance to human gullibility by rigorously filtering of signal from noise. Unlike the human brain, which often leaps to conclusions, AI can be programmed to ground its pattern recognition in evidence, statistical rigor, and cross-validation—if we build it that way and supervise it wisely.

Still, we must beware that the pattern-recognizing systems of AI may suffer from some of our delusionary tendencies. The best practices discussed here will consider both the positive and negative aspects of AI pattern recognition. We must avoid the traps of apophenia. We must stay true to the scientific methods and verify any new patterns purportedly discovered. Thus all opinions reached here will necessarily be lightly held and subject to further experimentation by others.

Video by Ralph Losey using SORA AI.

From Data to Insight: The Power of New Pattern Recognition

Modern AI models, including neural networks and transformer architectures like GPT-4, excel at uncovering subtle patterns in massive datasets far beyond human capability. This ability transforms raw data into actionable insights, thereby creating new knowledge in many fields, including the following:

Protein Structures: Models like Google’s DeepMind’s AlphaFold have already revolutionized protein structure prediction, achieving high success rates in predicting the 3D shapes of proteins from their amino acid sequences. This ability is crucial for understanding protein function and designing new drugs and medical therapies. The 2024 Nobel Prize in Chemistry was awarded to Demis Hassabis and John Jumper of DeepMind for their work on AlphaFold.

A scientist analyzes molecular structures and data visualizations related to AlphaFold 2 on a futuristic screen, featuring protein models and DNA sequences.
Image by Ralph Losey using his Visual Muse AI tool.

Medical Science. Generative AI models are now being used extensively in medical research, including analysis and proposals of new molecules with desired properties to discover new drugs and accelerate FDA approval. For example, Insilico Medicine uses its AI platform Pharma.AI, to developed drug candidates, including ISM001_055, for idiopathic pulmonary fibrosis (IPF). Insilico Medicine lists over 250 publications on its website reporting on its ongoing research, including a recent paper on its IPF discovery: A generative AI-discovered TNIK inhibitor for idiopathic pulmonary fibrosis: a randomized phase 2a trial (Nature Medicine, June 03, 2025). This discovery is especially significant because it is the first entirely AI-discovered drug to reach FDA Phase II clinical trials. Below is an infographic of Insilico Medicine showing some of its current work:

Infographic displaying the statistics and achievements of Insilico Medicine, an AI-driven biotech company, detailing development candidates, IND approvals, study phases, and global presence.
Insilico PDF infographic, found 7/23/25 in its 2-pg. overview.

Also see, Fronteo, a Japanese based research company, and its Drug Discovery AI Factory.

Materials Science. Google DeepMind’s Graph Networks for Materials Exploration (“GNoME”) has already identified millions of new stable crystals, significantly expanding our knowledge of materials science. This discovery represents an order-of-magnitude increase in known stable materials. Merchant and Cubuk, Millions of new materials discovered with deep learning (Deep Mind, 2023). Also see, 10 Top Startups Advancing Machine Learning for Materials Science (6/22/25).

Climate Science and Environmental Monitoring. Generative AI models are beginning to improve climate simulations, leading to more accurate predictions of climate patterns and future changes. For example, Microsoft’s Aurora Forecasting model is trained on Earth science data to go beyond traditional weather forecasting to model the interactions between the atmosphere, land, and oceans. This helps scientists anticipate events like cyclones, air quality shifts, and ocean waves with greater accuracy, allowing communities to prepare for environmental disasters and adapt to climate change. See e.g., Stanley et al, A Foundation Model for the Earth System (Nature, May 2025).

Video by Losey using Sora AI.

Historical and Artistic Revelations

AI is also helping with historical research. A new AI system was recently used to analyze one of the most famous Latin inscriptions: the Res Gestae Divi Augusti. It has always been thought to simply be an autobiographical inscription, which literally translates from Ancient Latin as “Deeds of the Divine Augustus.”  But when a specialty generative AI, Aeneas (again based on Google’s models) compared this text with a large database of other Latin sayings, the famous Res Gestae Divi Augusti inscription was found to share subtle language parallels with other Roman legal documents. The analysis uncovered “imperial political discourse,” or messaging focused on maintaining imperial power, an insight, a pattern, that had never seen before. Assael, Sommerschield, Cooley, et al. Contextualizing ancient texts with generative neural networks (Nature, July 2025).

The paper explains that the communicative power of these inscriptions are not only shaped by the written text itself “but also by their physical form and placement2,3” and that “about 1,500 new Latin inscriptions are discovered every year.” So the patterns analyzed not only included the words, but a number of other complex factors. The authors assert in the Abstract that their work with AI analysis shows.

… how integrating science and humanities can create transformative tools to assist historians and advance our understanding of the past.

Roman citizens reacting to propaganda. A Ralph Losey video.

In art and music, pattern detection has mapped the evolution of artistic styles in tandem with technological change. In a 2025 studio-lab experiment reported by Deruty & Grachten, a generative AI bass model (“BassNet”) unexpectedly rendered multiple melodic lines within single harmonic tones, exposing previously unnoticed structures in popular music bass compositions. This discovery was written up by Deruty and Gratchen, Insights on Harmonic Tones from a Generative Music Experiment (arXiv, June 2025). Their paper shows how AI can surface new musical patterns and deepen our understanding of human auditory perception.

As explained in the Abstract:

During a studio-lab experiment involving researchers, music producers, and an AI model for music generating bass-like audio, it was observed that the producers used the model’s output to convey two or more pitches with a single harmonic complex tone, which in turn revealed that the model had learned to generate structured and coherent simultaneous melodic lines using monophonic sequences of harmonic complex tones. These findings prompt a reconsideration of the long-standing debate on whether humans can perceive harmonics as distinct pitches and highlight how generative AI can not only enhance musical creativity but also contribute to a deeper understanding of music.

Video by Losey using Sora AI.

Legal Practice: From Precedent to Prediction

The legal profession has benefited from traditional rule-based statistical AI for over a decade, with predictive coding and similar applications. It is now starting to apply the new generative AI models in a variety of new applications. For instance, it can be used to uncover latent themes and trends in judicial decisions that human analysis has overlooked.

This was done in a 2024 study using ChatGPT-4 to perform a thematic analysis on hundreds of theft cases from Czech courts. Drápal, Savelka, Westermann, Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies (arXiv, February 2024).

The goal of the analysis was to discover classes of typical thefts. GPT4.0 analyzed fact patterns described in the opinions and human experts did the same. The AI not only replicated many of the human expert identified themes, but, as report states, also uncovered a new one that the humans had missed – a pattern of “theft from gym” incidents. This shows that generative AI can sift through vast case datasets and detect nuanced fact patterns, or criminal modus operandi, that were previously undetected by experts (here, three law students under supervision of a law professor).

Video by Losey using Sora AI.

Another study in early 2025 applied Anthropic’s Claude 3-Opus to analyze thousands of UK court rulings on summary judgment, developing a new functional taxonomy of legal topics for those cases. Sargeant, Izzidien, Steffek, Topic classification of case law using a large language model and a new taxonomy for UK law: AI insights into summary judgment (Springer, February 2025). The AI was prompted to classify each case by topic and identify cross-cutting themes.

The results revealed distinct patterns in how summary judgments are applied across different legal domains. In particular, the AI found trends and shifts over time and across courts – insights that allow new, improved understanding of when and in what types of cases summary judgments tend to be granted. These patterns were found despite the fact that U.K. case law lacks traditional topic labels. This kind of AI-augmented analysis illustrates how generative models can discover hidden trends in case law for improved effectiveness by practitioners.

Surprising abilities of Ai helping lawyers. Video by Losey.

Even sitting judges have begun to leverage generative AI to inform their decision-making, revealing new analytical angles in litigation. The notable 2023 concurrence by Judge Kevin Newsom of the Eleventh Circuit admitted to experimenting with ChatGPT to interpret an ambiguous insurance term (whether an in-ground trampoline counted as “landscaping”). Snell v. United Specialty Ins. Co., 102 F. 4th 1208 – Court of Appeals, (11th Cir., 5/28/24). Also See, Ralph Losey, Breaking News: Eleventh Circuit Judge Admits to Using ChatGPT to Help Decide a Case and Urges Other Judges and Lawyers to Follow Suit (e-Discovery Team, June 3, 2024) (includes full text of the opinion and Appendix and Losey’s inserted editorial comments and praise of Judge Newsom’s language.)

After querying the LLM, Judge Newsom concluded that “LLMs have promise… it no longer strikes me as ridiculous to think that an LLM like ChatGPT might have something useful to say about the common, everyday meaning of the words and phrases used in legal texts.” In other words, the generative AI was used as a sort of massive-scale case law analyst, tapping into patterns of ordinary usage across language data to shed light on a legal ambiguity. This marked the first known instance of a U.S. appellate judge integrating an LLM’s linguistic pattern analysis into a written opinion, signaling that generative models can surface insights on word meaning and context that enrich judicial reasoning.

A digital illustration of a judge in a courtroom setting, seated at a desk with a gavel. The judge, named Judge Newsom, is shown in a professional attire with glasses, and a holographic display behind him showing data and AI-related graphics, conveying a futuristic legal environment.
Image by Ralph Losey using his Visual Muse AI.

My Ask of AI to Find New Patterns

Now for the promised experiment to try to find at least one new connection, one previously unknown, undetected pattern linking different fields of knowledge. I used a combination of existing OpenAI and Google models to help me in this seemingly quixotic quest. To be honest, I did not have much real hope for success, at least not until release of the promised ChatGPT5 and whatever Google calls its counterpart, which I predict will be released the following week (or day). Plus, the whole thing seemed a bit grandiose, even for me, to try to get AI to boldly go where no one has gone before.

Absurd, but still I tried. I won’t go through all of the prompt engineering involved, except to say it involved my usual a complex, multi-layered, multi-prompt, multimodal-hybrid approach. I tempered my goals by directing ChatGPT4o, when I started the process, to seek new patterns that were useful, not Nobel Prize winning breakthroughs, just useful new patterns. I directed it to find five such new patterns and gave it some guidance as to fields of knowledge to consider, including of course, law. I asked for five new insights thinking that with such as big ask I might get one success.

Note, I write these words before I have received the response, but after I have written the above to help guide ChatGPT4o. Who knows, it might achieve some small modicum of success. Still, it feels like a crazy Quixotic quest. Incidentally, Miguel de Cervantes (1547-1616) character, Don Quixote (1605) does seem to person afflicted with apophenia. Will my AI suffer a similar fate?

Don Quixote in modern world. Video by Losey using Sora.

I designed the experiment specifically with this tension in mind between epiphanies, representing genuine insights and real advances in knowledge, and illusions, which are merely plausible yet misleading patterns. One of my goals was to probe AI’s capacity to distinguish one from the other.

Overview of Prompt Strategy and Time Spent

First, I spent about a hour with ChatGPT4o to set up my request by feeding it a copy of the article as written so far. I also chatted with it about the possibility of AI finding new patterns between different fields of knowledge. Then I just told ChatGPT4o to do it, find a new inter connective pattern. ChatGPT4o “thought” (processed only) for just a few minutes. Then it generated a response that purported to provide me with the requested five new patterns. It did so based on its existing training and review of this article.

As requested, it did not use its browser capabilities to search the web for answers. It just “looked within” and came with five insights it thought were new. Almost that easy. I lowered my expectations accordingly before read the output.

That was the easy part, after reading the response, I spent about 14-hours over the next several days doing quality control. The QC work used multiple other AIs, both by OpenAI and Google, to have them go online and research these claims, evaluate their validity – both good and bad, engage in “deep-think,” look for errors, especially signs of AI apophenia, and otherwise invited contrarian type criticisms from them. After that, I also asked the other AIs for suggested improvements they might make to the wording of the five clams and rank them by importance. The various rewordings were not too helpful, but the rankings were, and so were many of the editorial comments.

The 14-hours in QC does not include the approximate 6-hours of machine time by the Gemini and OpenAI models to do deep think and independent research on the web to verify or disprove the claims. My actual 14-hour time included traditional Google searches to double check all citations as per my “trust but verify” motto. My 14-hours also included my time to read (I’m pretty fast) and skim most of the key articles that the AI research turned up, although frankly some of the articles cited were beyond my knowledge levels. I tried to up my game, but it was hard. These other models also generated hundreds of pages of both critical and supportive analysis, which I also had to read. Finally, I probably put another 24-hours into research and writing this article (it took over a week), so this is one of my larger projects. I did not record the number of hours it took to design and generate the 26 videos because that was recreational.

Surrealistic depiction of time in robot space by a Ralph Losey video.

Part Two of this article is where I will make the reveal. Was this experiment another comic story of a Don Quixote type (me) and his sidekick Sanchez (AI), lost in an apophenia neurosis? Or perhaps it is another story altogether? Neither hot nor cold? Stay tuned for Part Two and find out.

PODCAST

As usual, we give the last words to the Gemini AI podcasters who chat between themselves about the article. It is part of our hybrid multimodal approach. They can be pretty funny at times and provide some good insights. This episode is called Echoes of AI: Epiphanies or Illusions? Testing AI’s Ability to Find Real Knowledge Patterns. Part One. Hear the young AIs talk about this article for 25 minutes. They wrote the podcast, not me.

An illustration featuring two anonymous AI podcasters sitting in front of microphones, discussing the theme 'Epiphanies or Illusions? Testing AI’s Ability to Find Real Knowledge Patterns.' The background has a digital, tech-inspired design.
Click here to listen to the podcast.

Ralph Losey Copyright 2025 – All Rights Reserved.


Report on the First Scientific Experiment to Test the Impact of Generative AI on Complex, Knowledge-Intensive Work

April 29, 2024

A first of its kind experiment testing use of AI found a 40% increase in quality and 12% increase in productivity. The tests involved 18 different realistic tasks assigned to 244 different consultants in the Boston Consulting Group. The Harvard Business School has published a preliminary report of the mammoth study. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality (Harvard Business School, Working Paper 24-013) (hereinafter “Working Paper”). The Working Paper is analyzed here with an eye on its significance for the legal profession.

My last article, From Centaurs To Cyborgs: Our evolving relationship with generative AI, explained that you should expect the unexpected when using generative AI. It also promised that use of sound hybrid prompt engineering methods, such as the Centaur and Cyborg methods, would bring more delight than fright. The Working Paper provides solid evidence of that claim. It reports on a scientific study conducted by AI experts, work experts and experimental scientists. They tested 244 consultants from the Boston Consulting Group (“BCG”). The Working Paper, although still in draft form, shares the key data from the experiment. Appendix E of the Working Paper discusses the conceptual model of the Centaur and Cyborg methods of AI usage, which I wrote about in From Centaurs To Cyborgs.

Harvard, Wharton, Warwick, MIT and BCG Experiment

This was an impressive scientific experiment involving a very large research group. The co-authors of the Working Paper are: Harvard’s Fabrizio Dell’AcquaEdward McFowland III, and Karim Lakhani; Warwick Business School’s Hila Lifshitz-Assaf; Wharton’s Ethan Mollick; and MIT’s Katherine Kellogg. Further, Saran RajendranLisa Krayer, and François Candelon ran the experiment on the BCG side. The generative AI used and tested here was ChatGPT4 (April 2023 version with no special training). For more background and detail on the Working Paper see the video lecture by Professor Ethan Mollick to Stanford students, Navigating the Jagged Technological Frontier.” (details of experiment set up starting at 18:15).

The 244 high-level BCG consultants were a diverse group who volunteered from offices around the world. They dedicated substantial time performing the 18 assigned tasks under the close supervision of the Working Paper author-scientists. Try getting that many lawyers in a global law firm to do the same.

The experiment included several important control groups and other rigorous experimental controls. The primary control was the unfortunate group of randomly selected BCG consultants who were not given ChatGPT4. They had to perform a series of assigned tasks in their usual manner, with computers of course, but without a generative AI tool. The control group comparisons provide strong evidence that use of AI tools on appropriate consulting tasks significantly improve both quality and productivity.

That qualification of “appropriate tasks” is important and involves another control group of tasks. The scientists designed, and included in the experiment, work tasks that they knew could not be done well with the help of AI, that is, not without extensive guidance, which was not provided. They knew that although these tasks were problematic for ChatGPT4, they could be done, and done well, without the use of AI. Working Paper at pg. 13. Pretty devious type of test for the poor guinea pig consultants. The authors called the tasks assigned that they knew to be beyond ChatGPT4’s then current abilities to be work “beyond the jagged technological frontier.” In the authors’ words:

Our results demonstrate that AI capabilities cover an expanding, but uneven, set of knowledge work we call a “jagged technological frontier.” Within this growing frontier, AI can complement or even displace human work; outside of the frontier, AI output is inaccurate, less useful, and degrades human performance. However, because the capabilities of AI are rapidly evolving and poorly understood, it can be hard for professionals to grasp exactly what the boundary of this frontier might be at a given. (sic)

Working Paper at pg. 1.

The improvement in quality for tasks appropriate for GPT4 – work tasks inside the frontier – was remarkable, overall 40%, although somewhat inconsistent between sub-groups as will be explained. Productivity also went up, although to a lesser degree. There was no increase in quality or productivity for workers trying to use GPT4 for tasks beyond the AI’s ability, those outside the frontier. In fact, when GPT4 was used for those outside tasks, the answers of the AI assisted consultants were 19% less likely to be correct. That is an important take-away lesson for legal professionals. Know what LLMs can do reliably, and what they cannot.

The scientists who designed these experiments themselves had difficulty coming up with work tasks that they knew would be outside ChatGPT4’s abilities:

In our study, since AI proved surprisingly capable, it was difficult to design a task in this experiment outside the AI’s frontier where humans with high human capital doing their job would consistently outperform AI.

Working Paper at pg. 19. It was hard, but the business experts finally came up with a consulting task that would make little ChatGPT4 look like a dunce.

The authors were obtuse in this draft report about the specific tasks “outside the frontier” used in the tests and I hope this is clarified, since it is very important. But it looks like they designed an experiment where consultants with ChatGPT4 would use it to analyze data in a spreadsheet and omit important details found only in interviews with “company insiders.” The AI and consultants relying on the AI were likely to miss important details in the interviews and so make errors in recommendations. To quote the Working Paper at page 13:

To be able to solve the task correctly, participants would have to look at the quantitative data using subtle but clear insights from the interviews. While the spreadsheet data alone was designed to seem to be comprehensive, a careful review of the interview notes revealed crucial details. When considered in totality, this information led to a contrasting conclusion to what would have been provided by AI when prompted with the exercise instructions, the given data, and the accompanying interviews.

In other words, it looks like the Working Paper authors designed tasks where they knew ChatGPT4 would likely make errors and gloss over important details in interview summaries. They knew that the human-only expert control group would likely notice the importance of these details in the interviews and so make better recommendations in their final reports. Working Paper, Section 3.2 – Quality Disruptor – Outside the frontier at pages 13-15.

This is comparable to an attorney relying solely on ChatGPT4 to study a transcript of a deposition that they did not take or attend, and ask GPT4 to summarize it. If the attorney only reads the summary, and the summary misses key details, which is known to happen, especially in long transcripts and where insider facts and language are involved, then the attorney can miss key facts and make incorrect conclusions. This is a case of over-delegation to an AI, past the jagged frontier. Attorneys should read the transcript, or have been at the deposition and so recall key insider facts, and thereby be in a position to evaluate the accuracy and completeness of the AI summary. Trust but verify.

The 19% decline in performance for work outside the frontier is a big warning flag to be careful, to go slow at first and know what generative AI can and cannot do well. See: Losey, From Centaurs To Cyborgs (4/24/24). Humans must remain the loop for many of the tasks of complex knowledge work.

Still, the positive findings of increased quality and productivity for appropriate tasks, those within the jagged frontier, are very encouraging to workers in the consulting fields, including attorneys. This large experiment on volunteer BCG guinea pigs provides the first controlled experimental evidence of the impact of ChatGPT4 on various kinds of consulting work. It confirms the many ad hoc reports that generative AI allows you to improve both the quality and productivity of your work, faster and better. You just have to know what you are doing, know the jagged line, and intelligently use both Centaur and Cyborg type methods.

Appendix E of the Working Paper discusses these methods. To quote from Appendix E – Centaur and Cyborg Practices:

By studying the knowledge work of 244 professional consultants as they used AI to complete a realworld, analytic task, we found that new human-AI collaboration practices and reconfigurations are emerging as humans attempt to navigate the jagged frontier. Here, we detail a typology of practices we observed, which we conceptualize as Centaur and Cyborg practices.

Centaur behavior. … Users with this strategy switch between AI and human tasks, allocating responsibilities based on the strengths and capabilities of each entity. They discern which tasks are best suited for human intervention and which can be efficiently managed by AI. From a frontier perspective, they are highly attuned to the jaggedness of the frontier and not conducting full sub-tasks with genAI but rather dividing the tasks into sub-tasks where the core of the task is done by them or genAI. Still, they use genAI to improve the output of many sub-tasks, even those led by them.

Cyborg behavior. … Users do not just have a clear division of labor here between genAI and themselves; they intertwine their efforts with AI at the very frontier of capabilities. This manifests at the subtask level, when for an external observer it might even be hard to demarcate whether the output was produced by the human or the AI as they worked tightly on each of the activities related to the sub task.

As discussed at length in my many articles on generative AI, close supervision and verification is required from most of the work by legal professionals. It is an ethical imperative. For instance, no new case found by AI should ever be cited without human verification. The Working Paper calls this blurred division of labor Cyborg behavior.

Excerpts from the Working Paper

Here are a few more excerpts from the Working Paper and a key chart. Readers are encouraged to read the full report. The details are important, as the outside the frontier tests showed. I begin with a lengthy quote from the Abstract. (The image inserted is my own, generated using my GPT for Dall-E, Visual Muse: illustrating concepts with style.)

In our study conducted with Boston Consulting Group, a global management consulting firm, we examine the performance implications of AI on realistic, complex, and knowledge-intensive tasks. The pre-registered experiment involved 758 consultants comprising about 7% of the individual contributor-level consultants at the company. After establishing a performance baseline on a similar task, subjects were randomly assigned to one of three conditions: no AI access, GPT-4 AI access, or GPT-4 AI access with a prompt engineering overview.

We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.

For each one of a set of 18 realistic consulting tasks within the frontier of AI capabilities, consultants using AI were significantly more productive (they completed 12.2% more tasks on average, and completed tasks 25.1% more quickly), and produced significantly higher quality results (more than 40% higher quality compared to a control group). Consultants across the skills distribution benefited significantly from having AI augmentation, with those below the average performance threshold increasing by 43% and those above increasing by 17% compared to their own scores.

For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI. Further, our analysis shows the emergence of two distinctive patterns of successful AI use by humans along a spectrum of human-AI integration. One set of consultants acted as “Centaurs,” like the mythical half-horse/half-human creature, dividing and delegating their solution-creation activities to the AI or to themselves. Another set of consultants acted more like “Cyborgs,” completely integrating their task flow with the AI and continually interacting with the technology.

Key Chart Showing Quality Improvements

The key chart in the Working Paper is Figure 2, found at at pages 9 and 28. It shows the underlying data of quality improvement. In the words of the Working Paper:

Figure 2 uses the composite human grader score and visually represents the performance distribution across the three experimental groups, with the average score plotted on the y-axis. A comparison of the dashed lines and the overall distributions of the experimental conditions clearly illustrates the significant performance enhancements associated with the use of GPT-4. Both AI conditions show clear superior performance to the control group not using GPT-4.

The version of the chart shown below has additions by one of the coauthors, Professor Ethan Mollick (Wharton), who put the red arrow comments not found in the published version. (Note the “y-axis” in the chart is the vertical scale labeled “Density.” In XY charts “Density” generally refers to distribution of variables, i.w. probability of data distribution. The horizontal “x-axis: is the overall quality performance measurement.)

Professor Mollick provides this helpful highlight of the main findings of the study, both quality and productivity:

[F]or 18 different tasks selected to be realistic samples of the kinds of work done at an elite consulting company, consultants using ChatGPT-4 outperformed those who did not, by a lot. On every dimension. Every way we measured performance. Consultants using AI finished 12.2% more tasks on average, completed tasks 25.1% more quickly, and produced 40% higher quality results than those without. Those are some very big impacts.

Centaurs and Cyborgs on the Jagged Frontier, (One Useful Thing, 9/16/23).

Preliminary Analysis of the Working Paper

I was surprised at first to see that the quality of the “some additional training group” did not go up more than the approximate 8% shown in the chart. In digging deeper I found a YouTube video by Professor Mollick on this study where he said at 19:14 that the training, which he created, only consisted of a five to ten minute seminar. In other words, very cursory and yet it still had an impact on performance.

Another thing to emphasize about the study is how carefully the tasks for the tests were selected and how realistic the challenges were. Again, here is a quote from Ethan Mollick‘s excellent article. Centaurs and Cyborgs on the Jagged Frontier, (One Useful Thing, 9/16/23). Also see Mollick’s interesting new book, Co-Intelligence: Living and Working with AI (4/2/24).

To test the true impact of AI on knowledge work, we took hundreds of consultants and randomized whether they were allowed to use AI. We gave those who were allowed to use AI access to GPT-4 . . . We then did a lot of pre-testing and surveying to establish baselines, and asked consultants to do a wide variety of work for a fictional shoe company, work that the BCG team had selected to accurately represent what consultants do. There were creative tasks (“Propose at least 10 ideas for a new shoe targeting an underserved market or sport.”), analytical tasks (“Segment the footwear industry market based on users.”), writing and marketing tasks (“Draft a press release marketing copy for your product.”), and persuasiveness tasks (“Pen an inspirational memo to employees detailing why your product would outshine competitors.”). We even checked with a shoe company executive to ensure that this work was realistic – they were. And, knowing AI, these are tasks that we might expect to be inside the frontier.

Most of the tasks listed for this particular test do not seem like legal work, but there are several general similarities. For example, the creative task of brainstorming of new ideas, the analytical tasks and the persuasiveness tasks. Legal professionals do not write inspirational memos to employees, like BCG consultants, but we do write memos to judges trying to persuade them to rule in our favor.

Another surprising finding of the Working Paper is that use of ChatGPT by BCG consultants on average reduced the range of ideas that the subjects generated. This is shown in the below Figure 1.

Figure 1. Distribution of Average Within Subject Semantic Similarity by experimental condition: Group A (Access to ChatGPT), Group B (Access to ChatGPT + Training), Group C (No access to ChatGPT), and GPT Only (Simulated ChatGPT Sessions).

We also observe that the GPT Only group has the highest degree of between semantic similarity, measured across each of the simulated subjects. These two results taken together point toward an interesting conclusion: the variation across responses produced by ChatGPT is smaller than what human subjects would produce on their own, and as a result when human subjects use ChatGPT there is a reduction in the variation in the eventual ideas they produce. This result is perhaps surprising. One would assume that ChatGPT, with its expansive knowledge base, would instead be able to produce many very distinct ideas, compared to human subjects alone. Moreover, the assumption is that when a human subject is also paired with ChatGPT the diversity of their ideas would increase.

While Figure 1 indicates access to ChatGPT reduces variation in the human-generated ideas, it provides no commentary on the underlying quality of the submitted ideas. We obtained evaluations of each subject’s idea list along the dimension of creativity, ranging from 1 to 10, and present these results in Table 1. The idea lists provided by subjects with access to ChatGPT are evaluated as having significantly higher quality than those subjects without ChatGPT. Taken in conjunction with the between semantic similarity results, it appears that access to ChatGPT helps each individual construct higher quality ideas lists on average; however, these ideas are less variable and therefore are at risk of being more redundant.

So there is hope for creative brainstormers, at least with GPT4 level of generative AI. Generative AI is clearly more redundant than humans. As quoted in my last article, Professor Mollick says they are a bit homogenous and same-y in aggregate. Losey, From Centaurs To Cyborgs: Our evolving relationship with generative AI (04/24/24). Great phrase that ChatGPT4 could never have come up with.

Also see: Mika Koivisto and Simone Grassini, Best humans still outperform artificial intelligence in a creative divergent thinking task (Nature, Scientific Reports, 2/20/24) (“AI has reached at least the same level, or even surpassed, the average human’s ability to generate ideas in the most typical test of creative thinking. Although AI chatbots on average outperform humans, the best humans can still compete with them.“); Losey, ChatGPT-4 Scores in the Top One Percent of Standard Creativity Tests (e-Discovery Team, 7/21/23) (“Generative Ai is still far from the quality of the best human artists. Not yet. … Still, the day may come when Ai can compete with the greatest human creatives in all fields. … More likely, the top 1% in all fields will be humans and Ai working together in a hybrid manner.”).

AI As a ‘Skill Leveler’

As mentioned, the improvement in quality was not consistent between subgroups. The consultants with the lowest pre-AI tests scores improved the most with AI. They became much better than they were before. The same goes for the middle of the pack pre-AI scorers. They also improved, but by a lesser amount. The consultants at the top end of pre-AI scores also improved, but by an even smaller amount than those behind them. Still, with their small AI improvements, the pre-AI winners maintained their leadership. The same consulting experts still outscored everyone. No one caught up with them. What are the implications of this finding on future work? On training programs? On hiring decisions?

Here is Professor Ethan Mollick’s take on the significance of this finding.

It (AI) works as a skill leveler. The consultants who scored the worst when we assessed them at the start of the experiment had the biggest jump in their performance, 43%, when they got to use AI. The top consultants still got a boost, but less of one. Looking at these results, I do not think enough people are considering what it means when a technology raises all workers to the top tiers of performance. It may be like how it used to matter whether miners were good or bad at digging through rock… until the steam shovel was invented and now differences in digging ability do not matter anymore. AI is not quite at that level of change, but skill leveling is going to have a big impact.

Ethan Mollick, Centaurs and Cyborgs on the Jagged Frontier: I think we have an answer on whether AIs will reshape work (One Useful Thing, 9/16/23).

My only criticism of Professor Mollick’s analysis is that it glosses over the differences that remained after AI between the very best, and the rest. In the field I know, law, not business consulting, the differences between the very good, the B or B+ lawyers, and great lawyers, the A or A+, is still very significant. All attorneys with skill levels in the B – A+ range can legitimately be considered top tier legal professionals, especially as compared to the majority of lawyers in the average and below average range. But the impact of these skill differences on client services can still be tremendous, especially in matters of great complexity or importance. Just watch when two top tier lawyers go against each another in court, one good and one truly great.

Further Analysis of Skill Leveling

What does the leveling phenomena of “average becoming good” mean to the future of work? Does it mean that every business consultant with ChatGPT will soon be able to provide top tier consulting advice. Will every business consultant on the street with ChatGPT soon be able to “pen an inspirational memo to employees detailing why your product would outshine competitors“? Will their lower priced memos be just as good as top tier BCG memos? Is generative AI setting the stage for a new type of John Henry moment for knowledge workers, as Professor Mollick suggests? Will this near leveling of the playing field hold true for all types of knowledge workers, not only business consultants, but also doctors and lawyers?

To answer these questions it is important to note that the results in this first study on business consultant work does not show a complete leveling. Not all of the consultants became John Henry superstars. Instead, the study showed the differences continued, but were less pronounced. The gap narrowed, but did not disappear. The race only became more competitive.

Moreover, the names of the individual winners and also-rans remained the same. It is just that the “losers” (seems like too harsh a term) now did not “lose” by as much. In the race to quality the same consultants were still leading, but the rest of the pack was not as far behind. Everyone got a boost, even the best. But will this continue as AI advances? Or eventually will some knowledge workers do far better with the AI steam hammers or shovels than others, no matter where they started out? Moreover, under what circumstances, including pricing differentials, do consumers choose the good professionals who are not quite as good as those on the medalist stand?

The study results show that the pre-AI winners, those at the very top of their fields before the generative AI revolution, were able to use the new AI tools as well as the others. For that reason, their quality and productivity was also enhanced. They still remained on top, still kept their edge. But in the future, assuming AI gets better, will that edge continue? Will there be new winners and also-rans? Or eventually will everyone tie for first, at least in so far as quality and productivity are concerned? Will all knowledge workers end up the same, all equal in quality and productivity.

That seems unlikely, no matter how good AI gets. I cannot see this happening anytime soon, at least in the legal field. (I assume the same is also true for the medical field.) In law the analysis and persuasion challenges are far greater than those in most other knowledge fields. The legal profession is far too complex for AI to create a complete leveling of performance, at least not in the foreseeable future. I expect the differentials among medical professionals will also continue.

Moreover, although not studied in this report, it seems obvious that some legal workers will become far better at using AI than others. In this first study of business consultants, all started on the same level of inexperience using generative AI. Only a few were given training. The training provided, only five to ten minutes, was still enough to move the needle. The control group with this almost trivial amount of training did perform better, although not enough to close the gap.

With significant training, or experience, the improvements should be much greater. Maybe quality will increase by 70%, instead of the 40% we saw with little or no training. Maybe productivity will increase by at least 50%, instead of just 12%. That is what I would expect based on my experience with lawyers since 2012 using predictive coding. After lawyer skill-sets develop for use of generative AI, all of the performance metrics may soar.

Conclusion

In this experiment where some professionals were given access to ChatGPT4 and some were not, a significant, but not complete leveling of performance was measured. It was not a complete leveling because the names at the very top of the leaderboard of quality and productivity remained the same. I believe this is because the test subjects were all ChatGPT virgins. They had not previously learned prompt engineering methods, even the beginning basics of Centaur or Cyborg approaches. It was all new to them.

As part of the experiment some were given ten minutes of basic training in prompt engineering and some were given none. In the next few years some professionals will receive substantial GPT training and attain mastery of the new AI tools. Many will not. When that happens, the names on the top of the leaderboard will likely change, and change dramatically.

History shows that times of great change are times of opportunity. The deck will be reshuffled. Who will learn and readily adapt to the AI enhancements and who will not? Which corporations and law firms will prosper in the age of generative AI, and which will fail? The only certainty here is the uncertainty of surprising change.

In the future every business may well have access to top tier business consultants. All may be able to pen an inspirational memo to employees. But will this near leveling between the best, and the rest, have the same impact on the legal profession? The medical profession? I think not. Especially as some in the profession gain skills in generative AI much faster than others. The competition between lawyers and law firms will remain, but the names on the top of the leader board will change.

From a big picture perspective the small differentials between good and great lawyers are not that important. Of far greater importance is the likely social impact of the near leveling of lawyers. The gain in skills of the vast majority of lawyers will make it possible, for the first time, for high quality legal services to become available to all.

Consumer law and other legal services could become available to everyone, at affordable rates, and without a big reduction in quality. In the future, as AI creates a more level playing field, the poor and middle class will have access to good lawyers too. These will be affordable good lawyers who, when ethically assisted by AI, are made far more productive. This can be accomplished by responsible use of AI. This positive social change seems likely. Equal justice for all will then become a common reality, not just an ideal.

Ralph Losey Copyright 2024. All Rights Reserved.


Protected: Surprise Top Five e-Discovery Cases of 2021

December 27, 2022

This content is password-protected. To view it, please enter the password below.


Protected: Robophobia: Great New Law Review Article – Part 2

May 26, 2022

This content is password-protected. To view it, please enter the password below.