Bottom Line Driven Proportional Review

January 15, 2012

I have been working on the problem of out-of-control e-discovery costs since 2006. At that time I phased out my general trial practice, went full-time e-discovery, and started this blog. (By the way, did you notice the new ® in the blog title? It means the U.S. Patent and Trademark Office granted me the trademark to e-Discovery Team.) I focused on the expense side because it was obvious that crazy high e-discovery cost was a core problem of civil litigation. It still is. Indeed, the high price of e-discovery, and the uncertainty of  these costs, are the main reasons most attorneys still avoid e-discovery like the plague. For more reasons see Tell Me Why?

The primary expense of e-discovery comes from the document search and review process; most estimate that it constitutes from 60% to 80% of the total. The core expense of the review process comes from the final manual quality control checks of each document to be produced to verify relevancy and to protect confidentiality by redaction and privilege logging. Confidentiality protection is an enormous problem in litigation. See Anonymous, An Open Letter to the Judiciary – Can We Talk? Parts One and Two.

Further, you cannot just dispense with final manual review. As I explained in my series Secrets of Search, Parts One, Two and Three, we are not going to turn that over to the Borg anytime soon. I’ve asked around and no law firms do that now. No experts advocate that approach either, even the most extreme advocates for automation (of which I’m one). The only exception I have heard of is in non-litigation circumstances, such as second reviews with production to the government. Automated review is nowhere near good enough to go it alone. You use predictive coding to speed up the final manual review to be sure, but only a fool (or con artist trying to get at a producing parties secrets) trusts coding software today without human verification.

My thinking and experiments since 2006 have focused on how to control the final review costs. By early 2008 I came up with one possible method that looked promising. I have been testing and refining this invention ever since with several e-discovery teams. I have also talked about it with many other attorneys, friend and foe, and used this new method in many law suits, big and small. I am now ready to write publicly about my proposed fix for the first time. I call it Bottom Line Driven Proportional Review and Production. A more technical description for it, the one I used in a legal methods patent application, is: System and Method for Establishing, Managing, and Controlling the Time, Cost, and Quality of  Information Retrieval and Production in Electronic Discovery. But I usually just call it Bottom Line Driven Review, and who knows, if it catches on – and I think it should because it really works – I may trademark that phrase too.

In the meantime, try it out. The more attorneys that use this method, the more accepted it will be by judges. Right now they are hearing it from my teams for the first time, and, like anything new, it takes some explaining and getting used to. But, once understood, it appears obvious, and I expect all thinking clients will demand that their attorneys use this approach. It saves money.

Bottom Line Driven Review

The bottom line in e-discovery production is what it costs. Believe me, clients care about that …. a lot! In Bottom Line Driven Proportional Review and Production everything starts with the bottom line. What is the production going to cost? Despite what some lawyers and vendors may tell you, that is not an impossible question to answer. It takes an experienced lawyer’s skill to answer, but after a while, you can get quite good at such estimation. It is basically a matter of man-hours estimation. With my method it becomes a reliable art that you can count on.

Price estimation is second nature to me, and an obvious thing to do before you begin work on any big project. That is primarily because I worked as a construction estimator out of college to save up money for law school back in the seventies. Believe me, estimating review costs is basically the same thing, projecting materials and labor costs. In construction you come up with prices per square foot. In e-discovery you estimate prices per file, as I will explain in detail later.

My new strategy and methodology is based on the bottom line. It is based on projected review costs, defensible culling, and best-practices of review. Under this method the producing party determines the number of documents to be subjected to costly final review by calculating backwards from the bottom line of what they are willing, or required, to pay for the production.

The process begins by the producing party calculating the maximum amount of money appropriate to spend on ESI production. A budget. This requires not only an understanding of the ESI production requests, but also a careful evaluation of the merits of the case. The amount selected for the budget should be proportional to the monies and issues in the case. Any more than that is unduly burdensome and prohibited under Rule 26(b)(2)(C), Federal Rules of Civil Procedure and other rules that underlie what is now known generally known as the Proportionality Principle. See Rule 1, Rule 26(b)(2)(C), Rule 26(b)(2)(B), and Rule 26(g) Federal Rules of Civil ProcedureCommentary on Proportionality in Electronic Discovery, 11 SEDONA CONF. J. 289 (2010); Oot, Kershaw & Roitblat, Mandating Reasonableness in a Reasonable Inquiry, Denver University Law Review, 87:2, 522-559 (2010); Also see Rule 403 of the Federal Evidence Code (inadmissibility of cumulative evidence).

The budget becomes the bottom line that drives the review and keeps the costs proportional. The producing party seeks to keep the total costs within that budget. The budget should either be by agreement of the parties, or at least without objection, or by court order. The failure to estimate and project future costs, and to decide in advance to conduct the review so as to stay within the budget, is the primary reason that e-discovery costs are so high.

After analysis of the case merits and determination of the maximum expense for production proportional to a case, the responding party makes a good faith estimate of the likely maximum number of documents that can be reviewed within that budget. The document count represents the number of documents that you estimate can be reviewed for final decisions of relevance, confidentiality, privilege and other issues, and still remain within your budget. The review costs you estimate must be based on best practices and be accurate (no puffing).

The producing party then uses smart search techniques and quality controls to find the documents most likely to be responsive within the number of documents that the budget allows. This is usually based on relevancy ranking, and thus the need for hybrid multimodal best practices in the search and review. Predictive coding is inherently rank based and so it makes bottom line driven review especially easy to do. That is one reason I am especially pleased to see the price of predictive coding software finally coming down. It can be done without predictive coding ranking to be sure, but it is harder to be accurate, especially with recall. Using best methods allows you to get the most bang for your buck, the core truth, and thus persuades the requesting party or court to go along with your budgetary limits. More on the new gold standards in a minute.

Example

An example may help clarify how it works. If you set a proportional cost for a case of $100,000, and estimate that it will cost you $5.00 per file for the final manual review before production of the ESI at issue, then you can  review no more that 20,000 documents and stay within budget. It is basically that simple. No higher math is required.

The only difficult part is the legal analysis to determine a budget proportional to the real merits of the case. But that is nothing new. What is the golden mean in litigation expense?  How to balance just, with speedy and inexpensive? The essence of the ideal proportionality question has preoccupied lawyers for decades. It has also preoccupied scientists, mathematicians, and artists for centuries. They claim to have found an answer that they call the golden mean or golden ratio:

In law this is the perennial Goldilocks question. How much is too much? Too little? Just right? How much is an appropriate spend to produce documents? The issue is old. I have been dealing with this problem for over thirty years. What’s new is applying that legal analysis to a modern-day high-volume-ESI search and review plan. Unfortunately, unlike art and math, there is no accepted golden ratio in the law, so it has to be recalculated and reargued for each case. (Side Note: If the golden ratio were accepted in law as an ideal proportionality, the number is 1.61803399, aka Phi. That would mean 38% is the perfect proportion. I have argued that when applied to litigation that means the total cost of litigation should never exceed 38% of the amount at issue. In turn, the total cost of discovery should not exceed 38% the total litigation cost, and the cost of document production should not exceed 38% of the total costs of discovery.  (It’s like Russian dolls that get proportionally smaller.) Thus for a $1 Million case you should not spend more than $54,872 for document productions (1,000,000 – 380,000 – 144,400 – 54,872). See Losey, R., Beware of the ESI-discovery-tail wagging the poor old merits-of-the-dispute dog. But I digress too far.)

Estimation for bottom line driven review is essentially a method for marshaling evidence to support an undue burden argument under Rule 26(b)(2)(C). It is basically the same thing we have been doing to support motions for protective orders in the paper production world for over sixty years. The only difference is that now the facts are technological, the numbers and variety of documents are enormous, sometimes astronomical, and the methods of review are very complex and not yet standardized.

The calculation of projected cost per file to review can be quite complicated, and is frequently misunderstood, or is not based on best practices. Still, in essence this cost projection is also fairly simple. You basically project how long it will take to do the review and the total cost of the time. Thus, for example, and this is a gross over simplification, in a review project of 20,000 documents (after computer assisted culling – it probably started as 100,000 or 200,000), if the average review and coding rate is 50 files per hour, it will take 400 hours to complete. If the projected total cost for the reviewer time, including supervision and other costs, is $250 per hour, the projected total cost for the review is $100,000 ($5.00 per file).

This may seem high when you consider the cost of contract lawyers is $50 or less for their time, but you have to also include expensive partner and senior associate management time, direct supervision, quality control reviews, and privilege logging, etc. Do not be fooled by promises of $1.00 per file charges by contract review companies (or even less than that). That does not include the law firm of record time and expenses to supervise, etc., and often is based on a pre-culling rate for file count. In this business one of the hardest aspects for good estimation is getting true apples-to-apples comparisons from vendors.

Also, quality control is important and best practices can be expensive, even though with bottom line driven review the total cost is still dramatically less than old-school. As I said before, when you talk about capping the number of documents you review, you also have to talk about finding the most likely relevant documents for this capped review. You have to provide the most bang for your buck, the most truth. That, along with transparency to earn trust, is a key to the success of this method, a key to persuading the other side or court to accept this new approach to reasonable discovery.

Estimate of Projected Costs

Another key to persuade the requesting party or court is to be sure your estimate is realistic. You cannot just dream up estimates, or puff the likely expense. The estimate must be based on knowledge of the types of documents that you will be reviewing in your particular case. It must be based on the times that you find it takes for manual review of the documents. Some document collections are faster and easier to review than others. The speed is measured in files per hour. (I like the sound of that, plus pages per hour is just a relic of the paper world. Computer files don’t have pages, only paper print-outs do.)

The typical speeds we see today in final manual review are anywhere from 25 files per hour, for collections with a lot of dense long documents and crappy review software, to 100 or even 200 files per hour for collections with easier to skim documents and the best software. Putting aside the question of the wide divergence in the quality of review software, we tend to see faster files per hour rates in email-heavy cases with few attachments, than we do in cases with a high percentage of complex documents and spreadsheets. You have to know your case, know your ESI, to make a proper estimate.

The projected costs must also be based on best practices for economical review and not be inflated. You can’t justify $500 per hour partners to do all of the review (although they may be needed as subject matter experts to do the seed set review in predictive coding). Of course, old-fashioned full manual review is out of the question. It has to be hybrid with computers doing most of the first-pass document culling. Even if you use technology assisted review, you still must also use best practices for methods. You cannot justify old-fashioned stupid review methods, such as batch out to reviewers solely based on first in, chronological, or just random. It has to be a best practice based multimodal type of review, where, for instance, you batch out documents for manual review based on issues, clusters, language, or other smart review methods. Best practice also means quality control and a random button as discussed at length in the Secrets of Search series. If you do not use the nine best practices to get the most bang for your buck, the core truth, the requesting party or court may not agree to limit the number of documents to be reviewed.

The processes behind the estimate should also be transparent. This means you should be willing to disclose it to the requesting party. That is how you can convince them that the estimate is reasonable and that you are not still stuck in the old paradigm of hide-the-ball discovery games. I cannot overstate how important it is to develop trust between counsel on discovery and often the only way to do that is through transparency. You do not have to disclose all of your trade secrets, but you have to keep the requesting party pretty well informed and involved in the process. That is what cooperation looks like.

In general, I have found that in 2011, $5.00 per document was a good place to start in projecting costs for review of a typical email collection (an email is one file, and each attachment is another). This price includes the expensive redaction and privilege logging processes. Review with a simple relevant or irrelevant coding is the easiest and cheapest to do, and is also fairly rare. There are usually multiple additional factors to consider.

The five dollars per file is a starting point of estimation, a rule of thumb that is often correct, but sometimes way off. It is comparable to the rule of thumb in construction estimation where you start with the typical costs to build on a square footage basis. But in some cases, especially ones involving cross-border issues, the costs could go much higher, as high as $15 per file. In others, where the review is simple, it could go as low as $2.00 per file. It is just like construction where various buildings in different locations have different costs.

The $5.00 per file price is based on my recent experiences in 2011. In 2009 the average cost was more like $6.50 per file, and I expect average costs will keep going down a little in 2012 and then level off.

It is important to note that you can justify starting with a much higher number based on legal precedent alone. For instance, the Department of Justice spent $9.09 per document (or file, same thing) for review in the Fannie Mae case, even though it used contract lawyers for the review work. In re Fannie Mae Securities Litig., 552 F.3d 814, 817 (D.C. Cir. 2009) ($6,000,000/660,000 emails). There were no comments by the court that this price was excessive when the government later came back and sought cost shifting. My current $5.00 per file general rule is also lower than the $6.09 per document that Verizon paid for a massive second review project that enjoyed large economies of scale and, again, utilized contract review lawyers.  Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010 ($14,000,000 to review 2.3 million documents in four months).

So, if your experience suggests a starting review rate higher than $5.00 per file, there is legal justification to use a higher number. Just be prepared to go to the next steps and back it up.

The price per file is just a starting point, a way to get a quick picture, a quick estimate, without doing all of the detail work. A more accurate picture starts to emerge with sample reviews and more detailed analysis of the tasks required in the review and the actual data to be reviewed. You have to, as I like to say, get your hands dirty in the digital mud. You have to know your ESI collection. Even in just one type of ESI, the one most common in e-discovery today, email and attachments, the variances in email collections can be tremendous.

Once you get your hands on the data you need to start to breakdown and analyze the time involved in the various tasks required in the review project. Here, as in construction estimation, the spreadsheet is your friend. This move to actual examination of the ESI at issue, and study of the specific review tasks that need to be performed in your case, is equivalent to the move in construction estimation from rough estimates based on average per square foot prices, to a careful study of the buildings plans and specifications, and a site visit with inspection and measurements of all relevant conditions. No builder would bid on a project without first doing the detailed real world estimation work.

Even in the same organization, and just dealing with email, the variances between custodians can be tremendous. Some for instance may have large amounts of privileged communications. This kind of email takes the most time to review, and if relevant, to log. High percentages of confidential documents, especially partially confidential, can also significantly drive up the costs of review. All of the many unique characteristics of ESI collections can effect the speed of review and total costs of review. That is why you have to look at your data and test sample the emails in your collection to make accurate predictions. Estimation in the blind is never adequate. It would be like bidding on a building without first studying the plans and specs.

Even when you have dealt with a particular client’s email collection before, a repeat customer so to speak, the estimates can still vary widely depending on the type of law suit, the issues, and on the amount of money in controversy or general importance of the case.

Although this may seem counter-intuitive, the truth is, the complex, big-ticket cases are the easiest to do e-discovery, especially if your goal is to do so in a proportional manner. If there is a billion dollars at issue, a reasonable budget for ESI review is pretty big. On the other hand, proportional e-discovery in small cases is a real challenge, no matter how simple they supposedly are. Many cases that are small in monetary value are still very complex. And complex or not, all cases today have a lot of ESI.

The medium size to small cases are where my bottom line driven proportional review has the highest application for cost control and the greatest promise to bring e-discovery to the masses.

The Quest for Gold

In Secrets of Search Parts One, Two and Three, I outlined the five key characteristics of search today, using the rubric of secrets. To support my outline I used the latest scientific research on legal search, and focused on the work of William Webber. Re-examining the Effectiveness of Manual Review. In Part Three I summarized my ideas on search and review using the symbol of the Pythagoreans, the five-sided polygon, or pentagon:

With this blog on Bottom Line Driven Proportional Review I add a sixth idea, where the process gets real and takes money into consideration. Here I have shared my method to use estimation, projections, budget, cooperation, transparency, and the legal doctrine of proportionality to control the costs of search and review. With this final piece my proposal for a new gold standard of search and review is complete.

Bottom Line Driven Review is a method to try to control the key problem in electronic discovery law today, the run away costs of review. The number of documents we have to review seems to double every two to three years, so this new legal method is imperative. New and better software, especially predictive coding type, is also important. As shown, the ranking of relevancy and other categories built into the latest algorithms is, under my bottom line driven analysis, an especially helpful new capability.  You rank the documents within your budget limit that the computer predicts, based on your training, will be the most relevant to your case. But new technology alone is not enough. We must also have new legal methods. Technology and law have to work together, grounded in science, to create a new gold standard.

In Secrets of Search Part II, I proposed a new gold standard, one that would replace the now disgraced old-gold brute-force manual review unassisted by technology. I drew upon the findings in the latest scientific research, legal literature, and my over thirty years of experience with discovery to create a first draft list of the nine criteria of the new gold. The first criteria listed was Bottom Line Driven Proportional Review, which I promised to explain later and have now done so. Here is how I put it in Part II:

The old gold standard of average human reviewers, working in dungeons <smile>, unassisted by smart technology, and not properly managed, has been exposed as a fraud. What else do you call a 28% overlap rate? We must now develop a new gold standard, a new best practice for big data review. And we must do so with the help and guidance of science and testing. The exact contours of the new gold are now under development in dozens of law firms, private companies, and universities around the world. Although we do not know all of the details, we know it will involve:

  1. Bottom Line Driven Proportional Review where the projected costs of review are estimated at the beginning of a project (more on this in a future blog);
  2. High quality tech assisted review, with predictive coding type software, and multiple expert review of key seed-set training documents using both subject matter experts (attorneys) and AI experts (technologists);
  3. Direct supervision and feedback by the responsible lawyer(s) (merits counsel) signing under 26(g);
  4. Extensive quality control methods, including training and more training, sampling, positive feedback loops, clever batching, and sometimes, quick reassignment or firing of reviewers who are not working well on the project;
  5. Experienced, well motivated human reviewers who know and like the AI agents (software tools) they work with;
  6. New tools and psychological techniques (e.g. game theory, story telling) to facilitate prolonged concentration (beyond just coffee, $, and fear) to keep attorney reviewers engaged and motivated to perform the complex legal judgment tasks required to correctly review thousands of usually boring documents for days on end (voyeurism will only take you so far);
  7. Highly skilled project managers who know and understand their team, both human and computer, and the new tools and techniques under development to help coach the team;
  8. Strategic cooperation between opposing counsel with adequate disclosures to build trust and mutually acceptable relevancy standards; and,
  9. Final, last-chance review of a production set before going out the door by spot checking, judgmental sampling (i.e. search for those attorney domains one more time), and random sampling.

I have probably missed a few key factors. This is a group effort and I cannot talk to everyone, nor read all of the literature. If you think I have missed something key here, please let me know. Of course we also need understanding clients who demand competence, and judges willing to get involved when needed to rein in intransigent non-cooperators and to enforce fair proportionality. Also, you should always go for confidentiality and clawback agreements and orders.

I repeated this nine-point list of the new gold in Part III of Secrets of Search, and again repeated my invitation for input with a comment on standards that bears repetition:

I have probably missed a few key factors. This is a group effort and I cannot talk to everyone, nor read all of the literature. If you think I have missed something key here, please let me know. I will be at Legal Tech New York for three days with four presentations. Seek me out and let’s talk. You can reach me at ralph.losey@gmail.com.

You may note that I am herewith joining the call of other leaders in the field to develop best practice standards, notably including Jason Baron, and have overcome my initial reluctance to go there for a variety of reasons. See Jason R. Baron, Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, XVII RICH. J.L. & TECH. 9, at 29-33. My concerns on arbitrary standards and unfounded malpractice claims remain, but I think we have no choice but to develop some basic industry standards. The nine characteristics of good document review outlined above constitute a first modest step in that direction.

I will be at LegalTech NY on January 30th, 31st and February 1st. My invitation for dialogue and input from readers continues. Seek me out and let’s talk, but spare me the sales pitches, please. (I am, however, open to writing pitches.) My main focus right now is the quest for a new gold standard of search and review. I know that many of you share this quest, so let’s use the power of groups, or team work, to make it happen. What do you think of my nine-point first draft list? Any suggestions to add new criteria, consolidate, or add to what any of these nine mean?

Two of my readers have already responded to my outreach for input, Bill Hamilton and Larry Chapin. They provided some more concrete details to the criteria number six in the list, new tools and psychological techniques. They submitted an essay on one such new technique, which is actually of ancient origin, and well-known to the best trial lawyers, namely the use of story and storytelling to improve legal review. Storytelling: The Shared Quest For Excellence in Document Review. If you have ideas for an article, please send me an email with outline or first draft and I will consider it for possible publication.  As coaches everywhere love to say, there is no “i” in team, even if there is in e-Discovery Team ® <grin>.

Conclusion

I have a dream, like all humans do. It is one of the key attributes that separates us from machines. My dream is not as noble or stirring as the public dreams of Martin Luther King or John Lennon. Those were grand dreams indeed. But my dream is important to me, and you can probably relate to it. I dream of a day where man and computer work together to bring truth and justice for all, not just the elite few who can afford it now. I dream of a day where e-discovery is affordable and used in all size cases. This dream of truth and justice for all is deeply rooted in my psyche. I suspect it is in yours too. We all grew up understanding the importance of Superman’s never-ending battle for truth, justice, and the American way. Join me in this battle. Join the e-discovery team fight for truth, justice, and the American way. (For my many readers outside of the U.S., my American way reference is not meant to be nationalistic or exclusionary, but rather to refer to the highest ideals of a great country.) All professionals in the field are invited. So too are computers, especially the latest generation of super smart ones. Yeah, their programmers too. All behind the scenes coders, techs, and scientists are an important part of the e-Discovery Team.

High tech lawyers working with computers and their handlers are key to my version of the archetypical American dream of truth and justice. Techs and computers helped bring about the nightmare we must now overcome – the explosion of ESI that hides the truth and makes justice too expensive. They helped get us into this mess, they can help get us out. We cannot turn back.

Jason Baron’s depressing prophesy of information dystopia, where we all drown in a flood of information, is no prophetic dream. It is a realistic assessment of the current state of the law and the discovery of electronic evidence. The reality today is that the vast majority of lawyers avoid the discovery of information in computers, even though that is where the truth lies. They have a prejudice against it. They believe in the inherent superiority of paper. We all know that most of the truth left paper filing cabinets over a decade ago (with the sole exception, perhaps, of the federal government), yet most lawyers still look there, and only there, for justice. Jason’s dream of extreme information overload is a projection of the current reality getting worse. He speaks the truth, but only if we don’t do something about it, if we don’t continue in the never-ending battle. Truth and justice can triumph. They must.

The motivation is clear. So is the solution. Imagine a world where the fool’s errand, the paper chase, comes to an end. Imagine a world where these old ways, based as they are on ignorance and delusion, are replaced by an affordable and effective process of e-discovery. Imagine a world where all the people, the litigants in all size cases, can all afford to do e-discovery. Imagine all the people living life in justice. It isn’t hard to do. As John Lennon said: You may say I’m a dreamer, but I’m not the only one. I hope some day you’ll join us. And the world will be as one.

I have a dream of a new method of technology assisted discovery, where Man and Machine work together to find the core truth. This day will come, in fact it is already here. As William Gibson said: “The future is already here – it’s just not evenly distributed yet.” The key facts you need to try a case and to do justice can be found in any size case, big and small, at an affordable price. But you have to open your mind. You have to embrace change and adopt new legal and technical methodologies. The Bottom Line Driven Review method is, I suggest, an important part of that answer. It is working for me today, it can work for you too. Our dreams can come true. The nightmare scenarios of justice for only the super-rich can be avoided. The battle for truth and justice must continue.

I  see a way out, where we can overcome, where truth and justice can be attained for all the people. I see a day where the truth in our computers can be found and brought to the court room for justice to prevail.

Although I also have a dream of a new generation of tech-smart lawyers, who understand and apply the new methods, the new gold, to keep e-discovery available for all. We do not need to wait for this slow gradual change. We can win the battle now, even without the young geek Supermen. The time for change is now – in this generation, not the next. As King said in his I Have Dream speech, this is not time to take the tranquilizing drug of gradualism.

Join me in the dream of e-truth and justice today. As a team we can get there, we can and shall overcome, we shall be free of the paper-prejudices of the pre-computer world. And when that day comes, let the bells of truth and justice ring throughout the world. To quote the end of King’s famous speech:

And when this happens, when we allow freedom to ring … [we] will be able to join hands and sing in the words of the old Negro spiritual, “Free at last! free at last! thank God Almighty, we are free at last!”


Storytelling: The Shared Quest For Excellence in Document Review

January 8, 2012

Guest Blog by William F. Hamilton and Lawrence C. Chapin.

Bill Hamilton is an attorney with nearly thirty years of experience in business litigation who is a partner at Quarles & Brady. Bill also serves as the Dean of the E-Discovery Department of Bryan University, which includes an online educational program in e-discovery project management. Bill is also an Adjunct Law Professor teaching Electronic Discovery and Digital Evidence at the University of Florida, and has frequently contributed to this blog. See Eg. The E-Discovery Crisis: An Immediate Challenge to Our Nation’s Law Schools, and The E-Discovery Sanctions Cube.

Larry Chapin is an attorney with 30+ years experience, including corporate Wall Street law, who now works as a contract review lawyer in New York City. Larry has taught at the New School for Social Research in NYC and currently serves on the Board of Directors for an asset management company in Stockholm Sweden. Larry is the first graduate of our e-Discovery Team Training program. He contributed a must-read blog here earlier this year entitled Contract Coders: e-Discovery’s “Wasting Asset”?

_____________

EDITOR’S NOTE: Over the last several blogs on the Secrets of Search we have examined the latest scientific research on manual and automated reviews. The research shows that although brute-force manual linear review is as dead as a doornail, or should be, there is still an important place for skilled human reviewers and review, even in the latest predictive coding models. But the emphasis is on skilled human reviewers and skilled methods. Simply asking some lawyers to look at documents all day on a computer screen for weeks on end and decide relevance or not is unacceptable. If that is how you conduct manual reviews, and just bid things out to the lowest paid reviewers, then you are inviting error. You probably would be better off turning it over to the Borg, and just skipping final quality control reviews altogether. But if you care about quality, if you are diligent in the protection of client confidentiality – and as a lawyer you have a clear ethical duty to do so – then you must improve and innovate on manual review. This guest blog by professional reviewer, Larry Chapin, and an expert in e-discovery and project management, Bill Hamilton, help show the way.

In Part III of Secrets of Search I listed a nine-point checklist for quality reviews. Point number six was: “New tools and psychological techniques (e.g. game theory, story telling) to facilitate prolonged concentration … ” This guest blog will flesh out a new approach that Chapin and Hamilton have developed to use storytelling to improve the quality of contract reviews. I think this is a great idea. Lawsuits are essentially a battle of competing stories. They can become high drama as the Casey Anthony trial that took place across from my office in Orlando showed in 2011. Good trial lawyers already know the importance of story to a case. They should quickly understand this idea and appreciate how this new review technique could help their cause. All attorneys, and especially companies that do contract review work, should look into including this new technique into their projects. Feel free to email Bill Hamilton or Larry Chapin to see how they may be able to assist.

_____________

By putting its faith in logic, control and optimization, command-and-control management has lost sight of the crucial role that passion plays in human action.

Stephen Denning, The Leader’s Guide to Storytelling

_________

Storytelling: The Shared Quest  For Excellence in Document Review

by William F. Hamilton and Lawrence C. Chapin

What is the future of large-scale human document reviews? With the startling advances of search technology, is human document review about to be consigned to the dustbin of history? Some believe so. Yet, others think that the death of human review has been grossly exaggerated. There is no doubt that computer assisted reviews will be increasingly important for large and even moderate scale reviews. However, the contest between human and computer, between manual and automated review is far from over. In this blog, Ralph Losey recently discussed some of the implications of the fascinating work of information scientist William Webber. It seems that in the proper setting, the best human reviewers can still out-perform the automated review.

Watson may be the Jeopardy winner, and IBM’s Deep Blue the chess champion, but the identification and evaluation of documents in the litigation context stretches the utility of computer algorithms. In document review setting, well-trained, well-led and properly motivated women and men are, in fact, able to excel. How can we build reviews to maximize human review performance? What can be done about the powerful disincentives of long hours of dreadfully monotonous work at rates of pay already low and still in decline? Put more constructively, what can be done to tap the intelligence, marshal the talents, and harness the energies of the contract lawyers who fill the ranks in the typical review? How do we rid ourselves of the upstairs-downstairs mentality that isolates and confines our reviewers, turns them into servants and cripples their reviews?

We believe that the answer may be found in an approach to document review that harkens back to a simpler time, before litigators faced the enormous volumes of documents common in our digital age. That is to say, answers are to be found in building reviews around the art of storytelling. Shakespeare was right: the entire world’s a stage, and all the men and women players. Certainly, litigation is drama. It is the drama of competing and clashing human passions. It is the stuff of stories. Document reviews must be understood as a central player in the litigation storytelling process.

A fundamental shift in the way that lawyers think, speak, and conduct document reviews is required. We propose a new paradigm. We propose building  “story-centric” reviews. First, though, let’s face it. Storytelling usually gets a hard knock. It’s for children. It’s the stuff of fairy tales. Storytelling is said to have no place in the hard-edged, logic driven, command-and- control culture to which the legal and business communities have grown accustomed. Euphemisms – like “business narrative” – have been invented so that stories might have a place of some kind in the working world.

Yet, storytelling has long been a part of lawyering. Good trial lawyers have always known that cases are won on the strength of their story. Even crazy ones can be convincing. Empirical studies also show that appellate briefs, too, are more persuasive if they tell stories rather than rely on logic alone.  A case can’t resonate with a judge or jury – emotionally, intellectually, or intuitively – unless it’s tied to a compelling story. The litigation team itself can’t know what evidence most belongs before the court unless it knows the story to which the evidence belongs. The discovery process serves to yield the elements and the contours of the story, and shed light on the connections between the cause and effect that are at its heart. It is the job of the entire team – including document reviewers – to construct the most persuasive story possible, and to diminish and discredit the tale told by the other side.

Our experience, unfortunately, is that too many lawyers separate document review from that creative process. They fail to see document reviewers for what they are: investigators sharing fully in the common tasks of discerning, shaping, and telling the client’s story. This kind of engagement requires that the review structure and evaluation adopt the elements and language of the story. It’s an orientation that triggers active reviewer participation and has real potential to address the problems now plaguing review. We believe that the failure to engage the review team in this way results in a process that is less true and just than it might be.

Suggestions to Add Story to Document Reviews

Accordingly, we offer a series of suggestions for the use of storytelling in the discovery process, toward building a story-centric review.

First, at the outset, use the client’s story and its themes to define the goal of the review project. Articulate clearly the central purpose of every reviewer’s contribution: to enable the story to be told. The story needs to remain the constant center of their focus. We might liken reviewers to crew members who sailed in search of new lands during the great age of exploration. Not every day was filled with adventure. During more days than we realize, their ships were becalmed on windless seas. What got them through those days was their purpose for being there, the vision of things that had launched their journey. So they kept their focus, mindful always of the possibility of a sighting and the promise of discovery. In that way, let the story of the case be what drives and sustains the review team. Remember that the critical document, like new land, may be just a moment away. Everyone needs to stay alert.

Project metrics should be designed to reflect this orientation. Story-centric metrics should measure: linkage, the degree to which documents pull the story together tightly to help tell the tale; gravity, the degree to which the document collection gives weight, heft and power to the tale; and resonance, the degree to which documents provide compound richness to the story.

“Linkage Docs” provide the basic story line. They establish the necessary cause and effect that transforms otherwise isolated facts into a real story. They reflect the fact that every story is composed of details that unfold at a time, place, during a temporal extension, and that involve human motivations and conflicts. They are the sinews, the connecting tissues without which a story does not exist. For example, in a case involving a business breach of contract for failure to maintain premises, a document that shows the defendant’s  financial distress shortly before the breach establishes linkage. Linkage allows the story to begin to congeal.

Links are related to gravity, but different. “Gravity Docs” are those documents that move the story events out of stasis towards resolution. They function as a pivotal column or anchor that marks a transition, direction or resolution within the story. We ultimately want links that tie to these pivotal columns. The documents with gravity are the turning point documents.

Finally, “Resonance Docs” are those documents that strike a chord in us. They evoke sympathies in ways that align us with the actors in the story. They establish decisive commonalities between persons hearing the story and those person within it. In helping the story ring true, they persuade us. The lead us safely past any temptation to turn to unpersuasive  clichés, triteness, and banality in telling the story. A document that provides resonance will  tie story links (sub-plots) and pivotal gravity markers (the main plot) together.

The story can have links  and pivoting documents, and still be unpersuasive. Resonating documents provide  understanding, the “now I get it feeling,” and are often documents that directly speak to human motivation and intention ( or give rise to strong presumptions of actual motivation).  The irony of the traditional review is that a  review team shackled by  traditional coding blinders can row past a proverbial “smoking gun” document and not recognize its value to the story. Reviewers should not resemble the galley rowers portrayed in Ben-Hur who are driven to exhaustion as the pace of the review escalates to ramming speed.

The review team must be able to recognize documents with story-centric values, not merely label documents  as responsive or non-responsive according to abstract coding rules. A good review team requires graphics. The review team’s identification of Linkage Docs, Gravity Docs and Resonance Docs compose the story as the review progresses. The review team needs to literally see the story mapped as it develops. The story-centric review replaces the  traditional white board with a large story board that simultaneously shapes and is shaped by the review.

Linkage, gravity, and resonance can be seen as three overlapping circles. In practice, depending upon the story, the circles may vary in size and shape (e.g. oblong),  but in the overlapping section we are likely to find the 7±2 documents that the trial team needs to tell the winning story.

So invest your own time in a solid understanding of the client’s story. Invest more time still in discussing it with the review team, so that together you reach a shared grasp of its themes and important facts. This initial investment may turn out to be substantial, but the rewards will be enormous. Don’t make the mistake of taking more time to talk about the software the team will be using than on the story they will be helping to tell.

On one project of which we are aware, the trial and discovery teams developed a highly detailed, rule-based review book. It was more than one hundred and fifty pages long, but devoted fewer than one hundred words in not even ten lines of text to actually telling the client’s story. Don’t do that. Don’t let a narrow focus on the chains of logic obscure the compelling threads of the underlying narrative.

Second, use storytelling with the review team to create a sense of quest. Remember again our metaphor of voyage. The reviewers are, of course, engaged in a real pursuit – weaving a tight, compelling story worthy of being told. Beyond that, quests intimate a feeling of authentic commitment – even a passion – among members of the review team. The power of the story transforms the document review experience. Stories have a unique ability to bind members of the team to a broader purpose, and to each other. As we work together, we are reminded of the human drama that has already unfolded for our client. We remember, too, that our own stories are still unfolding in our work together. On several levels, then, we feel connected. The present has new and important depth.

The organization of the review teams is critical to a sense of quest. The reviewers must identify with the quest to face its hardships and celebrate its victories. The review itself should be seen as a story that has drama, disappointments, dead ends, clues, and ultimately triumph. Banish forever the factory concept of document review as a mass production based on the principles of Taylorism and Fordism.

Third, use of a lawsuit’s stories serves to continually define and redefine the team’s analytical tasks, and to sharpen their focus as the review progresses. Use graphics and models to demonstrate the elements and cohesion of the story as the review is taking place. If the reviewers can’t understand and relate to your story, no judge or jury ever will. Emphasize that the story being told to them is provisional, and that their investigation may, in fact, bring about a retelling of the story. Reiterate key themes as you talk to the members of the team. Challenge them to discern both its strengths and its weaknesses. Provide opportunities for them to share their impressions and their hunches, their discoveries and concerns. This might be as simple and productive as it was on one recent project in which every day or so, one of the law firm’s associates on the case went among the reviewers and asked them, “What are you finding? What do you think?”

It is hard to exaggerate the importance of these interactions. They’re not drive-by questions that are all too easily answered with a yes or no. They are chances for leaders to demonstrate deep listening. They are open-ended invitations to contribute to the group’s learning. They are small streams of one-on-one talk that contribute to what Denning has called the river of conversation that keeps the project moving forward. They are also brief opportunities for members of the team to be acknowledged and affirmed in their work. The goal is to create short but meaningful exercises in team building and flushing out the law suit’s story.

Fourth, share “discoveries” among the team. After all, many of the decisions made by reviewers are close calls, and need to be shared and socialized for consistency and accuracy. In part, this question of sharing is a matter for science.  There are, no doubt, a wide variety of wiki-like technologies that might be brought to bear for purposes of shared learning. But there are several things to be remembered in that regard. First, the technologies seem to be variations on the same theme. That is, they provide ways in which reviewers can articulate their rule-based questions, which are then migrated upwards for consideration by someone on the trial team. The review team is then given access to a database containing all the questions and their answers.  There are many other technologies available for broader, more open learning, but sadly they are rarely employed. It is ironic that in this digital era that has spawned massive reviews, few of the readily available social networking and communications tools have been applied to “humanize” the review process. Then again, the reason is clear: non-story-centric reviews seem to have little use for creativity and collaboration.

The reviewers should be organized into “review teams.” Review teams should ideally be small teams (10-15 reviewers) located in physical proximity. The identification of Linkage Docs, Gravity Docs, and Resonance Doc should be quickly shared and celebrated. Review team members should encourage one another. Review metrics should not exclusively focus on number of documents reviewed per hour. All genuine work and creativity has valleys and plateaus. A review should not be a forced march. The football team regroups in the huddle before each play as it creatively marches down the field. A good, productive review will have its own rhythm. To facilitate this rhythm the successes of one review team should be shared with other teams. Success encourages success and friendly goal oriented competition. Reporting, feedback, and encouragement should be emphasized.

Why have we ignored the lessons of sports competition in our document reviews? Sports motivation coaches are paid millions to inspire athletes and teams. Yet in million dollar reviews, and where even more is at stake in the litigation, we tolerate performance that would be banished elsewhere. What is needed are the genuine “review coaches.”

Fifth, collaboration thrives on human face-to-face contact. The 2009 Text Retrieval Conference (TREC) validated this important point. The TREC team sponsored by the School of Information Sciences of the University of Pittsburgh was provided with shared digital space that allowed them to communicate with each other and to store and organize results. Early on, communication between the searchers consisted mostly of texting, with very little actual, verbal communication. Later on, as tasks became more difficult and the need to collaborate became greater, real talk between the searchers virtually replaced texting, as trust and familiarity developed.

The Pittsburgh team results suggest that while wiki-like technologies are useful in knowledge sharing, trust-based communication such as that involved in document review will gravitate towards ordinary face-to-face communication. It also reminds us that, especially in knowledge sharing exercises, “talk is work” as Stephen Denning has said. This may be another surprise for readers. Absolute silence may not simply mean a focused project. It may be signal a failure to share critical information.

It is precisely in such spontaneous conversations that members of the team draw from the pool of cognitive diversity. A good team will comprise individuals with different strengths, training and backgrounds. When left to themselves high functioning teams learn to take full advantage of their diversity. Good leaders will make sure that team members know their neighbors. Sadly, that rarely happens. On one project related to the life sciences, one reviewer had nearly a decade of law firm experience in that field. But the rest of the team never found out, because the supervisors never thought or wanted to ask. In another project involving the global capital markets, one of the reviewers had two decades of high-level experience trading financial instruments. He decided not to reveal that to anyone. Somehow, the message had gotten across to him that the smart approach to “surviving” document review was to “keep your head down.” It’s a saying you hear a lot on the project floor. What a terrible reflection upon the kind of “supervision“ and “management” to which document reviewers are commonly subjected!

Sixth, use storytelling to generate the connections that will make document review a meaningful experience. The most profound concerns about document review have always revolved around the lack of connection between the purposes of the work and those doing it. Storytelling, on the other hand, is all about connections. Remember what stories are: accounts of causally connected events. So, document review is really an investigation into the nature of those connections. Further, stories are a shared human experience; we all have our own stories. In working together to formulate the story of the case, our own stories become part of the story of the group.

Storytelling establishes common meanings and transmits the values characteristic of high-performing teams. Denning writes that the most striking thing about being part of a great team is the meaningfulness of the experience. “People talk about being part of something larger than themselves, of being connected, or being generative … their experiences as part of truly great teams stand out as singular periods of life lived to the fullest.” We have seen the reviewers’ faces light up, their smiles appear, and genuine excitement erupt when participating in story-centric reviews.

In our view, these are issues of leadership, more than management. The dominant language of document review management reflects the values of traditional command-and-control culture. Such management is about structure, schedules, budgets and the like. This management operates out of hierarchical schemes and derives its presumed effectiveness from the power of authority. Naturally, such things have their place in well-run reviews, as most published literature attests. Metrics matter; things need to be measured and counted. But traditional measures of performance are not always the most revealing.

Consider, for example, the story told in the movie Moneyball about Billy Bean’s discovery that the “five tools” traditionally used to evaluate baseball players missed the mark. Metrics such as batting average and speed on the bases mattered, but they were really pointing to something else that was the most telling factor between ball players on winning and losing teams, that is, on base percentage. What mattered was how often batters got on base by any means. What if the metrics relied upon in review command-and-control structures – such as documents per reviewer per hour – are off the mark?

Seventh, remember that the document review may have to be explained and defended. If challenged as to its reasonableness, the review will have its own story to be told. The McDermott case now is a powerful reminder of what may be at stake. Stories about the labors of well equipped, fully engaged, and highly motivated reviewers are bound to be the most persuasive stories of all.

Conclusion

Good storytelling lies at the very heart of good litigation. Neither the information revolutions of the digital age, nor the dizzying advances of technology have changed that.

The challenge lawyers face is that of adapting the storytelling art to the requirements and capacities of our day. Discovery and review must articulate the client’s most compelling story. It must disable the counter-story told by the other side. Story-centric reviews serve as powerful levers for the other assets – both human and hard – committed to the work of review excellence. This is important work. Justice depends on a compelling story and injustices arise when we forget that.



Secrets of Search – Part II

December 18, 2011

This is Part Two of the blog that I started last week on the Secrets of Search, which was in turn a sequel to two blogs before that: Spilling the Beans on a Dirty Little Secret of Most Trial Lawyers and Tell Me Why?  In Secrets of Search – Part One we left off with a review of some of the analysis on fuzziness of recall measurements included in the August 2011 research report of information scientist, William Webber: Re-examining the Effectiveness of Manual Review. We begin part two with the meat of his report and another esoteric search secret. This will finally set the stage for the deepest secret of all and the seventh insight into trial lawyer resistance to e-discovery.

Summarizing Part One of this Blog Post
and the First Two Secrets of Search

I can quickly summarize the first two secrets with popular slang: keyword search sucks, and so does manual review (although not quite as bad), and because most manual review sucks, most so-called objective measurements of precision and recall are unreliable. Sorry to go all negative on you, but only by outing these not-so-little search secrets can we establish a solid foundation for our efforts with the discovery of electronic evidence. The truth must be told, even if it sucks.

I also explained that keyword search would not be so bad if it were not done blindly like a game of Go Fish, where it achieves really pathetic recall percentages in the 4% to 20% range (the TREC batch tasks). It still has a place with smarter software and improved, cooperation based Where’s Waldo type methods and quality controls. In that same vein I explained that manual review can probably also be made good enough for accurate scientific measurements. But, in order to do so, the manual reviews would have to replicate the state-of-the-art methods we have developed in private practice, and that is expensive. I concluded that we should come up with the money for better scientific research so we could afford to do that. We could then develop and test a new gold standard for objective search measurements. Scientific research could then test, accurately measure, and guide the latest hybrid processes the profession is developing for computer assisted review.

Another conclusion you could also fairly draw is that since the law already accepts linear manual review and keyword search as reasonable methods to respond to discovery requests, the law has set a very low standard and so we do not need better science. All you need to do to establish that an alternative method is legally reasonable is to show that it does as well as the previously accepted keyword and manual methods. That kind of comparison sets a low hurdle, one that even our existing fuzzy research proves we have already met. This means we already have a green light under the law, or logically we should have, to proceed with computer assisted review. Judge Peck’s article on predictive coding stated an obvious logical conclusion based upon the evidence.

You could, and I think should, also conclude that any expectation that computer assisted reviews have to be near perfect to be acceptable is misplaced. The claim that some vendor’s make as to near perfection by their search methods is counter to existing scientific research. It is wrong, mere marketing puff, because the manual based measurements of recall and precision are too fuzzy to measure that closely. If any computer assisted or other type of review comes up with 44%, it might in fact be perfect by an actual objective standard, and visa versa. Allegedly objective measurements of high recall rates in search is, for the time being at least, an illusion. It is a dangerous delusion too because this misinformation could be used against producing parties to try to drive up the costs of production for ulterior motives. Let’s start getting real about objective recall claims.

In any event, most computer assisted search is already better than average keyword or manual search, so it should be accepted as reasonable under the law without confidence inflation. We don’t need perfection in the law, we don’t need to keep reviewing and re-reviewing to try to reach some magic, way-too-high measure of recall. Although we should always try to get more and more of the truth, we should always try to improve, we should also remember that there is only so much truth that any of us can afford when faced with big data sets and limited financial resources.

As I have said time and again when discussing e-discovery efforts in general, including preservation related efforts, the law demands reasonable efforts not perfection. Now science buttresses this position in document productions by showing that we have never had perfection in search of large numbers of documents, not with manual, and certainly not with keyword, and, here is the kicker, it is not possible to objectively measure it anyway!

At least not yet. Not until we start taking our ignorance of the processes of search and discovery as a disease. Then maybe we will start allocating our charitable and scientific efforts accordingly, so we can have better measurements. Then with reliable and more accurate measurements, with solid gold objective standards, we can create more clearly defined best practices, ones that are not surrounded with marketing fluff. More on this later, but first let’s move onto another secret that comes out of Webber’s research. I’m afraid it will complicate matters even further, but life is often like that. We live in a very complex and imperfect world.

The Third Search Secret (Known Only to a Very Few): e-Discovery Watson May Still Not Be Able to Beat Our Champions

Webber’s report reveals that there is more to the man versus machine question than we first thought. His drill down analysis of the 2009 TREC interactive tasks shows that the computer assisted reviews were not the hands down victors over human reviewers as we first thought, at least not victors over many of the well-trained, exceptional reviewer men and women. Putting aside the whole fuzziness issue, Webber’s research suggests that the TREC and EDI tests so far have been the equivalent of putting Watson up against the average Jeopardy contestants, you know, the poor losers you see each week who, like me, usually fail to guess anything right.

The real test of IBM’s Watson, the real proof, didn’t come until Watson went up against the champions, the true professionals at the game. We have not seen that yet in TREC or the EDI studies. But the current organizers know this, and they are trying to level the playing field with multi-pass reviews and, as Webber notes, trying to answer the question we lawyers really want to know, the one that has not been answered yet, namely which Watson, which method can an attorney most reliably employ to create a production consistent with their conception of relevance.

Webber in his research and report digs deep into the TREC 2009 results and looked at the precision and recall rates of individual first pass reviewers. Re-examining the Effectiveness of Manual Review. He found that while Grossman and Cormack were accurate to say that overall two of the top machines did better than man, the details showed that:

Only for Topic 203 does the best automated system clearly outperform the best manual reviewer. As before, the professional manual review team for Topic 207 stands out. Several reviewers outperform the best automated system, and even the weaker individual reviewers have both precision and recall above 0.5.

This means the best team of professional reviewers who participated in Topic 207 actually beat the best machines! They did this in spite of the mentioned inequities in training, supervision, and appeal. Did you know that secret? I’m told that topic 203 was an easy one having to do with junk filters, but still, easy or not, the human team won.

There is still more to this secret. When you drill down even further you find that certain individual reviewers on each team topic actually beat the best machines on each topic in some way, even if their entire human team did not. That’s right, the top machines were defeated by a few champion humans in most every event. Humans won even though they were disadvantaged by not having an even playing field. I guaranty that this is a secret you have never heard before (unless you went to China) because Webber just discovered it from his painstaking analysis of the 2009 TREC results. Chin up contract reviewers, the reports of your death have been greatly exaggerated. Watson has not beat you yet, in fact, Watson still needs you to set up the gold standard to determine who wins.

Webber’s research shows that a competition between the best Watsons and best reviewers is still a very close race where humans often win. Please note this analysis assumes no time limits or cost limits for the human review, which are, of course, false assumptions in legal practice. This is why pure manual review is still, or should be, as dead as a doornail. The future is a team approach where humans use machines in a nonlinear fashion, not visa versa. More on this later.

Webber’s findings are the result of something that is not a secret to anyone who has ever been involved in a large search project, that all reviewers are not created equal. Some are far better than others. There are many good psychological, intelligence, and project management and methodology reasons for this, especially the management and methodology issues. See eg the must read guest blog by contract review attorney Larry Chapin, Contract Coders: e-Discovery’s “Wasting Asset”?

The facts supporting Webber’s findings on individual reviewer excellence are shown in Figure 2 of his paper on the variability in review team reliability. Re-examining the Effectiveness of Manual Review. The small red crosses in each figure (except flawed task 205) show the computer’s best efforts. Note how many individual reviewers (a bin is 500 documents that were reviewed by one specific reviewer) were able to beat the computer’s best efforts in either precision, or recall, or both. They are shown as either to the right or above the red cross. If above this means they were more precise. If to the right, they had better recall.

William Webber summarizes these findings in his blog recently by saying:

The best reviewers have a reliability at or above that of the technology-assisted system, with recall at 0.7 and precision at 0.9, while other reviewers have recall and precision scores as low as 0.1. This suggests that using more reliable reviewers, or (more to the point) a better review process, would lead to substantially more consistent and better quality review. In particular, the assessment process at TREC provided only for assessors to receive written instructions from the topic authority, not for the TA to actively manage the assessment process, by (for instance) performing an early check on assessments and correcting misconceptions of relevance or excluding unreliable assessors. Now, such supervision of review teams by overseeing attorneys may (regrettably) not always occur in real productions, but it should surely represent best practice.

Webber, W., How Accurate Can Manual Review Be? IREvalEtAl (12/15/11). Better review process and project management are key, which is the next part of the secret.

How to Be Better Than Borg

Webber’s research shows that some of the human reviewers in TREC stood out as better than Borg. They beat the machines. Does this really surprise anyone in the review industry? Sure, human review may be (should be) dead as a way to review all documents in large-scale reviews, but it is alive and well as the most reliable method for final check of computer suggested coding, a final check for classifications like privilege before production.

This is a picture of humans and machines working together as a team, as friends, but not as Borg implants where machines dictate, nor as human slaves where smart machines are not allowed. I know that George Socha, whom I quoted in Tell Me Why?, much like one of my fictional heroes, Jean Luc Picard, was glad to escape the Borg enslavement. So too would most contract lawyers who are stuck in dead-end review jobs with cruel employers. By this way, his embarrassing, unprofessional, contract lawyers as slaves mentality was shown dramatically by some of the reader comments to Contract Coders: e-Discovery’s “Wasting Asset”? They report incredible incidents of abuse by some law firms. Some of the private complaints I have heard from document reviewers about abuse and mismanagement are even worse than these public comments. The primary rule of any relationship must always be mutual respect. That applies to contract lawyers, and, if they are a part of your team, even to artificial intelligence agents like Watson, Siri, and their predictive coding cousins. Get to know and understand your entire team and to appreciate their respective strengths and weaknesses.

Webber’s study shows that the quality of the individual human reviewers on a team is paramount. He makes several specific recommendations in section 3.4 of his report for improving review team quality, including:

Dual assessment, for instance, can help catch random errors of inattention, while second review by an authoritative reviewer such as the supervising attorney can correct misconceptions of relevance during the review process, and adjust for assessor errors once it is complete [Webber et al., 2010]. …

[S]ignificant divergence from the median appears to be a partial, though not infallible, indicator of reviewer unreliability. A simple approach to improving review team quality is to exclude those reviewers whose proportion relevant are significantly different from the median, and re-apportion their work to the more reliable reviewers. …

Fully excluding reviewers based solely on the proportion of documents they find relevant is a crude technique. Nevertheless, the results of this section suggest that this proportion is a useful, if only partial, indicator of reliability, one which could be combined with additional evidence to alert review managers when their review process is diverging from a controlled state. It may be that review teams with better processes, such as the team from Topic 207, already use such techniques. Therefore, they need to be considered when a benchmark for manual review quality is being established, against which automatic techniques can be compared.

Webber’s conclusion summarizes his findings and bears close scrutiny, so I quote it here in full:

5. CONCLUSIONS. The original review from which Roitblat et al. draw their data cost $14 million, and took four months of 100-hour weeks to complete. The cost, effort, and delay underline the need for automated review techniques, provided they can be shown to be reliable. Given the strong disagreement between manual reviews, even some loss in review accuracy might be acceptable for the efficiency gained. If, though, automated methods can conclusively be demonstrated to be not just cheaper, but more reliable, than manual review, then the choice requires no hesitation. Moreover, such an achievement for automated text-processing technology would mark an epoch not just in the legal domain, but in the wider world.

Two recent studies have examined this question, and advanced evidence that automated retrieval is at least as consistent as manual review [Roitblat et al., 2010], and in fact seems to be more reliable [Grossman and Cormack, 2011]. These results are suggestive, but (we argue) not conclusive as they stand. For the latter study in particular (leaving questions of potential bias in the appeals process aside), it is questionable whether the assessment processes employed in the track truly are representative of a good quality manual review process.

We have provided evidence of the greatly varying quality of reviewers within each review team, indicating a lack of process control (unsurprising since for four of the seven topics the reviewers were not a genuine team). The best manual reviewers were found to be as good as the best automated systems, even with the asymmetry in the evaluation setup. The one, professional team that does manage greater internal consistency in their assessors is also the one team that, as group, outperforms the best automated method. We have also pointed out a simple, statistically based method for improving process control, by observing the proportion of documents found relevant by each assessor, and counseling or excluding those who appear to be outliers.

Above all, it seems that previous studies (and this one, too) have not directly addressed the crucial question, which is not how much different review methods agreed or disagree with each other (as in the study by Roitblat et al. [2010]), nor even how close automated or manual review methods turn out to have come to the topic authority’s gold standard (as in the study by Grossman and Cormack [2011]). Rather, it is this: which method can a supervising attorney, actively involved in the process of production, most reliably employ to achieve their overriding goal, to create a production consistent with their conception of relevance. There is good, though (we argue) so far inconclusive, evidence that an automated method of production can be as reliable a means to this end as a (much more expensive) full manual review. Quantifying the tradeoff between manual effort and automation, and validating protocols for verifying the correctness of either approach in practice, are particularly relevant in the multi-stage, hybrid work-flows of contemporary legal review and production. Given the importance of the question, we believe that it merits the effort of a more conclusive empirical answer.

The evidence shows that it is at least very difficult, perhaps even impossible (I await for more science to form a definite opinion), for us humans to maintain the concentration necessary to review tens of thousands of documents, day in and day out, for weeks. Sure we can do it for a few hours, and for 500 or so documents, but for 8-10 hours a day with tens or hundreds of thousands of documents for weeks on end? I doubt it. We need help. We need suggestive coding. We need a team that includes smart computers.

Know Your Team’s Strengths and Weaknesses

The challenge to human reviewers becomes ridiculously hard when you ask them to not only make relevancy calls, but, at the same time, to also make privilege calls, and confidentiality calls, and, here is the worst, multiple case issues categorization calls, a/k/a, issue tagging. Experience shows that the human mind cannot really handle more than five or six case issues at a time, at least when reviewing all day. But I keep hearing tales of lawyers asking reviewers to make ten to twenty case issue calls for weeks on end. If you think it is hard to get consistent relevancy calls, just think of the problem of putting relevant docs into ten to twenty buckets. Might as well throw darts. That is a scientific experiment I’d like to see, one testing the efficacy of case issue tags. How many categorizations can humans really handle before it becomes a complete waste of time?

I call on e-discovery lawyers everywhere to better understand their team members and stop asking them to do the impossible. Issue tagging must be kept simple and straightforward for the human members of your team to deal with it. The ten to twenty case-issue tags is a complete waste of time, perhaps with the exception of seed-set training, as thereafter Watson has no such limitations. But in so far as the final, out-the-door review goes, do not encumber your humans with mission impossible tasks. Know your team members, their strengths and weaknesses. Know what the humans do best, like catch obvious bloopers beyond the kin of present day AI agents, and do not expect them to be as tireless as machines.

The review process improvements mentioned by Webber, and other safeguards touted by most professional review companies who truly understand and care about the strengths and weaknesses of their team, will certainly mitigate against the problems inherent in all human review. In my mind the most important of these are experience, training, mutual respect, good working conditions, motivation, and quality controls, including quick terminations or reassignments when called for. More innovative methods are, I believe, just around the corner, such as game theory applications discussed by Lawrence Chapin in Contract Coders: e-Discovery’s “Wasting Asset”? But the bottom line will always be that computers are much better at complex repetitive drudgery tasks such as reviewing tens of thousands, or millions, of documents. Thankfully our minds are not designed for this, whereas computers are.

Reviewers Need Subject Matter Expertise and Money Motivation

Based on my experience as a reviewer and supervisor, the human challenges to make review determinations over large scales of data are magnified when the human reviewers are not themselves subject matter experts, and magnified even further when the reviewers have no experience in the process. This was not only true of all of the student volunteer reviewers at TREC, but is also sometimes true in real world practice as well. That is just invited error. Training is part of the solution to that.

It is also my supposition that in our culture the errors are magnified again when there is no, or inadequate, compensation provided. All TREC reviewers were unpaid volunteers except for the professional review team members. They were paid by the companies they work for, although those companies were not paid, and the rate of pay to the individuals is unknown. Still, can you be surprised that the top reviewers, the ones who beat the machines, were all paid, and only a few of the student teams came close? In our culture money is a powerful motivator. That is another reason to have better funded experiments that come closer to real world conditions. The test subjects in our experiments should be paid.

The same principle applies in the real world too. Contract review companies should stop competing on price alone and we consumers should stop being fooled by that. Quality is job number one, or should be. Do you really think the company with the lowest price is providing the best service? Do you think their attorney reviewers don’t resent this kind of low pay, sometimes in the $15-$20 per hour range. Most of these lawyers have six-figure student loans to pay off. They deserve a fair wage and, I hypothesize, will perform better if they are paid better.

To test my money-motivation theory I’d love to see an experiment where one review team is paid $25 an hour, and another is paid $75. Be real and let them know which team they are on. Then ask both to review the same documents involving weeks of grueling, boring work. Add in the typical vagaries of relevance, and equal supervision and training, and then see which team does better. Maybe add another variation where there is a stick added to the carrot and you can be fired for too many mistakes. Anyone willing to fund such a study? A contract review company perhaps? (Doubtful!) Better yet, perhaps there is a tech company out there willing to do so, one that competes with cheap human review teams? They should be motivated by money to finance such research (why would most contract review companies want this investigated?). The research would, of course, have to be done by bona fide third-party scientists in a peer review setting. We don’t want the profit motive messing with the truth and objective science.

Secret of Sampling

There is one more fundamental thing you need to understand about the TREC tests, indeed all scientific tests, one which I suppose you could also call a secret since so few people seem to know it, and that is, no one, I repeat, no person, ever sat down and looked at all of the 685,592 documents under consideration in 2010 TREC Legal Track interactive tasks. No one has ever looked at all of the documents in any TREC task. No person, much less a team of subject matter experts with three-pass reviews as I discussed in Part One, has determined the individual relevancy, or not, of all of these documents by which to judge the results of the software assisted reviews. All that happened (and I don’t mean that as a negative connotation), is that a random sample of the 685,592 documents were reviewed by a variety of people.

I have no trouble with sampling and do not think it really matters that only a random sample of the 685,592 corpus was reviewed. Sampling and math are the most powerful tools in every information scientist’s pocket. It seems like magic (much like the hash algorithms), but random sampling has been proven time and again to be reliable. For instance, a sample of 2,345 documents is needed to know the contents of 100,000, with a 95% confidence level and a +/-2 % confidence interval. Yet for a collection of 1,000,000 with the same confidence levels, a sample of only  2,395 is required (just 50 more to sample 900,00 more documents). If you add another zero and seek to know about 10,000,000 documents, you need only sample 2,400.

To play with the metrics yourself I suggest you see the calculator at http://www.surveysystem.com/sscalc.htm. For a good explanation of sampling see: Application of Simple Random Sampling (SRS) in eDiscovery, Manuscript By Doug Stewart, submitted to the Organizing Committee of the Fourth DESI Workshop on Setting Standards for Electronically Stored Information in Discovery Proceedings on April 20, 2011. Sampling is important. As I have been saying for over two years now, all e-discovery software should include a sampling button as a basic feature. (Many vendors have taken my advice, and I keep asking some of them to whom I made specific demands, to now call the new feature the Ralph Button, but they just laugh. Oh well:)

If the Human Review is Unreliable, Then so is the Gold Standard

The problem with average human review and the comparative measurements of computer assisted alternatives is not with the sampling techniques used to measure. The problem is that if the sample set created by average Joe or Jane reviwer is flawed, then so is the projection. Sampling has the same weakness as AI agent software, including predictive coding seed sets. If the seeds selected are bad, then the trees they grow will be bad too. They won’t look at all like what you wanted and the errors will magnify as the trees grow. It is the same old problem of garbage in, garbage out. I addressed this in Part One on this article, in the section, The Second Search Secret (Known Only to a Few): The Gold Standard to Measure Review is Really Made Out of Lead, but it bears repetition. It is a critical point that has been swept under the carpet until now.

Like it or not, aside from a few top reviewers working with relatively small sets, like the champs in TREC, most human review of relevancy in large-scale reviews is basically garbage, unless it is very carefully managed and constantly safeguarded by statistical sampling and other procedures. Also, if there is no clear definition of relevance, or if relevance is a constantly moving target, or both as is often the case, then the reviewers work will be poor (inconsistent), no matter what methods you use. Note this clear understanding of relevance is often missing in real world reviews for a variety of reasons, including the requesting party’s refusal to clarify under mistaken notions of work product protection, vigorous advocacy, and the like.

Even in TREC, where they claim to have clear relevancy definitions and the review sets were not that large, I’m told by Webber that:

TREC assessors disagree with themselves between 15% to 19% of the times when shown the same document twice (due to undetected duplication in the corpus).

That’s right, the same reviewers looking at the same document at different times disagreed with themselves between 15% to 19% of the time. For authority Webber refers to: Scholer et al., Quantifying Test Collection Quality Based on the Consistency of Relevance Judgements. As you start adding multiple reviewers to a project the disagreement rates naturally get much higher. That is in accord with most everyone’s experience and the scientific tests. If people cannot agree with themselves on questions of relevance, how can you expect them to agree with others? Despite a few champs, human relevancy review is generally very fuzzy.

Some Things Can Still Be Seen Through the Fuzzy Lenses

The exception to the fuzzy measurements problem, which I noted in Part One, is that the measures are not too vague for purposes of comparison, at least that is what the scientists tell me. Also, and this is very important, when you add the utility measures of time and money to review evaluation, which in the real world of litigation we must do, but has not yet been done in scientific testing, and do not just rely on the abstract measures of precision and recall, then computer assisted review must always win, at least in large-scale projects. We never have the time and money to manually review hundreds of thousands, or millions, of documents, just because they are in the custody of a person of interest. I don’t care what kind of cheap, poor quality labor you use. As Jason Baron likes to point out, at a fast review speed of 100 files per hr, and a cost of $50 per hour for a reviewer, it would still take $500 Million and 10 Million hours to review the 1 Billion emails in the White House.

When you consider the utility measures of time and cost, it is obvious that pure manual review is dead. Even our weak, fuzzy comparative testing lens shows that shows manual and computer review precision and recall are about equal, and maybe the computer is even leading (hard to tell with these fuzzy lenses on). But when you add the time and costs measures, the race is not even close. Computers are far faster and should also be  much cheaper. The need for computer assisted review to cull down the corpus, and then assist in the coding, is painfully obvious. The EDI study of a $14 Million review project by all too human contract coders with an overlap rate of only 28% proved that. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010.

Going for the Gold

The old gold standard of average human reviewers, working in dungeons <smile>, unassisted by smart technology, and not properly managed, has been exposed as a fraud. What else do you call a 28% overlap rate? We must now develop a new gold standard, a new best practice for big data review. And we must do so with the help and guidance of science and testing. The exact contours of the new gold are now under development in dozens of law firms, private companies, and universities around the world. Although we do not know all of the details, we know it will involve:

  1. Bottom Line Driven Proportional Review where the projected costs of review are estimated at the beginning of a project (more on this in a future blog);
  2. High quality tech assisted review, with predictive coding type software, and multiple expert review of key seed-set training documents using both subject matter experts (attorneys) and AI experts (technologists);
  3. Direct supervision and feedback by the responsible lawyer(s) (merits counsel) signing under 26(g);
  4. Extensive quality control methods, including training and more training, sampling, positive feedback loops, clever batching, and sometimes, quick reassignment or firing of reviewers who are not working well on the project;
  5. Experienced, well motivated human reviewers who know and like the AI agents (software tools) they work with;
  6. New tools and psychological techniques (e.g. game theory, story telling) to facilitate prolonged concentration (beyond just coffee, $, and fear) to keep attorney reviewers engaged and motivated to perform the complex legal judgment tasks required to correctly review thousands of usually boring documents for days on end (voyeurism will only take you so far);
  7. Highly skilled project managers who know and understand their team, both human and computer, and the new tools and techniques under development to help coach the team;
  8. Strategic cooperation between opposing counsel with adequate disclosures to build trust and mutually acceptable relevancy standards; and,
  9. Final, last-chance review of a production set before going out the door by spot checking, judgmental sampling (i.e. search for those attorney domains one more time), and random sampling.

I have probably missed a few key factors. This is a group effort and I cannot talk to everyone, nor read all of the literature. If you think I have missed something key here, please let me know. Of course we also need understanding clients who demand competence, and judges willing to get involved when needed to rein in intransigent non-cooperators and to enforce fair proportionality. Also, you should always go for confidentiality and clawback agreements and orders.

Technology Assisted Review

When I say technology assisted review in the best practices list above, which is now a popular phrase, I mean the same thing as computer assisted review. I mean a review method where computerized processes are used to cull down the corpus, and then again to assist in the coding. In the first step technology is  used to cull out final selections of documents from a larger corpus for humans to review before final production. The probable irrelevant documents are culled-out and not subject to any further human reviews, except perhaps for quality control random sampling. Keyword search is one very primitive example of that computer assisted culling. Concept search is another more recent, advanced example. There are many others. Think for instance of Axcellerate’s 40 automatically populated filters, which they collectively refer to as their Predictive Analytics step that I described in Part One of Secrets of Search.

These days the software is so smart that technology assisted review can not only intelligently cull out likely irrelevant documents, it can also make predictions for how the remaining relevant documents should be categorized. That is the second step where all of the remaining documents are reviewed by software to predict key classifications like privileged, confidential, hot, and maybe even a few case specific issues. The software predicts how a human will likely code a documents and batches documents out in groups accordingly. This predictive coding, combined with efficient document batching (putting into sets of documents for human review), makes the human review work easier and more efficient. For instance, one reviewer, or small review team, might be assigned all of the probable privileged documents, another the probable confidential for redaction, a third the probable hot documents, and the remaining documents divided into teams by case issue tags, or maybe by date, or custodian, all depending on the specifics of the case. It is an art, but one that can and should be measured and guided by science.

I contrast this kind of technology assisted review with pure Borg type computer controlled review, where there is complete computer delegation, where the computer does all, with little or no human involvement, except for the first seed set generation of relevancy patterns. Here we trust the AI agent and produce all documents determined to be relevant and not-privileged. No human does a double-check of the computer’s coding before the documents go out the door. In my opinion, we are still far away from such total delegation, although I don’t rule it out someday. (Resistance is futile.) Do you agree?

Is anyone out there relying on 100% computer review with no human eye quality controls? Conversely, as to the opposite, is there anyone out there who still uses pure (100%) human review? Who has humans (lawyers or paralegals) review all documents in a custodian collection (assuming, as you should, that there are thousands or tens of thousands of documents in the collection)? Is there anyone who does not rely on some little brother of Watson to review and cull out at least some of the corpus first?

More Research Please

The fuzzy standard of most human review is an inconvenient truth known to all information scientists. As we have seen, it has been known to TREC researchers since at least 2000 with the study by Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000).  Yet I for one have not heard much discussion about it. This flaw cuts to the core of information science, because without accurate, objective measurements, there can be no science. For that reason scientists have come up with many techniques to try to overcome the inherent fuzziness of relevancy determinations, in and outside of legal search. I concede they are making progress, and TREC legal track is, for instance, getting better every year, but, like Voorhees and Webber, I insist there is still a long way to go.

Maybe the best software programs (whatever they are) are far better than our best reviewers under ideal conditions (that’s what I think), maybe not. But the truth is, we don’t really know what our real precision and recall rates are now, we don’t really know how much of the truth we are finding. The measures are, after all, so vague, so human dependent. What are we to make of our situation in legal review where the Roitblat et al study shows an overlap rate of only 28%? Here is Webber’s more precise information science language explanation that he made in reviewing my blog article in his blog:

The most interesting part of Ralph’s post, and the most provocative, both for practitioners and for researchers, arises from his reflections on the low levels of assessor agreement, at TREC and elsewhere, surveyed in the background section of my SIRE paper. Overlap (measured as the Jaccard coefficient; that is, size of intersection divided by size of union) between relevant sets of assessors is typically found to be around 0.5, and in some (notably, legal) cases can be as low as 0.28. If one assessor were taken as the gold standard, and the effectiveness of the other evaluated against it, then these overlaps would set an upper limit on F1 score (harmonic mean of precision and recall) of 0.66 and 0.44, respectively. Ralph then provocatively asks, if this is the ground truth on which we are basing our measures of effectiveness, whether in research or in quality assurance and validation of actual productions, then how meaningful are the figures we report? At the most, we need to normalize reported effectiveness scores to account for natural disagreement between human assessors (something which can hardly be done without task-specific experimentation, since it varies so greatly between tasks). But if our upper bound F1 is 0.66, then what are we to make of rules-of-thumb such as “75% recall is the threshold for an acceptable production”?

As Webber well knows, this means that such 75% or higher rules-of-thumb for acceptable recall are just wishful thinking. It means they should be disregarded because they are counter to the actual evidence of measurement deficiencies. The evidence instead shows that the maximum possible mean precision and recall rate measured objectively is only 44%. Demands in litigation for objective search recall rates higher than 44% fly in the face of the EDI study. It is an unreasonable request on its face, never mind the legal precedent for accepting keyword search or manual review. I understand that the research also shows that technology assisted reviews are at least as good as manual, but that begs the real question as to how good either of them are!

I personally find it hard to believe that with today’s technology assisted reviews we are not in fact doing much better than 44% or 65% recall, but then I think back to the lawyers in the 1980s in the Blair Moran study: We are confident our search terms uncovered 75% of the relevant evidence. Well, who knows, maybe they did, but the measurements were wrong. Who knows how well any of us are doing in big data reviews? The fuzziness of the measures is an inconvenient truth that must be faced. The 44% max objective rate creates a lack of confidence interval that must be corrected. We have to significantly improve the gold standard, we have to upgrade the quality of reviews used for measurements.

This is one reason I call for more research, and better funded research. We need to know how much of the truth we are finding, we need a recall rate we can count on to do justice. Large corporations should especially step up to the plate and fund pure scientific research, not just product development. I trust you that it works, but, as President Regan said, I still want you to verify. I still want you to show me exactly how well it works, and I want you to do it with objective, peer-reviewed science, and to use a gold standard that I can trust.

Trust But Verify

As it now stands, the confidence rates and error margins are too low for me to entirely trust Watson, much less his little brothers. The computer was, after all, trained by humans, and they can be unreliable. Garbage in, garbage out. I will only trust a computer trained by several humans, checking against each other, and all of them experts, well paid experts at that. Even then, I’d like to have a final expert review of the documents finally selected for production before they actually go out the door. After all, the determinations and samples are based on all too human judgments. If the stakes are high, and they usually are in litigation, especially where privileges and confidential information are involved, there needs to be a final check before documents are produced. That is the true gold standard in my world. Do you agree? Please leave a comment below.

Apology and Holiday Greetings from Ralph

Now I must apologize to my readers. I promised a two-part blog on Secrets of Search where the deepest secret would be revealed in Part Two, along with the seventh insight into why most lawyers in the world do not want to do e-discovery. But admit it, this Part Two is already too long isn’t it (over 7,100 words)? How long can we mere mortals maintain our attention on this stuff? You already have a lot to think about here. So, it looks like I lied before. It now seems to me better to wait and finish this article in a Part III, rather than ask you to read on and on.

So stay tuned friends, I promise this soap opera will finally come to a conclusion next time, when we are all much fresher and finally ready to hear the truth, the whole truth, and nothing but the truth about the secrets of search. (And yes, I really have four monitors at my desk, actually I have five when you include my personal MacBook Pro, which is by far my favorite computer.) Oh yeah, and the next blog may be late too. We’ll see how busy Santa keeps me. Happy Holidays!


Follow

Get every new post delivered to your Inbox.

Join 95 other followers