e-Discovery Team’s 2016 TREC Report: Once Again Proving the Effectiveness of Our Standard Method of Predictive Coding

February 24, 2017

Team_TRECOur Team’s Final Report of its participation in the 2016 TREC ESI search Conference has now been published online by NIST and can be found here and the final corrected version can be found here.  TREC stands for Text Retrieval Conference. It is co-sponsored by a group within the National Institute of Standards and Technology (NIST), which is turn is an agency of the U.S. Commerce Department. The stated purpose of the annual TREC conference is to encourage research in information retrieval from large text collections.

The other co-sponsor of TREC is the United States Department of Defense. That’s right, the DOD is the official co-sponsor of this event, although TREC almost never mentions that. Can you guess why the DOD is interested? No one talks about it at TREC, but I have some purely speculative ideas. Recall that the NSA is part of the DOD.

We participated in one of several TREC programs in both 2015 and 2016, the one closest to legal search, called the Total Recall Track. The leaders, administrators of this Track were Professors Gordon Cormack and Maura Grossman. They also participated each year in their own track.

One of the core purposes of all of the Tracks is to demonstrate the robustness of core retrieval technology. Moreover, one of the primary goals of TREC is:

[T]o speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems.

Our participation in TREC in 2015 and 2016 has demonstrated substantial improvements in retrieval methodologies. That is what we set out to do. That is the whole point of the collaboration between the Department of Commerce and Department of Defense to establish TREC.

clinton_emailThe e-Discovery Team has a commercial interest in participation in TREC, not a defense or police interest. Although from what we saw with the FBI’s struggles to search email last year, the federal government needs help. We were very unimpressed by the FBI’s prolonged efforts to review the Clinton email collection. I was one of the few e-discovery lawyers to correctly call the whole Clinton email server “scandal” a political tempest in a teapot. I still do and I am still outraged by how her email review was handled by the FBI, especially with the last-minute “revelations.”

prism_nsaThe executive agencies of the federal government have been conspicuously absent from TREC. They seem incapable of effective search, which may well be a good thing. Still, we have to believe that the NSA and other defense agencies are able to do a far better job at large-scale search than the FBI. Consider their ongoing large-scale metadata and text interception efforts, including the once Top Secret PRISM operation. Maybe it is a good thing the NSA doe not share it abilities with the FBI, especially these days. Who knows? We certainly will not.

Mr_EDRThe e-Discovery Team’s commercial interest is to transfer Predictive Coding technology from our research labs into commercial products, namely transfer our Predictive Coding 4.0 Method using KrolL Discovery EDR software to commercial products. In our case at the present time “commercial products” means our search methods, time and consultations. But who knows, it may be reduced to a robot product someday like our Mr. EDR.

The e-Discovery Team method can be used on other document review platforms as well, not just Kroll’s, but only if they have strong active machine learning features. Active machine learning is what everyone at TREC was testing, although we appear to have been the only participant to focus on a particular method of operation. And we were the only team led by a practicing attorney, not an academic or software company. (Catalyst also fielded a team in 2015 and 2106 headed by Information Science Ph.D., Jeremy Pickens.)

Olympics-finish-line-Usain-Bolt-winsThe e-Discovery Team wanted to test the hybrid multimodal software methods we use in legal search to demonstrate substantial improvements in retrieval methodologies on real-world problems. We have now done so twice; participating in both the 2015 and 2016 Total Recall Tracks. The results in 2016 were even better than 2015. We obtained remarkable results in document review speed, recall and precision; although, as we admit, the search challenges presented at TREC 2016 were easier than most projects we see in legal discovery. Still, to use the quaint language of TREC, we have demonstrated the robustness of our methods and software.

These demonstrations, and all of the reporting and analysis involved, have taken hundreds of hours of our time, but there was no other venue around to test our retrieval methodologies on real-world problems. The demonstrations are now over. We have proven our case. Our standard Predictive Coding method has been tested and its effectiveness demonstrated. No one else has tested and proven their predictive coding methods as we have done. We have proven that our hybrid multimodal method of AI-Enhanced document review is the gold standard. We will continue to make improvements in our method and software, but we are done with participation in federal government programs to prove our standard, even one run by the National Institute of Standards and Technology.


To prove our point that we have now demonstrated substantial improvements in retrieval methodologies, we quote below Section 5.1 of our official TREC report, but we urge you to read the whole thing. It is 164 pages. This section of our report covers our primary research question only. We investigated three additional research questions not included below.


Section 5.1 First and Primary Research Question

What Recall, Precision and Effort levels will the e-Discovery Team attain in TREC test conditions over all thirty-four topics using the Team’s Predictive Coding 4.0 hybrid multimodal search methods and Kroll Ontrack’s software, eDiscovery.com Review (EDR).

Again, as in the 2015 Total Recall Track, the Team attained very good results with high levels of Recall and Precision in all topics, including perfect or near perfect results in several topics using the corrected gold standard. The Team did so even though it only used five of the eight steps in its usual methodology, intentionally severely constrained the amount of human effort expended on each topic and worked on a dataset stripped of metadata. The Team’s enthusiasm for the record-setting results, which were significantly better than its 2015 effort, is tempered by the fact that the search challenges presented in most of the topics in 2016 were not difficult and the TREC relevance judgments had to be corrected in most topics.  …

This next chart uses the corrected standard. It is the primary reference chart we use to measure our results. Unfortunately, it is not possible to make any comparisons with BMI standards because we do not know the order in which the BMI documents were submitted.



The average results obtained across all thirty-four topics at the time of reasonable call using the corrected standard are shown below in bold. The average scores using the uncorrected standard are shown for comparison in parentheses.

  • 91.57% Recall (75.46%)
  • 65.90% Precision (57.12%)
  • 76.64% F1 (57.69%)
  • 124 Docs Reviewed Effort (124)

At the time of reasonable call the Team had recall scores greater than 90% in twenty-two of the thirty-four topics and greater than 80% in five more topics. Recall of greater than 95% was attained in fourteen topics. These Recall scores under the corrected standard are shown in the below chart. The results are far better than we anticipated, including six topics with total recall – 100%, and two topics with both total recall and perfect precision, topic 417 Movie Gallery and topic 434 Bacardi Trademark.


At the time of reasonable call the Team had precision scores greater than 90% in thirteen of the thirty-four topics and greater than 75% in three more topics. Precision of greater than 95% was attained in nine topics. These Precision scores under the corrected standard are shown in the below chart. Again, the results were, in our experience, incredibly good, including three topics with perfect precision at the time of the reasonable call.


At the time of reasonable call the Team had F1 scores greater than 90% in twelve of the thirty-four topics and greater than 75% in two more. F1 of greater than 90% was attained in eight topics. These F1 scores under the corrected standard are shown in the below chart. Note there were two topics with a perfect score, Movie Gallery (100%) and Bacardi Trademark (100%) and three more that were near perfect: Felon Disenfranchisement (98.5%), James V. Crosby (97.57%), and Elian Gonzalez (97.1%).


We were lucky to attain two perfect scores in 2016 (we attained one in 2015), in topic 417 Movie Gallery and topic 434 Bacardi Trademark. The perfect score of 100% F1 was obtained in topic 417 by locating all 5,945 documents relevant under the corrected standard after reviewing only 66 documents. This topic was filled with form letters and was a fairly simple search.

The perfect score of 100% F1 was obtained in topic 434 Bacardi Trademark by locating all 38 documents relevant under the corrected standard after reviewing only 83 documents. This topic had some legal issues involved that required analysis, but the reviewing attorney, Ralph Losey, is an SME in trademark law so this did not pose any problems. The issues were easy and not critical to understand relevance. This was a simple search involving distinct language and players. All but one of the 38 relevant documents were found by tested, refined keyword search. One additional relevant document was found by a similarity search. Predictive coding searches were run after the keywords searches and nothing new was uncovered. Here machine learning merely performed a quality assurance role to verify that all relevant documents had indeed been found.

The Team proved once again, as it did in 2015, that perfect recall and perfect precision is possible, albeit rare, using the Team’s methods and fairly simple search projects.

The Team’s top ten projects attained remarkably high scores with an average Recall of 95.66%, average Precision of 97.28% and average F-Measure: 96.42%. The top ten are shown in the chart below.


In addition to Recall, Precision and F1, the Team per TREC requirements also measured the effort involved in each topic search. We measured effort by the number of documents that were actually human-reviewed prior to submission and coded relevant or irrelevant. We also measured effort by the total human time expended for each topic. Overall, the Team human-reviewed only 6,957 documents to find all the 34,723 relevant documents within the overall corpus of 9,863,366 documents. The total time spent by the Team to review the 6,957 documents, and do all the search and analysis and other work using our Hybrid Multimodal Predictive Coding 4.0 method, was 234.25 hours. reviewed_data_pie_chart_2016

It is typical in legal search to try to measure the efficiency of a document review by the number of documents classified by an attorney in an hour. For instance, a typical contract review attorney can read and classify an average of 50 documents per hour. The Team classified 9,863,366 documents by review of 6,957 documents taking a total time of 234.25 hours. The Team’s overall review rate for the entire corpus was thus 42,106 files per hour (9,863,366/234.25).

In legal search it is also typical, indeed mandatory, to measure the costs of review and bill clients accordingly. If we here assume a high attorney hourly rate of $500 per hour, then the total cost of the review of all 34 Topics would be $117,125. That is a cost of just over $0.01 per document. In a traditional legal review, where a lawyer reviews one document at a time, the cost would be far higher. Even if you assume a low attorney rate of $50 per hour, and review speed of 50 files per hour, the total cost to review every document for every issue would be $9,863,366. That is a cost of $1.00 per document, which is actually low by legal search standards.13

Analysis of project duration is also very important in legal search. Instead of the 234.25 hours expended by our Team using Predictive Coding 4.0, traditional linear review would have taken 197,267 hours (9,863,366/50). In other words, the review of thirty-four projects, which we did in our part-time after work in one Summer, would have taken a team of two lawyers using traditional methods, 8 hours a day, every day, over 33 years! These kinds of comparisons are common in Legal Search.

Detailed descriptions of the searches run in all thirty-four topics are included in the Appendix.


We also reproduce below Section 1.0, Summary of Team Efforts, from our 2016 TREC Report. For more information on what we learned in the 2016 TREC see alsoComplete Description in 30,114 Words and 10 Videos of the e-Discovery Team’s “Predictive Coding 4.0” Method of Electronic Document ReviewNine new insights that we learned in the 2016 research are summarized by the below diagram more specifically described in the article.



Excerpt From Team’s 2016 Report

1.1 Summary of Team’s Efforts. The e-Discovery Team’s 2016 Total Recall Track Athome project started June 3, 2016, and concluded on August 31, 2016. Using a single expert reviewer in each topic the Team classified 9,863,366 documents in thirty-four review projects.

The topics searched in 2016 and their issue names are shown in the chart below. Also included are the first names of the e-Discovery Team member who did the review for that topic, the total time spent by that reviewer and the number of documents manually reviewed to find all of the relevant documents in that topic. The total time of all reviewers on all projects was 234.25 hours. All relevant documents, totaling 34,723 by Team count, were found by manual review of 6,957 documents. The thirteen topics in red were considered mandatory by TREC and the remaining twenty-one were optional. The e-Discovery Team did all topics.


They were all one-person, solo efforts, although there was coordination and communications between Team members on the Subject Matter Expert (SME) type issues encountered. This pertained to questions of true relevance and errors found in the gold standard for many of these topics. A detailed description of the search for each topic is contained in the Appendix.

In each topic the assigned Team attorney personally read and evaluated for true relevance every email that TREC returned as a relevant document, and every email that TREC unexpectedly returned as Irrelevant. Some of these were read and studied multiple times before we made our final calls on true relevance, determinations that took into consideration and gave some deference to the TREC assessor adjudications, but were not bound by them. Many other emails that the Team members considered irrelevant, and TREC agreed, were also personally reviewed as part of their search efforts. As mentioned, there was sometimes consultations and discussion between Team members as to the unexpected TREC opinions on relevance.

This contrasts sharply with participants in the Sandbox division. They never make any effort to determine where their software made errors in predicting relevance, or for any other reasons. They accept as a matter of faith the correctness of all TREC’s prior assessment of relevance. To these participants, who were all academic institutions, the ground truth itself as to relevance or not, was of no relevance. Apparently, that did not matter to their research.

All thirty-four topics presented search challenges to the Team that were easier, some far easier, than the Team typically face as attorneys leading legal document review projects. (If the Bush email had not been altered by omission of metadata, the searches would have been even easier.) The details of the searches performed in each of the thirty-four topics are included in the Appendix. The search challenges presented by these topics were roughly equivalent to the most simplistic challenges that the e-Discovery Team might face in projects involving relatively simple legal disputes. A few of the search topics in 2016 included quasi legal issues, more than were found in the 2015 Total Recall Track. This is a revision that the Team requested and appreciated because it allowed some, albeit very limited testing of legal judgment and analysis in determination of true relevance in these topics. In legal search relevancy, legal analysis skills are obviously very important. In most of the 2016 Total Recall topics, however, no special legal training or analysis was required for a determination of true relevance.

At Home participants were asked to track and report their manual efforts. The e-Discovery Team did this by recording the number of documents that were human reviewed and classified prior to submission. More were reviewed after submission as part of the Team’s TREC relevance checking. Virtually all documents human reviewed were also classified, although all documents classified were not used for active training of the software classifier. The Team also tracked effort by number of attorney hours worked as is traditional in legal services. Although the amount of time varied somewhat by topic, the average time spent per topic was only 6.89 hours. The average review and classification speed for each project was 42,106 files per hour (9,863,366/234.25).

Again, for the full picture and complete details of our work please see the complete 164 page report to TREC of the e-Discovery Team’s Participation in the 2016 Total Recall Track.





e-Discovery Team’s 2015 TREC Report

December 8, 2015

Google_Car_HybridHere the e-Discovery Team will explain how and what they did in the 2015 Text Retrieval Conference (TREC) conference conducted by the National Institute of Standards and Technology (NIST). In 2015 TREC had eight different types of text retrieval experiments going on; each one is called a Track. Our Team participated in one of them, the Total Recall Track using our Hybrid Multimodal method of CAR – Computer Assisted Review, and Kroll Ontrack software.

The Total Recall Track Coordinators were Adam Roegiest, University of Waterloo; Gordon V. Cormack, University of Waterloo; Maura R. Grossman, Wachtell Lipton Rosen & Katz; and, Charles L. A. Clarke, University of Waterloo. Our team was composed of Ralph Losey, Jackson Lewis; Jim Sullivan, Kroll Ontrack; Tony Reichenberger, Kroll Ontrack; and Mr. EDR. We played in all three Total Recall AtHome games, the only participant to do so in 2015:

  1. AtHome1Jeb Bush Email with a collection of 290,099 emails;
  2. AtHome2Black Hat World Forums with a collection of 465,149 forum posts (use caution in visiting these sites); and,
  3. AtHome3 – Local News Reports with a collection of 902,434 articles.

We did not play in the fully automated TREC Recall Sandbox this year, but maybe next year.

The primary purpose of the e-Discovery Team’s participation in TREC 2015 Total Recall Track was to test the effectiveness of our legal search methods and software. As to software, we used Kroll Ontrack’s EDR. As to methods, we used a modified form of Ralph Losey’s hybrid multimodal. Due to the way the Total Recall Track tests were set up we could not use our standard eight-step hybrid multimodal work flow. This is the method we normally use in law suits to find evidence. It is shown in the animation below.

We had to depart from this standard work flow in order to fit the experiments we were given. Plus, like all of the other participants in the Total Recall Track, we were working under very tight time limitations. We had 45 days to complete 30 different document reviews involving a total of 16,576,820 documents. Some of the variations from our standard workflow were made just to save time. That is, for instance, the primary reason we skipped Step Three – Random Baseline and Step Seven – ZEN Quality Assurance Tests. We had to skip our usual Step One – ESI Discovery Communications because of the way the experiments were set up. There was no one to communicate with. It was all automated. The standard eight-step work flow showing the three steps we omitted entirely in TREC are next shown.

Predictive Coding Search diagram by Ralph Losey

Our usual first step of ESI Discovery Communications normally takes hours to learn and refine the scope of relevance of the search project. In the Total Recall Track there were no communications as to what we were searching for except for two or three word descriptions of what was relevant given at the beginning of a search. That was it! But we were then supposed to learn and deduce the scope of relevance from instant feedback given to our document submissions to TREC. This was a kind of micro-feedback process that we never see in real search, especially legal search, but it did make for an interesting experiment, especially for the TREC Recall Sandbox participants. (We did not participate in that.) They used fully automated processes and never even looked at any documents.

By omitting our standard Step Three, the upfront Random Sample step that we normally use to determine prevalence, the Team saved a great deal of time. This omission did, however, make it harder to estimate when the Team thought they had found 90% of the relevant documents — 90% recall. Our ultimate goal for each of the thirty topics was to get as high a recall point as possible, with 90% representing an ideal, before determining that we made what we considered to be a reasonable effort (the point where we would normally stop a real-world legal review and begin quality assurance tests, including ei-Recall). It was a pleasant surprise for our Team to see how often we would exceed this 90% recall goal, even under the new and difficult conditions TREC presented, and even without our normal steps One, Three and Seven.

ZenBAgain to save time and be able to do thrity large document review projects in only forty-five days, we skipped most of the Zero Error Numerics procedures too. We typically follow this Seventh Step for quality control purposes, including a random sampling analysis, ei-RecallIt is risky to omit these procedures, which is why we never cut these corners when doing real legal work for clients. But we had to do so in order to timely review 16,576,820 documents in 45 days. That is an average of 368,374 documents every day.

Although the e-Discovery Team robot, Mr. EDR, was only down for an hour for maintenance, the humans on the team has to sleep several hours each day. They had other tasks to perform too, namely their day jobs, so most of their work was done at night and on weekends. As a legal review team we are used to hard work, deadlines and project management challenges. Still, this 30 projects in 45 days was the biggest challenge any of us had ever faced. The entire Team, and support staff, was given awards like shown below when they completed the challenge.


Even without all of the quality control procedures we normally use, and the random baseline in step three, we did very well. We made some mistakes, especially at first. But after we got a better idea as to how the games should be played from the first few topics we tackled, we sped through the remaining topics with only minor errors in a few. It was possible because only the top, most experienced players on the e-Discovery Team worked on the TREC games. That included e-Discovery Team coach, Ralph Losey, who led 17 topics, and three of Kroll Ontrack’s best experts, Team Captain Jim Sullivan (aka Mr. Perfect), who led 8 topics, Tony Reichenberger, who led 4, and Joe White, who led 1, plus a host of other supporting staff. The help of Kroll Ontrack team players made a huge difference, plus, so Coach says, did Mr. EDR, the Kroll Ontrack software.

Here is the Poster we prepared and presented at the TREC event. Most every participant included a poster like this to summarize their experiments. As you can see the poster summarizes our achievements and our main experiment.

trec_posterYes, you read the poster right. In one of the thirty topics we did, Topic 3484, we got a perfect score of 100% Recall and 100% Precision! We are pretty proud of that one. (And now you know how Jim Sullivan got his Mr. Perfect nickname.) More information of what we did will be forthcoming soon.

TREC_2We cannot compare how we did with the other participants until after TREC makes its full, final report in early 2016. Those are the TREC guidelines. At that time we will put together our official Final Report to let everyone know the full metrics, analysis of how we did, comparisons with the scores of others, the final answers to our three research questions, and provide a full description of our search efforts on all thirty topics. In the meantime, here is more information that we can provide at this time as to what the e-Discovery Team did in TREC 2015.

e-Discovery Team Members

The Team is made up of three attorneys — Ralph Losey, Jim Sullivan, and Tony Reichenberger – and one “robot,” Mr. EDR. The Team members are all attorneys who specialize in legal search and review of Electronically Stored Information (ESI). They are experts in the search for evidence in large unorganized data collections, such as corporate email, texts messages, and social media. Their work as search attorneys is typically performed in the context of legal proceedings. Their work includes the search, review, analysis, classification, production and admission of ESI as evidence in courts in the Unites States and other countries.


The Team leader is Ralph C. Losey, J.D., a practicing attorney, shareholder and National e-Discovery Counsel of Jackson Lewis P.C., a U.S. law firm with over 800 attorneys and fifty-five offices. He has over 35 years of legal experience, including document reviews. Losey is also a speaker and blogger at e-DiscoveryTeam.com where he has written over a million words on e-discovery. The past four years Losey has participated in multiple public and private experiments testing predictive coding methods and has written over fifty articles on the subject.

Jim Sullivan, J.D. (shown left above) and Tony Reichenberger, J.D. (shown right above) are attorney search and review specialists employed by Kroll Ontrack, Inc. That is the primary e-discovery vendor used by Losey and his law firm. Kroll Ontrack is a global e-Discovery software, processing and project management company (eDiscovery.com). The Team robot, Mr. EDR, is the Team’s personalization of Kroll Ontrack’s software, eDiscovery.com Review (EDR). The human Team members relied heavily on Mr. EDR’s active machine learning features, including his ranking and probable relevance predictions. They think of him as a fifth team member.

Research Questions Considered at TREC 2015 Recall Track

gavelBackground to questions considered: It is generally accepted in the legal search community that the use of predictive coding type search algorithms can improve the search and review of documents in legal proceedings.[1] The use of predictive coding has also been approved, and even encouraged by various courts around the world, including numerous courts in the U.S.[2]

Although there is agreement on use of predictive coding, there is controversy and disagreement as to the most effective methods of use.[3] There are, for instance, proponents for a variety of different methods to find training documents for predictive coding. Some advocate for the use of chance selection alone, others for the use of top ranked documents alone, others for a combination of top ranked and mid-level ranked documents where classification is unsure, and still others, including Losey, call for the use of a combination of all three of these selection processes and more.[4]

Many attorneys and predictive coding software vendors advocate the use of predictive coding search methods alone, and forego other search methods when they do so, such as keyword search, concept searches, similarity searches and linear review. e-Discovery Team members reject that approach and instead advocate for a hybrid multimodal approach they call Predictive Coding 3.0.[5]

The e-Discovery Team approach includes all types of search methods, with primary reliance placed on predictive coding. The Team also uses a multimodal selection of methods within the context of predictive coding to locate suitable training documents, including high ranking documents, some mid-level ranked uncertain documents, and all other search methods, keyword search, etc. 

Three Research Questions:

  1. Primary Question: What Recall, Precision and Effort levels will the e-Discovery Team attain in TREC test conditions over all 30 Topics using the Team’s Predictive Coding 3.0 hybrid multimodal search methods and Kroll Ontrack’s software, com Review (EDR).
  2. How will the Team’s results using its semi-automated, supervised learning method compare with other AtHome Recall Track participants.
  3. What are the ideal ratios, if any, for relevant and irrelevant training examples to maximize effectiveness of active machine learning with EDR.

Overview Of Team Participation in 2015 TREC Recall Track

The e-Discovery Team participated in all thirty of the Recall Track Athome topics. It did not participate in the fully automated TREC Recall sandbox. The thirty topics were divided into three sets, each searching a different collection of documents.

The first set of Athome Topics searched a corpus of emails of Florida Governor Jeb Bush. There were 290,099 emails in the Jeb Bush Email collection. Almost all metadata of these emails and associated attachments and images had been stripped and converted to pure text files. The ten topics searched, and their names, which were the only guidance on relevance initially provided, are shown below.

Topic 100 School and Preschool Funding
Topic 101 Judicial Selection
Topic 102 Capital Punishment
Topic 103 Manatee Protection
Topic 104 New Medical Schools
Topic 105 Affirmative Action
Topic 106 Terri Schiavo
Topic 107 Tort Reform
Topic 108 Manatee County
Topic 109 Scarlet Letter Law

Ralph Losey personally searched each of these ten Topics. In all of these Topics he was at least somewhat assisted by a single “contract review attorney.”[6] The contract attorneys in most of these ten Topics did a majority of the document review, but had only limited involvement in initial keyword searches, and no involvement in predictive coding searches or related decisions.

All participants in the 2015 Recall Track were required to complete all ten of the Bush Email Topics. Completion of the other twenty Topics in two other data collections was optional.

The second set of Athome Topics searched a corpus of posts taken from Black Hat World Forums. There were 465,149 posts in that collection. Again, almost all metadata of these posts and associated images had been stripped and converted to pure text files. The ten topics searched, and their names, which again were the only guidance initially provided on relevance, are shown below.

Topic 2052 Paying for Amazon Book Reviews
Topic 2108 CAPTCHA Services
Topic 2129 Facebook Accounts
Topic 2130 Surely Bitcoins can be Used
Topic 2134 PayPal Accounts
Topic 2158 Using TOR for Anonymous Internet Browsing
Topic 2225 Rootkits
Topic 2322 Web Scraping
Topic 2333 Article Spinner Spinning
Topic 2461 Offshore Host Sites

The third set of Athome Topics searched a corpus of 902,434 online Local News Articles, again in text only format. The ten topics searched, and their names, which again were the only guidance initially provided on relevance, are shown below.

Topic 3089 Pickton Murders
Topic 3133 Pacific Gateway
Topic 3226 Traffic Enforcement Cameras
Topic 3290 Rooster Turkey Chicken Nuisance
Topic 3357 Occupy Vancouver
Topic 3378 Rob McKenna Gubernatorial Candidate
Topic 3423 Rob Ford Cut the Waist
Topic 3431 Kingston Mills Lock Murders
Topic 3481 Fracking
Topic 3484 Paul and Cathy Lee Martin

The Team found the News Articles less difficult to work with than our typical legal search of corporate ESI.

 Summary of the Team’s 45-Day Effort

MrEdr_CapedThe e-Discovery Team’s 2015 Recall Track Athome project started July 14, 2015, and concluded on August 28, 2015. Using both one and two-man attorney review teams the e-Discovery Team classified 17,014,085 documents in 30 different review projects. That is an average of 378,091 documents classified per day for 45 days, with the average project completed in 1.5 days.

Ten of the 30 topics were performed by Ralph Losey with a single assistant, a Kroll Ontrack contract review attorney. The other 20 topics were performed by a single attorney Team member, either Ralph Losey, Jim Sullivan, Tony Reichenberger, or Joe White.

All attorneys used the e-Discovery Team’s Predictive Coding 3.0 hybrid multimodal search techniques and were assisted by Mr. EDR. They relied on active machine learning and other search techniques to find relevant documents and effective training documents. The various types of searches included in the Team’s multimodal approach are shown in the search pyramid, below.

Multimodal Search Pyramid

The standard eight-step workflow used by the Team in legal search projects is shown in the diagram below.[7] To meet the time requirements of completing every review project in 1.5 days, the standard steps Three and Seven were omitted or severely constrained as will be further explained.

Predictive Coding Search diagram by Ralph Losey

The first step of ESI Discovery Communications also had to be drastically changed to fit the format of the Recall Track Athome experiments. The only communication under the TREC protocol was a very short, often just two-word description of relevance, plus instant feedback in the form or yes or no responses as to whether particular documents submitted were relevant. In the e-Discovery Team’s typical workflow discovery communications typically involve: (1) detailed requests for information contained in court documents such a subpoenas or Request For Production; (2) input from a qualified SME, who is typically a legal expert with deep knowledge of the factual issues in the case, and thus deep knowledge of what the presiding judge in the legal proceeding will hold to be relevant and discoverable; and, (3) dialogues with the party requesting the production of documents to clarify the search target, other parties, and the client.

vulcan-mind-meldThe Team never receives a request for production with just two or three word descriptions, as encountered in the TREC experiments. When the Team does receive longer, but still vague requests, which are common, the Team seeks clarification in discussions (Step One). In practice if there is disagreement as to relevance between the parties, which is also common, the presiding judge is brought in by motion practice to make final relevance rulings. Again, none of this was possible in the TREC experiments.

All of our usual practices in Step One had to be adjusted to the submissions format of the 30 Athome Topics. The most profound impact of these adjustments is that the four attorneys on the Team often lacked a clear understanding as to the intended scope of relevance and the rationale behind the automated TREC relevance rulings on particular documents. These protocol changes had the impact of minimizing the importance of the SME role on the active machine learning process. Instead, this role was often shifted almost entirely to the analytics of Mr. EDR, who could often see patterns, and correctly predict relevance, beyond the grasp of the human attorney reviewers. This is not common in legal search where the attorney reviewers always have some sort of understanding of relevance and the role of the SME is important.

working_hardRalph Losey was in charge of all e-Discovery Team efforts in all 30 Topics. Losey personally did the search and review work on ­­­17 Topics, including all 10 of the mandatory Athome1 Bush Emails Topics. Losey was assisted in the 10 Bush Email topics by a single Kroll Ontrack contract review attorney. Ann Bottolene, a contract attorney, assisted in 4 of these Topics, and David Jensen, Review Team Manager, Kroll Ontrack, acted as a contract attorney to assist in 6 Topics. The remaining 13 Athome Topics were completed by Kroll Ontrack’s Jim Sullivan, who did the review work on 8 topics; Tony Reichenberger on 4; and, Joe White on 1.

Short Answers to Research Questions

Research Question # 1 (Primary Question): What Recall, Precision and Effort levels will the e-Discovery Team attain in TREC test conditions over all 30 Topics using the Team’s Predictive Coding 3.0 hybrid multimodal search methods and Kroll Ontrack’s software, eDiscovery Review (EDR).

Short Answer to Primary Question: The TREC measured results demonstrated high levels of Recall and Precision with relatively little human review efforts using the e-Discovery Team’s methods and Mr. EDR. Four attorneys and one robot were able to review and classify 17,014,085 documents in 45 days under TREC test conditions. They attained total Recall of all relevant documents in all 30 Topics by human review of only 32,916 documents. They did so with two-attorney teams in the 10 Bush Email Topics, and one-attorney teams in the 20 other Topics. In Topic 3484, 100% Recall was attained with 100% Precision. This is something considered impossible in a large search project, such as this one, where Topic 3484 searched a collection of 902,434 Local News Articles. On many other Topics the Team attained near perfection scores. In total, very high scores were recorded in 18 of the 30 topics with good results obtained in all, especially when considering the low human efforts involved in the supervised learning. This initial report will include a Full Answer on this first research question.

Research Question # 2: How will the Team’s results using its semi-automated, supervised learning method compare with other AtHome Recall Track participants.

Short Answer: We are still studying the results. The Final Report will include an analysis of these results.

Research Question # 3: What are the ideal ratios, if any, for relevant and irrelevant training examples to maximize effectiveness of active machine learning with EDR.

Short Answer: We are still studying the results and performing further experiments to answer questions on the relative efficacy of various training ratios, primarily the balance between positive and negative training documents. The Final Report will include an analysis of this question. What we can say at the current time is that training using mr. EDR does not function effectively if only relevant documents are used. A certain number of irrelevant training documents must be included in the training.



[1] Predictive Coding is defined by The Grossman-Cormack Glossary of Technology-Assisted Review,  2013 Fed. Cts. L. Rev. 7 (January 2013) (Grossman-Cormack Glossary) as: “An industry-specific term generally used to describe a Technology Assisted Review process involving the use of a Machine Learning Algorithm to distinguish Relevant from Non-Relevant Documents, based on Subject Matter Expert(s) Coding of a Training Set of Documents. ” A Technology Assisted Review process is defined as: “A process for Prioritizing or Coding a Collection of electronic Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection. … TAR processes generally incorporate Statistical Models and/or Sampling techniques to guide the process and to measure overall system effectiveness.” Also see:  Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology, Vol. XVII, Issue 3, Article 11 (2011).

[2] Da Silva Moore v. Publicis Groupe 868 F. Supp. 2d 137 (SDNY 2012) and numerous cases later citing to and following this landmark decision by Judge Andrew Peck.

[3] Grossman & Cormack, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014; Grossman & Cormack,  Comments on “The Implications of Rule 26(g) on the Use of Technology-Assisted Review”, 7 Federal Courts Law Review 286 (2014); Herbert Roitblat, series of five OrcaTec blog posts (1, 2, 3, 4, 5), May-August 2014; Herbert Roitblat, Daubert, Rule 26(g) and the eDiscovery Turkey OrcaTec blog, August 11th, 2014; Hickman & Schieneman, The Implications of Rule 26(g) on the Use of Technology-Assisted Review, 7 FED. CTS. L. REV. 239 (2013); Losey, R. Predictive Coding 3.0, part one (e-Discovery Team 10/11/15).

[4] Id.; Webber, Random vs active selection of training examples in e-discovery (Evaluating e-Discovery blog, 7/14/14).

[5] Losey, R., Predictive Coding 3.0, part two (e-Discovery Team, 10/18/15).

[6] “Contract review attorney,” or simply “contract attorney,” is a term now in common parlance in the legal profession to refer to licensed attorneys who do document review on a project-by-project basis. Their pay under a project contract is usually by the hour and is at a far lower rate than attorneys in a law firm, typically only $55 to $75 per hour. Their only responsibility is to review documents under the direct supervision of law firm attorneys who have much higher billing rates.

[7] Losey, R., Predictive Coding 3.0, part two (e-Discovery Team, 10/18/15) contains a complete description of all eight steps.


Short Blog this Month Because I’m Busy at TREC

August 1, 2015

Ralph_TwitterThis month’s blog will be short and sweet, for TREC reasons, as I will explain. (TREC, by the way, stands for Text REtrieval Conference, and is sponsored by NIST, the National Institute of Science and Technology). Since I am tied up participating in TREC right now, I have no time to write my usual long monthly blog. But, I do have time to continue my almost daily micro-blogging, in other words, my tweets. I invite you to tune into these micro-blogs, and click on follow me on Twitter. What? Don’t yet have a Twitter account? Well then it is about time you did. It is free, easy to sign up and easy to use. It is my favorite social media.

twitter-bird-calloutOn Twitter you can see my latest thoughts about this and that, mostly e-discovery, tech-law, and cybersecurity. Sometimes I will go off topic and just share random things I find interesting or funny. Best yet, it is never more than 140 characters. It is all quick and concise pointers and mini thoughts. Like most everyone else I often include photos (worth a thousand words), but the new thing now is to include short videos. (Coming soon.) I like to check in with all the people and publishers that I follow and read almost every day, and I like to share my thoughts and interesting web-finds on Twitter. I find it relaxing and fun. Typically my tweets will include links to articles on the web that I find interesting. I do not tweet about personal stuff, nor follow anyone who does. Nothing wrong with that I suppose, just not my style.

If you are a reader here, I think you will like Twitter, and the mini-blogs I post there several times a week, sometimes several times a day. So drink the cool-aide, follow me @ralphlosey. Hopefully I will inform and even make you smile from time to time. I have about 2,982 followers. Come join the fun and get yourself a free Twitter account.

Twitter_growth_animatedI can get boring of course, but no worries, there are plenty of other people on Twitter who tweet about subjects you care about. Most are far more interesting than me and their numbers are growing every day. Right now there are 304 million monthly active users. I follow 134 people, for instance, which is a relatively small number since most of them do not post much. (You can see who I follow by going to my page at Twitter.com/RalphLosey.)  The people I follow include a few select news organizations, scientists, and researchers, and even a few interesting CEOs, such as Elon MuskBill GatesGregory Bufithis and John Tredennick. No, I don’t follow celebrities, or sports heroes, but you can if you want. There are 500 million tweets sent per day, which is about 6,000 tweets per second, and it is all searchable. There is something for everyone. And yes, you can find evidence there too.

TREC: Why I’m Too Busy to Write My Usual Lengthy Blog This Month

nistI’m too busy to write my usual long blog this month (the opposite of my tweets) because all of my free time is being eaten up by participation in this year’s TREC. Yes, I am totally preoccupied as a participant in 2015 TREC, the Recall Track, which, as you probably know, is the successor to the prior years Legal Track. I have reported here on TREC many times in the past. See eg: Analysis of the Official Report on the 2011 TREC Legal Track – Part OnePart Two, and Part Three;  The Legal Implications of What Science Says About Recall. This is, however, the first time I have participated in TREC. (I was enrolled in  TREC 2012, but the Legal Track was cancelled until this year’s TREC, in 2015.) Of course, I hope to make a good showing. Unlike my other research projects, including EDI’s Oracle study, the results on this will be public.

hypercube_predictive_codingThe idea of TREC, like the EDI Oracle study, is not to compete, and see who is the best searcher in the world, but to learn and experiment. The point of science is the extension of knowledge. It is not a sport with winners and losers. Still, anybody (or corporation, include e-discovery vendors) always wants to know how well their kung-fu compares with others. That adds to the excitement of a world-wide public experiment of various e-discovery search software and search methods. Be it under the governance of the Electronic Discovery Institute and Oracle, or that of the U.S. government’s National Institute of Science and Technology, the chances to test and compare your relative search expertise are few and far between. It is amazing how few self-proclaimed search experts and vendors are willing to participate. You can draw your own conclusions.

This year’s TREC group experiment in Recall, which is the point of legal search, is headed up by Gordon Cormack and Maura Grossman. Judge Peck’s recent opinion in Rio Tinto appointing Maura Grossman as Special Master in that case, made clear his opinion as to her qualifications in predictive coding:

Ms. Grossman is one of the most knowledgeable lawyers (if not the most knowledgeable lawyer) about TAR, having supervised the National Institute of Standards and Technology’s TREC (text retrieval conference) Legal Track, as well as having conducted studies and written papers on TAR issues with computer science Professor Gordon Cormack.

(8/10/2010) – James Brosher/AMERICAN-STATESMAN – Home base umpire Mike Lusky signals safe as Express first baseman Brian Bogusevic (23) slides into home plate, scoring Round Rock's first run in a game against Sacramento at Dell Diamond on Tuesday, Aug. 10, 2010. 0811expressI agree with Judge Peck’s conclusion (well, I might quibble with the “the most knowledgeable lawyer” comment, but that would be my own vanity talking). In TREC 2015,  Maura Grossman, JD, PhD, and Professor Gordon Cormack, serve as the supervisors and judges of the TREC Recall Track. Judge Peck got all of that right. It is a lot of work, and their role as test-designers and umpires proves their qualifications. Of course, the participants in TREC Recall Track are not chopped liver either. I am proud to be one of a few attorneys willing to put my name out there as a player in TREC. Grossman and Cormack have done a good job to set up an interesting game for everyone to test their skills and software. All who cower to participate in TREC have surely missed out. Whether umpires are better at baseball than the players is beside the point. There is room for both in the Hall of Fame.

quality_triangleTREC is my chance to prove what I have been saying about predictive coding in the fifty articles I have written on the subject over the past four years. The challenge at this year’s TREC is to perform thirty search projects, all dreamed up by Grossman and Cormack, and, in my case, to complete these thirty search projects in a little over five weeks. (They give us a little longer than that, but that is all the time I have, and even in these five weeks, I’m still holding down my day job, albeit at a reduced summer schedule.) Unlike EDI, which I did alone, at TREC I have a small team from my favorite vendor to help me, and, of course, their top of the line software. Hundreds of contract review lawyers are not needed, just great software and the right hybrid, multimodal AI-enhanced search techniques.



Team_TriangleIt certainly remains to be seen how my e-Discovery Team of legal search and review experts will do this year. TREC is ongoing as we speak, and does not end until August 31, 2015. Still, rest assured dear readers, I will do my best to try to excel. My team is up against a strong international field. Who knows how well we will do against the many search experts from all over the world who are participating in this year’s TREC. Maybe U.S. lawyers like me will do well, maybe not. But win, lose or draw, I will report it here. I will also submit an official paper to TREC.

predictive.coding_CALTREC is not about winning or proving you are the most knowledgeable, or not, it is about science. Although we have only been working on TREC for two weeks now, I have already devised several new twists and techniques to improve my standard hybrid, multimodal AI-enhanced document reviews. The search challenges presented are giving me several new ideas for more improvements. We are testing and learning new things on a daily basis, trying out various software techniques, supplements, and even new software features. This is all exciting and very challenging work. The thirty document review projects we are running now, almost all at once, are using methods far beyond, and in some ways departing from, the standard 8-step workflow shown above and in my other writings. There is far more to the latest art and science of predictive coding then simply adding continuous training, the so-called CAL techniques. I will elaborate on some of this in future writings, at least in a general way, but do not have time now.

Gold_Lexie_robotLet me close by making clear that TREC is not a competition and is not about winning. It is about science and advancing the field of knowledge of search, of information retrieval. It is about conducting research to improve your technologies and methods. What really matters in TREC is learning from the intriguing experiments that NIST sponsors each year. Everyone who dares to participate is a winner because they will learn from these search challenges. Those who sit on the sidelines can advertise and market their self-proclaimed excellence in legal search all they want. But, as they say, talk is cheap. Be bold. Prove your claims. I say, walk your talk. Spend the time and effort needed in basic research and open comparisons. It is too late to join this year’s TREC, but consider doing so next year. Then, and only then, will you have credibility in my book.

Analysis of the Official Report on the 2011 TREC Legal Track – Part Three

September 23, 2012

This is part three of my description and analyses of the official report of the 2011 TREC Legal Track search project. I urge you to read the original report that was published in July 2012: Overview of the TREC 2011 Legal Trackand of course, be sure to read Part One and Part Two first.

Participants’ Papers

In addition to the official Overview of the TREC 2011 Legal Track, you should read the papers that the 2011 participants submitted. To make that easier to do (they are currently difficult to find), I list them all here.

These participant reports are very interesting in their own right and I will conclude with comments about each paper. Some are quite short and written in nearly incomprehensible techo-speak, while others are lengthy and well-written, albeit still with technical components. But I dare say you cannot really understand what goes on in any TREC Legal Track, especially this one, without study of these original papers by the participants.

Beijing University of Posts and Telecommunications

The Beijing University team’s article is called Discovery Based on Relevant Feedback. The authors are Jiayue Zhang, Wenyi Yang, Xi Wang, Lihua Wu, Yongtian Zhang, Weiran Xu, Guang Chen, and Jun Guo, who are all with the School of Information and Communication Engineering at the Beijing University of Posts and Telecommunications. Their experiment was to try out a method of searching our emails, attachments and loose files that combined both indri and relevant feedback. I assume relevant feedback is a typical machine learning type of code, but what is indri? Wikipedia explained that indri is the name of one of the largest living lemurs that are native to Madagascar. Digging further I learned that indri is also the name for a search engine that is part of The Lemur Projectthat developed the Lemur Toolkit. Wikipedia explained that the Lemur Toolkit, is:

an open-source software framework for building language modeling and information retrieval software, and the INDRI search engine. This toolkit is used for developing search engines, text analysis tools, browser toolbars, and data resources in the area of IR. 

So it appears the Chinese research team was using open source software, namely INDRI, to test how it works on relevant feedback of the kind provided in the 2011 TREC Legal track. The short report described what they did without many specifics, but it looks like they used keywords selected by their researchers for each of the three topics as part of the process. Their results, along with all of the other participants, are shown in Overview of the TREC 2011 Legal Track. Look in the insrutable results charts under the abbreviation priindAM. As far as I can tell, their experiment with INDRI and keywords in this environment did not prove very effective. Another nail in the coffin of keywords.

Recommind, Inc.

Recommind’s report is called simply Recommind at TREC 2011 Legal Track and was written by Peter Zeinoun, Aaron Laliberte, Jan Puzicha (shown right), Howard Sklar and Craig Carpenter. The report states that they used Recommind’s Axcelerate® Review and Analysis software, version 4.2. They employed a multimodal method that they described as using:

…various search and entity extraction methodologies including keywords, phrase extraction, and concept searches. Relevant documents were mined for additional terms that could be used to enhance the efficacy of the search. The team then used additional analytics within the Axcelerate System to examine different documents that contained responsive keywords for each Topic and at times all Topics, applying training and relevancy analysis to identify various document sets in different ways.

The Recommind report goes on to give a detailed and cogent summary of their efforts on the task. The description of their interpretation of relevancy for each topic was particularly interesting. It shows how flexible a thing relevancy is, and thus demonstrates once again the fuzzy lens problem of trying to measure recall and precision.

The report then goes on to describe what they call their patented Predictive Coding process and the extensive quality control steps they took. It also describes the Probabilistic Latent Semantic Analysis the software uses, along with a Context Optimized Relevancy Engine.

The report concludes with summaries and charts purporting to show how well their methods did as compared with other participants. This part apparently got them into some trouble with TREC, so all I can say is read the Recommind report yourself, and compare it with the official summary and its concluding charts, and the charts of other participants. I do not know enough to evaluate the competing claims, and I am not going to comment on what their marketing department may or may not have done, but certainly both Recommind and the official reports show that they did well.

Helioid, Inc.

Helioid’s report is called Learning to Rank from Relevance Feedback for e-Discovery and was written by Peter Lubell-Doughtie and Kenneth Hamilton (shown right). Here is how they describe their method:

Our approach begins with language modeling and then, as feedback from the user is received, we combine relevance feedback and learning to rank on the query level to improve result rankings using information from user interaction.

The report is filled with highly technical language, most of it far more impenetrable than that. Obviously it was not designed for lawyers to read, only other information retrieval scientists. Apparently by their participation they learned that their learning to rank methods did worse than their query expansion methods, which I think just means intelligently expanded keyword search terms, much like concept searches.

Indian Statistical Institute

The Indian Statistical Institute report is titled Cluster-based Relevance Feedback: Legal Track 2011. It was written by Kripabandhu Ghosh, Prasenjit Majumder and Swapan Kumar Parui. Apparently they used a combination of Boolean keyword search and machine learning predictive coding type search. Like the researchers from Beijing University they used the INDRI search engine of Lemur 4.11 toolkit for Boolean retrieval, and they used Terrier 3.0 software for their Rocchio algorithm relevance feedback techniques. These Wikipedia article links are interesting if you want to learn more.

It seems like the Indian team used keyword search, Boolean query expansion (building on keywords like concept search), and document clustering to help build the seed set and supplement the training received from the documents marked as relevant by the Topic Authorities. Apparently these techniques and the mentioned open source software allowed them to do very well on one of the three topics (401).

OpenText Corporation

The OpenText report is entitled Learning Task Experiments in the TREC 2011 Legal Track and was written by Stephen Tomlinson of Ontario, Canada. They used their own software called OpenText Search Server®, eDOCS Edition. Their report points out that they have participated in every TREC Legal Track since it started in 2006, for which they are to be congratulated.

Like most of the other participants they seemed to rely heavily on keyword Boolean searches in the initial training. Their relevancy ranking was based on an adjusted keyword counting system, kind of like number of keywords per page. I have seen from experience how poor this kind of ranking can be in commercial, pre-predictive coding type review software.

Most of their report was in incomprehensible shorthand tech-speak, so I am not sure exactly what method they used or how well it worked. Apparently they were trying to compare experimental feedback-based, topic-based and Boolean-based techniques. They summarized their results in regular language by saying:

Generally speaking, approaches based on relevance feedback were found to outperform the other approaches.

I think this means that once again keyword Boolean search, no matter how beefed up and expanded, was found to be the worst approach.

Technology Concepts & Design, Inc.

The Technology Concepts & Design, Inc. (“TCDI”) report, called Auto-Relevancy and Responsiveness Baseline II, was written by Cody Bennett (shown right). The subtitle of the report says it all (smile): Improving Concept Search to Establish a Subset with Maximized Recall for Automated First Pass and Early Assessment Using Latent Semantic Indexing [LSI], Bigrams and WordNet 3.0 Seeding. I had never heard of WordNet, so I consulted Wikipedia that explained:

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.

I think this means they used enhanced keyword searches with concept search type expansion of keywords for each topic. Seems similar to the other participants’ description, but they used different software to do it. As the Legal Track Results page shows, TCDI used the automatic (Borg) approach in all of its test runs, and not the TechAssist (Hybrid) approach. They relied upon mathematics, more than Man, including a couple of my favorites, the Golden Ratio and prime numbers. See Eg. Good, Better, Best: a Tale of Three Proportionality Cases – Part Oneand Bottom Line Driven Proportional Review.

Several things in the TCDI report abstract caught my interest and led to my admittedly limited insights into the TCDI  approach:

We experiment with manipulating the features at build time by indexing bigrams created from EDRM data and seeding the LSI index with thesaurus-like WordNet 3.0 strata. From experimentation, this produces fewer false positives and a smaller, more focused relevant set. The method allows concept searching using bigrams and WordNet senses in addition to singular terms increasing polysemous value and precision; steps towards a unification of Semantic and Statistical. …

The result of the normalized cosine distance score for each document in each topic is then shifted based on the foundation of primes, golden standard, and golden ratio. This results in ‘best cutoff’ using naturally occurring patterns in probability of expected relevancy with limit approaching. …

Overall the influence of humans involved (TAs) was very minimal, as their assessments were not allowed to modify any rank or probability of documents. However, the identification of relevant documents by TAs at low LSI thresholds provided a feedback loop to affect the natural cutoff.

This all seems very pro-Borg to me. I can just imagine the scientists thoughts: Pesky humans! Do not let them modify the evaluation of documents. They will just muck things up with their supposed legal judgments and such. I have talked with coder math types having attitudes like this before.

The report does step into English from time to time and even includes legal argument, which, naturally, I disagree with. Indeed the following assertion is made without any authority that I can see, either legal or factual:

But, there is always one more document which may be relevant and nowhere near similar due to semantic ambiguity. The most important documents to a case arguably may be those which are in this outlier area, and more expensive to obtain.

Really? The most important documents are the one’s that you did not find? Damn the expense, keep looking, because we are sure the outliers are key to the case! So much for proportionality. But then, proportionality has always been an argument for clients, not vendors. Still, I do not mean to be too critical. TCDI does end their report with a conciliatory statement that I totally endorse:

The application of a hybrid feature approach / complex concepts to Latent Semantic Indexing using very simple automated parsing and query construction appears promising in generating a high Recall set based solely on initial topic modeling (Request for Production). …

This automated study is not about replacing the human intelligence required to successfully complete an end-to-end review. It is one part of a display of how automated and human assisted workflows can in tandem guide a historically expensive process into a realm of data proportionality and expectation.

So it appears our disagreements are relatively minor, perhaps even just latent semantic and attitude based. The important thing is we agree in principle to the hybrid approach and to proportionality. Hey, they even used my word hybrid, so I have got to like this company and report author.

University of South Florida

The University of South Florida report is entitled Modeling Concept and Context to Improve Performance in eDiscovery. It was written by H. S. Hyman and Warren Fridy III. The abstract of the report starts with an interesting sentence that puts legal search in perspective with other kinds of search:

One condition of eDiscovery making it unique from other, more routine forms of IR is that all documents retrieved are settled by human inspection.

I guess this means that other areas of search do not have the gold/lead standard fuzzy lens issues we have. The paper abstract goes on to make two more good points:

Automated IR tools are used to reduce the size of a corpus search space to produce smaller sets of documents to be reviewed. However, a limitation associated with automated tools is they mainly employ statistical use of search terms that can result in poor performance when measured by recall and precision. One reason for this limitation is that relevance — the quality of matching a document to user criteria – – is dynamic and fluid, whereas a query — representing the translation of a user’s IR goal – is fixed.

In other words, to put it plainly, keyword search sucks in legal search of chaotic data sets like email and loose file collections. The relevance determinations are a moving target – too fluid and dynamic for keyword search alone to work. Keywords have to be used very carefully.

The University of South Florida researchers have a good handle on the problem and were testing one possible solution that combines concept and context modeling to enhance search term performance. They used a hybrid multimodal approach with the following basic strategy to solve the unique problems of e-discovery:

In answering this question we propose an approach to model three constructs: (1) Concepts underlying the fixed search terms and queries, (2) Context of the domain and the corpus, and (3) Elimination terms used as counter-measures for reduction of nonrelevant documents.

This is one of the better written papers with frequent use of only 19th grade English.

Ursinus College

The Ursinus College report is titled Latent Semantic Indexing with Selective Query Expansion. It was written by Andy Garron and April Kontostathis. They are one of the participants that tried out an automatic (Borg) approach. Their one sentence description of the task was concise and accurate:

The E-Discovery simulation includes an opportunity for machine learning based on relevance feedback – i.e. training systems to improve search results over multiple iterations after consulting with a Topic Authority (TA).

Here is the description they provide of their multi-dimensional latent semantic indexing approach:

The system we implemented for both 2010 and 2011 is based on Latent Semantic Indexing (LSI), a search method that attempts to draw out the meaning of terms. In particular, we implemented Essential Dimensions of LSI (EDLSI), which combines standard Vector Space retrieval with LSI in a “best of both worlds” approach. In 2011, teams are allowed multiple submissions for each query (“runs”), after each run they receive relevance judgments for a number of documents. This procedure lends itself intuitively to selective query expansion. In selective query expansion, we modify the query using information from documents that are known to be relevant in order to train the system to produce better retrieval results. We implemented selective query expansion as a machine learning feature in our system.

So it appears they are trying to make their machines more intuitive in query expansion, meaning, I think, the selection of new keywords to add to training. I know that Data on Star Trek never really attained the human hunch capacity, but maybe the robots from Ursinus will do better.

University of Melbourne

The University of Melbourne report, Melbourne at the TREC 2011 Legal Track, by William Webber, a frequent contributor to this blog, and Phil Farrelly. They tried both TechAssist (Hybrid) and Automatic (Borg) approaches. It looks like they used keyword term occurrences with binary weights as part of the seed set generation. This was a very short report and, unfortunately, I did not understand most of it.

I asked William about the report and he admitted the report was rushed. He also admitted that the experiment they tried this year did not work out too well for various reasons. William they sent me his informal explanation of what his team did for publication in this blog. This time he used language that I could understand. Here is a slightly edited version of what he sent:

What the Melbourne team did at TREC 2011 was fairly mainstream predictive coding (“text classification” in technical jargon). The Support Vector Machine (SVM) is a standard text classification algorithm, that I imagine is widely used in predictive coding, including by vendors in the U.S.  “Active learning” refers to the way we selected documents for coding to improve the classifier: instead of picking documents at random, we chose those documents that the classifier was “most unsure about;” these are the documents that the classifier might give a 50% probability of relevance to, as you were encountering in the Kroll OnTrack system. [William is referring to my descriptions in the seven-part search narrative using Inview.]

The initial set of documents for coding were selected by simple keyword queries. All the above is fairly standard predictive coding. As an experiment, we tried two different sources for responsiveness coding. One was to ask the official TREC topic authority for assessments; the other was to ask an assessor internal to the team (who had e-discovery experience, though he was not a lawyer) for annotations. We wanted to see how well you could do if your annotations were made by someone other than the person who was defining the conception of relevance.

In the event, we did better with the internal than with the official annotations. However, our scores were uniformly poor, so little can be concluded from this finding. Whether our poor scores were due to a bug in our system, or to not getting enough annotations from the official topic authority (we found the turnaround to be very slow, for whatever reason), or why, I’m not sure.

University of Waterloo

The University of Waterloo report, University of Waterloo at TREC 2011: A Social Networking Approach to the Legal Learning Track, was written by Robert Warren and David R. Cheriton. This team used a truly unique approach that I can summarize as the Borg go FaceBook. Here is their more sciency explanation:

The goal of the experiments this year was the exploration of whether social network analysis could be applied to the problem of e-discovery and legal document retrieval. We also opted for a fully automatic approach in that only responsiveness judgments from the topic authorities were used in the learning component of the system.

To perform this social media experiment they used, I kid you not, the Wumpus Search Engine. They claim to have counted 255,964 individuals that sent or received documents within the Enron dataset. Although most of their paper is not really intelligible to me, despite efforts to explain by example involving phone calls to pizza stores, I gather they were looking for patterns of who talked to who as a way to find relevant evidence for seed sets. Apparently it did not work out too well, but they want to try again by adding the fourth dimension, i.w. – time.


Reading these original reports should inoculate you from the nonsense that you may read elsewhere about TREC Legal Track. The participants’ reports also provide important information on the science involved that you will not find in the Overview. For instance, the reports provide more perspective on the fuzzy lens and gold standard problems, and what we can do to improve quality controls for review. The reports also provide critical insight into the kinds of tests and experiments the participants were running.

Study of these materials will also prepare you to read the post-hoc scientific analysis of the 2011 results that will surely follow in the next few years. I look forward to these future studies and commentaries from information scientists. They will provide, as they have for prior conferences, detailed analysis, interpretation and critiques of the 2011 conference. Sometimes these materials can be difficult to understand. They are often written for fellow academics and scientists, not lawyers, but I encourage you to make the effort.

I also encourage the scientists who write these reports about Legal Track to try to constrain their propensity to the over-use of technical jargon and inside language and abbreviations. We lawyers know a lot about obscure technical talk too. We practically invented it. We can slip into technical lawyerese any time we want and I guarantee that scientists will not understand it. But, unless we are writing arcane type articles intended only for other specialists, our goal is communication to a wider audience, to all members of our e-discovery team. Lawyers train themselves to process and understand very complex issues and facts so that they can explain things in 12th grade English. I am not pleading with you to go that far, but how about 19th grade English, and not so much techno-shorthand? How about writing for the whole team of professionals passionate about legal search, not just other information retrieval scientists? I know you need some techno-speak and math, but remember that these studies will be read by lawyers and technologists too, so try to include segments for us.

Like it or not, scientists are important members of a modern-day interdisciplinary e-discovery team, especially when it comes to advanced search techniques. Along with technologists, information scientists and the insights they offer are key to efficient and effective legal search. Still, at the end of the day, an e-discovery team must be lead by an attorney. Legal search concerns lawsuits and the gathering and production of evidence to be used in courts of law. Rule 26(g) Federal Rules of Civil Procedure places the full responsibility of compliant legal search on legal counsel, not consultants.

Justice is not an academic pursuit or government study. It is a real life imperative, and often a messy one at that. Lawyers have a legal and ethical duty to lead and supervise their e-discovery teams in all kinds of conditions. Lawyers can meet these challenges if they have a good team and can understand what the technologists and scientists are saying. If lawyers will take the time to study these TREC reports, and understand the language of technology and science, they will be better able discharge their duties as team leaders.

%d bloggers like this: