Analysis of the Official Report on the 2011 TREC Legal Track – Part Three

This is part three of my description and analyses of the official report of the 2011 TREC Legal Track search project. I urge you to read the original report that was published in July 2012: Overview of the TREC 2011 Legal Trackand of course, be sure to read Part One and Part Two first.

Participants’ Papers

In addition to the official Overview of the TREC 2011 Legal Track, you should read the papers that the 2011 participants submitted. To make that easier to do (they are currently difficult to find), I list them all here.

These participant reports are very interesting in their own right and I will conclude with comments about each paper. Some are quite short and written in nearly incomprehensible techo-speak, while others are lengthy and well-written, albeit still with technical components. But I dare say you cannot really understand what goes on in any TREC Legal Track, especially this one, without study of these original papers by the participants.

Beijing University of Posts and Telecommunications

The Beijing University team’s article is called Discovery Based on Relevant Feedback. The authors are Jiayue Zhang, Wenyi Yang, Xi Wang, Lihua Wu, Yongtian Zhang, Weiran Xu, Guang Chen, and Jun Guo, who are all with the School of Information and Communication Engineering at the Beijing University of Posts and Telecommunications. Their experiment was to try out a method of searching our emails, attachments and loose files that combined both indri and relevant feedback. I assume relevant feedback is a typical machine learning type of code, but what is indri? Wikipedia explained that indri is the name of one of the largest living lemurs that are native to Madagascar. Digging further I learned that indri is also the name for a search engine that is part of The Lemur Projectthat developed the Lemur Toolkit. Wikipedia explained that the Lemur Toolkit, is:

an open-source software framework for building language modeling and information retrieval software, and the INDRI search engine. This toolkit is used for developing search engines, text analysis tools, browser toolbars, and data resources in the area of IR. 

So it appears the Chinese research team was using open source software, namely INDRI, to test how it works on relevant feedback of the kind provided in the 2011 TREC Legal track. The short report described what they did without many specifics, but it looks like they used keywords selected by their researchers for each of the three topics as part of the process. Their results, along with all of the other participants, are shown in Overview of the TREC 2011 Legal Track. Look in the insrutable results charts under the abbreviation priindAM. As far as I can tell, their experiment with INDRI and keywords in this environment did not prove very effective. Another nail in the coffin of keywords.

Recommind, Inc.

Recommind’s report is called simply Recommind at TREC 2011 Legal Track and was written by Peter Zeinoun, Aaron Laliberte, Jan Puzicha (shown right), Howard Sklar and Craig Carpenter. The report states that they used Recommind’s Axcelerate® Review and Analysis software, version 4.2. They employed a multimodal method that they described as using:

…various search and entity extraction methodologies including keywords, phrase extraction, and concept searches. Relevant documents were mined for additional terms that could be used to enhance the efficacy of the search. The team then used additional analytics within the Axcelerate System to examine different documents that contained responsive keywords for each Topic and at times all Topics, applying training and relevancy analysis to identify various document sets in different ways.

The Recommind report goes on to give a detailed and cogent summary of their efforts on the task. The description of their interpretation of relevancy for each topic was particularly interesting. It shows how flexible a thing relevancy is, and thus demonstrates once again the fuzzy lens problem of trying to measure recall and precision.

The report then goes on to describe what they call their patented Predictive Coding process and the extensive quality control steps they took. It also describes the Probabilistic Latent Semantic Analysis the software uses, along with a Context Optimized Relevancy Engine.

The report concludes with summaries and charts purporting to show how well their methods did as compared with other participants. This part apparently got them into some trouble with TREC, so all I can say is read the Recommind report yourself, and compare it with the official summary and its concluding charts, and the charts of other participants. I do not know enough to evaluate the competing claims, and I am not going to comment on what their marketing department may or may not have done, but certainly both Recommind and the official reports show that they did well.

Helioid, Inc.

Helioid’s report is called Learning to Rank from Relevance Feedback for e-Discovery and was written by Peter Lubell-Doughtie and Kenneth Hamilton (shown right). Here is how they describe their method:

Our approach begins with language modeling and then, as feedback from the user is received, we combine relevance feedback and learning to rank on the query level to improve result rankings using information from user interaction.

The report is filled with highly technical language, most of it far more impenetrable than that. Obviously it was not designed for lawyers to read, only other information retrieval scientists. Apparently by their participation they learned that their learning to rank methods did worse than their query expansion methods, which I think just means intelligently expanded keyword search terms, much like concept searches.

Indian Statistical Institute

The Indian Statistical Institute report is titled Cluster-based Relevance Feedback: Legal Track 2011. It was written by Kripabandhu Ghosh, Prasenjit Majumder and Swapan Kumar Parui. Apparently they used a combination of Boolean keyword search and machine learning predictive coding type search. Like the researchers from Beijing University they used the INDRI search engine of Lemur 4.11 toolkit for Boolean retrieval, and they used Terrier 3.0 software for their Rocchio algorithm relevance feedback techniques. These Wikipedia article links are interesting if you want to learn more.

It seems like the Indian team used keyword search, Boolean query expansion (building on keywords like concept search), and document clustering to help build the seed set and supplement the training received from the documents marked as relevant by the Topic Authorities. Apparently these techniques and the mentioned open source software allowed them to do very well on one of the three topics (401).

OpenText Corporation

The OpenText report is entitled Learning Task Experiments in the TREC 2011 Legal Track and was written by Stephen Tomlinson of Ontario, Canada. They used their own software called OpenText Search Server®, eDOCS Edition. Their report points out that they have participated in every TREC Legal Track since it started in 2006, for which they are to be congratulated.

Like most of the other participants they seemed to rely heavily on keyword Boolean searches in the initial training. Their relevancy ranking was based on an adjusted keyword counting system, kind of like number of keywords per page. I have seen from experience how poor this kind of ranking can be in commercial, pre-predictive coding type review software.

Most of their report was in incomprehensible shorthand tech-speak, so I am not sure exactly what method they used or how well it worked. Apparently they were trying to compare experimental feedback-based, topic-based and Boolean-based techniques. They summarized their results in regular language by saying:

Generally speaking, approaches based on relevance feedback were found to outperform the other approaches.

I think this means that once again keyword Boolean search, no matter how beefed up and expanded, was found to be the worst approach.

Technology Concepts & Design, Inc.

The Technology Concepts & Design, Inc. (“TCDI”) report, called Auto-Relevancy and Responsiveness Baseline II, was written by Cody Bennett (shown right). The subtitle of the report says it all (smile): Improving Concept Search to Establish a Subset with Maximized Recall for Automated First Pass and Early Assessment Using Latent Semantic Indexing [LSI], Bigrams and WordNet 3.0 Seeding. I had never heard of WordNet, so I consulted Wikipedia that explained:

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.

I think this means they used enhanced keyword searches with concept search type expansion of keywords for each topic. Seems similar to the other participants’ description, but they used different software to do it. As the Legal Track Results page shows, TCDI used the automatic (Borg) approach in all of its test runs, and not the TechAssist (Hybrid) approach. They relied upon mathematics, more than Man, including a couple of my favorites, the Golden Ratio and prime numbers. See Eg. Good, Better, Best: a Tale of Three Proportionality Cases – Part Oneand Bottom Line Driven Proportional Review.

Several things in the TCDI report abstract caught my interest and led to my admittedly limited insights into the TCDI  approach:

We experiment with manipulating the features at build time by indexing bigrams created from EDRM data and seeding the LSI index with thesaurus-like WordNet 3.0 strata. From experimentation, this produces fewer false positives and a smaller, more focused relevant set. The method allows concept searching using bigrams and WordNet senses in addition to singular terms increasing polysemous value and precision; steps towards a unification of Semantic and Statistical. …

The result of the normalized cosine distance score for each document in each topic is then shifted based on the foundation of primes, golden standard, and golden ratio. This results in ‘best cutoff’ using naturally occurring patterns in probability of expected relevancy with limit approaching. …

Overall the influence of humans involved (TAs) was very minimal, as their assessments were not allowed to modify any rank or probability of documents. However, the identification of relevant documents by TAs at low LSI thresholds provided a feedback loop to affect the natural cutoff.

This all seems very pro-Borg to me. I can just imagine the scientists thoughts: Pesky humans! Do not let them modify the evaluation of documents. They will just muck things up with their supposed legal judgments and such. I have talked with coder math types having attitudes like this before.

The report does step into English from time to time and even includes legal argument, which, naturally, I disagree with. Indeed the following assertion is made without any authority that I can see, either legal or factual:

But, there is always one more document which may be relevant and nowhere near similar due to semantic ambiguity. The most important documents to a case arguably may be those which are in this outlier area, and more expensive to obtain.

Really? The most important documents are the one’s that you did not find? Damn the expense, keep looking, because we are sure the outliers are key to the case! So much for proportionality. But then, proportionality has always been an argument for clients, not vendors. Still, I do not mean to be too critical. TCDI does end their report with a conciliatory statement that I totally endorse:

The application of a hybrid feature approach / complex concepts to Latent Semantic Indexing using very simple automated parsing and query construction appears promising in generating a high Recall set based solely on initial topic modeling (Request for Production). …

This automated study is not about replacing the human intelligence required to successfully complete an end-to-end review. It is one part of a display of how automated and human assisted workflows can in tandem guide a historically expensive process into a realm of data proportionality and expectation.

So it appears our disagreements are relatively minor, perhaps even just latent semantic and attitude based. The important thing is we agree in principle to the hybrid approach and to proportionality. Hey, they even used my word hybrid, so I have got to like this company and report author.

University of South Florida

The University of South Florida report is entitled Modeling Concept and Context to Improve Performance in eDiscovery. It was written by H. S. Hyman and Warren Fridy III. The abstract of the report starts with an interesting sentence that puts legal search in perspective with other kinds of search:

One condition of eDiscovery making it unique from other, more routine forms of IR is that all documents retrieved are settled by human inspection.

I guess this means that other areas of search do not have the gold/lead standard fuzzy lens issues we have. The paper abstract goes on to make two more good points:

Automated IR tools are used to reduce the size of a corpus search space to produce smaller sets of documents to be reviewed. However, a limitation associated with automated tools is they mainly employ statistical use of search terms that can result in poor performance when measured by recall and precision. One reason for this limitation is that relevance — the quality of matching a document to user criteria – – is dynamic and fluid, whereas a query — representing the translation of a user’s IR goal – is fixed.

In other words, to put it plainly, keyword search sucks in legal search of chaotic data sets like email and loose file collections. The relevance determinations are a moving target – too fluid and dynamic for keyword search alone to work. Keywords have to be used very carefully.

The University of South Florida researchers have a good handle on the problem and were testing one possible solution that combines concept and context modeling to enhance search term performance. They used a hybrid multimodal approach with the following basic strategy to solve the unique problems of e-discovery:

In answering this question we propose an approach to model three constructs: (1) Concepts underlying the fixed search terms and queries, (2) Context of the domain and the corpus, and (3) Elimination terms used as counter-measures for reduction of nonrelevant documents.

This is one of the better written papers with frequent use of only 19th grade English.

Ursinus College

The Ursinus College report is titled Latent Semantic Indexing with Selective Query Expansion. It was written by Andy Garron and April Kontostathis. They are one of the participants that tried out an automatic (Borg) approach. Their one sentence description of the task was concise and accurate:

The E-Discovery simulation includes an opportunity for machine learning based on relevance feedback – i.e. training systems to improve search results over multiple iterations after consulting with a Topic Authority (TA).

Here is the description they provide of their multi-dimensional latent semantic indexing approach:

The system we implemented for both 2010 and 2011 is based on Latent Semantic Indexing (LSI), a search method that attempts to draw out the meaning of terms. In particular, we implemented Essential Dimensions of LSI (EDLSI), which combines standard Vector Space retrieval with LSI in a “best of both worlds” approach. In 2011, teams are allowed multiple submissions for each query (“runs”), after each run they receive relevance judgments for a number of documents. This procedure lends itself intuitively to selective query expansion. In selective query expansion, we modify the query using information from documents that are known to be relevant in order to train the system to produce better retrieval results. We implemented selective query expansion as a machine learning feature in our system.

So it appears they are trying to make their machines more intuitive in query expansion, meaning, I think, the selection of new keywords to add to training. I know that Data on Star Trek never really attained the human hunch capacity, but maybe the robots from Ursinus will do better.

University of Melbourne

The University of Melbourne report, Melbourne at the TREC 2011 Legal Track, by William Webber, a frequent contributor to this blog, and Phil Farrelly. They tried both TechAssist (Hybrid) and Automatic (Borg) approaches. It looks like they used keyword term occurrences with binary weights as part of the seed set generation. This was a very short report and, unfortunately, I did not understand most of it.

I asked William about the report and he admitted the report was rushed. He also admitted that the experiment they tried this year did not work out too well for various reasons. William they sent me his informal explanation of what his team did for publication in this blog. This time he used language that I could understand. Here is a slightly edited version of what he sent:

What the Melbourne team did at TREC 2011 was fairly mainstream predictive coding (“text classification” in technical jargon). The Support Vector Machine (SVM) is a standard text classification algorithm, that I imagine is widely used in predictive coding, including by vendors in the U.S.  “Active learning” refers to the way we selected documents for coding to improve the classifier: instead of picking documents at random, we chose those documents that the classifier was “most unsure about;” these are the documents that the classifier might give a 50% probability of relevance to, as you were encountering in the Kroll OnTrack system. [William is referring to my descriptions in the seven-part search narrative using Inview.]

The initial set of documents for coding were selected by simple keyword queries. All the above is fairly standard predictive coding. As an experiment, we tried two different sources for responsiveness coding. One was to ask the official TREC topic authority for assessments; the other was to ask an assessor internal to the team (who had e-discovery experience, though he was not a lawyer) for annotations. We wanted to see how well you could do if your annotations were made by someone other than the person who was defining the conception of relevance.

In the event, we did better with the internal than with the official annotations. However, our scores were uniformly poor, so little can be concluded from this finding. Whether our poor scores were due to a bug in our system, or to not getting enough annotations from the official topic authority (we found the turnaround to be very slow, for whatever reason), or why, I’m not sure.

University of Waterloo

The University of Waterloo report, University of Waterloo at TREC 2011: A Social Networking Approach to the Legal Learning Track, was written by Robert Warren and David R. Cheriton. This team used a truly unique approach that I can summarize as the Borg go FaceBook. Here is their more sciency explanation:

The goal of the experiments this year was the exploration of whether social network analysis could be applied to the problem of e-discovery and legal document retrieval. We also opted for a fully automatic approach in that only responsiveness judgments from the topic authorities were used in the learning component of the system.

To perform this social media experiment they used, I kid you not, the Wumpus Search Engine. They claim to have counted 255,964 individuals that sent or received documents within the Enron dataset. Although most of their paper is not really intelligible to me, despite efforts to explain by example involving phone calls to pizza stores, I gather they were looking for patterns of who talked to who as a way to find relevant evidence for seed sets. Apparently it did not work out too well, but they want to try again by adding the fourth dimension, i.w. – time.


Reading these original reports should inoculate you from the nonsense that you may read elsewhere about TREC Legal Track. The participants’ reports also provide important information on the science involved that you will not find in the Overview. For instance, the reports provide more perspective on the fuzzy lens and gold standard problems, and what we can do to improve quality controls for review. The reports also provide critical insight into the kinds of tests and experiments the participants were running.

Study of these materials will also prepare you to read the post-hoc scientific analysis of the 2011 results that will surely follow in the next few years. I look forward to these future studies and commentaries from information scientists. They will provide, as they have for prior conferences, detailed analysis, interpretation and critiques of the 2011 conference. Sometimes these materials can be difficult to understand. They are often written for fellow academics and scientists, not lawyers, but I encourage you to make the effort.

I also encourage the scientists who write these reports about Legal Track to try to constrain their propensity to the over-use of technical jargon and inside language and abbreviations. We lawyers know a lot about obscure technical talk too. We practically invented it. We can slip into technical lawyerese any time we want and I guarantee that scientists will not understand it. But, unless we are writing arcane type articles intended only for other specialists, our goal is communication to a wider audience, to all members of our e-discovery team. Lawyers train themselves to process and understand very complex issues and facts so that they can explain things in 12th grade English. I am not pleading with you to go that far, but how about 19th grade English, and not so much techno-shorthand? How about writing for the whole team of professionals passionate about legal search, not just other information retrieval scientists? I know you need some techno-speak and math, but remember that these studies will be read by lawyers and technologists too, so try to include segments for us.

Like it or not, scientists are important members of a modern-day interdisciplinary e-discovery team, especially when it comes to advanced search techniques. Along with technologists, information scientists and the insights they offer are key to efficient and effective legal search. Still, at the end of the day, an e-discovery team must be lead by an attorney. Legal search concerns lawsuits and the gathering and production of evidence to be used in courts of law. Rule 26(g) Federal Rules of Civil Procedure places the full responsibility of compliant legal search on legal counsel, not consultants.

Justice is not an academic pursuit or government study. It is a real life imperative, and often a messy one at that. Lawyers have a legal and ethical duty to lead and supervise their e-discovery teams in all kinds of conditions. Lawyers can meet these challenges if they have a good team and can understand what the technologists and scientists are saying. If lawyers will take the time to study these TREC reports, and understand the language of technology and science, they will be better able discharge their duties as team leaders.

3 Responses to Analysis of the Official Report on the 2011 TREC Legal Track – Part Three

  1. […] Analysis of the Official Report on the 2011 TREC Legal Track – Part Three […]

  2. […] of the Official Report on the 2011 TREC Legal Track – Part Three – (Ralph […]

  3. […] eg. Analysis of the Official Report on the 2011 TREC Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, Sedona and […]

Leave a Reply