Seventh Class: Concept and Similarity Searches
In this class we will cover our insights on two of the remaining four basic search methods: Concept Searches (Passive, Unsupervised Learning) and Similarity Searches (Families and near Duplication). In the next class we will cover Keyword Searches (tested, Boolean, parametric) and Focused Linear Search (key dates & people). The five search types are all in the familiar Search Pyramid shown below.
The e-discovery search software company, Engenium was one of the first to use Passive Machine Learning techniques. Shortly after the turn of the century, the early 2000s, Engenium began to market what later become known as Concept Searches. They were supposed to be a major improvement over then dominant Keyword Search. Kroll Ontrack bought Engenium in 2006 and acquired its patent rights to concept search. These software enhancements were taken out of the e-discovery market and removed from all competitor software, except Kroll Ontrack.
The same thing happened in 2014 when Microsoft bought Equivio. See e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 Million – Part One and Part Two. We have yet to see what Microsoft will do with it. All we know for sure is that Equivo’s active machine learning add-on for Relativity is no longer available. (Relativity’s RAR offering is not active machine learning, but rather an advanced and enhanced version of passive machine learning.)
Several other vendors are stepping in to fill the old Equivo shoes and provide a software module add-on to Relativity that provides active machine learning capabilities. Ironically, the new KrolLDiscovery is one of these companies, but there are several others that offer a module to all Relativity vendors. I have seen a few of them and they look pretty good.
Back to the history lesson, David Chaplin, who founded Engeniun in 1998, and sold it in 2006, became Kroll Ontrack’s V.P. of Advanced Search Technologies from 2006-2009. He is now the CEO of two Digital Marketing Service and Technology (SEO) companies, Atruik and SearchDex. Other then vendors emerged in the 2006-2009 time period to stay competitive with the search capabilities of Kroll Ontrack’s document review platform. They included Clearwell, Cataphora, Autonomy, Equivio, Recommind, Ringtail, Catalyst, and Content Analyst. Most of these companies have since gone the way of Equivo and are now ghosts, gone from the e-discovery market as stand-alones. There are a few notable exceptions, including Catalyst, who participated in TREC with us in 2015 and 2016.
Passive Machine Learning
The so-called Concept Searches used by the legal search specialists in the 2006-2009 era relied on passive machine learning. This kind of learning does not depend on training or active instruction by any humans (aka supervised learning). It is all done automatically by computer study and analysis of the data alone, including semantic analysis of the language contained in documents. That meant you did not have to rely on keywords alone, but could state your searches in conceptual terms. The below is a screen-shot of one example of concept search interface using Kroll’s EDR software.
For a good description of these admittedly powerful, albeit now no longer state-of-the-art search tools, see the article by D4’s Tom Groom, The Three Groups of Discovery Analytics and When to Apply Them. The article refers to Concept Search as Conceptual Analytics, and is described as follows:
Conceptual analytics takes a semantic approach to explore the conceptual content, or meaning of the content within the data. Approaches such as Clustering, Categorization, Conceptual Search, Keyword Expansion, Themes & Ideas, Intelligent Folders, etc. are dependent on technology that builds and then applies a conceptual index of the data for analysis.
Search experts and information scientists know that active machine learning, also called supervised machine learning, was the next big step in search after concept searches. Again, these old data search techniques are known in the data science and engineering world as passive or unsupervised machine learning. The below instructional chart by Hackbright Academy sets forth key difference between supervised learning (predictive coding) and unsupervised or passive learning (analytics, aka concept search).
Tips on Use of Passive Machine Learning
It is usually worthwhile to spend some time using concept search to speed up the search and review of electronic documents. We have found it to be of only modest value in simple search projects, with greater value added in more complex projects, especially where data is very complex. Still, in all projects, simple or complex, the use of Concept Search features such as document Clustering, Categorization, Keyword Expansion, Themes & Ideas are at least somewhat helpful. They are especially helpful in finding new keywords to try out, including wild-card stemming searches with instant results and data groupings.
In simple projects you may not need to spend much time with these kind of searches. We find that an expenditure of at least thirty minutes at the beginning of a search is cost-effective in all projects, even simple ones. In more complex projects it may be necessary to spend much more time on these kinds of features.
Passive, unsupervised machine learning is a good way to be introduced to the type of data you are dealing with, especially if you have not worked with the client data before. In TREC Total Recall 2015 and 2016, where we were working with the same data-sets, our use of these searches diminished as our familiarity with the data-sets grew. They can also help in projects where the search target in not well-defined. There the data itself helps focus the target. It is helpful in this kind of sloppy, I’ll know it when I see it type of approach. That usually indicates a failure of both target identification and SME guidance. Even with simple data you will want to use passive machine learning in those circumstances
Similarity Searches – Families and Near Duplication
Another kind of search that is indispensable for anyone’s multimodal toolbox is similarity search. We consider such searches to be based on types of near-duplication file analysis. In Tom Groom‘s, article, The Three Groups of Discovery Analytics and When to Apply Them, he refers to Similarity Searches as Structured Analytics, which he explains as follows:
Structured analytics deals with textual similarity and is based on syntactic approaches that utilize character organization in the data as the foundation for the analysis. The goal is to provide better group identification and sorting. One primary example of structured analytics for eDiscovery is Email Thread detection where analytics organizes the various email messages between multiple people into one conversation. Another primary example is Near Duplicate detection where analytics identifies documents with like text that can be then used for various useful workflows.
These methods can always improve efficiency of a human reviewer’s efforts. It makes it easier and faster for human reviewers to put documents in context. It also helps a reviewer minimize repeat readings of the same language or same document. The near duplicate clustering of documents can significantly speed up review. In some corporate email collections the use of Email Thread detection can also be very useful. The idea is to read the last email first, or read in chronological order from the bottom of the email chain to the top. The ability to instantly see on demand the parents and children of email collections can also speed up review and improve context comprehension.
All of these Similarity Searches are less powerful than Concept Search, but tend to be of even more value than Concept Search in simple to intermediate complexity cases. In most simple or medium complex projects one to three hours are typically used with these kind of software features. Also, for this type of search the volume of documents is important. The larger the data set, especially the larger the number of relevant documents located, the greater the value of these searches.
We conclude this class with a quick tip by Losey on his favorite use of a Similarity Search.
Or pause to do this suggested “homework” assignment for further study and analysis.
SUPPLEMENTAL READING: If you have not already done so, click on the links in this class and review the referenced web pages. Be sure to carefully study Tom Groom‘s, article, The Three Groups of Discovery Analytics and When to Apply Them.
EXERCISES: Search for articles in the general field of artificial intelligence that discuss the differences between active machine learning and passive machine learning. Read a few of these articles. Also look for use of the term “concept search,” generally only used in the field of legal search, and read a couple of these too. Also try researching the phrase “latent semantic indexing.”
Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!
e-Discovery Team LLC COPYRIGHT 2017
ALL RIGHTS RESERVED