I have spent a bit of time over the last couple of weeks thinking about and mulling over Jason Baron’s guest post on coming up with best practices or standards in eDiscovery. See Jason Baron’s In Search of Quality: Is it Time for E-Discovery Search Process Quality Standards? I want to thank Ralph for letting me post my thoughts on eDiscovery standards, more specifically as it relates to search and retrieval technologies.
The Right Question
In assessing Jason’s proposal/discussion of whether it is time for eDiscovery search process quality standards, I find myself thinking that there isn’t just one right question to ask when it comes to search and retrieval processes or search methods. Unfortunately, there is no easy button here – there isn’t one magic way to search data and get “the” answer. In his guest post, Jason suggests that these two are:
[t]he right questions: how does one go about designing an optimal process that produces a quality result. And are there ways to regularize or standardize that process so as to “certify” the result in a way that is defensible?
The issue with these two questions as the “right” questions to ask is that the search process could be perfect and your implementation of that search process could be perfect, but if, in your implementation of the “perfect search process” you chose an ineffective search method for that data type and content, the perfect process won’t matter. Your search results will stink and you won’t get the relevant information you need or want. The search process is too highly variable to standardize. One of the best things we can do to help the eDiscovery industry move from the “Wild Wild West” is to concentrate on the fundamental reason why you are using search in the first place.
I think the right question to ask for search process quality standards is this: Is your search process using a search method that is appropriate for the data type and content you have? Or put another way: Are your search results valid given your search method? If you validate your chosen search method(s) will that obviate the need for any other standards or certification program?
Search Process v. Search Method
First, I want to briefly explain the small but crucial difference between search process and search method.
Search process is the set of steps followed in analyzing a data set and retrieving data from it. A search process may involve an initial analysis of the type and amount of data; the choosing of a search method or search methods to use on the data, setting acceptable precision and recall levels, implementing an iterative process of searching the data and analyzing the results for acceptable precision and recall levels, and a quality control review of the entire process and results. There are many other steps that could happen in this search process – it is a highly individual set of steps. The search process may be similar to another but that is highly unlikely because so much depends on the data itself (more on that point below). For more information on what to consider in your search process, see The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process (May 2009).
Search Method is the actual type of search and retrieval technology (or technologies) used to find data in a particular data set – often a combination of methods is used, not just one singular search method. There are many types of search methods and it is important for anyone setting out to employ a search method to understand the differences. For a longer explanation of different search methods, please see the paper I wrote for the DESI III workshop, Are Lawyers Being Replaced by Artificial Intelligence? Moving Beyond Keyword Search: An Introduction to Advanced Search & Retrieval Technologies.
The critical difference between search process and search method is often misunderstood in the eDiscovery industry. For example, keyword search is a search method. Agreeing to use keyword search on a data set to find relevant data for eDiscovery is part of the search process. The search method used for any particular data set will depend on a variety of factors: the size of the data set, the type of data (text, documents, numbers, database entries, calendar items, amount and quality of OCR’d data, etc.). This list of factors to consider is just the tip of the iceberg when it comes to the factors to consider when choosing an appropriate search method. Jason correctly points out two of the main problems with eDiscovery data sets:
Two fundamental obstacles to achieving “the perfect search”: the intractability of language, and the problem of expanding ESI volume… and [I] made an initial stab at suggesting coping strategies for the profession, including the use of alternative search methods, sampling, and cooperation, including staged negotiations amongst parties (i.e., multiple meet and confers) — all of which I am happy to report have come to the fore over the last few years.
The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (August 2007) is extremely helpful to anyone in eDiscovery because it discusses in detail the two problems Jason highlighted — the intractability of language and the ever expanding volume of ESI — as well as different search methods that can be used to find relevant data.
To get you started, I’ve included a cursory explanation of a few common search methods.
- Keyword – keyword search is a linguistic method that requires a word to be in the data set to retrieve data containing that word.
- Keyword plus – keyword plus is just a short hand way to refer to the use of keyword search along with other search techniques such as Boolean (And, Or, Not), stemming (hous*), or proximity searching (house w/ 5 of mate).
- Clustering – clustering methods are statistically based and group (or cluster) similar data together based upon common factors. This is sometimes referred to as concept searching. For example, airplane, aircraft, and plane might all appear together in one cluster.
- Ontologies – ontological methods are linguistically based and group things together based upon a query expanded using thesauri and other techniques. Ontologies are especially effective on foreign language, cryptic language use, and other code-based data sets.
- Predictive coding – this is a search method where software is used to predict what data is relevant.
- Determinative coding – this is a search method where the software extrapolates the relevancy decisions made on data samples by reviewers to an entire data set (addressing the volume problem).
- Pattern analysis – pattern analysis methods look at patterns in the data – these patterns may be word based (linguistic) or similarity/likeness based (statistical). Social network analysis is an example of a statistical based pattern analysis – how often do you communicate with Person A or Person B.
One or more search methods can be employed in any search process. Other search techniques include a machine learning approach, which is necessarily an iterative approach that uses a feedback loop to improve the search results. Fuzzy search models can be used when the “intractability of language” problem creeps up and the exact word you think was used, wasn’t. Probabilistic (or Bayesian) models use language to draw inferences in the data. Again, these are just a few examples to stimulate your thinking on what search methods are and what is available. For more detailed information on all of these methods and more, please refer to two papers, one I previously mentioned that I wrote for a DESI workshop in 2009 and the Sedona Conference’s Best Practices Commentary on the Use of Search and Information Retrieval Methods in E- Discovery (August 2007).
All Search is Not Created Equal
All search is not created equal. Let me repeat that, because it is important: ALL SEARCH IS NOT CREATED EQUAL. One search method (take keyword search) may work great on one set of data (say memos and other documents) and may not work at all one another set of data (short text messages or IMs, for example). As Jason points out:
Context is, indeed, everything. As language is infinitely malleable, none of us mere mortals can reliably account for all possible word-choices that exist in deep repositories of ESI that make documents relevant (or that render documents seemingly relevant when they are not), without necessarily relying on more powerful stratagems in the form of concept search, predictive analytics, artificial intelligence, and the like.
Employing the “right” search method not only requires an understanding of the various search methods and what is available, but it requires an understanding of how those search methods work on various types of data and the effectiveness of it on THAT data set: context matters, the type of data matters, and the content of the data matters.
Effective search can be likened to eating trail mix– if you are trying to eat only the chocolate chips out of it, you may want to use ontologies to pinpoint only the chocolate chips in the data; if you like to eat all of the peanuts first, use clustering to cluster all of the peanuts together so you can eat them first. Effective search can also depend on the scale of the data set – for example, you may want to use ontologies to define a particular set of documents of interest, then use clustering or other search methods to measure the similarity among them – but again it completely depends on the data set you are working with.
Here are a few more detailed explanations of where search methods can be employed effectively and ineffectively on different types of data sets.
- Appointments are often very similar in overall form, but contain specific pieces of information that are often critical: particular dates, particular people, etc. Clustering technologies may be very ineffective here because the appointments will be deemed similar enough to put in the same cluster. When minute details matter, clustering may not be effective.
- Jurisdictional issues can be tricky when the geographic difference matters. Suppose the Santa Clara Water Management District is relevant, but the San Mateo Water Management District is not, yet they use the exact same forms, they discuss the same topics, etc. Distinguishing among jurisdictions often require the use of ontologies, but won’t work with clustering, because they would all end up in one cluster.
- Pharmaceutical/chemical examples – the type of search will matter if your search shows all isotopes in one cluster but you only wanted one isotope as responsive, and its sibling isotope as not responsive. This can be an issue if all of the isotopes are used in the same tests, etc.
- Dirty OCR’d data, involving many misspellings, can be problematic. Ontologies won’t work well unless they are supported by a system of misspelling extraction or misspelling generation. Clustering will work very effectively on this type of data, possibly in a combination or suite of search methods.
- Data with a lot of near duplicates can be difficult to deal with. Clustering technology can be very effective at grouping the similar documents (the near duplicates) together for review.
Using an appropriate search method to distinguish between these bits of information is important for search to be effective on a particular set of data. These are just a few examples where one search method can be utterly ineffective and another search method can be very effective. Although these few examples concentrate on the difference between ontology and clustering technologies in particular, they are intended to make the point that the blind use of a search method can cripple the effectiveness of your search results.
At the end of the day, we need to be able to ask the producing party – “Was the search that you used validated to show that you achieved acceptable results?” Whether it is a third party validation of the search method or self-validation remains up for discussion. (Although self-validation of the chosen search method seems to be a little bit like the fox guarding the hen house problem with the self-collection of data.) I would advocate that the DESI workshop in June include discussion of validation of the chosen search method as well as the search process.
The single biggest factor in whether a search method is going to be successful or not is knowing what you want to know about the data or get out of the data. Usually the response to this is: “I want to produce all responsive data.” That is a simplistic answer and is not at all getting to the fundamental reason why you are searching through the data in the first place. The point of the search is to find data that will make or break your case, not just find responsive data.
The “successful” search process will involve picking a search method that will find the particular kind of data you want and that will depend on your fundamental reason for searching through the data. Performing quality control in the search process will involve an iterative process to determine if your search method is working the way you thought it would. Is it generating the results or the type of information that you were looking for? Is it returning the intended results, but that isn’t what you were looking for? Is it returning the wrong thing? All of these types of questions about your search method need to be a part of whatever search process you put together. As Jason points out:
More recently, what has emerged out of the research is that we can in fact do a much better job finding relevant documents if we employ iterative processes with human-in-the-loop experts serving as topic authorities — in other words, a form of hybrid approach that relies neither on brute force manual search nor fancy computer algorithms alone. In an article to appear in the above-mentioned Spring 2011 e-discovery issue of the Richmond Journal of Law and Technology, Maura Grossman and Gordon Cormack present tantalizing findings derived from the 2009 running of the Track’s “interactive task,” in which participating teams could use any combination of search methods including keyword searches, machine learning, and/or human review. The article supports the use of technology-assisted review and places one more nail in the coffin where the “myth of manual review being the gold standard for the legal profession” resides (or should).
This simple illustration tells me that it is still the Wild Wild West out there, with the definite possibility that we as lawyers — who start off generally not being all that very well informed about the quality of the search algorithm being employed by a legal service provider — have a lot of questions to ask. Outside a particular legal setting, the larger academic question of interest devolves to: what kind of “process” has been employed so as to achieve optimum results in a given legal context.
Iteration is an integral part of any search process, but more importantly, it is an integral part of the analysis of the effectiveness of the search method on the given data set. The effectiveness questions about the search method should be asked every time because you can iterate all day long, but iteration won’t do any good if the search method you have chosen to use (keyword) does not work on the data set you have (keyword is missing). The data must be present to support the effectiveness of your chosen search method(s). Knowing that data and what you want from it is key.
Other Forms of Search/Connectedness
In analyzing data for networks and connectedness among the people, keyword search would be very ineffective. Pattern analysis would probably be much more effective. For example, take the Bear Stearns case where one of the fund managers would send derogatory remarks about the fund under management to the spouse of the other manager at a non-corporate email address. In this case, it may even be important to distinguish between the professional networks and the family and friends networks. You wouldn’t want to consider all of the friends and family network data to be non-responsive in this case. You would want to highlight the suspicious communication channel. Keyword search would not find this inter-connectedness. Again, choose the appropriate search method to fit the data set and what you want to know from the data.
If you are looking at the data and you encounter a topical search issue, you will need to choose a search method or combination of search methods to help find information related to that particular topic. For example, “Board Meeting” might be used in a family and friends context to refer to the meeting of the board of the local country club, not the corporation. If you have used keyword search or clustering to take a look at the data that mentions “Board Meeting”, you may need to use another search method to disambiguate between the two types of board meetings even though the same words (Board Meeting) are used.
Simply comparing the most common words in the English language to your data index will be illuminating – this simple comparison would have quickly led anyone looking through the Enron data to identify Raptor, and many other code words, as important (or at the least, out of the ordinary enough to research further). Just using keyword search as your search method would not have returned the truly desired data results – i.e. keyword search would have missed the good stuff.
Now, all of these examples are provided not to bash any one type of search method, but to illustrate that all search is not created equal. It is meant to simply illustrate that your search method or combination of search methods matters and that you may need to use different search methods and combinations of search methods to find your relevant data.
Pre-work is Required
The bad news in all of this discussion about choosing an appropriate search method is that it requires pre-work. It requires knowing something about the data set – what type of data is it? Is it mostly email, documents, calendar data, database entries, numbers, medical test data, etc. It also requires knowing a little bit about what you want from the data. Do you want exculpatory data, do you want to illustrate a pattern of behavior (i.e. discrimination), or do you want to show the context of a bad email? The possibilities are endless. Choosing an appropriate search method also requires knowing the good, the bad and the ugly about the available search methods. It also requires the ability to match an appropriate search method to your data needs.
All of this pre-work can be time consuming and it costs money up front. I believe it is cost-effective in the long run, but it requires clients to spend money sooner than they have in the past. In the past litigation spends looked like a typical logarithmic J curve. The current spend pattern trend is morphing into two hills (like two bell curves next to each other) as shown below. The net result of the two hills should be that the overall spend is less than the one mountain.
Search is NOT One Size Fits All
Search of information in not one size fits all. Search is an inexact science that requires iteration, knowledge about how search works, and knowledge about the data. Search methods are not one size fits all – they aren’t even one size fits most. They are highly specific to the data set and the desired search results. Unfortunately our industry seems to want perfect search results (whether it is a search process or search method). The industry behaves as if one size fits all – I see law firms buying a search solution license and then applying it to all clients’ data no matter what is in the data or what the desired results are. This kind of reckless application of search can only end up yielding ineffective search results and costing more money.
Neither approach is workable – with the expectation of perfect search, you will find that perfect results are impossible to achieve and “near perfect” results will be very, very (read prohibitively) expensive to achieve. With the one size fits all approach (using one search method on every data type) your search results will be wholly unsatisfying and expensive to sort through. Choosing an appropriate search method for the current data set and the desired data subset can be much more cost effective in the long run and can certainly be cost effective for repeat litigation.
None of this is about building standards into the eDiscovery process – it is all about the validation of the chosen search method as it relates to the data set and what is desired from the data set. Validating a search method would mean determining that the search method (or combination of search methods) is appropriate for the data set AND for the desired results. If the chosen search method were validated as appropriate, would other standards still be needed?
Fundamentally, finding and using the appropriate solution depends on what you are looking for. The appropriate search solution is different for different types and sizes of data sets and it mainly depends on what you want to find in the data set. That is the biggest factor dictating how you will go about finding it and choosing an appropriate search method.
Basically, “successful search” comes down to quality control of the search process and its implementation, which is what Jason will be discussing in the DESI workshop in June. More importantly, “successful search” comes down to the validation of the chosen search method itself, which I am suggesting should be central to any discussion about eDiscovery search process quality standards or certification programs.