License to Cull: Updated description of the two-filter document culling method

July 5, 2015

Click Here to download the PDF version of this article. You may freely distribute for non-profit purposes.
Bond_license_KULL copy

Every attorney has a license to cull irrelevant data before beginning expensive linear review. It is part of their duty to protect their clients and country from waste and abuse. This article describes the two-filter culling method I’ve devised over the years to identify and bulk-code irrelevant documents. The method is designed for use before commencing a detailed attorney review. The efficacy of any large-scale document review project can be enhanced by this double-cull method. In my experience, it not only helps to reduce costs, it also maximizes recall, allowing an attorney to find all of the documents needed for a case quickly and efficiently.

CULLING.2-Filters.3-lakes-ProductionLI briefly introduced this method, and the diagram shown right illustrating it, at the conclusion of a lengthy article on document review quality control: Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part Three (e-DiscoveryTeam, 2015). I use the two-filter method in most large projects as part of my multimodal, bottom line driven, AI-Enhanced (i.w. – predictive coding) method of review. I have described segments of this method, including especially predictive coding, in prior articles on document review. They are listed at the bottom of the Legal Search Science website. I also described this process as part of the Electronic Discovery Best Practices website, found at, which outlines my views on the best practices for lawyers doing e-discovery. (Please note that all views expressed here, and my other writings, are my own personal opinions, and not necessarily those of my law firm or clients.)

The two-filter culling method includes the well-known technology processes of deduplication and deNisting in the first filter. (Note: I always do full horizontal deduplication across all custodians.) Deduplication and deNisting are, however, just technical engineering filters, not based on legal analysis or judgment. They are well-established industry standards and so I will not discuss them further in this article.

Many e-discovery beginners think that deNisting and deduplication are the end-all of ESI culling, but that is far from true. They are just the beginning. The other methods described here all require legal judgment, and so you cannot just hire an e-discovery vendor to do it, as you can with deduplication and deNisting. Legal judgment is critical to all effective document review, including culling of irrelevant documents before lawyers spend their valuable time in linear review. In my opinion, all legal review teams should employ some type of two-filter culling component.

My thirty-five plus years of experience as a practicing lawyer have shown me that the most reliable way for the magic of justice to happen is by finding the key documents. You find the truth, the whole truth, and nothing but the truth, when you find the key documents needed to complete the picture of what happened and keep witnesses honest. In today’s information flooded world, that can only happen if we use technology in a strategic manner to find relevant evidence quickly and inexpensively. The two-filter method makes it easier to do that. This almost 10,000 word article provides an explanation of how to do it that is accessible to beginners and eLeet alike.

Ralph_SimpsonI have been working to refine this irrelevant culling method since 2006. At that time I limited my practice to e-discovery and put aside my commercial and employment litigation practice. For more background on my personal views and opinions on e-discovery, and for a description of other document review methods that I have developed, not just two-filter culling, see the Pages above, and especially the About Page. This is one of the few (perhaps only) e-discovery blogs in existence that has always been independent of any law firm or vendor. There are no ads and no sponsors. It is all free, and, as I said before, these are my own independent views and opinions. That is the way I like it. A free exercise of First Amendment rights. I have written over a million words on e-Discovery in this manner, including five books. (Despite all of these words, I have still not attained my secret goal to appear on The Simpsons, although as you can see, I am ready anytime.)

This article contains a lengthy description of document culling, but still is not complete. My methods vary to adapt to the data and changing technologies. I share these methods to try to help all attorneys control the costs of document review and find the information needed to do justice. All too often these costs spiral out of control, or the review is done so poorly that key documents are not found. Both scenarios are obviously bad for our system of justice. We need cases to be decided on the merits, on the facts.

Hopefully my writings can help make that happen in some small way. Hopefully a more tech-savvy Bar can stem the tide of over-settlement that we have seen in the profession since the explosion of data began in the nineties. All too often cases are now decided on the basis of settlement value, not merits. As it now stands, way too many frivolous cases are filed hoping there will be some kind of payout. These cases tend to drown out the few with merit. Judges are overwhelmed and often do not have the time needed to get down to the nitty-gritty details of the truth.

Most of the time judges and juries are never given the chance to do their job. The cases all settle out instead. As a result only one percent of federal civil cases actually go to trial. This is a big loss for society, and for the “trial lawyers” in our profession, a group I once prided myself to be a part. Now I just focus on getting the facts from big data, to help keep the witnesses honest, and cases decided on the true facts, the evidence. Then I turn it over to the trial lawyers in my firm. They are then armed with the truth, the key documents, good or bad. The trial lawyers then put the best face possible on these facts, which hopefully is handsome to begin with. They argue how the law applies to these facts to seek a fair and just result for our clients. The disputed issues of fact are also argued, but based on the evaluation of the meaning of the key documents and the witness testimony.

Clarence Darrow and William Jennings Bryan

That is, in my opinion, how our system of justice is supposed to operate. It is certainly the way our legal system functioned when I learned to practice law and had my first trials back in 1980. Back then we only had a few thousand files to cull through to find the key documents, perhaps tens of thousands in a big case. Now we have hundreds of thousands of documents to cull through, millions in a big case. Still, even though the data volumes are far greater today, with the two-filter method described here, the few key documents needed to decide a case can be found.

Big Data today presents an opportunity for lawyers. There are electronic writings everywhere and can be hard to destroy. The large amount of ESI floating in cyberspace means that the truth is almost always out there. You just have to find it. 


There is so much data that it is much more likely for key documents to exist than ever before. The digital trails that people leave today are much bigger than the paper trails of old. 

The fact that more truth is out there than ever before gives tech-savvy lawyers a great advantage. They have a much better chance than lawyers in the past ever did to find the documents needed to keep witnesses honest, or put more politely, to help refresh their memory. The flood of information can in this way improve the quality of justice. It all depends on our ability to find the truth from the massive quantities of irrelevant information available.

The more advanced culling methods described here, primarily the ones in the second filter that use predictive coding – AI-enhanced document ranking methods – are especially effective in culling the chaff from the wheat. They are especially effective in Big Data cases. I expect this kind of predictive analytics software to keep on improving. For that reason I am confident that we will continue to be able to find the core kernels of truth needed to do justice, no matter how much data we generate and save.

Some Software is Far Better than Others

Top_Filter_MultiOne word of warning, although this method is software agnostic, in order to emulate the two-filter method, your document review software must have certain basic capabilities. That includes effective, and easy, bulk coding features for the first filter. This is the multimodal broad-based culling. Some of the multiple methods do not require software features, just attorney judgment, such as excluding custodians, but other do require software features, like domain searches or similarity searches. If your software does not have the features that will be discussed here for the first filter, then you probably should switch right away, but, for most, that will not be a problem. The multimodal culling methods used in the first filter are, for the most part, pretty basic.

Bottom-Filter_onlySome of the software features needed to implement the second filter, are, however, more advanced. The second filter works best when using predictive coding and probability ranking. You review the various strata of the ranked documents. The Second Filter can still be used with other, less advanced multimodal methods, i.e. keywords. Moreover, even when you use bona fide active machine learning software features, you continue to use a smattering of other multimodal search methods in the Second Filter. But now you do so not to cull, but to help find relevant and highly relevant documents to improve training. I do not rely on probability searches alone, although sometimes in the Second Filter I rely almost entirely on predictive coding based searches to continue the training.

If you are using software without AI-enhanced active learning features, then you are forced to only use other multimodal methods in the second filter, such as keywords. Warning, true active learning features are not present in most review software, or are very weak. That is true even with software that claims to have predictive coding features, but really just has dressed-up passive learning, i.e. concept searches with latent semantic indexing. You handicap yourself, and your client, by continuing to use such less expensive programs. Good software, like everything else, does not come cheap, but should pay for itself many times over if used correctly. The same comment goes for lawyers too.

First Filter – Keyword Collection Culling

Some first stage filtering takes place as part of the ESI collection process. The documents are preserved, but not collected nor ingested into the review database. The most popular collection filter as of 2015 is still keyword, even though this is very risky in some cases and inappropriate in many. Typically such keyword filtering is driven by vendor costs to avoid processing and hosting charges.

Top_Filter_KeywordsSome types of collection filtering are appropriate and necessary, for instance, in the case of custodian filters, where you broadly preserve the ESI of many custodians, just in case, but only collect and review a few of them. It is, however, often inappropriate to use keywords to filter out the collection of ESI from admittedly key custodians. This is a situation where an attorney determines that a custodian’s data needs to be reviewed for relevant evidence, but does not want to incur the expense to have all of their ESI ingested into the review database. For that reason they decide to only review data that contains certain keywords.

I am not a fan of keyword filtered collections. The obvious danger of keyword filtering is that important documents may not have the keywords. Since they will not even be placed in the review platform, you will never know that the relevant ESI was missed. You have no chance of finding them.

KEYS_cone.filter-copySee eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was use before predictive coding began. What is the maximum recall in re Biomet?Evaluating e-Discovery (4/24/13). Webber shows that in Biomet this method First Filtered out over 40% of the relevant documents. This doomed the Second Filter predictive coding review to a maximum possible recall of 60%, even if was perfect, meaning it would otherwise have attained 100% recall, which never happens. The Biomet case very clearly shows the dangers of over-reliance on keyword filtering.

Nevertheless, sometimes keyword collection may work, and may be appropriate. In some simple disputes, and with some data collections, obvious keywords may work just fine to unlock the truth. For instance, sometimes the use of names is an effective method to identify all, or almost all, documents that may be relevant. This is especially true in smaller and simpler cases. This method can, for instance, often work in employment cases, especially where unusual names are involved. It becomes an even more effective method when the keywords have been tested. I just love it, for instance, when the plaintiff’s name is something like the famous Mister Mxyzptlk.

In some cases keyword collections may be as risky as in the complex Biomet case, but may still be necessary because of the proportionality constraints of the case. The law does not require unreasonably excessive search and review, and what is reasonable in a particular case depends on the facts of the case, including its value. See my many writings on proportionality, including my law review article Predictive Coding and Proportionality: A Marriage Made In Heaven26 Regent U. Law Review 1 (2013-2014). Sometimes you have to try for rough justice with the facts that you can afford to find given the budgetary constraints of the case.

go fishThe danger of missing evidence is magnified when the keywords are selected on the basis of educated guesses or just limited research. This technique, if you can call it that, is, sadly, still the dominant method used by lawyers today to come up with keywords. I have long thought it is equivalent to a child’s game of Go FishIf keywords are dreamed up like that, as mere educated guesses, then keyword filtering is a high-risk method of culling out irrelevant data. There is a significant danger that it will exclude many important documents that do not happen to contain the selected keywords. No matter how good your predictive coding may be after that, you will never find these key documents.

If the keywords are not based on a mere guessing, but are instead tested, then it becomes a real technique that is less risky for culling. But how do you test possible keywords without first collecting and ingesting all of the documents to determine which are effective? It is the old cart before the horse problem.

One partial answer is that you could ask the witnesses, and do some partial reviews before collection. Testing and witness interviews is required by Judge Andrew Peck’s famous wake up call case. William A. Gross Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co., 256 F.R.D. 134, 134, 136 (S.D.N.Y. 2009). I recommend that opinion often, as many attorneys still need to wake up about how to do e-discovery. They need to add ESI use, storage, and keyword questions to their usual new case witness interviews.

mud-handsInterviews do help, but there is nothing better than actual hands on reading and testing of the documents. This is what I like to call getting your hands dirty in the digital mud of the actual ESI collected. Only then will you know for sure the best way to mass-filter out documents. For that reason my strong preference in all significant size cases is to collect in bulk, and not filter out by keywords. Once you have documents in the database, then you can then effectively screen them out by using parametric Boolean keyword techniques. See your particular vendor for various ways on how to do that.

By the way, parametric is just a reference to the various parameters of a computer file that all good software allows you to search. You could search the text and all metadata fields, the entire document. Or you could limit your search to various metadata fields, such as date, prepared by, or the to and from in an email. Everyone knows what Boolean means, but you may not know all of the many variations that your particular software offers to create highly customized searches. While predictive coding is beyond the grasp of most vendors and case managers, the intricacies of keyword search are not. They can be a good source of information on keyword methods.

First Filter – Date Range and Custodian Culling

Top_Filter_Date_cusdEven when you collect in bulk, and do not keyword filter before you put custodian ESI in the review database, in most cases you should filter for date range and custodian. It is often possible for an attorney to know, for instance, that no emails before or after a certain date could possibly be relevant. That is often not a highly speculative guessing game. It is reasonable to filter on this time-line basis before the ESI goes in the database. Whenever possible, try to get agreement on date range screening from the requesting party. You may have to widen it a little, but it is worth the effort to establish a line of communication and begin a cooperative dialogue.

The second thing to talk about is which custodians you are going to include in the database. You may put 50 custodians on hold, and actually collect the ESI of 25, but that does not mean you have to load all 25 into the database for review. Here your interviews and knowledge of the case should allow you to know who the key, key custodians are. You rank them by your evaluation of the likely importance of the data they hold to the facts disputed in the case. Maybe, for instance, in your evaluation you only need to review the mailboxes of 10 of the 25 collected.

Again, disclose and try to work that out. The requesting party can reserve rights to ask for more, that is fine. They rarely do after production has been made, especially if you were careful and picked the right 10 to start with, and if you were careful during review to drop and add custodians based on what you see. If you are using predictive coding in the second filter stage, the addition or deletion of data mid-course is still possible with most software. It should be robust enough to handle such mid-course corrections. It may just slow down the ranking for a few iterations, that’s all.

First Filter – Other MultiModal Culling

Top_Filter_MultiThere are many other bulk coding techniques that can be used in the first filter stage. This is not intended to be an exhaustive search. Like all complex tasks in the law, simple black letter rules are for amateurs. The law, which mirrors the real world, does not work like that. The same holds true for legal search. There may be many Gilbert’s for search books and articles, but they are just 1L types. For  true legal search professionals they are mere starting points. Use my culling advice here in the same manner. Use your own judgment to mix and match the right kind of culling tools for the particular case and data encountered. Every project is slightly different, even in the world of repeat litigation, like employment law disputes where I currently spend much of my time.

Team_TriangleLegal search is at core a heuristic activity, but one that should be informed by science and technology. The knowledge triangle is a key concept for today’s effective e-Discovery Team. Although e-Discovery Teams should be led by attorneys skilled in evidence discovery, they should include scientists and engineers in some way. Effective team leaders should be able to understand and communicate with technology experts and information scientists. That does not mean all e-discovery lawyers need to become engineers and scientists too. That effort would likely diminish your legal skills based on the time demands involved. It just means you should know enough to work with these experts. That includes the ability to see through the vendor sales propaganda, and to incorporate the knowledge of the bona fide experts into your legal work

One culling method that many overlook is file size. Some collections have thousands of very small files, just a few bits, that are nothing but backgrounds, tiny images, or just plain empty space. They are too small to have any relevant information. Still, you need to be cautious and look out for very small emails, for instance, ones that just says “yes.” Depending on context it could be relevant and important. But for most other types of very small files, there is little risk. You can go ahead a bulk code them irrelevant and filter them out.

Even more subtle is filtering out files based on their being very large. Sort your files by size, and then look at both ends, small and big. They may reveal certain files and file types that could not possibly be relevant. There is one more characteristic of big files that you should consider. Many of them have millions of lines of text. Big files are confusing to machine learning when, as typical, only a few lines of the text are relevant, and the rest are just noise. That is another reason to filter them out, perhaps not entirely, but for special treatment and review outside of predictive coding. In other projects where you have many large files like that, and you need the help of AI ranking, you may want to hold them in reserve. You may only want to throw them into the ranking mix after your AI algorithms have acquired a pretty good idea of what you are looking for. A maturely trained system is better able to handle big noisy files.

File type is a well-known and often highly effective method to exclude large numbers of files of a same type after only looking at a few of them. For instance, there may be database files automatically generated, all of the same type. You look at a few to verify these databases could not possibly be relevant to your case, and then you bulk code them all irrelevant. There are many types of files like that in some data sets. The First Filter is all about being a smart gatekeeper.

File type is also used to eliminate, or at least divert, non-text files, such as audio files or most graphics. Since most Second Filter culling is going to be based on text analytics of some kind, there is no point for anything other than files with text to go into that filter. In some cases, and some datasets, this may mean bulk coding them all irrelevant. This might happen, for instance, where you know that no music or other audio files, including voice messages, could possibly be relevant. We also see this commonly where we know that photographs and other images could not possibly be relevant. Exclude them from the review database.

You must, however, be careful with all such gatekeeper activities, and never do bulk coding without some judgmental sampling first. Large unknown data collections can always contain a few unexpected surprises, no matter how many document reviews you have done before. Be cautious. Look before you leap. Skim a few of the ESI file types you are about to bulk code as irrelevant.

DevilIsInTheDetailsThis directive applies to all First Filter activities. Never do it blind on just logic or principle alone. Get you hands in the digital mud. Do not over-delegate all of the dirty work to others. Do not rely too much on your contract review lawyers and vendors, especially when it comes to search. Look at the documents yourself and do not just rely on high level summaries. Every real trial lawyer knows the importance of that. The devil is always in the details. This is especially true when you are doing judgmental search. The client wants your judgment, not that of a less qualified associate, paralegal, or minimum wage contract review lawyer. Good lawyers remain hands-on, to some extent. They know the details, but are also comfortable with appropriate delegation to trained team members.

There is a constant danger of too much delegation in big data review. The lawyer signing the Rule 26(g) statement has a legal and ethical duty to closely supervise document review done in response to a request for production. That means you cannot just hire a vendor to do that, although you can hire outside counsel with special expertise in the field.

DiversionSome non-text file types will need to be diverted for different treatment than the rest of your text-based dataset. For instance, some of the best review software allows you to keyword search audio files. It is based on phonetics and wave forms. At least one company I know has had that feature since 2007.  In some cases you will have to carefully review the image files, or at least certain kinds of them. Sorting based on file size and custodian can often speed up that exercise.

Remember the goal is always efficiency, and caution, but not over-cautious. The more experienced you get the better you become at evaluating risks and knowing where you can safely take chances to bulk code, and where you cannot. Another thing to remember is that many image files have text in them too, such as in the metadata, or in ASCII transmissions. They are usually not important and do not provide good training for second stage predictive coding.

Text can also be hidden in dead Tiff files, if they have not been OCR’ed. Scanned documents Tiffs, for instance, may very well be relevant and deserve special treatment, including full manual review, but they may not show in your review tool as text, because they have never been OCR text recognized.

Concept searches have only rarely been of great value to me, but should still be tried out. Some software has better capacities with concepts and latent semantic indexing than others. You may find it to be a helpful way to find groupings of obviously irrelevant, or relevant documents. If nothing else, you can always learn something about your dataset from these kind of searches.

Similarity searches of all kinds are among my favorite. If you find some files groups that cannot be relevant, find more like that. They are probably bulk irrelevant (or relevant) too. A similarity search, such as find every document that is 80% or more the same as this one, is often a good way to enlarge your carve outs and thus safely improve your efficiency.

spam_trashAnother favorite of mine is domain culling of email. It is kind of like a spam filter. That is a great way to catch the junk mail, newsletters, and other purveyors of general mail that cannot possibly be relevant to your case. I have never seen a mail collection that did not have dozens of domains that could be eliminated. You can sometimes cull-out as much as 10% of your collection that way, sometimes more when you start diving down into senders with otherwise safe domains. A good example of this is the IT department with their constant mass mailings, reminders and warnings. Many departments are guilty of this, and after examining a few, it is usually safe to bulk code them all irrelevant.

Second Filter – Predictive Culling and Coding

Bottom-Filter_onlyThe second filter begins where the first leaves off. The ESI has already been purged of unwanted custodians, date ranges, spam, and other obvious irrelevant files and file types. Think of the First Filter as a rough, coarse filter, and the Second Filter as fine-grained. The Second Filter requires a much deeper dive into file contents to cull out irrelevance. The most effective way to do that is to use predictive coding, by which I mean active machine learning, supplemented somewhat by using a variety of methods to find good training documents. That is what I call a multimodal approach that places primary reliance on the Artificial Intelligence at the top of the search pyramid. If you do not have active machine learning type of predictive coding with ranking abilities, you can still do fine-grained Second Level filtering, but it will be harder, and probably less effective and more expensive.

Pyramid Search diagram

dice_manyAll kinds of Second Filter search methods should be used to find highly relevant and relevant documents for AI training. Stay away from any process that uses just one search method, even if the one method is predictive ranking. Stay far away if the one method is rolling dice. Reliance on random chance alone has been proven to be an inefficient and ineffective way to select training documents. Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – – Part One and Part Two and Three and Four. No one should be surprised by that.

The first round of training begins with the documents reviewed and coded relevant incidental to the First Filter coding. You could also defer the first round until you have done more active searches for relevant and highly relevant from the pool remaining after First Filter culling. In that case you also include irrelevant in the first training round, which is also important. Note that even though the first round of training is the only round of training that has a special name – seed set – there is nothing all that important or special about it. All rounds of training are important.

There is so much misunderstanding about that, and seed sets, that I no longer like to even use the term. The only thing special in my mind about the first round of training is that it is sometimes a very large training set. That happens when the First Filter turns up a large amount of relevant files, or they are otherwise known and coded before the Second Filter training begins. The sheer volume of training documents in many first rounds thus makes them special, not the fact that the training came first.

ralph_wrongNo good predictive coding software is going to give special significance to a training document just because it came first in time. (It might if it uses a control set, but that is a different story, which I will probably explain in next month’s blog.) The software I use has no trouble at all disregarding any early training if it later finds that it is inconsistent with the total training input. It is, admittedly, somewhat aggravating to have a machine tell you that your earlier coding was wrong. But I would rather have an emotionless machine tell me that, than another gloating attorney (or judge), especially when the computer is correct, which is often (not always) the case.

man_robotThat is, after all, the whole point of using good software with artificial intelligence. You do that to enhance your own abilities. There is no way I could attain the level of recall I have been able to manage lately in large document review projects by reliance on my own, limited intelligence alone. That is another one of my search and review secrets. Get help from a higher intelligence, even if you have to create it yourself by following proper training protocols.

Privacy Issues

Maybe someday the AI will come prepackaged, and not require training, or at least very little training. I know it can be done, especially if other data analytics techniques are used. I am working on this project now. See In addition to technical issues, there are serious ethical concerns as well, including especially employee privacy concerns. Should Lawyers Be Big Data Cops? The implications on the law of predictive misconduct are tremendous. I am now focusing my time and resources accordingly.

Information governance in general is something that concerns me, and is another reason I hold back on Presuit. Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part One and Part TwoAlso see: e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 Million – Part Two. I do not want my information governed by a State of Big Brother, even assuming that’s possible. I want it secured, protected, and findable, but only by me, unless I give my express written assent (no contracts of adhesion permitted). By the way, even though I am cautious, I see no problem in requiring that consent as a condition of employment, so long as it is reasonable in scope and limited to only business communications.

I am wary of Big Brother emerging from Big Data. You should be too. I want AIs under our own individual control where they each have a real big off switch. That is the way it is now with legal search and I want it to stay that way. I want the AIs to remain under my control, not visa versa. Not only that, like all Europeans, I want a right to be forgotten by AIs and humans alike.

Facciola_shrugBut wait, there’s still more to my vision of a free future, one where the ideals of freedom and liberty triumph. I want AIs smart enough to protect individuals from governments, all governments, including the Obama administration. His DOJ has continued the disgraceful acts of the Bush Administration to ignore the Constitutional prohibition against General WarrantsSeeFourth Amendment to the U.S. Constitution. Now that Judge Facciola has retired, who on the federal D.C. bench is brave enough to protect us? SeeJudge John Facciola Exposes Justice Department’s Unconstitutional Search and Seizure of Personal Email.

Perhaps quantum entanglement encryption is the ultimate solution? See eg. Entangled Photons on Silicon Chip: Secure Communications & Ultrafast Computers, The Hacker News, 1/27/15.  Truth is far stranger than fiction. Quantum Physics may seem irrational, but it has repeatedly been proven true. The fact that it may seem irrational for two electrons to interact instantly over any distance just means that our sense of reason is not keeping up. There may soon be spooky ways for private communications to be forever private.


At the same time that I want unentangled freedom and privacy, I want a government that can protect us from crooks, crazies, foreign governments, and black hats. I just do not want to give up my Constitutional rights to receive that protection. We should not have to trade privacy for security. That is a false choice. Once we lay down our Constitutional rights in the name of security, the terrorists have already won.

Getting back to legal search, and how to find out what you need to know inside the law by using the latest AI-enhanced search methods, there are three kinds of probability ranked search engines now in use for predictive coding.

Three Kinds of Second Filter Probability Based Search Engines

SALAfter the first round of training (really after the first document is coded in software with continuous active training), you can begin to harness the AI features in your software. You can begin to use its probability ranking to find relevant documents. There are currently three kinds of ranking search and review strategies in use: uncertainty, high probability, and random. The uncertainty search, sometimes called SAL for Simple Active Learning, looks at middle ranking documents where the code is unsure of relevance, typically the 40%-60% range. The high probability search looks at documents where the AI thinks it knows about whether documents are relevant or irrelevant. You can also use some random searches, if you want, both simple and judgmental, just be careful not to rely too much on chance.

CALThe 2014 Cormack Grossman comparative study of various methods has shown that the high probability search, which they called CAL, for Continuous Active Learning using high ranking documents, is very effective. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014.  Also see: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine TrainingPart Two.

My own experience also confirms their experiments. High probability searches usually involve SME training and review of the upper strata, the documents with a 90% or higher probability of relevance. The exact percentage depends on the number of documents involved. I may also check out the low strata, but will not spend very much time on that end. I like to use both uncertainty and high probability searches, but typically with a strong emphasis on the high probability searches. And again, I supplement these ranking searches with other multimodal methods, especially when I encounter strong, new, or highly relevant type documents.

SPLSometimes I will even use a little random sampling, but the mentioned Cormack Grossman study shows that it is not effective, especially on its own. They call such chance-based search Simple Passive Learning, or SPL. Ever since reading the Cormack Grossman study I have cut back on my reliance on any random searches. You should too. It was small before, it is even smaller now. This does not mean sampling does not still have a place in documents review. It does, but in quality control, not in selection of training documents. See eg. and Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search.

Irrelevant Training Documents Are Important Too

In the second filer you are on a search for the gold, the highly relevant, and, to a lesser extent, the strong and merely relevant. As part of this Second Filter search you will naturally come upon many irrelevant documents too. Some of these documents should also be added to the training. In fact, is not uncommon to have more irrelevant documents in training than relevant, especially with low prevalence collections. If you judge a document, then go ahead and code it and let the computer know your judgment. That is how it learns. There are some documents that you judge that you may not want to train on – such as the very large, or very odd – but they are few and far between,

Of course, if you have culled out a document altogether in the First Filter, you do not need to code it, because these documents will not be part of the documents included in the Second Filter. In other words, they will not be among the documents ranked in predictive coding. The will either be excluded from possible production altogether as irrelevant, or will be diverted to a non-predictive coding track for final determinations. The later is the case for non-text file types like graphics and audio in cases where they might have relevant information.

How To Do Second Filter Culling Without Predictive Ranking

KEYS_cone.filter-copyWhen you have software with active machine learning features that allow you to do predictive ranking, then you find documents for training, and from that point forward you incorporate ranking searches into your review. If you do not have such features, you still sort out documents in the Second Filter for manual review, you just do not use ranking with SAL and CAL to do so. Instead, you rely on keyword selections, enhanced with concept searches and similarity searches.

When you find an effective parametric Boolean keyword combination, which is done by a process of party negotiation, then testing, educated guessing, trial and error, and judgmental sampling, then you submit the documents containing proven hits to full manual review. Ranking by keywords can also be tried for document batching, but be careful of large files having many keyword hits just on the basis of file size, not relevance. Some software compensates for that, but most do not. So ranking by keywords can be a risky process.

I am not going to go into detail on the old-fashioned ways of batching out documents for manual review. Most e-discovery lawyers already have a good idea of how to do that. So too do most vendors. Just one word of advice. When you start the manual review based on keyword or other non-predictive coding processes, check in daily with the contract reviewer work and calculate what kind of precision the various keyword and other assignment folders are creating. If it is terrible, which I would say is less than 50% precision, then I suggest you try to improve the selection matrix. Change the Boolean, or key words, or something. Do not just keep plodding ahead and wasting client money.

I once took over a review project that was using negotiated, then tested and modified keywords. After two days of manual review we realized that only 2% of the documents selected for review by this method were relevant. After I came in and spent three days with training to add predictive ranking we were able to increase that to 80% precision. If you use these multimodal methods, you can expect similar results.

Review of Basic Idea of Two Filter Search and Review

CULLING.filters_SME_only_reviewWhether you use predictive ranking or not, the basic idea behind the two filter method is to start with a very large pool of documents, reduce the size by a coarse First Filter, then reduce it again by a much finer Second Filter. The result should be a much, much small pool that is human reviewed, and an even smaller pool that is actually produced or logged. Of course, some of the documents subject to the final human review may be overturned, that is, found to be irrelevant, False Positives. That means they will not make it to the very bottom production pool after manual review in the diagram right.

In multimodal projects where predictive coding is used the precision rates can often be very high. Lately I have been seeing that the second pool of documents, subject to the manual review has precision rates of at least 80%, sometimes even as high as 95% near the end of a CAL project. That means the final pool of documents produced is almost as large as the pool after the Second Filter.

Please remember that almost every document that is manually reviewed and coded after the Second Filter gets recycled back into the machine training process. This is known as Continuous Active Learning or CAL, and in my version of it at least, is multimodal and not limited to only high probability ranking searches. See: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training– Part Two. In some projects you may just train for multiple iterations and then stop training and transition to pure manual review, but in most you will want to continue training as you do manual review. Thus you set up a CAL constant feedback loop until you are done, or nearly done, with manual review.


As mentioned, active machine learning trains on both relevance and irrelevance. Although, in my opinion, the documents found that are Highly Relevant, the hot documents, are the most important of all for training purposes. The idea is to use predictive coding to segregate your data into two separate camps, relevant and irrelevant. You not only separate them, but you also rank them according to probable relevance. The software I normally use, Kroll Ontrack’s EDR, has a percentage system from .01% to 99.9% probable relevant and visa versa. A very good segregation-ranking project should end up looking like an upside down champagne glass.


A near perfect segregation-ranking project will end up looking like an upside down T with even fewer documents in the unsure middle section. If you turn the graphic so that the lowest probability relevant ranked documents are on the left, and the highest probable relevant on the right, a near perfect project ranking looks like this standard bar graph:


screen_shot_table_5percentThe above is a screen shot from a recent project I did after training was complete. This project had about a 4% prevalence of relevant documents, so it made sense for the relevant half to be far smaller. But what is striking about the data stratification is how polarized the groupings are. This means the ranking distribution separation, relevant and irrelevant, is very well formed. There are an extremely small number of documents where the AI is unsure of classification. The slow curving shape of irrelevant probability on the left (or the bottom of my upside down champagne glass) is gone.

The visualization shows a much clearer and complete ranking at work. The AI is much more certain about what documents are irrelevant. To the right is a screenshot of the table form display of this same project in 5% increments. It shows the exact numerics of the probability distribution in place when the machine training was completed. This is the most pronounced polar separation I have ever seen, which shows that my training on relevancy has been well understood by the machine.

After you have segregated the document collection into two groups, and gone as far as you can, or as far as your budget allows, then you cull out the probable irrelevant. The most logical place for the Second Filter cut-off point in most projects in the 49.9% and less probable relevant. They are the documents that are more likely than not to be irrelevant. But do not take the 50% plus dividing line as an absolute rule in every case. There are no hard and fast rules to predictive culling. In some cases you may have to cut off at 90% probable relevant. Much depends on the overall distribution of the rankings and the proportionality constraints of the case. Like I said before, if you are looking for Gilbert’s black-letter law solutions to legal search, you are in the wrong type of law.

Upside-down_champagne_2-halfsAlmost all of the documents in the production set (the red top half of the diagram) will be reviewed by a lawyer or paralegal. Of course, there are shortcuts to that too, like duplicate and near-duplicate syncing. Some of the documents in the irrelevant low ranked

documents will have been reviewed too. That is all part of the CAL process where both relevant and irrelevant documents are used in training. If all goes well, however, only a few of the very low percentage probable relevant documents will be reviewed.

Limiting Final Manual Review

In some cases you can, with client permission (often insistence), dispense with attorney review of all or near all of the documents in the upper half. You might, for instance, stop after the manual review has attained a well-defined and stable ranking structure. You might, for instance, only have reviewed 10% of the probable relevant documents (top half of the diagram), but decide to produce the other 90% of the probable relevant documents without attorney eyes ever looking at them. There are, of course, obvious problems with privilege and confidentiality to such a strategy. Still, in some cases, where appropriate clawback and other confidentiality orders are in place, the client may want to risk disclosure of secrets to save the costs of final manual review.

In such productions there are also dangers of imprecision where a significant percentage of irrelevant documents are included. This in turn raises concerns that an adversarial view of the other documents could engender other suits, even if there is some agreement for return of irrelevant. Once the bell has been rung, privileged or hot, it cannot be un-rung.

Case Example of Production With No Final Manual Review

In spite of the dangers of the unringable bell, the allure of extreme cost savings can be strong to some clients in some cases. For instance, I did one experiment using multimodal CAL with no final review at all, where I still attained fairly high recall, and the cost per document was only seven cents. I did all of the review myself acting as the sole SME. The visualization of this project would look like the below figure.


Note that if the SME review pool were drawn to scale according to number of documents read, then, in most cases, it would be much smaller than shown. In the review where I brought the cost down to $0.07 per document I started with a document pool of about 1.7 Million, and ended with a production of about 400,000. The SME review pool in the middle was only 3,400 documents.

Army of One: Multimodal Single-SME Approach To Machine LearningAs far as legal search projects go it was an unusually high prevalence, and thus the production of 400,000 documents was very large. Four hundred thousand was the number of documents ranked with a 50% or higher probable prevalence when I stopped the training. I only personally reviewed about 3,400 documents during the SME review. I then went on to review another 1,745 documents after I decided to stop training, but did so only for quality assurance purposes and using a random sample. To be clear, I worked alone, and no one other than me reviewed any documents. This was an Army of One type project.

Although I only personally reviewed 3,400 documents for training, I actually instructed the machine to train on many more documents than that. I just selected them for training without actually reviewing them first. I did so on the basis of ranking and judgmental sampling of the ranked categories. It was somewhat risky, but it did speed up the process considerably, and in the end worked out very well. I later found out that other information scientists often use this technique as well. See eg.Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9.

My goal in this project was recall, not precision, nor even F1, and I was careful not to over-train on irrelevance. The requesting party was much more concerned with recall than precision, especially since the relevancy standard here was so loose. (Precision was still important, and was attained too. Indeed, there were no complaints about that.) In situations like that the slight over-inclusion of relevant training documents is not terribly risky, especially if you check out your decisions with careful judgmental sampling, and quasi-random sampling.

I accomplished this review in two weeks, spending 65 hours on the project. Interestingly, my time broke down into 46 hours of actual document review time, plus another 19 hours of analysis. Yes, about one hour of thinking and measuring for every two and a half hours of review. If you want the secret of my success, that is it.

I stopped after 65 hours, and two weeks of calendar time, primarily because I ran out of time. I had a deadline to meet and I met it. I am not sure how much longer I would have had to continue the training before the training fully stabilized in the traditional sense. I doubt it would have been more than another two or three rounds; four or five more rounds at most.

Typically I have the luxury to keep training in a large project like this until I no longer find any significant new relevant document types, and do not see any significant changes in document rankings. I did not think at the time that my culling out of irrelevant documents had been ideal, but I was confident it was good, and certainly reasonable. (I had not yet uncovered my ideal upside down champagne glass shape visualization.) I saw a slow down in probability shifts, and thought I was close to the end.

I had completed a total of sixteen rounds of training by that time. I think I could have improved the recall somewhat had I done a few more rounds of training, and spent more time looking at the mid-ranked documents (40%-60% probable relevant). The precision would have improved somewhat too, but I did not have the time. I am also sure I could have improved the identification of privileged documents, as I had only trained for that in the last three rounds. (It would have been a partial waste of time to do that training from the beginning.)

The sampling I did after the decision to stop suggested that I had exceeded my recall goals, but still, the project was much more rushed than I would have liked. I was also comforted by the fact that the elusion sample test at the end passed my accept on zero error quality assurance test. I did not find any hot documents. For those reasons (plus great weariness with the whole project), I decided not to pull some all-nighters to run a few more rounds of training. Instead, I went ahead and completed my report, added graphics and more analysis, and made my production with a few hours to spare.

A scientist hired after the production did some post-hoc testing that confirmed an approximate 95% confidence level recall achievement of between 83% to 94%.  My work also confirmed all subsequent challenges. I am not at liberty to disclose further details.

In post hoc analysis I found that the probability distribution was close to the ideal shape that I now know to look for. The below diagram represents an approximate depiction of the ranking distribution of the 1.7 Million documents at the end of the project. The 400,000 documents produced (obviously I am rounding off all these numbers) were 50% plus, and 1,300,000 not produced were less than 50%. Of the 1,300,000 Negatives, 480,000 documents were ranked with only 1% or less probable relevance. On the other end, the high side, 245,000 documents had a probable relevance ranking of 99% or more. There were another 155,000 documents with a ranking between 99% and 50% probable relevant. Finally, there were 820,000 documents ranked between 49% and 01% probable relevant.


The file review speed here realized of about 35,000 files per hour, and extremely low cost of about $0.07 per document, would not have been possible without the client’s agreement to forgo full document review of the 400,000 documents produced. A group of contract lawyers could have been brought in for second pass review, but that would have greatly increased the cost, even assuming a billing rate for them of only $50 per hour, which was 1/10th my rate at the time (it is now much higher.)

The client here was comfortable with reliance on confidentiality agreements for reasons that I cannot disclose. In most cases litigants are not, and insist on eyes on review of every document produced. I well understand this, and in today’s harsh world of hard ball litigation it is usually prudent to do so, clawback or no.

Another reason the review was so cheap and fast in this project is because there were very little opposing counsel transactional costs involved, and everyone was hands off. I just did my thing, on my own, and with no interference. I did not have to talk to anybody; just read a few guidance memorandums. My task was to find the relevant documents, make the production, and prepare a detailed report – 41 pages, including diagrams – that described my review. Someone else prepared a privilege log for the 2,500 documents withheld on the basis of privilege.

I am proud of what I was able to accomplish with the two-filter multimodal methods, especially as it was subject to the mentioned post-review analysis and recall validation. But, as mentioned, I would not want to do it again. Working alone like that was very challenging and demanding. Further, it was only possible at all because I happened to be a subject matter expert of the type of legal dispute involved. There are only a few fields where I am competent to act alone as an SME. Moreover, virtually no legal SMEs are also experienced ESI searchers and software power users. In fact, most legal SMEs are technophobes. I have even had to print out key documents to paper to work with some of them.

Penrose_triangle_ExpertiseEven if I have adequate SME abilities on a legal dispute, I now prefer to do a small team approach, rather than a solo approach. I now prefer to have one of two attorneys assisting me on the document reading, and a couple more assisting me as SMEs. In fact, I can act as the conductor of a predictive coding project where I have very little or no subject matter expertise at all. That is not uncommon. I just work as the software and methodology expert; the Experienced Searcher.

Recently I worked on a project where I did not even speak the language used in most of the documents. I could not read most of them, even if I tried. I just worked on procedure and numbers alone. Others on the team got their hands in the digital mud and reported to me and the SMEs. This works fine if you have good bilingual SMEs and contract reviewers doing most of the hands-on work.


ralphlosey_cartoon_smallThere is much more to efficient, effective review than just using software with predictive coding features. The methodology of how you do the review is critical. The two filter method described here has been used for years to cull away irrelevant documents before manual review, but it has typically just been used with keywords. I have shown in this article how this method can be employed in a multimodal manner that includes predictive coding in the Second Filter.

Keywords can be an effective method to both cull out presumptively irrelevant files, and cull in presumptively relevant, but keywords are only one method, among many. In most projects it is not even the most effective method. AI-enhanced review with predictive coding is usually a much more powerful method to cull out the irrelevant and cull in the relevant and highly relevant.

If you are using a one-filter method, where you just do a rough cut and filter out by keywords, date, and custodians, and then manually review the rest, you are reviewing too much. It is especially ineffective when you collect based on keywords. As shown in Biomet, that can doom you to low recall, no matter how good your later predictive coding may be.

If you are using a two-filter method, but are not using predictive coding in the second filter, you are still reviewing too much. The two-filter method is far more effective when you use relevance probability ranking to cull out documents from final manual review.


e-Disco News, Knowledge and Humor: What’s Happening Today and Likely to Happen Tomorrow

June 7, 2015

Spock_smilingMy monthly blogs seems to be getting too heavy, even for me, so this month I am going to try to change. This month I will resort to e-discovery gossip and cheap laughs. I’m hoping that even Spock himself would smile.

But first, a little introspective musing. In February this year, after nine years of writing a weekly blog, I switched to monthly. Since then my blogs have not only been long, complex and difficult, which I warned you would happen, but have also been a tad serious and intellectual. That was never my intent, but it just turned out that way. For instance, my first monthly blog in March, where I started harmlessly enough with a fantasy about time travel and a hack of the NSA, the blog morphed into a detailed outline and slide show on how to do a predictive coding project. Heavy, some might even say boring, well, at least the second half. My next blog was my all time deepest writing ever, where I explained my new intellectual paradigm, Information → Knowledge → Wisdom. I really do hope as many people as possible will read this. It is intended to be insightful, not necessarily entertaining, and certainly not light reading. It went beyond just e-discovery, and law, and ventured into general social commentary.

ZENumericsIn last month’s blog I shared a moment of ZEN, but the moment was filled with math and metrics, not bliss. That’s because in my bizarro world ZEN now means Zero Error Numerics and is designed for quality control in legal search and document review, not Enlightenment. The focus in that blog was on seventeen skills that must be learned to master the ZEN of document review, including concentration. If it were not bad enough to share deep knowledge, instead of fun facts, I even included links to wisdom words with quotes of Zen Masters, old and new. I also mentioned the new trend in corporate America, especially Silicon Valley, of meditation and mindfulness. That was a heavy blog indeed, even the name was way long: Introducing a New Website, a New Legal Service, and a New Way of Life / Work; Plus a Postscript on Software Visualization and Thanks to Kroll Ontrack.

zen_garden_kyotoThe response from most of you, my dear readers, to last month’s blog reminded me of the sound of one hand clapping, or, as I will explain latter, the pauses after Craig Ball’s jokes at his keynote in London last month. Still, last month’s blog did at least provoke an enthusiastic response from all Krollites. I have to concede, however, that this could be a result of my mention and sincere thanks to Kroll Ontrack in the Postscript to Data Visualization at the end of the blog, rather than any great fascination on Kroll’s part with ZEN. Still, I may go with KO next year to teach predictive coding in Tokyo, and even visit Kyoto, so their interest in stages beyond mere information may well be sincere. See: Information → Knowledge → Wisdom: Progression of Society in the Age of Computers.

This month, with my goal to amuse and make even Spock smile, my blog will focus on information, name dropping and insider references. Some knowledge will be thrown in too, of course, because, after all, that is the whole point of information. Information is never an end in itself, or at least should not be. A dash of wisdom may also be thrown in, but, I promise, I will wrap it in humor and sneak it by with vague allusions. No more Zen Master quotes, not even Steve Jobs. Hopefully you will not even notice the wise guy comments, and may even suspect, falsely of course, that you are none the wiser for reading all this bull.

LTN Finalist for Innovative CIO of the Year

ralphlosey_cartoon_smallI will start this newsy blog off on a personal note about my surprise nomination for an honest to God award. No. It has nothing to do with ZEN or document review competitions (ahem – never did get an award for that). It has to do with innovation. Me and new ideas. Imagine that. Unlike former government guru award laden Jason R. Baron, now IG champion of the World after his recent trashing of me in London, I have never won an award (I don’t count my third grade spelling bee) (imagine very small violins playing now). I still have not won an award, mind you, but I have at least now been nominated and qualified as one of three finalists in the Legaltech News Innovation Award 2015. For losers like me just getting a third place mention is a big deal. Sad huh? The award is supposed to recognize “outstanding achievement by legal professionals in their use of technology.”


Thank you dear readers for nominating and voting for me to receive this award. The award category I am in is a bit odd (for me at least), Chief Information Officer, but apparently that is the only one that someone like me could be crammed into. The three finalists in each Innovation category are determined by open voting by LTN magazine subscribers and through LTN’s website. So again, thanks for all of you who voted, especially my family and paid voters in Eastern Europe (they work cheap). The final winner among the finalists in each category are, according to LTN, chosen by “a panel of judges comprised of members of Legaltech News’ editorial staff.” Uh-oh.

Congratulations to all who made it as a finalists and good luck to one and all. There were many vendor categories too, aside from the law firm ones listed in the chart. I list all the vendor categories and finalists below. I have heard of most of them, and know a few very well. But to be honest, I had never heard of many of these vendors, which, no doubt, is what most law firm CIOs are now saying about me. This is an informative list, so I suggest you take time to read it. Again, congrats to all finalists.

Vendor Finalists/Winners

New Product of the Year
Avvo Inc., Advisor
Catalyst Repository Systems, Predict
Diligence Engine
Lex Machina
Best Marketing Services Providers
JD Supra
Best Knowledge Management Software
Motivation Group’s Easy Data Maps
MDLegalApps’ Not Guilty App
ZL Technologies
Best Mobile Device Tool or Service
Abacus Data Systems
Logik Systems’ Logikcull
Best Trial Support Software
Indata Corp.’s TrialDirector
LexisNexis CaseMap
Thomson Reuters’ Case Notebook
Best Case/Matter Management System
Bridgeway Software
Mitratech Holdings’ TeamConnect 4
Best Records Management Software
Hewlett-Packard Co., HP Records Manager (formerly TRIM)
IBM Records Manager
ZL Technologies
Best Risk Management Software
Compliance Science Inc.
IBM OpenPages Operational Risk Management
Best Time and Billing Software
Abacus Data Systems
Tabs3 Software
Tikit North America
Best Collaboration Tool
Accellion kiteworks
Litera IDS
Mitratech’s Lawtrac Self-Service Portal
Opus2 Magnum
Best Document Automation/Management
HotDocs Market
Leaflet Corp.
Best E-Discovery Managed Service Provider
Clutch Group
FTI Consulting, FTI Technology
Iris Data Services’ Arc
Best E-Discovery Processing
Exterro Inc.
iPro Tech
ZL Technologies
Best E-Discovery Review Platform
FTI Technology’s Ringtail software
iConect Developement
kCura Corp.’s Relativity
Recommind Inc.’s Axcelerate 5
Best E-Discovery Legal Hold
Exterro Legal Hold
Legal Hold Pro
Best E-Discovery Hosting Provider
Iris Data Services
Nextpoint Inc.
Best E-Discovery OEM Technology Partner
Content Analyst
Best Research Product
Docket Alarm Inc.
Handshake Corp.
Best Research Platform
Bloomberg Law
LexisNexis’ Lexis Advance
Thomson Reuters’ Westlaw Next
Best Practice Management Software
LexisNexis Legal & Professional’s Firm Manager
Thomson Reuters’ Firm Central

The other two finalists in “my” category, CIO (if hell freezes over and I win, you know I’ll add that title to my card), are Dan Nottke and Harry Shipley. Again, good luck to them and please excuse my pathetic attempts at humor. I Googled them both and will share what I know about them and then make a prediction as to how I will do in this event (hint – it’s not good).

Nottke_danDan Nottke is currently the Chief Information Officer for Kirkland & Ellis LLP, a law firm that always seems to begin descriptions of itself by saying it is 100-years old. (In just a few more years I may be able to say that too.) Most of us know Kirkland, not as old, but as one of the largest, most powerful law firms, having over 1,800 lawyers in key cities around the world. The IT challenges of a firm like that must be huge. Dan is obviously a serious player in the law firm CIO world.

I have never met Dan, but a quick Google shows he is on LTN’s Law Firm Chief Information & Technology Officers Board. With the exception of Monica Bay, who has now left LTN and this CIO Board, I have never met, or even heard of, any of the LTN CIO board members. They all appear to be great people, we just do not travel in the same circles. They are, after all, real life law firm CIOs. They are engineers, not lawyers, but for lawyers. Googling Dan shows that he is usually described with the following defining accomplishment:

Since joining the Firm in 2008, Dan has led the transformation of the Information Technology department from a decentralized team to a fully centralized Information Technology Infrastructure Library (ITIL) based on a high performing organization.

Since his expertise is so different from mine (to be honest I had to look up ITIL on Wikipedia, since I had never heard of it), it is no surprise that we have never met. About the only thing we have in common is high performing law firms, although his firm is more than twice the size of mine. The same goes for the other finalist, Harry Shipley. He has yet another completely different skill set and list of accomplishments.

Harry Shipley is the Assistant Executive Director and CFO of the Iowa State Bar Association. According to his Linked-In profile he is a graduate of Grand View College and the top skill listed is legal research, but otherwise he does not disclose much. Further research shows that he is an expert in document automation, an area he has been working on for over 15 years. I also see that Harry received an award last year from the Iowa State Bar Association in 2014, the Patriot Award, in recognition of his support of Iowa Bar employees serving in the National Guard and Reserve. Seems like a very nice guy, but I could not find out much more about him. Obviously he has many LTN fans or he would not have made the finals.

If I were the LTN Editors and had to pick a winner from these three for the most Innovative CIO of the year, I would pick Dan Nottke (no offense Harry). After all, Dan is the only real life CIO, and taking a “decentralized team to a fully centralized Information Technology Infrastructure Library” seems pretty innovative to me. But, what do I know, I’m just a lawyer, which appears to be the only advantage I have at this point over Dan and Harry. So, congratulations in advance to Mr. Nottke. In the very unlikely event that Harry or I win instead, Dan can at least console himself in knowing that the most Innovative CIO this year did not go to a competitor CIO, it went to a Bar guy or hacker lawyer instead.

LTN-innovation-awardsThe Legaltech News announcement of the award finalists said: “The winners will be recognized at a special event on July 14 at the close of Legaltech West at the City Club of San Francisco.” Well, I never go to Legaltech West, just East. So, even if I did not already have another very important conflict, flying from Orlando to San Francisco is a tad too far to travel for a maybe award dinner. So, Dan, please do not be insulted if I do not show to applaud your acceptance. I admit I did ask Legaltech about any possible advance notice, and they said no way, come to the dinner to find out just like everyone else (I respect that, but had to try). Apparently, this is the first time LTN has ever had an awards dinner for this with all the super-secrecy stuff. (I understand they used to just make an announcement in LTN and mail you something.) But now they have a dinner and are looking for a good turn-out. I don’t blame them for that, having put on a few events myself during my nearly 100 years.

Oscar_AwardAnyway, LTN told me that I had to be there, at the awards dinner, in order to physically receive the award (not sure exactly that means). So, even though I cannot come due to the long distance, and an expected very big and important conflict, namely my playing a new role of Grandfather at about that time, my firm, Jackson Lewis, does have a nice office in San Francisco. So, I am hoping to persuade one or two of my e-discovery attorneys in that office to show up at the dinner for me, to clap a million times where appropriate as awards are doled out, and, in case lightning strikes, to accept the award for me in absentia. In fact, I hope to make them Ralph face-masks to wear, just in case, so, if I win, they can make a convincing showing and quickly grab the hardware before any of the LTN editors figure out that I’m not really there, much less not a real CIO.

e-Discovery in Switzerland

Alice_Down_the_Rabbit_HoleWe used to think of e-discovery as a unique U.S. legal obsession, but that is not true anymore. Our little preoccupation of following evidence down the rabbit hole of technology is now a worldwide phenomena. This was very evident at a couple of events I attended last month in Zurich and London. I’ll start off with Zurich, which has got to be one of the most beautiful cities in the world. The city seemed like a kind of Disney World, super clean, nice and expensive, but without the annoying characters or tourists, and, incredibly quiet. Zurich is all about pristine water, the Swiss alps, and environmentally conscious, healthy, smart people.

I knew all that coming in, but what I did not know until I got there was how sharp and interested the Swiss Bar would be about e-discovery, especially with an advanced topic like predictive coding. I now know why half the world’s money is stashed in Switzerland. They are a very secure bunch, and all carry Swiss army knives and ride around on bikes. Their only vice in Zurich appears to be chocolate, which they eat constantly, and even drink. The only negative thing I can say about Zurich is that it shuts down at 9:00 and it is thereafter impossible to find a good restaurant.

Taylor_HoffmanI was invited to Zurich by Swiss Re e-discovery department to be on a panel that followed the premiere in Switzerland of Joe Looby’s documentary, The Decade of Discovery. Our primary host was Taylor Hoffman, SVP, Head of eDiscovery at Swiss Re. What a dream job Taylor has. He primarily works in New York, but spends a lot of time in Zurich. Jason Baron and his wife, Robin, and I had a seven-course lunch at the private dining facility as Swiss Re’s headquarters overlooking Lake Zurich. We were joined by other members of Swiss Re’s legal department, plus some e-discovery lawyers who came in from Germany and elsewhere to meet and greet. We discussed e-discovery between various wine pairings and ever-changing dishes.

The focus on e-discovery in the EU is all about government investigations, a fact later confirmed by my discussions in London. They also focus on privacy and cross-border issues, and seem to think we are barbarians when it comes to privacy. Since I do not really disagree with them on their privacy criticisms (See: Losey, Are We the Barbarians at the Gate? (e-Discovery Team, Sept. 1, 2008)), a position that seemed to surprise them even more than my being a blogger in a suit, I was able to dodge the daggers very politely thrown at Jason and me.

hide the ballInstead, being the accomplished diplomat that I am (I even have my own email server, rather than blind copying the Chinese on everything), and used to arguing with lawyers everywhere, just as a matter of professional courtesy, it did not take long (one glass) for me to bring up the whole pesky notion of truth and justice. Namely, how can you have justice when both sides in litigation are permitted to hide any documents that they want? They explained to me, an obviously naive and hopelessly idealistic American, that in civilized society, namely Europe, all you are required to disclose are the documents, the ESI, that happen to support your case. In civil litigation you only produce the documents that support your side of the story of what happened.

JusticeThey have virtually no conception of a duty in private litigation to disclose to opposing parties the documents that you have found that show your witnesses are “misremembering” the facts, i.e.- lying. You can imagine how diplomatic I was, and how squirmy and quiet Jason soon became, but it did all end well. We agreed that no one should lie to a judge. Apparently judges everywhere get tired of all of the contradicting allegations and may force both sides to disclose the truth, the whole truth, and nothing but. Apparently, however, that is rare in non-criminal litigation. The primary focus of the kind of disclosure that we know, involving both good and bad documents, is in criminal cases, government investigations, and private, internal investigations.

I asked the non-Swiss Re attorneys attending the lunch how much of their time they spend doing e-discovery work, as opposed to other types of legal services. The answer was it depends, of course, but upon close cross-examination (yeah, I was popular), I learned that the percentage was from 10% to 25%. Remember, these are the outside counsel with special expertise in e-discovery. To me that made it all the more impressive to see how quickly the Zurich attorneys got it who attended the The Decade of Discovery movie. They paid attention, and most importantly, they laughed at all of the right places and seemed to understand. Their questions were good too. They were an unusually astute group, considering that no one outside of Swiss Re and the sponsoring vendor, Consilio, actually do much of this work.

Michael-BeckerConsilio sponsored Joe Looby’s movie showing in Zurich, and then again in London. Consilio’s Managing Directors also presented at both panels following the show, Michael Becker (shown here) in Zurich and Drew Macaulay in London. My thanks for Concilio’s gracious sponsorship and well-run events. Also presenting at these events were Joe Looby, Jason Baron, and Taylor Hoffman.

The main draw was not the panel discussions, as interesting as I think they were, but rather the movie itself, Looby’s Decade of DiscoveryEveryone in Zurich assumed I had seen the documentary many times, but in truth that was only the second time I had seen it. The first was the major showing in DC where everyone who is anyone attended, and most of us wondered how we ended up on the cutting room floor. Still in DC it was a standing ovation, and very emotional, as the star of the movie, Richard Braman, had recently passed away. This movie is a fitting tribute to his work.


Notice how the movie poster says “Justice … is the right to learn the truth from your adversary.” Who knew that is not a popular sentiment in Europe and the UK? We need to learn about privacy from them and they need to learn from us about the importance of full disclosure.

The Decade of Discovery movie prominently features the award-winning Mr. Baron, as the journeyman to Sedona. It makes for a good story, and in the process explains predictive coding pretty well.

I made a movie with Jason myself many years ago, Did You Know: e-Discovery? Apparently our short little slide-show type video is now hard to find, so, even though it is not in the same league as Looby’s real movie, I reproduce it here again so all can easily find it. I can brag that all of our predictions have, so far, come true and the exponential increase in data continues. Feel free to share it by using the share button in the upper left. I would reproduce the Decade of Discovery movie instead, but it is not available online.

Unlike my little slide show video with Jason, Joe Looby’s Decade of Discovery is a real movie. Now that I have seen it twice, I appreciate it much more. I urge you to take time to see it if it ever comes to your town. Check out Joe’s Facebook page for his movie company 10th Mountain Films.

Joe_LoobyOne of the surprise treats from my European trip was to learn what a great guy Joe Looby is. I did not really know Joe. What a pleasure to learn there is no b.s. in Joe, and no big ego either. I did not make any money, nor get any new clients from this trip, but I did make a new friend in Joe Looby. Skeptics may think I’m just kissing up in the hopes of getting a part in an e-discovery sequel, but that’s not true. Joe’s next documentary will concern how emergency decisions are made in the oval office, think Cuban missile crisis. I for one cannot wait to see it. Joe is a true scholar and artist and is evolving beyond his roots in law. Unlike Jason and I, he will surely go on to bigger and better movies. It would not surprise me to see him at the Oscars some day.

e-Disclosure in London

Lord Chief JusticeWe showed the movie in London and had a panel, where, surprisingly the lawyers in attendance did not seem as engaged as the Swiss. We even served popcorn at this event, so go figure. Maybe it was because it was raining (but isn’t it always in London), or maybe it was that whole truth for justice approach that us yanks have. Anyway, Jason, Joe and I had a good time. By the way, they do not call it e-discovery in the UK, they call it e-disclosure. Also, and this amazes me, they do not take depositions over there, or least it is very rare. They just serve prepared statements on each other. That and produce the documents that they want you to see, and hide the rest. The Barristers must be very skilled at cross examination to earn their wigs.

The day after the London movie Jason and I were a keynote at the IQPC event at the Waldorf Astoria in London. We were billed as the great debate on Information Governance. Jason was pro, of course, and I was sort of against, as per my old blog post, Information Governance v Search: The Battle Lines Are Redrawn.  Our keynote was entitled: Let’s Have A Debate About Information Governance — Are We at the End or At the Beginning?

BaronsThe event was the IQPC 10th Anniversary of Information Governance and eDiscovery. Everyone there was either already an IG specialist or hoped to be one. In other words, I was there to argue to the audience that they were all wrong, that IG was dead. Needless to say, my presentation did not go over that well, and Jason soundly won. Even though the deck was stacked against me going in, Jason pulled out all the plugs to make sure he won decisively. I found out why he is banned by his family from playing Monopoly because he is over-competitive, a story he tells whenever he talks about cooperation. His beautiful wife Robin, shown right with Jason in Zurich, confirmed that story for me later. And much more, but I am sworn to secrecy.

british flagSo anyway, just to be sure that he beat me at the great debate, Jason changed the rules at the last moment to have some strange formal debate structure that I’d never heard of involving stop-watch timing, which he controlled. Then, at the closing he surprised me with a carefully scripted speech that he must have stayed up all night writing. He evoked Winston Churchill’s War Room, that was just a few blocks away, and then finished with a rousing quote from the end of Churchill’s most famous speech, We Shall Fight on the Beaches. The only thing missing was a Union Jack draped around his shoulders. The crowd went crazy with patriotic fervor and go-team IG enthusiasm. They will never surrender! It was the only time I saw London lawyers express any emotion. They were real quiet after I followed Churchill, I mean Baron, with my closing statement. Since Baron was Churchill urging all good British citizens to fight on for Information Governance, it was not hard to figure out who they thought I was. I was lucky to be able to goose step out of there alive.

Alison_NorthWell, at least I made some friends by my attack of the London IG establishment, including Alison North, another presenter at the event who is an IG expert herself. She was very nice, protected me from the flying umbrellas that came my way, and politely said she agreed with me. It was more of a whisper really. We sat together for most of the event after that. I was glad to meet such an obviously sophisticated, anti-establishment thinker. We even tried to build a structure out of toothpicks together to hold a marshmallow up in the air as high as possible. That is apparently what lawyers in London do for team building at CLEs. We were at the table with Craig Ball, who was very keen on winning this event. We spent a good fifteen minutes arguing with Craig the ethics of his interpretation of the contest rules. Even though I won that debate, I got called away, so as a team-builder, it was a another loss for me.

Ball_London_15Craig Ball gave the keynote presentation to kick-off the event the first thing in the morning. That is a difficult time slot and I thought he did a good job. As you can see from the photo I took, they had crazy disco type lighting. On stage it was hard for a speaker to see the audience over the bright lights. Craig made many attempts to humor and entertain the London IG professionals. I smiled and laughed a few times, but was alone. Most of Craig’s witty remarks did not even draw a smile, much less a laugh. Only when he made an off-color reference to Fifty Shades of Grey (who better than Craig to do that) did he get a laugh.

I learned a lesson from his start and did not even try for humor. Apparently it does not translate well into whatever language it is they speak over there. In fact the only speaker that was able to get the audience riled up was the Baron of IG himself with his Churchill impression. You know when Craig speaks here again he will surely quote Churchill at length.

Judge_LaporteOther presentations at the event included, U.S. magistrate Judge Elizabeth Laporte (shown right), whom I always enjoy hearing. She did very well with the British Judges on her panel, pointing out that if you are in her court, you have to follow U.S. rules requiring mutual full disclosure, like it or not. The rules of UK and other foreign courts are not what govern. Also presenting and moderating at many of the panels was the reporter, blogger, and retired Solicitor, Chris Dale, whom at the time I thought was a colleague and friend.

Also keynoting at the IQPC were Jeffrey Ritter, Professor of Law, Georgetown University; Jamie Brown, Global eDiscovery Counsel, UBS; Karen Watson, Digital Forensic Investigations, Betfair; Greg O’Connor, Global Head of Corporate, Policy and Regulation, Man Group; Anwar Mirza, Financial Systems Director, TNT Express; and, Jan-Johan Balkema, Global Master Data Manager, Akzo Nobel.

_Balazs_BucsayIn addition to debating Baron on IG, I presented with a reformed black-hat hacker, Balazs Bucsay, who now works for Vodaphone, and Judge Michael Hopmeier, Kingston-on-Thames Crown Court. We had a very short 35 minute panel presentation on cybersecurity. Hacker Bucsay, who is one scary guy, gave a demonstration where a volunteer came on stage and had his password hacked. Impressive. Judge Hopmeier – who was a great guy by the way, tech savvy, frank and outspoken – told everyone how many cybersecurity crimes he sees, and shared a story of a brilliant teenage hacker charged with a serious crimes, even though no money was taken. The kid did it for fun, much like Bucsay used to do. But often it is done by hardened criminals or terrorists. Judge Hopmeier well understands the problem. I hope he is invited to speak in the U.S. soon. We need to hear from him.

Data_Breach_Cost_2014I emphasized Judge Hopmeier’s points on the enormity of the problem, and the Billions of dollars now lost each year by cyber crimes. The average cost of a data breach last year was $3.5 Million. Then I closed with twelve pointers on what a lawyer can do about cyber crime to try to protect their legal practice and their client’s data:

  1. Invest in your company or law firm’s Cybersecurity.
  2. Think like a Hacker and allocate resources accordingly.
  3. Most Law Firms should Outsource primary targets.
  4. Keep Virus Protection in place and updated.
  5. Harden your IT Systems and Websites; $$ and people.
  6. Intrusion Response Focus (Hackers will get in).
  7. Penetration Testing and Vulnerability Scans
  8. Train and Test Employees on Phishing and Social Engineering; Reward/Discipline to prove you are serious.
  9. Be Careful with Cloud Providers and their Agreements.
  10. Buy as much Insurance as possible (insurer guessing game).
  11. Change Laws to make Software Cos Accountable for Errors.
  12. Update Anti-Hacking Laws.

Chinese-cyber-warIt was the only panel on cybersecurity at the IG CLE, which, as far as I am concerned, is a huge mistake. It was late in the day and not well attended. The IG crowd does not seem to grasp the importance of the problem. The Chinese Army applauds their apathy. Let me be very clear using a recent event as an example where they hacked the U.S. government employee database and email. If you are one of the four million past and present federal government employees impacted, the Chinese military not only knows where you live, and has your social security number, user names and passwords, they also know pretty much everything about your personal and professional life. Experts Say China Is Hacking Federal Employees’ Info to Create a Database of Government Workers.

If you are a federal employee who has been a bad boy or girl, say you had an affair, or took a bribe, or maybe you are paying brides to former high school students you molested years ago like Dennis Hastert, they probably know about that too. They read your emails, texts, and Facebook posts. If you have any kind of security clearance, they will have a couple of paid hackers monitoring your every move on the Net. If you were bad, or otherwise have something to hide, they will try will try to extort you. That is what spies do. The FBI is taking this seriously. The four million plus federal employees whose email was hacked should too.

Diner at the Savoy

SavoyI do not usually mention CLE speaker dinners, but the one hosted by Recommind at the IQPC deserves an exception. It was held at a private dining room in the Gordon Ramsay’s Savoy Grill, in The Savoy hotel. I stayed at the Savoy in Zurich and wish I had in London too. But do not waste your time eating at the other famous restaurant at the Savoy, Simpson’s-In-The-Strand. The atmosphere at Simpson’s was good, but not the food. Ramsay’s Savoy Grill, on the other hand, was so good that we went back there the next night. It was by far the best food we had in London, even though some of the waiters spoke with a fake French accent that sounded just like Steve Martin’s Inspector Clouseau. No. Hamburger was not on the menu.

Sherlock Holmes in the Twenty First CenturyWhat made the Recommind dinner special was the group of people they brought together as guests. This was primarily a group of young UK attorneys, the ones who specialize in e-disclosure. Many of them were not able to attend the IQPC event, but they did accept an invite from Recommind for dinner at the Savoy. Aside from the famous Chris Dale, there were only a couple of other speakers there. Most of the dinner guests were true London lawyers, with a couple of Americans lawyers thrown in, those who were lucky enough to be transferred to London. It was a sophisticated group of very smart creatives, all with lovely accents. I felt right at home will all of them and found we had much in common, including my London favorite, Sherlock Holmes.

RALPHCaricatureThis was not my first trip as a speaker to London. Last year I spoke about predictive coding at the famous Lincoln Inn, and also had a dinner with a small group of specialists and judges. That was sponsored by Kroll. I look forward to an opportunity to speak in London again. It is very important to both of our countries that we maintain a close relationship. Next time, however, I just want to speak about predictive coding and cybersecurity. I will leave IG to Jason. You know, old man, it is not really my cup of tea.


Get every new post delivered to your Inbox.

Join 4,268 other followers