My Hack of the NSA and Discovery of a Heretofore Unknown Plan to Use Teams of AI-Enhanced Lawyers and Search Experts to Find Critical Evidence

March 1, 2015

NSA_logoNow that my blog has changed from weekly to monthly I have more time for my hobbies, like trying to hack into NSA computers. I made a breakthrough with that recently, thanks primarily to exuberant disclosures by Snowden after the Oscars. I was able to get into one of the NSA’s top-secret systems. Not only that, my hack led to discovery of a convert operation that will blow your mind. (Hey, if the NSA can brag about their exploits, then so can I.) And if that were not enough, I was able to get away with downloading two documents from their system. I will share what I borrowed with you here (and, of course, on Wikileaks). The documents are:

  • A perviously unknown Plan to use sophisticated e-Discovery Teams with AI-enhancements to find evidence for use in investigations and courtrooms around the world.
  • A slide show in movie and PDF form that tells you how these teams operate.

nsa-spying-logoI can disclose my findings and stolen documents here without fear of becoming Citizen Five because what I found out is so incredible that the NSA will disavow all knowledge. They will be forced to claim that I made up the whole story. Besides, I am not going to explain how I hacked the NSA. Moreover, unlike some weasels, I will never knowingly give aid and comfort to foreign governments. This is something many Hollywood types and script kiddies fail to grasp. All I will say is that I discovered a critical zero-day type error in two lines of code, out of billions, in a software program used by the NSA. In accord with standard white hat protocol, if the NSA admits my story here is true, I will tell them the error. Otherwise, I am keeping this code mistake secret.

Time_SpiralThe hack allowed me to access a Top Secret project coded-named Gibson. It is a Cyberspace Time Machine. This heretofore secret device allows you to travel in time, but, here’s the catch, only on the Internet. Since it is an Internet based device the NSA has to keep it plugged in. That is why I was not faced with the nearly insoluble air gap defense protecting the NSA’s other computer systems.

From what I have been able to figure out, the time travel takes place on a subatomic cyber-level and requires access to the Hadron Collider. The Gibson somehow uses entangled electrons, Higgs bosons, and quantum flux probability. The new technology is based on Hawking’s latest theories, quantum computers, and, can you believe it, imaginary numbers, you know, the square root of negative numbers. It all seems so obvious after you read the NSA executive summary, that other groups with Hadron Collider access and quantum computers are likely to come up with the same invention soon. But for now the NSA has a huge advantage and head start. Maybe someday they will even share some of that info with POTUS.

google_Hadron

The NSA Internet Time Machine allows you to peer into the past content of the Internet, which, I know, is not all that new or exciting. But, here is the really cool part that makes this invention truly disruptive, you can also look into the future. With the Gibson and special web browsers you can travel to and capture future webpages and content that have not been created yet, at least not in our time. You can Goggle the future! Just think of the possibilities. No wonder the NSA never has any funding problems.

Apple_buildingThis kind of breakthrough invention is so huge, and so incredible, that NSA must deny all knowledge. If people discover this is even possible, other groups will race to catch up and build their own Internet Time Machines. That is probably why Apple is hoarding so much cash. Will there be a secret collider built off the books under their new headquarters? It kind of looks like it. Google is probably working on this too. The government cannot risk anyone else knowing about this discovery. That would encourage a dangerous time machine race that would make the nuclear race looks like child’s play. Can you imagine what Iran would do with information from the future? The government simply cannot allow that to happen.

minority-report_Cruse_LoseyFor that reason alone my hack and disclosures are untouchable. The NSA cannot admit this is true, or even might be true. Besides, having seen the future, I already know that I will not be prosecuted for these intrusions. In fact, no one but a few hard-core e-Discovery Team players will even believe this story. I can also share the information I have stolen from the future without fear of CFAA prosecution. Technically speaking my unauthorized access of web pages in the future has not happened yet. Despite my PreCrimelike proposals in PreSuit.com, you cannot (yet) be prosecuted for future crimes. You can probably be fired for what you may do, but that is another story.

nsa_eye_blueStill, the hack itself is not really what is important here, not even the existence of the NSA’s Time Machine, as great as that is. The two documents that I brought back from the future are what really matters. That is the real point of this blog, just in case you were wondering. I have been able to locate and download from the future Internet a detailed outline of a Plan for AI-Enhanced search and review.

The Plan is apparently in common use by future lawyers. I am not sure of the document’s exact date, but it looks like circa 2025. It is obviously from the future, as nobody has any plans like this now. I also found a video and PDF of a PowerPoint of some kind. It shows how lawyers and other investigators in the future use artificial intelligence to enhance all kinds of ESI search projects, including overt litigation and covert investigations. It appears to be a detailed presentation of how to use what is still called Predictive Coding. (Well, at least they do not call it TAR anymore.) Nobody in our time has seen this presentation yet. I am sure of that. You will have the first glimpse now.

The Plan for AI-Enhanced search and review is in the form of a detailed 1,500 word outline. It looks like this Plan is commonly used in the future to obtain client and insurer approval of e-discovery review projects. I think that this review Plan of the future is part of a standardized approval process that is eventually set up for client protection. Obviously we have nothing like that now. The plan might even be shared with opposing counsel and the courts, but I cannot be sure of that. I had to make a quick exit from the NSA system before my intrusion was detected.

I include a full copy of this Plan below, and the PowerPoint slides in video form. See if thee documents are comprehensible to you. If my blog is brought down by denial of service attacks, you can also find it on Wikileaks servers around the world. The Plan can also be found here as a standalone document, and the PDF of the slides can be found here. I hope that this disclosure is not too disruptive to existing time lines, but, from what I have seen of the future of law, temporal paradox be damned, some disruption is needed!

Time_MachineAlthough I had to make a quick exit, I did leave a back door. I can seize root of the NSA Gibson Cyberspace Time Machine anytime I want. I may share more of what I find in upcoming monthly blogs. It is futuristic, but as part of the remaining elite who still follow this blog, I’m sure you will be able to understand. I may even start incorporating this information into my legal practice, consults, and training. You’ll read about it in the future. I know. I’ve been there.

If you have any suggestions on this hacking endeavor, or the below Plan, send me an encrypted email. But please only use this secure email address: HackerLaw@HushMail.com. Otherwise the NSA is likely to read it, and you may not enjoy the same level of journalistic sci-fi protection that I do.

_______________

Outline of 12-Step Plan for Predictive Coding Review

1. Basic Metrics of the Project

a. Number and type of documents to be reviewed

b. Time to complete review

c. Software to be used for review

(1) Active Machine Learning features

(A) General description

(B) Document ranking system (ie- Kroll ranks documents by percentage probability, .01% – 99.9%)

(2) Vendor expert assistance to be provided

d. Budget Range (supported by separate document with detailed estimates and projections)

2. Basic Goals of the Project, including analysis of impact of Proportionality Doctrine and Document Ranking. Here are some possible examples:

a. High recall and production of responsive documents within budget proportionality constraints and time limits

b. Top 25% probable relevant and all probable (50%+) highly relevant is proportional, and this reasonable in this particular case for this kind of ESI

c. All probable relevant and highly relevant with extreme care given as to confidentiality protection, or privilege

d. Evaluation of large production received by client

e. Rush preparation for specific hearings, mediation, depositions, or 3rd party subpoenas

f. Compliance with government requests and civil and criminal investigations

3. General Cooperation Strategy

a. Disclosures planned

(1) Transparent

(2) Translucent

(3) Brick Wall

b. Treatment of Irrelevant Documents

c. Relevancy Discussions

d. Sedona Principle Six

4. Team Members for Project

Penrose_triangle_Expertisea. Predictive Coding Chief. Experienced searcher in charge of the Predictive Coding aspects of the document review

1. Experienced ESI Searcher

2. Same person in charge of non-PC aspects, if not, explain

3. Authority and Responsibilities

4. List qualifications and experience

b. Subject Matter Experts (SME)

(1) Senior SME

A. Final Decision Maker – usually partner in charge of case

B. Determines what is relevant or responsive

(i) Based on experience with the type of case at issue

(ii) Predicts how judge will rule on relevance and production issues

C. Formulates specific rules when faced with particular document types

D. Controls communications with requesting parties senior counsel (usually)

E. List qualifications and experience

(2) Junior SME(s)

A. Lead Document Review expert(s)

B. Usually Sr. Associate working directly with partner in charge

C. Seeks input from final decision maker on grey area documents (Undetermined Category)

D. Responsible for Relevancy Rule articulations and communications

E. List qualifications and experience

(3) Amount of estimated time in budget for the work by Sr and Jr SMEs.

A. Assurances of adequate time commitments, availability

B. Reference time estimates in budget

C. Time should exclude training

(4) Response times guaranties to questions, requests from Predictive Coding Chief

c. Vendor Personnel

(1) Anticipated roles

(2) List qualifications and experience

d. Power Users of particular software and predictive coding features to be used

(1) Law Firm and Vendor

(2) List qualifications and experience

e. Outside Consultants or other experts

(1) Anticipated roles

(2) List qualifications and experience

f. Contract Lawyers

(1) Price list for reviewers and reviewer management

A. $500-$750 per hr is typical (Editors Note: Is this widespread inflation, or new respect?)

B. Competing bids requested? Why or why not.

(2) Conflict check procedures

(3) Licensed attorneys only or paralegals also

(4) Size of team planned

A. Rationale for more than 5 contract reviewers

B. “Less is More” plan

(5) Contract Reviewer Selection criteria

g. Plan to properly train and supervise contract lawyers

5. One or Two-Pass Review

a. Two pass is standard, with first pass selecting relevance and privilege using Predictive Coding, and second pass by reviewers with eyes-on review to confirm relevance prediction and code for confidentiality, and create priv log.

b. If one pass proposed (aka Quick Peek), has client approved risks of inadvertent disclosures after written notice of these risks?

6. Clawback and Confidentiality agreements and orders

a. Rule 502(d) Order

b. Confidentiality Agreement: Confidential, AEO, Redactions

c. Privilege and Logging

(1) Contract lawyers

(2) Automated prep

7. Categories for Review Coding and Training

a. Irrelevant – this should be a training category

b. Relevant – this should be a training category

(1) Relevance Manual for contract lawyers (see form)

(2) Email family relevance rules

A. Parents automatically relevant is child (attachment) relevant

B. Attachments automatically relevant if email is?

C. All attachments automatically relevant if one attachment is?

c. Highly Relevant – this should be a training category

d. Undetermined – temporary until final adjudication

e. No or Very Few Sub-Issues of Relevant, usually just Highly Relevant

f. Privilege – this should be a training category

g. Confidential

(1) AEO

(2) Redaction Required

(3) Redaction Completed

i. Second Pass Completed

8. Search Methods to find documents for training and production

a. ID persons responsible and qualifications

CULLING.2-Filters.3-lakes-ProductionLb. Methods to cull-out documents before Predictive Coding training begins to avoid selection of inappropriate documents for training and to improve efficiency

(1) Eg – any non-text document; overly long documents

(2) Plan to review by alternate methods

(3) ID general methods for this first stage culling; both legal and technical

c. ID general methods for Predictive Coding, ie – Machine selected only, or multimodal

d. Describe machine selection methods.

(1) Random – should be used sparingly, and never as sole method

(2) Uncertainty – documents that machine is currently unsure of ranking, usually in 40%-60% range

(3) High Probability – documents as yet un-coded that machine considers likely relevant

(4) All or some of the above in combination

Multimodal Search Pyramide. Describe other human based multimodal methods

(1) Expert manual

(2) Parametric Boolean Keyword

(3) Similarity and Near Duplication

(4) Concept Search (passive machine learning, such as latent semantic indexing)

(5) Various Ranking methods based on probability strata selected by expert in charge

f. Describe whether a Continuous Active Learning (CAL) process for review will be used, or two-stage process (train, then review), and if later, rationale

9. Describe Quality Control procedures, including, where applicable, any features built into the software, to accomplish following QC goals

quality_trianglea. Three areas of focus to maximize quality of predictive coding

(1) Quality of the AI trainers work to select documents for instruction in the active machine learning process

(2) Quality of the SME work to properly classify documents, especially Highly Relevant and grey area documents, in accord with true probative value and court opinions

(3) Quality of the software algorithms that apply the training input to create a mathematical model that accurately separates the document cloud into probability polar groupings

b. Supervise all reviewers, including contract reviewers who usually do the bulk of the document review work.

(1) ID persons responsible

(2) ID general methods

c. Avoid incorrect conceptions and understanding of relevance and responsiveness, iw – what are you searching for and what will you produce.

(1) Target matches legal obligations

(2) Relevance scope dialogues with requesting party

(3) 26(f) conferences and 16(b) hearings

(4) Motion practice with Court for early resolution of disputes

(5) ID persons responsible

d. Minimize human errors in document coding

(1) Mistakes in relevance rule applications to particular documents

(2) Physical mistakes in clicking wrong code buttons

(3) Inconsistencies in coding of same or similar documents

(4) Inconsistencies in coding of same or similar document types

(5) ID persons responsible

e. Facilitate horizontal and vertical communications in team

(1) ID persons responsible

(2) ID general methods

f. Corrections for Concept Drift inherent in any large review project where understanding of relevance changes over time

(1) ID persons responsible

(2) ID general methods

g. Detection of inconsistencies between predictive document ranking and coding

(1) ID persons responsible

(2) ID general methods

h. Avoid incomplete, inadequate selection of documents for training

(1) ID persons responsible

(2) ID general methods

i. Avoid premature termination of training

(1) ID persons responsible

(2) ID general methods

j. Avoid omission of any Highly Relevant documents, or new types of strong relevant documents

(1) ID persons responsible

(2) ID general methods

k. Avoid inadvertent production of privileged documents

(1) List of attorneys names and email domains

(2) Active multimodal search supplement to predictive coding

(3) Dual pass review

(4) ID persons responsible

(5) ID general methods

l. Avoid inadvertent production of confidential documents without proper labeling and redactions

(1) ID persons responsible

(2) ID general methods

m. Avoid incomplete, inaccurate privilege logs

(1) ID persons responsible

(2) ID general methods

n. Avoid errors in final media production to requesting party

(1) ID persons responsible

(2) ID general methods

UpSide_down_champagne_glass10. Decision to Stop Training for Predictive Coding

a. ID persons responsible

b. Criteria to make the decision

(1) Probability distribution

(2) Separation of documents into two poles

(3) Ideal of upside down champagne glass visualization

(4) Few new relevant documents found in last rounds of training

(5) Few new strong relevant types found

(6) No new Highly Relevant documents found

11. Quality Assurance Procedures to Validate Reasonability of Decision to Stop

ei-Recall_smalla. Random Sample Tests to validate the decision

(1) ei-Recall method used, if not, describe

(2) accept on zero error for any Highly Relevant found in elusion test, or new strong relevant type.

(3) Recall and Precision goals

b. Judgmental sampling

12. Procedures to Document the Work Performed and Reasonability of Efforts

a. Clear identification of efforts on the review platform itself with screen shots before project closure

b. Memorandums to file or opposing counsel

(1) Basic metrics for possible disclosure

(2) Detail for internal use only and possible testimony

c. Availability of expert testimony if court challenges arise

________________

_______

What follows is another file I stole from the NSA, a video of PowerPoint slides (no voiceover) for a future presentation called:

Predictive Coding: An Introduction and Real World Example.

The PDF of the slides can be found here.

____

____________




Two-Filter Document Culling – Part Two

February 1, 2015

Please read Part One of this article first.

Second Filter – Predictive Culling and Coding

Bottom-Filter_onlyThe second filter begins where the first leaves off. The ESI has already been purged of unwanted custodians, date ranges, spam, and other obvious irrelevant files and file types. Think of the First Filter as a rough, coarse filter, and the Second Filter as fine grained. The Second Filter requires a much deeper dive into file contents to cull out irrelevance. The most effective way to do that is to use predictive coding, by which I mean active machine learning, supplemented somewhat by using a variety of methods to find good training documents. That is what I call a multimodal approach that places primary reliance on the Artificial Intelligence at the top of the search pyramid. If you do not have active machine learning type of  predictive coding with ranking abilities, you can still do fine grained Second Level filtering, but it will be harder, and probably less effective and more expensive.

Multimodal Search Pyramid

All kinds of Second Filter search methods should be used to find highly relevant and relevant documents for AI training. Stay away from any process that uses just one search method, even if the one method is predictive ranking. Stay far away if the one method is rolling dice. Relying on random chance alone has been proven to be an inefficient and ineffective way to select training documents. Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part OneTwoThree and Four. No one should be surprised by that.

The first round of training begins with the documents reviewed and coded relevant incidental to the First Filter coding. You may also want to defer the first round until you have done more active searches for relevant and highly relevant from the pool remaining after First Filter culling. In that case you also include irrelevant in the first training round, which is also important. Note that even though the first round of training is the only round of training that has a special name – seed set – there is nothing all that important or special about it. All rounds of training are important.

There is so much misunderstanding about that, and seed sets, that I no longer like to even use the term. The only thing special in my mind about the first round of training is that it is often a very large training set. That happens when the First Filter turns up a large amount of relevant files, or they are otherwise known and coded before the Second Filter training begins. The sheer volume of training documents in many first rounds thus makes it special, not the fact that it came first.

ralph_wrongNo good predictive coding software is going to give special significance to a training document just because it came first in time. The software I use has no trouble at all disregarding any early training if it later finds that it is inconsistent with the total training input. It is, admittedly, somewhat aggravating to have a machine tell you that your earlier coding was wrong. But I would rather have an emotionless machine tell me that, than another gloating attorney (or judge), especially when the computer is correct, which is often (not always) the case.

man_robotThat is, after all, the whole point of using good software with artificial intelligence. You do that to enhance your own abilities. There is no way I could attain the level of recall I have been able to manage lately in large document review projects by reliance on my own, limited intelligence alone. That is another one of my search and review secrets. Get help from a higher intelligence, even if you have to create it yourself by following proper training protocols.

Presuit_smallMaybe someday the AI will come prepackaged, and not require training, as I imagine in PreSuit. I know it can be done. I can do it with existing commercial software. But apparently from the lack of demand I have seen in reaction to my offer of Presuit as a legal service, the world is not ready to go there yet. I for one do not intend to push for PreSuit, at least not until the privacy aspects of information governance are worked out. Should Lawyers Be Big Data Cops?

Information governance in general is something that concerns me, and is another reason I hold back on Presuit. Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information GovernancePart One and Part Two. Also see: e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 MillionPart Two. I do not want my information governed, even assuming that’s possible. I want it secured, protected, and findable, but only by me, unless I give my express written assent (no contracts of adhesion permitted). By the way, even though I am cautious, I see no problem in requiring that consent as a condition of employment, so long as it is reasonable in scope and limited to only business communications.

I am wary of Big Brother emerging from Big Data. You should be too. I want AIs under our own individual control where they each have a real big off switch. That is the way it is now with legal search and I want it to stay that way. I want the AIs to remain under my control, not visa versa. Not only that, like all Europeans, I want a right to be forgotten by AIs and humans alike.

Facciola_shrugBut wait, there’s still more to my vision of a free future, one where the ideals of America triumph. I want AIs smart enough to protect individuals from out of control governments, for instance, from any government, including the Obama administration, that ignores the Constitutional prohibition against General Warrants. See: Fourth Amendment to the U.S. Constitution. Now that Judge Facciola has retired, who on the DC bench is brave enough to protect us? SeeJudge John Facciola Exposes Justice Department’s Unconstitutional Search and Seizure of Personal Email.

Perhaps quantum entanglement encryption is the ultimate solution? See eg.: Entangled Photons on Silicon Chip: Secure Communications & Ultrafast ComputersThe Hacker News, 1/27/15.  Truth is far stranger than fiction. Quantum Physics may seem irrational, but it has been repeatedly proven true. The fact that it may seem irrational for two electrons to interact instantly over any distance just means that our sense of reason is not keeping up. There may soon be spooky ways for private communications to be forever private.

quantum-cryptology

At the same time that I want unentangled freedom and privacy, I want a government that can protect us from crooks, crazies, foreign governments, and black hats. I just do not want to give up my Constitutional rights to receive that protection. We should not have to trade privacy for security. Once we lay down our Constitutional rights in the name of security, the terrorists have already won. Why do we not have people in the Justice Department clear-headed enough to see that?

Getting back to legal search, and how to find out what you need to know inside the law by using the latest AI-enhanced search methods, there are three kinds of probability ranked search engines now in use for predictive coding.

Three Kinds of Second Filter Probability Based Search Engines

SALAfter the first round of training, you can begin to harness the AI features in your software. You can begin to use its probability ranking to find relevant documents. There are currently three kinds of ranking search and review strategies in use: uncertainty, high probability, and random. The uncertainty search, sometimes called SAL for Simple Active Learning, looks at middle ranking documents where the code is unsure of relevance, typically the 40%-60% range. The high probability search looks at documents where the AI thinks it knows about whether documents are relevant or irrelevant. You can also use some random searches, if you want, both simple and judgmental, just be careful not to rely too much on chance.

CALThe 2014 Cormack Grossman comparative study of various methods has shown that the high probability search, which they called CAL, for Continuous Active Learning using high ranking documents, is very effective. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014.  Also see: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine TrainingPart Two.

My own experience also confirms their experiments. High probability searches usually involve SME training and review of the upper strata, the documents with a 90% or higher probability of relevance. I will, however, also check out the low strata, but will not spend as much time on that end. I like to use both uncertainty and high probability searches, but typically with a strong emphasis on the high probability searches. And again, I supplement these ranking searches with other multimodal methods, especially when I encounter strong, new, or highly relevant type documents.

SPLSometimes I will even use a little random sampling, but the mentioned Cormack Grossman study shows that it is not effective, especially on its own. They call such chance based search Simple Passive Learning, or SPL. Ever since reading the Cormack Grossman study I have cut back on my reliance on random searches. You should too. It was small before, it is even smaller now.

Irrelevant Training Documents Are Important Too

In the second filer you are on a search for the gold, the highly relevant, and, to a lesser extent, the strong and merely relevant. As part of this Second Filter search you will naturally come upon many irrelevant documents too. Some of these documents should also be added to the training. In fact, is not uncommon to have more irrelevant documents in training than relevant, especially with low prevalence collections. If you judge a document, then go ahead and code it and let the computer know your judgment. That is how it learns. There are some documents that you judge that you may not want to train on – such as the very large, or very odd – but they are few and far between,

Of course, if you have culled out a document altogether in the First Filter, you do not need to code it, because these documents will not be part of the documents included in the Second Filter. In other words, they will not be among the documents ranked in predictive coding. The will either be excluded from possible production altogether as irrelevant, or will be diverted to a non-predictive coding track for final determinations. The later is the case for non-text file types like graphics and audio in cases where they might have relevant information.

How To Do Second Filter Culling Without Predictive Ranking

KEYS_cone.filter-copyWhen you have software with active machine learning features that allow you to do predictive ranking, then you find documents for training, and from that point forward you incorporate ranking searches into your review. If you do not have such features, you still sort out documents in the Second Filter for manual review, you just do not use ranking with SAL and CAL to do so. Instead, you rely on keyword selections, enhanced with concept searches and similarity searches.

When you find an effective parametric Boolean keyword combination, which is done by a process of party negotiation, then testing, educated guessing, trial and error, and judgmental sampling, then you submit the documents containing proven hits to full manual review. Ranking by keywords can also be tried for document batching, but be careful of large files having many keyword hits just on the basis of file size, not relevance. Some software compensates for that, but most do not. So ranking by keywords can be a risky process.

I am not going to go into detail on the old fashioned ways of batching out documents for manual review. Most e-discovery lawyers already have a good idea of how to do that. So too do most vendors. Just one word of advice. When you start the manual review based on keyword or other non-predictive coding processes, check in daily with the contract reviewer work and calculate what kind of precision the various keyword and other assignment folders are creating. If it is terrible, which I would say is less than 50% precision, then I suggest you try to improve the selection matrix. Change the Boolean, or key words, or something. Do not just keep plodding ahead and wasting client money.

I once took over a review project that was using negotiated, then tested and modified keywords. After two days of manual review we realized that only 2% of the documents selected for review by this method were relevant. After I came in and spent three days with training to add predictive ranking we were able to increase that to 80% precision. If you use these multimodal methods, you can expect similar results.

Basic Idea of Two Filter Search and Review

CULLING.2-Filters.3-lakes-ProductionLWhether you use predictive ranking or not, the basic idea behind the two filter method is to start with a very large pool of documents, reduce the size by a coarse First Filter, then reduce it again by a much finer Second Filter. The result should be a much, much small pool that is human reviewed, and an even smaller pool that is actually produced or logged. Of course, some of the documents subject to the final human review may be overturned, that is, found to be irrelevant, False Positives. That means they will not make it to the very bottom production pool after manual review in the diagram right.

In multimodal projects where predictive coding is used the precision rates can often be very high. Lately I have been seeing that the second pool of documents, subject to the manual review has precision rates of at least 80%, sometimes even as high as 95% near the end of a CAL project. That means the final pool of documents produced is almost as large as the pool after the Second Filter.

Please remember that almost every document that is manually reviewed and coded after the Second Filter gets recycled back into the machine training process. This is known as Continuous Active Learning or CAL, and in my version of it at least, is multimodal and not limited to only high probability ranking searches. See: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training- Part Two. In some projects you may just train for multiple iterations and then stop training and transition to pure manual review, but in most you will want to continue training as you do manual review. Thus you set up a CAL constant feedback loop until you are done, or nearly done, with manual review.

CAL_multi

As mentioned, active machine learning trains on both relevance and irrelevance. Although, in my opinion, the documents found that are Highly Relevant, the hot documents, are the most important of all for training purposes. The idea is to use predictive coding to segregate your data into two separate camps, relevant and irrelevant. You not only separate them, but you also rank them according to probable relevance. The software I use has a percentage system from .01% to 99.9% probable relevant and visa versa. A near perfect segregation-ranking project should end up looking like an upside down champagne glass.

UpSide_down_champagne_glassAfter you have segregated the document collection into two groups, and gone as far as you can, or as far as your budget allows, then you cull out the probable irrelevant. The most logical place for the Second Filter cut-off point in most projects in the 49.9% and less probable relevant. They are the documents that are more likely than not to be irrelevant. But do not take the 50% plus dividing line as an absolute rule in every case. There are no hard and fast rules to predictive culling. In some cases you may have to cut off at 90% probable relevant. Much depends on the overall distribution of the rankings and the proportionality constraints of the case. Like I said before, if you are looking for Gilbert’s black-letter law solutions to legal search, you are in the wrong type of law.

Upside-down_champagne_2-halfs

Almost all of the documents in the production set (the red top half of the diagram) will be reviewed by a lawyer or paralegal. Of course, there are shortcuts to that too, like duplicate and near-duplicate syncing. Some of the documents in the irrelevant low ranked documents will have been reviewed too. That is all part of the CAL process where both relevant and irrelevant documents are used in training. But only a very low percentage of the probable irrelevant documents need to be reviewed.

Limiting Final Manual Review

In some cases you can, with client permission (often insistence), dispense with attorney review of all or near all of the documents in the upper half. You might, for instance, stop after the manual review has attained a well defined and stable ranking structure. You might, for instance, only have reviewed 10% of the probable relevant documents (top half of the diagram), but decide to produce the other 90% of the probable relevant documents without attorney eyes ever looking at them. There are, of course, obvious problems with privilege and confidentiality to such a strategy. Still, in some cases, where appropriate clawback and other confidentiality orders are in place, the client may want to risk disclosure of secrets to save the costs of final manual review.

In such productions there are also dangers of imprecision where a significant percentage of irrelevant documents are included. This in turn raises concerns that an adversarial view of the other documents could engender other suits, even if there is some agreement for return of irrelevant. Once the bell has been rung, privileged or hot, it cannot be un-rung.

Case Example of Production With No Final Manual Review

In spite of the dangers of the unringable bell, the allure of extreme cost savings can be strong to some clients in some cases. For instance, I did one experiment using multimodal CAL with no final review at all, where I still attained fairly high recall, and the cost per document was only seven cents. I did all of the review myself acting as the sole SME. The visualization of this project would look like the below figure.

CULLING.filters_SME_only_review

Note that if the SME review pool were drawn to scale according to number of documents read, then, in most cases, it would be much smaller than shown. In the review where I brought the cost down to $0.07 per document I started with a document pool of about 1.7 Million, and ended with a production of about 400,000. The SME review pool in the middle was only 3,400 documents.

CULLING.filters_SME_Ex

As far as legal search projects go it was an unusually high prevalence, and thus the production of 400,000 documents was very large. Four hundred thousand was the number of documents ranked with a 50% or higher probable prevalence when I stopped the training. I only personally reviewed about 3,400 documents during the SME review, plus another 1,745 after I decided to stop training in a quality assurance sample. To be clear, I worked alone, and no one other than me reviewed any documents. This was an Army of One type project.

Although I only personally reviewed 3,400 documents for training, and I actually instructed the machine to train on many more documents than that. I just selected them for training without actually reviewing them first. I did so on the basis of ranking and judgmental sampling of the ranked categories. It was somewhat risky, but it did speed up the process considerably, and in the end worked out very well. I later found out that information scientists often use this technique as well.

My goal in this project was recall, not precision, nor even F1, and I was careful not to overtrain on irrelevance. The requesting party was much more concerned with recall than precision, especially since the relevancy standard here was so loose. (Precision was still important, and was attained too. Indeed, there were no complaints about that.) In situations like that the slight over-inclusion of relevant training documents is not terribly risky, especially if you check out your decisions with careful judgmental sampling, and quasi-random sampling.

I accomplished this review in two weeks, spending 65 hours on the project. Interestingly, my time broke down into 46 hours of actual document review time, plus another 19 hours of analysis. Yes, about one hour of thinking and measuring for every two and a half hours of review. If you want the secret of my success, that is it.

I stopped after 65 hours, and two weeks of calendar time, primarily because I ran out of time. I had a deadline to meet and I met it. I am not sure how much longer I would have had to continue the training before the training fully stabilized in the traditional sense. I doubt it would have been more than another two or three rounds; four or five more rounds at most.

Typically I have the luxury to keep training in a large project like this until I no longer find any significant new relevant document types, and do not see any significant changes in document rankings. I did not think at the time that my culling out of irrelevant documents had been ideal, but I was confident it was good, and certainly reasonable. (I had not yet uncovered my ideal upside down champagne glass shape visualization.) I saw a slow down in probability shifts, and thought I was close to the end.

I had completed a total of sixteen rounds of training by that time. I think I could have improved the recall somewhat had I done a few more rounds of training, and spent more time looking at the mid-ranked documents (40%-60% probable relevant). The precision would have improved somewhat too, but I did not have the time. I am also sure I could have improved the identification of privileged documents, as I had only trained for that in the last three rounds. (It would have been a partial waste of time to do that training from the beginning.)

The sampling I did after the decision to stop suggested that I had exceeded my recall goals, but still, the project was much more rushed than I would have liked. I was also comforted by the fact that the elusion sample test at the end passed my accept on zero error quality assurance test. I did not find any hot documents. For those reasons (plus great weariness with the whole project), I decided not to pull some all-nighters to run a few more rounds of training. Instead, I went ahead and completed my report, added graphics and more analysis, and made my production with a few hours to spare.

A scientist hired after the production did some post-hoc testing that confirmed an approximate 95% confidence level recall achievement of between 83% to 94%.  My work also confirmed all subsequent challenges. I am not at liberty to disclose further details.

In post hoc analysis I found that the probability distribution was close to the ideal shape that I now know to look for. The below diagram represents an approximate depiction of the ranking distribution of the 1.7 Million documents at the end of the project. The 400,000 documents produced (obviously I am rounding off all these numbers) were 50% plus, and 1,300,000 not produced were less than 50%. Of the 1,300,000 Negatives, 480,000 documents were ranked with only 1% or less probable relevance. On the other end, the high side, 245,000 documents had a probable relevance ranking of 99% or more. There were another 155,000 documents with a ranking between 99% and 50% probable relevant. Finally, there were 820,000 documents ranked between 49% and 01% probable relevant.

Probability_Distribution_Ora

The file review speed here realized of about 35,000 files per hour, and extremely low cost of about $0.07 per document, would not have been possible without the client’s agreement to forgo full document review of the 400,000 documents produced. A group of contract lawyers could have been brought in for second pass review, but that would have greatly increased the cost, even assuming a billing rate for them of only $50 per hour, which was 1/10th my rate at the time (it is now much higher.)

The client here was comfortable with reliance on confidentiality agreements for reasons that I cannot disclose. In most cases litigants are not, and insist on eyes on review of every document produced. I well understand this, and in today’s harsh world of hard ball litigation it is usually prudent to do so, clawback or no.

Another reason the review was so cheap and fast in this project is because there were very little opposing counsel transactional costs involved, and everyone was hands off. I just did my thing, on my own, and with no interference. I did not have to talk to anybody; just read a few guidance memorandums. My task was to find the relevant documents, make the production, and prepare a detailed report – 41 pages, including diagrams – that described my review. Someone else prepared a privilege log for the 2,500 documents withheld on the basis of privilege.

I am proud of what I was able to accomplish with the two-filter multimodal methods, especially as it was subject to the mentioned post-review analysis and recall validation. But, as mentioned, I would not want to do it again. Working alone like that was very challenging and demanding. Further, it was only possible at all because I happened to be a subject matter expert of the type of legal dispute involved. There are only a few fields where I am competent to act alone as an SME. Moreover, virtually no legal SMEs are also experienced ESI searchers and software power users. In fact, most legal SMEs are technophobes. I have even had to print out key documents to paper to work with some of them.

Penrose_triangle_ExpertiseEven if I have adequate SME abilities on a legal dispute, I now prefer to do a small team approach, rather than a solo approach. I now prefer to have one of two attorneys assisting me on the document reading, and a couple more assisting me as SMEs. In fact, I can act as the conductor of a predictive coding project where I have very little or no subject matter expertise at all. That is not uncommon. I just work as the software and methodology expert; the Experienced Searcher.

Right now I am working on a project where I do not even speak the language used in most of the documents. I could not read most of them, even if I tried. I just work on procedure and numbers alone, where others get their hands in the digital mud and report to me and the SMEs. I am confident this will work fine. I have good bilingual SMEs and contract reviewers doing most of the hands-on work.

 Conclusion

Ralph_face_13There is much more to efficient, effective review than just using software with predictive coding features. The methodology of how you do the review is critical. The two filter method described here has been used for years to cull away irrelevant documents before manual review, but it has typically just been used with keywords. I have tried to show here how this method can be employed in a multimodal method that includes predictive coding in the Second Filter.

Keywords can be an effective method to both cull out presumptively irrelevant files, and cull in presumptively relevant, but keywords are only one method, among many. In most projects it is not even the most effective method. AI-enhanced review with predictive coding is usually a much more powerful method to cull out the irrelevant and cull in the relevant and highly relevant.

If you are using a one-filter method, where you just do a rough cut and filter out by keywords, date, and custodians, and then manually review the rest, you are reviewing too much. It is especially ineffective when you collect based on keywords. As shown in Biomet, that can doom you to low recall, no matter how good your later predictive coding may be.

If you are using a two-filter method, but are not using predictive coding in the Second Filter, you are still reviewing too much. The two-filter method is far more effective when you use relevance probability ranking to cull out documents from final manual review.

Try the two filter method described here in your next review. Drop me a line to let me know how it works out.



Follow

Get every new post delivered to your Inbox.

Join 3,817 other followers