Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One

July 6, 2014

Nasreddin_(17th-century_miniature)There is a well-known joke found in most cultures of the world about a fool looking for something. This anecdote has been told for thousands of years because it illustrates a basic trait of human psychology, now commonly called after the joke itself, the  Streetlight Effect. This is a type of observational bias where people only look for whatever they are searching by looking where it is easiest. This human frailty, when pointed out in the right way, can be funny. One of the oldest known forms of pedagogic humor illustrating the Streetlight effect comes from the famous stories of Nasrudin, aka, Nasreddin, an archetypal wise fool from 13th Century Sufi traditions. Here is one version of this joke attributed to Nasreddin:

One late evening Nasreddin found himself walking home. It was only a very short way and upon arrival he can be seen to be upset about something. Alas, just then a young man comes along and sees the Mullah’s distress.

“Mullah, pray tell me: what is wrong?”

“Ah, my friend, I seem to have lost my keys. Would you help me search them? I know I had them when I left the tea house.”

So, he helps Nasreddin with the search for the keys. For quite a while the man is searching here and there but no keys are to be found. He looks over to Nasreddin and finds him searching only a small area around a street lamp.

“Mullah, why are you only searching there?”

“Why would I search where there is no light?”

Using Only Random Selection to Find Predictive Coding Training Documents Is Easy, But Foolish

easy-buttonThe easiest way to train documents for predictive coding is simply to use random samples. It may be easy, but, as far as I am concerned, it is also defies common sense. In fact, like the Nasrudin story, it is so stupid as to be funny. You know you dropped your keys near your front door, but you do not look there because it is dark, it is hard to search there. You take the easy way out. You search by the street lamp.

The morals here are many. The easy way is not necessarily the right way. This is true in search, as it is in many other things. The search for truth is often hard and difficult. You need to follow your own knowledge, what you know, and what you do not. What do you know about where you lost your keys? Think about that and use your analysis to guide your search. You must avoid the easy way, the lazy way. You must not be tempted to only look under the lamp post. To do so is to ignore your own knowledge. It is foolish to the extreme. It is laughable, as this 1942 Mutt and Jeff comic strip shows:

mutt-jeff_key_search

Random search for predictive coding training documents is laughable too. It may be easy to simply pick training documents at random, but it is ineffective. It ignores an attorney’s knowledge of the case and the documents. It is equivalent to just rolling dice to decide where to look for something, instead of using your own judgment, your own skills and insights. It purports to replace the legal expertise of an attorney with a roll of dice. It would have you ignore an attorney’s knowledge of relevance and evidence, their skills, expertise, and long experience with search.

diceIf you know you left your keys near the front door, why let random chance tell you where to search? You should instead let your knowledge guide your search. It defies common sense to ignore what you know. Yet, this is exactly what some methods of predictive coding tell you to do. These random only methods are tied to particular software vendors; the ones whose software is designed to run only on random training.

These vendors tell you to rely entirely on random selection of documents to use in training. They do so because that requires no thought, as if lawyers were not capable of thought, as if lawyers have not long been the masters of discovery of legal evidence. It is insulting to the intelligence of any lawyer, and yet several software vendors actually prescribe this as the only way to do predictive coding search. This has already been criticized as predictive coding junk science by search expert and attorney Bill Speros, who used the same classic street light analogy. Predictive Coding’s Erroneous Zones Are Emerging Junk Science  (Pulling a random sample of documents to train the initial seed set … is erroneous because it looks for relevance in all the wrong places. It turns a blind eye to what is staring you in the eye.) Still, the practice continues.

The continuing success of a few vendors still using this approach is, I suspect, one reason that the new study by Gordon Cormack and Maura R. Grossman, is designed to answer the question:

Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning? 

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014 (quote from the Abstract).

Although the answer seems common sensical, in a deep archetypal way, and obvious; sometimes common sense and history can be wrong. The only way to know for sure is by scientific experiment. That is exactly what Cormack and Grossman have done.

Since several influential vendors say yes to the question raised in the study, and tell their customers that they should only look under the lamp post, and use one-light-only random search software, Grossman and Cormack had to give this seemingly funny assertion serious attention. They put the joke to the test. To no one’s surprise, except a few vendors, the experiments they performed showed that it was more effective to select training documents using non-random methods and active learning (a process that I call multimodal search). I will discuss their ingenious experiments and report in some detail in Part-Two of this blog.

Some Vendors Add Insult to Injury to Try to Justify their Random-Only Approach

caveman lawyerTo add insult to injury, some vendors try to justify their method by arguing that random selection avoids the prejudice of lawyer bias. It keeps the whole search process open. They seem to think lawyers know nothing. That they dropped their keys and have absolutely no idea where. If the lawyers think they know, they are just biased and should be ignored. They are not to be trusted.

This is not only insulting, but ignores the obvious reality that lawyers are always making the final call on relevance, not computers, not software engineers. Lawyers say what is relevant and what is not, even with random selection.

Some engineers who design random-only selected training software for predictive coding justify the limitation on the basis of assumed lawyer dishonesty. They think that if lawyers are allowed to pick samples for training, and not just have them selected for them at random, that lawyers may rig the system and hide the truth by intentionally poor selections. This is the way a lot of computer experts think when it comes to law and lawyers. I know this from over thirty years of experience.

Star_wars_emperorIf a lawyer is really so dishonest that they will deliberately mis-train a predictive coding system to try to hide the truth, then that lawyer can easily find other, more effective ways to hide the ball than that. Hiding evidence is unethical. It is dishonest. It is not what we are paid to do. Argue what the facts mean? Yes, most definitely. Change the facts. No. Despite what you may think is true about law and lawyers, this is not the kind of thing that 98% of lawyers do. It will not be tolerated by courts. Such lawyer misconduct could not only lead to loss of a case, but also loss of a license to practice law. Can you say that about engineering?

My message to software vendors is simple, leave it to us, to attorneys and the Bar, to police legal search. Do not attempt to do so by software design. That is way beyond your purview. It is also foolish because the people you are insulting with this kind of mistrust are your customers!

I have talked to some of the engineers who believe in random reliance as a way to protect their code from lawyer manipulation. I know perfectly well that this is what some (not all) of them are trying to do. Frankly, the arrogant engineers who think like that do not know what they are talking about. It is just typical engineer lawyer bias, plain and simple. Get over it and stop trying to sell us tools designed for dishonest children. We need full functionality. The latest Grossman Cormack study proves this.

Protect Us from Bias by Better Code, Not Random Selection

Some software designers with whom I have debated this topic will, at this point, try to placate me with statements about unintentional bias. They will point out that even though a lawyer may be acting in good faith, they may still have an unconscious, subjective bias. They will argue that without even knowing it, without realizing it, a lawyer may pick documents that only favor their clients. Oh please. The broad application of this so called insight into subjectivity to justify randomness is insulting to the intelligence of all lawyers. We understand better than most professions the inherent limitations of reason. Scientific Proof of Law’s Overreliance On Reason: The “Reasonable Man” is Dead, Long Live the Whole Man, Part Two. Also see The Psychology of Law and DiscoveryWe are really not that dimwitted as to be unable to do legal search without our finger on the scale, and, this is important, neither is the best predictive coding software.

Precautions can be taken against inherent, subjective bias. The solution is not to throw the baby out with the bath water, which is exactly what random-only search amounts to. The solution to bias is better search algorithms, plus quality controls. Code can be make to work so that it is not so sensitive and dependent on lawyer selected documents. It can tolerate and correct errors. It can reach out and broaden initial search parameters. It is not constrained by the lawyer selected documents.

Dear software designers: do not try to fix lawyers. We do not need the help of engineers for that. We will fix ourselves, thank you! Fix your code instead. Get real with your methods. Overcome your anti-lawyer bias and read the science.

Compete With Better Code, Not False Doctrine

Many software companies have already fixed their code. They have succeeded in addressing the inherent limitations in all active machine learning, driven as it must be by inconsistent humans. In their software the lawyer trainers are not the only ones selecting documents for training. The computer selects documents too. Smart computer selection is far different, and far better, than stupid random selection.

I know that the software I use, Kroll Ontrack’s EDR (eDiscovery Review), is frequently correcting my errors, broadening my initial conception of relevance. It is helping me to find new documents that are relevant, documents that I would never had thought of or found on my own. The computer selects as many documents as I decide are appropriate to enhance the training. Random has only a small place at the beginning to calculate prevalence. Concept searches, similarity searches, keyword, even linear, are far, far better than random alone. When they are all put together in a multimodal predictive coding package, the results can be extremely good.

The notion that you should just turn search over to chance means you should search everywhere any anywhere. That is the essence of random. It means you have no idea of where the relevant documents might be located, and what they might say. That is again completely contrary to what happens in legal discovery. No lawyer is that dim witted. There is always at least some knowledge as to the type or kind of documents that might be relevant. There is always some knowledge as to who is most likely to have them, and when, and what they might say, what names would be used, what metadata, etc.

A Joke at the Expense of Our System of Justice is Not Funny

Google_Nasreddin_Hodja_FestivalI would be laughing at all of this random-only search propaganda like a Nasreddin joke, but for the fact that many lawyers do not get the joke. They are buying software and methods that rely exclusively on random search for training documents. Many are falling for the streetlight effect gimmicks and marketing. It is not funny because we are talking about truth and justice here, not just a fool’s house keys. I care about these pursuits and best practices for predictive coding. The future of legal search is harmed by this naive foolishness. That is why I have reacted before to vendor propaganda promoting random search. That is why I spent over fifty hours doing a predictive coding experiment based in part on random search, an approach I call the Random Borg approach. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents(Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have also written several articles on this subject to try to debunk this method, and yet this method lives on. See eg The Many Types of Legal Search Software in the CAR Market Today; Three-Cylinder Multimodal Approach To Predictive Coding.

Bill SperosSo too have others, see eg. Speros, W., Predictive Coding’s Erroneous Zones Are Emerging Junk Science (e-Discovery Team Blog (Guest Entry), 28th April 2013). As Bill Speros puts it:

Some attorneys employ random samples to populate seed sets apparently because they:

    • Don’t know how to form the seed set in a better way, or
    • Want to delegate responsibility to the computer “which said ‘so’,” or
    • Are emboldened by a statistical rationale premised on the claim that no one knows anything so random is a good a place to start as anywhere.

In spite of the many criticisms, on my blog at least, the random seed set approach continues, and even seems to be increasing in popularity.

Fortunately, Gordon Cormack and Maura R. Grossman have now entered this arena. They have done scientific research on the random only training method. Not surprisingly, they concluded, as Speros and I did, that random selection of training documents is not nearly as effective as multimodal, judgmental selection. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia, ACM 978-1-4503-2257-7/14/07.

To be continued . . . . where I will review the new Grossman Cormack Study and conclude with my recommendations to vendors who still use random only training. I will offer a kind of olive branch to the Borg where I respectfully invite them to join the federation of search, a search universe where all capacities are used, not just random. They have a good start with their existing predictive coding software. All they need do is break with the false doctrine and add new search capacities.


Return of the Robots!

June 29, 2014

transformers_extinctionTired of all of the words thrown at you by the e-Discovery Team blog? Just want to relax and enjoy the summer, but still keep up? Maybe learn something interesting and potentially useful? We understand. We have just the thing for you: a nostalgic look back at our robot movies. They are not extinct yet, and although some sequels stink, these are pretty good. Our robots cover transforming topics that are still cutting edge. They explain the use of storytelling and gamification in predictive coding. They also cover the ethics of viruses and bad robots, and then end with our robots getting ready to testify before Judge Waxse on random sampling in predictive coding. I dare say few people can follow their talk on sampling in just one viewing.

f28e9-blogtrojanhorselogoLove words like we do? Not satisfied with robot reruns? We understand that too. Our summer reading is mainly full of cool cybersecurity books found at eDiscovery Security, especially the Cyberthriller novels. Check them out. I’m reading Trojan Horse right now. I has to do with a virus that allows documents to be altered in route after they are sent by email. Talk about an evidence authentication nightmare!

Remember, for full enjoyment of these videos press the HD button on the upper right corner, and then expand in the lower right for full size screen. Maybe someday we will do 3D and iMax too!

eDiscovery Robots Explain How STORYTELLING Will Be Used in Predictive Coding in the Not Too Distant Future

______

eDiscovery Robots Explain How GAMIFICATION Will Be Used in Predictive Coding in the Not Too Distant Future

______

 

eDiscovery Robots Explain ETHICS and Predictive Coding in the Not Too Distant Future


_____

eDiscovery Robots Explain How RANDOM SAMPLING is Used in Predictive Coding

 

____________

_______

___

Goodbye Lexie! We luv ya! It was a great run while it lasted.

Goodbye Lexie! We luv ya!
It was a great run while it lasted.
Who knows? Maybe you’ll return someday too?

 


Hacking Flash Trading on Wall Street: From Fiction to Fact in Just Three Weeks

June 22, 2014

Rogue_Code_bk_cover

I read Mark Russinovich’s new novel, Rogue Code, a few weeks ago when it was first released. The book is about flash trading and criminal hackers attacking Wall Street hedge funds. Then, just this week, I read a news flash on CNBC of a real life hack attack of a Wall Street hedge fund. Cybersecurity firm says large hedge fund attacked (CNBC 1/19/14). Again, it involved the controversial practice of flash trading. The facts of the news report were eerily close to Russinovich’s fiction. The news report seemed to come right off the pages of Rogue Code. Unless this is an elaborate hoax to promote the book, Mark Russinovich has taken predictive coding to a new level.

Remarkable Parallels

In both the book and news report a sophisticated, highly organized team of skilled hackers penetrated what was thought to be a totally secure stock trading computer system. They then planted a very complex piece of software code, malware, that hid in the system. It operated undetected for months, taking a million here, a million there. The hidden program was remotely controlled to surreptitiously interfere with flash trading in order to direct profits to the hackers from intercepted trades. Millions of dollars were stolen over several months time.

In the novel and real world some suspicious circumstances caused the brokers to hire an outside cybersecurity firm to investigate their computer systems. The cybersecurity white hats finally discovered the malware. In the book the hero catches the bad guys. In real life no one seems to even have a clue as to who they are. They are at large, enjoying the rich life of the billionaires they stole from.

In the novel the penetration went beyond just one hedge fund into the very trading platform of the New York Stock Exchange. The whole world financial system was threatened. No one is saying if that has also happened in real life.

Paul Henninger_CNBCThe cybersecurity company that broke the story, BAE Systems Applied Intelligence, made a point of saying that this kind of hack into stock trading systems, especially high-speed flash systems, has never been seen before. It may not have been seen, but Mark Russinovich certainly imagined it. The BAE spokesman, Paul Henninger (shown right), says that this hack represents a new level of attack involving both very advanced computer technical skills and advanced trading skills. Henninger says there are only a few experts in the world with the necessary skills to pull it off. Yet, this was all described in detail in Mark Russinovich’s novel. Kind of makes you wonder where Mark gets his material?

Cyber Thrillers

zero_day

Russinovich is one of the best writers in the new fiction genre that I like, cyber thrillers. For a complete list of the most popular of these books that have a cybersecurity focus see my Must Read Books on Cybersecurity page, which is a part of eDiscoverySecurity.com. Rogue Code is Russinovich‘s third in a series that stated with Zero Day in 2010 and Trojan Horse in 2012. All three books in this series star Jeff Aiken, a cybersecurity expert who saves the world as a White Hat hacker. Jeff Aiken battles Black Hat bad guys and bureaucratic bumblers at the same time. Jeff Aiken is kind of a nerdy version of James Bond and serves as his own Q. He’s got some cool hacking tools that would even make the JΞSTΞR jealous.

I can really relate with Jeff Aiken’s constant frustration with small-minded government types that get in his way. They usually suspect him of the being the bad guy. The real bad guys, the black hatters, usually come across as more sympathetic characters, which is one of the charms of the Jeff Aiken series. But the real attraction of his novels for me is how much you learn about cyber security while reading them.

Mark Russinovich and the Texas Instrument 99/4A

TI99:4AI figured Russinovich books were good, and accurate, and provided real insights, just based on the background of the author himself. Mark Russinovich is the real deal. He is now a Technical Fellow in the Cloud and Enterprise Division at Microsoft. I personally like him because at age 15, he bought himself his first computer, a TI99/4A. That was also my first personal computer and the first one I wrote programs for.

My kids still fondly remember my Make a Face program of the 99/4A. My daughter claims that was the world’s first avatar creation program, although at the time, to be honest, I thought of it as a high-tech Mr. Potato Head. You could make thousand of different looking faces, and no matter what face you made, Mr. Computerhead was always happy with your design and said, with lips moving, I sure look good now! It was one of those games where you could not lose. I offered it for sale on the TI99/4A user group newsletter. I wonder if Mark was ever tempted to buy it? I say tempted, because I know for sure he did not buy it. Sadly, I never sold any, despite my one $25 ad, and so I concentrated instead on the life of a techno-trial lawyer and computer hobbyist.

mark_russinovichAnyway, Mark Russinovich went on to become a real computer expert while I plugged along as a lawyer. Mark earned a B.S. in computer engineering from Carnegie Mellon University, a leading university for elite white hats. Then he received an M.S. in computer engineering from Rensselaer Polytechnic Institute. Then after some work in the real world, he returned to Carnegie Mellon, for a Ph.D. in computer engineering in 1994. Yeah, Mark knows his stuff. In so far as Microsoft products are concerned, he is one of the top experts in the world. He has personally discovered, and we assume quickly disclosed and fixed, many software errors and vulnerabilities that hackers could otherwise have exploited for fun and profit. Indeed, Mark now has a suspiciously large body of knowledge on how to hack into business systems of all kinds, especially those based on Microsoft operating systems.

Is Truth Stranger Than Fiction?

I had no idea how good his knowledge really was, and how close he was to the pulse of the elite hacking world, until reading the news story this week. It seemed to come right off the pages of his new book. I fully expect Jeff Aiken to be on the case right now tracking down the rogue coders who penetrated the hedge fund. I wonder if they are in Brazil watching the World Cup? In fact, come to think of it, the events Mark was writing about in Rogue Code were, we now know, taking place on the real Wall Street at the very same time he was writing about it. Hmm. What a coincidence. I wonder if well-known SEC investigator and attorney, Robert Ashton, will look into that? Too bad Patrick Oot has moved on. I’m sure he could e-discover the truth, that is, unless the Brazilian Mafia, the NL, got to him first.

For more about the Rogue Code check out this video trailer. I think this book would make a great movie.

Of course, the facts in Rogue Code and the BAE Systems report are somewhat different. You would not want to be too obvious, would you? Still, to a careful reader of both stories, both fact and faction, the similarities dominate. Both involved teams of experts working together to interfere with hedge fund flash traders to directly profit from the trades. Both involved long-term penetrations that lasted for months and resulted in the diversions (a polite word for theft) of millions of dollars. That’s right. This is big time cyber fraud, involving Big Data and Big Money and victims who usually will not want to complain. It makes for the perfect crime, especially if you like stealing from billionaires in a way that will likely go undetected.

Will the True Story of Wall Street Hacking Ever Be Known?

The full story of the real attack on the Wall Street flash trading hedge fund is still unknown. Indeed, the odds of our ever knowing the full truth of the real attack are slim to none. The as yet unnamed hedge fund has every incentive to keep it secret and keep their name out of the press. Think how their customers would react if they knew their money had been stolen by hack attack? How would their customers, billionaires all, react if they found out that their brokers had been outsmarted by hackers. No. That would not work out too well. So, as we learn in Rogue Code the novel, these things are usually hushed up and the bad guys get away with millions.

Going back to real life, and the BAE report by Paul Henninger, who said:

It’s pretty amazing,” Henninger said in an interview Wednesday from London. “The level of business sophistication involved as opposed to technical sophistication involved was something we had not seen before.”  . . .

Henninger said such business-savvy financial attacks can represent “the perfect crime,” because they are extremely difficult to trace to obscure locations around the globe, and because companies can be reluctant to go to law enforcement. “It often takes a while for firms to get comfortable with the idea of exposing what is in effect their dirty laundry to a law enforcement investigation,” Henninger said. “You can imagine the impact potentially on investor confidence.”

He said he does not know if the hedge fund reported the details of the attack—which he estimated cost the firm millions of dollars over just a few months’ time—to the SEC or the FBI.

Officials from the SEC and FBI declined to comment on this specific case.  . . .

Henninger said the malware represented a multimillion dollar problem for the hedge fund. “This was not something that was a minor issue for them,” he said. “This was something that was getting reviewed at the board level of this hedge fund precisely because it was having a material impact on performance across the portfolio.”

Public disclosure of illicit trading based on hacked information is exceedingly rare.

Eamon Javers, Cybersecurity firm says large hedge fund attacked (CNBC 1/19/14).

Conclusion

Bodek_flashThe introduction to  Rogue Code was written by Haim Bodek, Managing Partner of Decimus Capital Markets, LLC. He is an expert on flash trading who is now sounding the alarm on the abuses that flash trading is causing on Wall Street. Even without cyber intrusions and theft by hackers, Bodek thinks the stock exchanges could fall by the dishonesty and inherent unfairness of flash trading. I do not know about that, but I do know this micro-second trading gives an unfair advantage to some. We need a level playing field and a stock market that provides equal opportunities to all, including small investors. I hope that the alarm sounded by Haim Bodek about flash trading is overstated, but fear it is not. Rogue Code, and now the report by BAE, suggest that his concerns are well founded.

I am not delusional enough to think that the alarm sounded by Mark Russinovich on hacking Wall Street is a false alarm. That is a separate issue. I have no doubt in my mind that this is a clear and present danger. Although Rogue Code is a work of fiction, the hacking of Wall Street is not. The SEC must start taking cybersecurity more seriously. Indeed, all of us need to do that. Hackers are now getting organized and profit driven. This is not just an Anonymous group of kids anymore, these are criminal gangs. Hack attacks should be reported to the FBI. The days of secretive cover-ups must come to an end.


Follow

Get every new post delivered to your Inbox.

Join 3,341 other followers