Genius Bar at Georgetown

November 23, 2014

genius_bar_logoI interrupt my current series of blogs on predictive coding visualization to report on a recent experience with a Genius Bar event. I am not talking about the computer hipster type geniuses that work at the Apple Genius Bar, although there were a few of them at the CLE too. The Apple Genius Bar types can be smart, but, as we all know, they are not really geniuses, even if that is their title. True genius is rare, especially in the Legal Bar. Wikipedia says that a genius is a person who displays exceptional intellectual ability, creativity, or originality, typically to a degree that is associated with the achievement of new advances in a domain of knowledge.

genius_bar

All of us who attended the Georgetown Advanced e-Discovery Institute this week saw a true genius in action. He did not wear the tee-shirt uniform of the Apple genius employees. He wore a bow tie. His name is John M. Facciola. His speech at Georgetown was his last public event before he retires next week as a U.S. Magistrate Judge.

Facciola_sitting

Judge Facciola’s one hour talk displayed exceptional intellectual ability, creativity and originality, just as the definition of genius requires. What else can you call a talk that features a judge channeling Socrates? An oration that uses Plato’s Apology to criticize and enlighten Twenty-First Century lawyers? …. sophists all. The intensity of John’s talk, to me at least, and I’m sure to most of the six hundred or so other lawyers in the room, also indicated a new advance in the making in the domain of knowledge of Law. Still, true genius requires that an advance in knowledge actually be achieved, not just talked about. It requires that the world itself be moved. It requires, as another genius of our day, Steve Jobs, liked to say, that a dent be made in the Universe.

Facciola_standing_thinGeniuses not only have intellectual ability, creativity and originality, they have it to such a degree that they are able to change the world. In the legal world, indeed any world, that is rare. Richard Braman was one such man. His Sedona Conference did make a dent in the legal universe. So did the Principles, and so did his crowning achievement, the Cooperation Proclamation. John Facciola is another such man, or may be, who is trying to take Cooperation to the next level, to expand it to platonic heights. To be honest, the jury is still out on whether his ingenious ideas and proposals will in fact be adopted by the Bar, will in fact lead to the achievement of new advances in a domain of knowledge. That is the true test of a real genius.

Thus, whether future generations will see John Facciola as a genius depends in no small part on all of us, as well as on what John Facciola does next. For unlike the genius of Jobs and Braman, Facciola may be retired as a judge, but he is still very much alive. His legacy is still in the making. For that we should be very grateful. I for one cannot wait to see what he does next and will continue to support his genius in the making.

All of the other judges at Georgetown made it clear where they stand on the ideas of virtue and justice that Facciola promotes. In the final judges panel each wore a funny bow tie in his honor, and were all introduced by panel leader Maura Grossman with Facciola as their last name. It was a very touching and funny moment, all at the same time. I am really glad I was there.

Facciola’s last speech as a judge reflected his own life, his own genius. It was a very personal talk, a deep talk, where, to use his words, he shared his own strong religious and spiritual convictions. In this context he shared his critique of the law as we currently know it, and of legal ethics. It was damning and based on long experience. It was real. Some might say harsh. But he balanced this with his inspirational vision of what the law could and should be in the future. A law where morality, not profit, is the rule. Where the Golden Rule trumps all others. A profession where lawyers are not sophists, that will say or do anything for their clients. He laments that in federal court today most of the litigants are big corporations, as only they can afford federal court.

Judge Facciola calls for a profession where lawyers are citizens who care, who try to do the right thing, the moral thing, not just the expedient or profitable thing for their clients. He calls for lawyers to cooperate. He calls for a complete rewrite of our codes of ethics to make them more humanistic, and at the same time, more spiritual, more Platonic, in the ancient philosophic sense of Truth and Goodness. This is the genius we saw shine at Georgetown.

It reminds me of some quotes from Plato’s Apology, a few excerpts of which Facciola also read during his last talk as a judge. Take a moment and remember with me the most famous closing argument of all times:

Men of Athens, I honor and love you; but I shall obey God rather than you, and while I have life and strength I shall never cease from the practice and teaching of philosophy, exhorting anyone whom I meet after my manner, and convincing him, saying: O my friend, why do you who are a citizen of the great and mighty and wise city of Athens, care so much about laying up the greatest amount of money and honor and reputation, and so little about wisdom and truth and the greatest improvement of the soul, which you never regard or heed at all? Are you not ashamed of this? And if the person with whom I am arguing says: Yes, but I do care; I do not depart or let him go at once; I interrogate and examine and cross-examine him, and if I think that he has no virtue, but only says that he has, I reproach him with undervaluing the greater, and overvaluing the less. And this I should say to everyone whom I meet, young and old, citizen and alien, but especially to the citizens, inasmuch as they are my brethren. For this is the command of God, as I would have you know; and I believe that to this day no greater good has ever happened in the state than my service to the God. For I do nothing but go about persuading you all, old and young alike, not to take thought for your persons and your properties, but first and chiefly to care about the greatest improvement of the soul. I tell you that virtue is not given by money, but that from virtue come money and every other good of man, public as well as private. This is my teaching, and if this is the doctrine which corrupts the youth, my influence is ruinous indeed. But if anyone says that this is not my teaching, he is speaking an untruth. Wherefore, O men of Athens, I say to you, do as Anytus bids or not as Anytus bids, and either acquit me or not; but whatever you do, know that I shall never alter my ways, not even if I have to die many times.

hemlock_cup_David-The_Death_of_Socrates_crop

For the truth is that I have no regular disciples: but if anyone likes to come and hear me while I am pursuing my mission, whether he be young or old, he may freely come. Nor do I converse with those who pay only, and not with those who do not pay; but anyone, whether he be rich or poor, may ask and answer me and listen to my words; and whether he turns out to be a bad man or a good one, that cannot be justly laid to my charge, as I never taught him anything. And if anyone says that he has ever learned or heard anything from me in private which all the world has not heard, I should like you to know that he is speaking an untruth.

Facciola_standing_thin_shrugIf Facciola’s positive, Socratic inspired, moral vision for the Law is realized, and I for one think it is possible, then it would be a great new advance in the field of Law. The legal universe would be dented again. It would cement Facciola’s own place as a great Twenty-First Century genius, right up there with Jobs and Braman.

I am sure that Judge Facciola will continue his educational efforts in the field of law after the judge title becomes honorific. I hope he will give more specific form to his reform proposals. I cannot hope that his educational efforts will increase, because they are already incredibly prodigious, but I can hope they will now focus on his legacy, on his particular genius for legal ethics.

Many of our judges and attorneys work hard on e-discovery education. Many have great intellectual ability. But not many are capable of displaying the kind of genius we saw from Facciola’s swan-song as a judge at Georgetown. It is his alma mater, and the students at the Institute, which we have taken to calling the audience these days, were filled with John’s friends and admirers. It brought out the best in Fatch.

There were over 600 students, or fans, or audience, whatever you want to call them, who attended the Georgetown event held at the Ritz Carlton in Tysons Corner. That is a lot of people, mostly all lawyers. To be honest, that was several hundred lawyers too many for any CLE event. Big may be better in data, but not in education.

I liked the Institute better in its early days when there were just a few dozen attendees. I was there near the beginning as a teacher, and considered my sessions to be classes. The people who paid to attend were considered students. That is the language we used then. Now that has all changed. Now I attend as a presenter, and the people who pay to attend are called an audience. It seems like a transition that Socrates would condemn.

The big crowd and entertainment aspects of this years Georgetown Institute reminded me of a big event in Canada last month where I was honored to make the keynote on the first day. I talked about Technology and the Future of the Law, and, as usual, had my razzle dazzle Keynote slides. (I don’t use PowerPoint.) On the second day they had a second keynote. I was surprised to learn he was a professional motivational speaker. Not even a lawyer. My honor faded quickly. The keynote was all salesman rah rah, with no mention of the law at all. That’s not right in my book. It also made me wonder why I was really asked to give the first day’s keynote. Oh well, it was otherwise a great event. But I am now starting to tone down my slides. If I could tone down my enthusiasm, I would too, but I’ve tried, and that’s not possible.

John FacciolaThe task of putting on a show for a large, 600 plus audience was too great a challenge for almost all of the presenters at Georgetown. Do not get me wrong, all of the attorneys tagged to present knew their stuff, but being an expert, and an educator, are very different things. Being an expert and an entertainer are almost night and day. Very, very few experts have the skills of Facciola to do that, who, by the way used no slides at all. (I cannot, however, help but think how it might have been improved by the projection of a large holographic image of Socrates.)

Most of the sessions I attended at Georgetown were like any other CLE, fairly boring. We presenters (at least we were not called performers) were all told to engage our audience, to get them talking, but that almost never happened. The shows were no doubt educational, at least to those who had not seen them before. But entertaining? Even slightly amusing? No, not really. Oh, a few of the panels had their moments, and some were very interesting at times, even to me. A couple even made me laugh a few times. But only one was pure genius. The solo performance of Judge John Facciola.

Fatch_keyboardI found especially compelling his role-playing as Socrates, along with his quotes of Plato, where he read from the Greek original of his high school book from long ago. Judge Facciola presented with a light and witty hand both his dark condemnations of our profession’s failings, and his hope for a different, more virtuous future. His sense of humor of the human predicament made it all work. Humor is a quality possessed of most geniuses, and near geniuses. John radiates with it, and makes you smile, even if you cannot hear or understand all of his words. And even if many of his words anger you. I have no doubt some who heard this talk did not like his bluntness, nor his call for spirituality and a complete rewrite with non-lawyer participation of our professional code of ethics. Well, they did not like Socrates either. It comes with the turf of know-nothing truth-tellers. That is what happens when you speak truth to power.

I thought of trying to share the contents of John’s Apology by consulting my notes and memory. But that could never do it justice. I am no Plato. And really, truth be told, I know Nothing. You have to see the full video of John’s talk for yourself. And you can. Yes! Unlike Socrates’ last talk, Georgetown filmed John’s talk. Not only that, they filmed the whole CLE event. I suspect Georgetown will profit handsomely from all of this. John, of course, was paid nothing, and he would have it no other way.

Dear Georgetown advisors, and Dean Center, good citizens and friends all, please make a special exception regarding payment for the video of John Facciola’s talk. In the spirit of Socrates and your mission as educators, I respectfully request that you publish it online, in full, free of charge. Not the whole event, mind you, but John’s talk, all of his talk. Everyone should see this, not just the bubble people, not just Georgetown graduates and insiders. Let anyone, whether they be rich or poor, listen to these words. Put it on YouTube. Circulate it as widely as you can. Let me know and I will help you to get the word out. Give it away. No charge. You know that is what Socrates would demand.

In the meantime for all of my dear readers not lucky enough to have been there, here is a short fair use video that I made of Judge Facciola’s concluding words. Here he makes a humorous reference to the final passage he had previously quoted in full from Plato’s Apology. This is at the very end where Socrates asks his friends to punish his sons, the way he has tormented them, should they fall from the way of virtue. Having a son myself, I will finish this blog with the full quote from Plato and make the same request of you all. And I do not mean the humorous reference to long hair in Facciola’s concluding joke, I mean the real Socratic reference to  virtue over money and a puffed up sense of self-importance. A reference that we should all take to heart, not just Adam.

socrates3Do to my sons as I have done to you.

Still I have a favour to ask of them. When my sons are grown up, I would ask you, O my friends, to punish them; and I would have you trouble them, as I have troubled you, if they seem to care about riches, or anything, more than about virtue; or if they pretend to be something when they are really nothing,—then reprove them, as I have reproved you, for not caring about that for which they ought to care, and thinking that they are something when they are really nothing. And if you do this, both I and my sons will have received justice at your hands.

The hour of departure has arrived, and we go our ways—I to die, and you to live. Which is better God only knows.


Visualizing Data in a Predictive Coding Project – Part Two

November 16, 2014

visual-numbersThis is part two of my presentation of an idea for visualization of data in a predictive coding project. Please read part one first.

As most of you already know, the ranking of all documents according to their probable relevance, or other criteria, is the purpose of predictive coding. The ranking allows accurate predictions to me made as to how the documents should be coded. In part one I shared the idea by providing a series of images of a typical document ranking process. I only included a few brief verbal descriptions. This week I will spell it out and further develop the idea. Next week I hope to end on a high note with random sampling and math.

Vertical and Horizontal Axis of the Images

Raw_DataThe visualizations here presented all represent a collection of documents. It is supposed to be pointillist image, with one point for each document. At the beginning of a document review project, before any predictive coding training has been applied to the collection, the documents are all unranked. They are relatively unknown. This is shown by the fuzzy round cloud of unknown data.

Once the machine training begins all documents start to be ranked. In the most simplistic visualizations shown here the ranking is limited to predicted relevance or irrelevance. Of course, the predictions could be more complex, and include highly relevant and privilege, which is what I usually do. It could also include various other issue classifications, but I usually avoid this for a variety of reasons that would take us too far astray to explain.

Once the training and ranking begin the probability grid comes into play. This grid creates both a vertical and horizontal axis. (In the future, we could add third dimensions too, but let’s start simple.)  The one public comment received so far stated that the vertical axis on the images showing percentages adjacent to the words “Probable Relevant” might give people the impression that it is the probability of a document being relevant. Well, I hope so, because that is exactly what I was trying to do!

The vertical axis shows how the documents are ranked. The horizontal axis shows the number of documents, roughly, at each ranking level. Remember, each point is supposed to represent a specific, individual document. (In the future we could add family overlays, but again, let’s start simple.) A single dot in the middle would represent one document. An empty space would represent zero documents. A wide expanse of horizontal dots would represent hundreds or thousand of documents, depending on the scale.

The diagram below visualizes a situation common where ranking has just begun and the computer is uncertain as to how to classify the documents. It classifies most in the 37.5% to 67.5% range of probable relevance. It is all about fifty fifty at this point. This is the kind of spread you would expect to see if training began with only random sampling input. The diagram indicates that the computer does not really know much yet about the data. It does not yet have any real idea as to which documents are relevant, and which are not.

Vertical_ranking_overlay

The vertical axis of the visualization is the key.  It is intended to show a running grid from 99% probable relevant to 0.01% probable relevant. Note that 0.01% probable relevant is another way of saying 99.9% probable irrelevant, but remember, I am trying to keep this simple. More complex overlays may be more to the liking of some software users. Also note that the particular numbers I show on the these diagrams is arbitrary: 0.01%, 12.5%, 25%, 37.5%, 50%, 67.5%, 75%, 87.5%, 99.9%, I would prefer to see more detail here, and perhaps add a grid showing a faint horizontal line at every 10% interval. Still, the fewer lines shown here does have a nice aesthetic appeal, plus it was easier for me to create on the fly for this blog.

Again, let me repeat to be very clear. The vertical grid on these diagrams represents the probable ranking from least likely to be relevant on the bottom, to most likely on the top. The horizontal grid shows the number of documents. It is really that simple.

Why Data Visualization Is Important

visualize 2This kind of display of documents according to a vertical grid of probable relevance is very helpful because it allows you to see exactly how your documents are ranked at any one point in time. Just as important, it helps you to see how the alignment changes over time. This empowers you to see how your machine training impacts the distribution.

This kind of direct, immediate feedback greatly facilitates human computer interaction (what I call in my approximate 50 articles on predictive coding the hybrid approach). It makes it easier for the natural human intelligence to connect with the artificial intelligence. It makes it easier for the human SMEs involved to train the computer. The humans, typically attorneys or their surrogates, are the ones with the expertise on the legal issues in the case. This visualization allows them to see immediately what impact particular training documents have upon the ranking of the whole collection. This helps them to select effective training documents. It helps them to attain the goal of separation of relevant from irrelevant documents. Ideally they would be clustered on both the bottom and top of the vertical axis.

For this process to work it is important for the feedback to be grounded in actual document review, and not be a mere intellectual exercise. Samples of documents in the various ranking strata must be inspected to verify, or not, whether the ranking is accurate. That can vary from strata to strata. Moreover, as everyone quickly finds out, each project is different, although certain patterns do tend to emerge. The diagrams used as an example in this blog represent one such typical pattern, although greatly compressed in time. In reality the changes shows here from one diagram to another would be more gradual and have a few unexpected bumps and bulges.

Visualizations like this will speed up the ranking and the review process. Ultimately the graphics will all be fully interactive. By clicking on any point in the graphic you will be taken to the particular document or documents that it represents. You click and drag and you are taken to a whole set of documents selected. For instance, you may want to see all documents between 45% and 55%, so you would select that range in the graphic. Or you may want to see all documents in the top 5% probable relevance ranking, so you select that top edge of the graphic. These documents will instantly be shown in the review database. Most good software already has document visualizations with similar linking capacities. So we are not reinventing the Wheel here, just applying these existing software capacities to new patterns, namely to document rankings.

These graphic features will allow you to easily search the ranking locations. This will in turn allow you to verify, or correct, the machine’s learning. Where you find that the documents clicked have a correct prediction of relevance, you verify by coding as relevant, or highly relevant. Where the documents clicked have an incorrect prediction, you correct by coding the document properly. That is how the computer learns. You tell it yes when it gets it right, and no when it gets it wrong.

At the beginning of a project many predictions of relevance and irrelevance will be incorrect. These errors will diminish as the training progress, as the correct predictions are verified, and erroneous predictions are corrected. Fewer mistakes will be made as the machine starts to pick up the human intelligence. To me it seems like a mind to computer transference. More of the predictions will be verified, and the document distributions will start to gather on both end of the vertical relevance axis. Since the volume of documents is represented by the horizontal axis, more documents will start to bunch together at both the top and bottom of the vertical axis. Since document collections in legal search usually contain many more irrelevant documents than relevant, you will typically see most documents on the bottom.

Visualizations of an Exemplar Predictive Coding Project

In the sample considered here we see unnaturally rapid training. It would normally take many more rounds of machine training than are shown in these four diagrams. In fact, with a continuous active training process, there could be hundreds of rounds per day. In that case the visualization would look more like an animation than a series of static images. But again, I have limited the process here for simplicity sake.

1000000_docsAs explained previously, the first thing that happens to the fuzzy round cloud of unknown data before any training begins is that the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. In addition other necessarily irrelevant documents to this particular project are bulk-culled out. For example, ESI such as music files, some types of photos, and many email domains, like, for instance, emails from publications such as the NY Times. By good fortune in this example exactly One Million documents remain for predictive coding.

RandomWe begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. (They are the yellow dots.) Assuming a 95% confidence level, do you know what confidence interval this creates? I asked this question before and repeat it again, as the answer will not come until the final math installment next week.

Next we assume that an SME, and or his or her surrogates, reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant. Do you know what prevalence rate this creates? Do you know the projected range of relevant documents within the confidence interval limits of this sample? That is the most important question of all.

Next we do the first round of machine training proper. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. Again for simplicity sake, we assume that the analytics is directed towards relevance alone. In fact, most projects would also include high-relevance and privilege.

data-visual_Round_2In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities. Here there are already more documents trained on irrelevance, than relevance. This is in spite of the fact that the active search was to find relevant documents, not irrelevant documents. This is typical in most review projects where you have many more irrelevant than relevant documents overall, and where it is easier to spot and find irrelevant than relevant.

data-visual_Round_3Next we see the data after the second round of training. The division of the collection of documents into relevant and irrelevant is beginning to form. The largest of collection of documents are the blue points at the bottom. They are the documents that the computer predicts are irrelevant based on the training to date. There are also a large collection of points shown in red at the top. They are the ones where the computer now thinks there is a high probability of relevance. Still, the computer is uncertain about the vast majority of the documents: the red in the third strata from the top, the blue in the third strata from the bottom, and the many in the grey, the 37.5% to 67.5% probable relevance range. Again we see an overall bottom heavy distribution. This is a typical pattern because it is usually easier to train on irrelevance than relevance.

As noted before, the training could be continuous. Many software programs offer that feature. But I want to keep the visualizations here simple, and not make an animation, and so I do not assume here a literally continuous active learning. Personally, although I do like to keep the training continuous throughout the review, I like the actual computer training to come in discrete stages that I control. That gives me a better understanding of the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. That is the kind of feedback that these visualizations enhance.

data-visual_Round_4Next we see the data after the third round of training. Again, in reality it would typically take more rounds of training than three to reach this relatively mature state, but I am trying to keep this example simple. If a project did progress this fast, it would probably be because a large number of documents were used in the prior rounds.  The documents about which the computer is now uncertain — the grey area, and the middle two brackets — is now much smaller.

The computer now has a high probability ranking for most of the probable relevant and probable irrelevant documents. The largest number of documents are the blue bottom, where the computer predicts they have a near zero chance of being classified relevant. Again, most of the  probable predictions, those in the top and bottom three brackets, are located in the bottom three brackets. Those are the documents predicted to have less that a 37.5% chance of being relevant. Again, this kind of distribution is typical, but there can be many variances from project to project. We here see a top loading where most of the probable relevant documents are in the top 12.5% percent ranking. In other words, they have an 87.5% probable relevant ranking, or higher.

data-visual_Round_5Next we see the data after the fourth round of training. It is an excellent distribution at this point. There are relatively few documents in the middle. This means there are relatively few documents about which the computer is uncertain as to its probable classification. This pattern is one factor among several to consider in deciding whether further training and document review are required to complete your production.

Another important metric to consider is the total number of documents found to be probable relevant, and comparison with the random sample prediction. Here is where math comes in, and understanding of what random sampling can and cannot tell you about the success of a project. You consider the spot projection of total relevance based on your initial prevalence calculation, but much more important, you consider the actual range of documents under the confidence interval. That is what really counts when dealing with prevalence projections and random sampling. That is where the plus or minus  confidence interval comes into play, as I will explain in detail the third and final installment to this blog.

PrevalenceIn the meantime, here is  the document count of the distribution roughly pictured in the final diagram above, which to me looks like an upside down, fragile champagne glass. We see that exactly 250,000 documents have a 50% or higher probable relevance ranking, and 750,000 documents have a 49.9% or less probable relevance ranking. Of the probable relevant documents, there are 15,000 documents that fall in the 50% to 67.5% range. There are another 10,000 documents that fall in the 37.5% to 49.9% probable relevance range. Again, this is also fairly common as we often see less on the barely irrelevant side that we do on the barely relevant side. As a general rule I review with humans all documents that are 50% or higher probable relevance, and do not review the rest. I do however sample and test the rest, the documents with less than a 50% probable relevance ranking. Also, in some projects I review far less than the top 50%. That all depends on proportionality constraints, and document ranking distribution, the kind of distributions that these visualizations will show.

In addition to this metrics analysis, another important factor to consider in whether our search and review efforts are now complete, is how much change in ranking there has been from one training round to the next. Sometimes there may be no change at all. Sometimes there may only be very slight changes. If the changes from the last round are large, that is an indication that more training should still be tried, even if the distribution already looks optimal, as we see here.

Another even more important quality control factor is how correct the computer has been in the last few rounds of its predictions. Ideally, you want to see the rate of error decreasing to a point where you see no errors in your judgmental samples. You want your testing of the computer’s prediction to show that it has attained a high degree of precision. That means there are few documents predicted relevant, that actual review by human SMEs show are in fact irrelevant. This kind of error is known as a False Positive. Much more important to quality evaluation is to the discovery of documents predicted irrelevant, that are actually relevant. This kind of error is known as a False Negative. The False Negatives are your real concern in most projects because legal search is usually focused on recall, not precision, at least within reason.

The final distinction to note in quality control is one that might seem subtle, but really is not. You must also factor in relevance weight. You never want a False Negative to be a highly relevant document. If that happens to me, I always commence at least one more round of training. Even missing a document that is not highly relevant, not hot, but is a strong relevant document, and one of a type not seen before, is typically a cause for further training. This is, however, not an automatic rule as with the discovery of a hot document. It depends on a variety of factors having to do with relevance analysis of the particular case and document collection.

In our example we are going to assume that all of the quality control indicators are positive, and a decision has been made to stop training and move on to a final random sample test.

A second random sample comes next. That test and visualization will be provided next week, along with the promised math and sampling analysis.

Math Quiz

I part one, and again here, I asked some basic math questions on random sampling, prevalence, and recall. So far no one has attempted to answer the questions posed. Apparently, most readers here do not want to be tested. I do not blame them. This is also what I find in my online training program, e-DiscoveryTeamTraining.com, where only a small percentage of the students who take the program elect to be tested. That is fine with me as it means one less paper to grade, and most everyone passes anyway. I do not encourage testing. You know if you get it or not. Testing is not really necessary.

The same applies to answering math questions in a public blog. I understand the hesitancy. Still, I hope many privately tried, or will try, to solve the questions and came up with the correct answers. In part three of this blog I will provide the answers, and you will know for sure if you got it right. There is still plenty of time to try to figure it out on your own. The truly bold can post it online in the comments below. Of course, this is all pretty basic stuff to try experts of this craft. So, to my fellow experts out there, you have yet another week to take some time and strut your stuff by sharing the obvious answers. Surely I am not the only one in the e-discovery world bold enough to put their reputation on the line by sharing their opinions and analysis in public for all to see (and criticize). Come on. I do it every week.

Math and sampling are important tools for quality control, but as Professor Gordon Cormack, a true wizard in the area of search, math, and sampling likes to point out, sampling alone has many inherent limitations. Gordon insists, and I agree, that sampling should only be part of a total quality control program. You should never just rely on random sampling alone, especially in low prevalence collections. Still, when sampling, prevalence, and recall are included as part of an overall QC effort, the net effect is very reassuring. Unless I know that I have an expert like Gordon on the other side, and so far that has never happened, I want to see the math. I want to know about all of the quality control and quality assurance steps taken to try to find the information requested. If you are going to protect your client, you need to learn this too, or have someone at hand who already knows it.

This kind of math, sampling, and other process disclosures should convince even the most skeptical adversary or judge. That is why it is important for all attorneys involved with legal research to have a clear mathematical understanding of the basics. Visualizations alone are inadequate, but, for me at least, visualizations do help a lot. All kinds of data visualizations, not just the ones here presented, provide important tools to help lawyers to understand how a search project is progressing.

Challenge to Software Vendors

challengeThe simplicity of the design of the idea presented here is a key part of the power and strength of the visualization. It should not be too difficult to write code to implement this visualization. We need this. It will help users to better understand the process. It will not cost too much to implement, and what it does cost should be recouped soon in higher sales. Come on vendors, show me you are listening. Show me you get it. If you have a software demo that includes this feature, then I want to see it. Otherwise not.

All good predictive coding software already ranks the probable relevance of documents, so we are not talking about an enormous coding project. This feature would simply add a visual display to calculations already being made. I could hand make these calculations myself using an Excel spreadsheet, but that is time consuming and laborious. This kind of visualization lends itself to computer generation.

I have many other ideas for predictive coding features, including other visualizations, that are much more complex and challenging to implement. This simple grid explained here is an easy one to implement, and will show me, and the rest of our e-discovery community, who the real leaders are in software development.

Conclusion

Ralph_2013_beard_frownThe primary goal of the e-Discovery Team blog is educational, to help lawyers and other e-discovery professionals. In addition, I am trying to influence what services and products are provided in e-discovery, both legal and technical. In this blog I am offering an idea to improve the visualizations that most predictive software already provide. I hope that all vendors will include this feature in future releases of their software. I have a host of additional ideas to improve legal search and review software, especially the kind that employs active machine learning. They include other, much more elaborate visualization schemes, some of which have been alluded to here.

Someday I may have time to consult on all of the other, more complex ideas, but, in the meantime, I offer this basic idea for any vendor to try out. Until vendors start to implement even this basic idea, anyone can at least use their imagination, as I now do, to follow along. These kind of visualizations can help you to understand the impact of document ranking on your predictive coding review projects. All it takes is some idea as to the number of documents in various probable relevance ranking strata. Try it on your next predictive coding project, even if it is just rough images from your own imagination (or Excel spreadsheet). I am sure you will see for yourself how helpful this can be to monitor and understand the progress of your work.

 

 


Visualizing Data in a Predictive Coding Project

November 9, 2014

data-visual_Round_5This blog will share a new way to visualize data in a predictive coding project. I only include a brief description this week. Next week I will add a full description of this project. Advanced students should be able to predict the full text from the images alone. Study the text and try to figure out the details of what is going on.

Soon all good predictive coding software will include visualizations like this to help searchers to understand the data. The images can be automatically created by computer to accurately visualize exactly how the data is being analyzed and ranked. Experienced searchers can use this kind of visual information to better understand what they should do next to efficiently meet their search and review goals.

For a game try to figure out how the high and low number of relevant documents that you must find in this review project to claim that you have a 95% confidence level of having found all relevant documents, the mythical total recall. This high-low range will be wrong one time out of twenty, that is what the 95% confidence level means, but still, this knowledge is helpful. The correct answer to questions of recall and prevalence is always a high-low range of documents, never just one number, and never a percentage. Also, there are always confidence level caveats. Still, with these limitations in mind, for extra points, state what the spot projection is for prevalence. These illustrations and short descriptions provide all of the information you need to calculate these answers.

The project begins with a collection of documents here visualized by the fuzzy ball of unknown data.

Raw_Data

Next the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. By good fortune exactly One Million documents remain.

1000000_docs

We begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. Assuming a 95% confidence level, what confidence interval does this create?

Random

Assume that an SME reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant.

 

Training Begins

Next we do the first round of machine training. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. To keep it simple we only show the relevance ranking, and not also the irrelevance metrics display. The top represents 99.9% probable relevance. The bottom the inverse, 00.1% probable relevance. Put another way, the bottom would represent 99.9% probable irrelevance. For simplicity sake we also assume that the analytics is directed towards relevance alone, whereas most projects would also include high-relevance and privilege. In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities.

data-visual_Round_2

Next we see the data after the second round of training. Note that the training could with most software be continuous. But I like to control when the training happens in order to better understand the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. The human SME understands how the machine is learning. The SME learns where the machine needs the most help to tune into their conception of relevance. This kind of cross-communication makes it easier for the artificial intelligence to properly boost the human intelligence.

data-visual_Round_3

Next we see the data after the third round of training. The machine is learning very quickly. In most projects it takes longer than this to attain this kind of ranking distribution. What does this tell us about the number of documents between rounds of training?

data-visual_Round_4

Now we see the data after the fourth round of training. It is an excellent distribution and so we decide to stop and test.data-visual_Round_5The second random sample comes next. That visualization, and a full description of the project, will be provided next week. In the meantime, leave your answers to the questions in the comments below. This is a chance to strut your stuff. If you prefer, send me your answers, and questions, by private email.

 


Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two

November 2, 2014

recordsThis is the second part of a two-part blog, please read part one first.

AI-Enhanced Big Data Search Will Greatly Simplify Information Governance

Information Governance is, or should be, all about finding the information you need, when you need it, and doing so in a cheap and efficient manner. Information needs are determined by both law and personal preferences, including business operation needs. In order to find information, you must first have it. Not only that, you must keep it until you need it. To do that, you need to preserve the information. If you have already destroyed information, really destroyed it I mean, not just deleted it, then obviously you will not be able to find it. You cannot find what does not exist, as all Unicorn chasers eventually find out.

Too_Many_RecordsThis creates a basic problem for Information Governance because the whole system is based on a notion that the best way to find valuable information is to destroy worthless information. Much of Information Governance is devoted to trying to determine what information is a valuable needle, and what is worthless chaff. This is because everyone knows that the more information you have, the harder it is for you to find the information you need. The idea is that too much information will cut you off. These maxims were true in the pre-AI-Enhanced Search days, but are, IMO, no longer true today, or, at least, will not be true in the next five to ten years, maybe sooner.

In order to meet the basic goal of finding information, Information Governance focuses its efforts on the proper classification of information. Again, the idea was to make it simpler to find information by preserving some of it, the information you might need to access, and destroying the rest. That is where records classification comes in.

The question of what information you need has a time element to it. The time requirements are again based on personal and business operations needs, and on thousand of federal, state and local laws. Information governance thus became a very complicated legal analysis problem. There are literally thousands of laws requiring certain types of information to be preserved for various lengths of time. Of course, you could comply with most of these laws by simply saving everything forever, but, in the past, that was not a realistic solution. There were severe limits on the ability to save information, and the ability to find it. Also, it was presumed that the older information was, the less value it had. Almost all information was thus treated like news.

These ideas were all firmly entrenched before the advent of Big Data and AI-enhanced data mining. In fact, in today’s world there is good reason for Google to save every search, ever done, forever. Some patterns and knowledge only emerge in time and history. New information is sometimes better information, but not necessarily so. In the world of Big Data all information has value, not just the latest.

paper records management warehouseThis records life-cycle ideas all made perfect sense in the world of paper information. It cost a lot of money to save and store paper records. Everyone with a monthly Iron Mountain paper records storage bill knows that. Even after the computer age began, it still cost a fair amount of money to save and store ESI. The computers needed to buy and maintain digital storage used to be very expensive. Finding the ESI you needed quickly on a computer was still very difficult and unreliable. All we had at first was keyword search, and that was very ineffective.

Due to the costs of storage, and the limitations of search, tremendous efforts were made by record managers to try to figure out what information was important, or needed, either from a legal perspective, or a business necessity perspective, and to save that information, and only that information. The idea behind Information Management was to destroy the ESI you did not need or were not required by law to preserve. This destruction saved you money, and, it also made possible the whole point of Information Governance, to find the information you wanted, when you wanted it.

Back in the pre-AI search days, the more information you had, the harder it was to find the information you needed. That still seems like common sense. Useless information was destroyed so that you could find valuable information. In reality, with the new and better algorithms we now have for AI-enhanced search, it is just the reverse. The more information you have, the easier it becomes to find what you want. You now have more information to draw upon.

That is the new reality of Big Data. It is a hard intellectual paradigm to jump, and seems counter-intuitive. It took me a long time to get it. The new ability to save and search everything cheaply and efficiently is what is driving the explosion of Big Data services and products. As the save everything, find anything way of thinking takes over, the classification and deletion aspects of Information Governance will naturally dissipate. The records lifecycle will transform into virtual immortality. There is no reason to classify and delete, if you can save everything and find anything at low cost. The issues simplify; they change to how to save and  search, although new collateral issues of security and privacy grow in importance.

Save and Search v. Classify and Delete

The current clash in basic ideas concerning Big Data and Information Governance is confusing to many business executives. According to Gregory Bufithis who attended a recent event in Washington D.C. on Big Data sponsored by EMC, one senior presenter explained:

The C Suite is bedeviled by IG and regulatory complexity. … 

The solution is not to eliminate Information Governance entirely. The reports of its complete demise, here or elsewhere, are exaggerated. The solution is to simplify IG. To pare it down to save and search. Even this will take some time, like I said, from five to ten years, although there is some chance this transformation of IG will go even faster than that. This move away from complex regulatory classification schemes, to simpler save and search everything, is already being adopted by many in the high-tech world. To quote Greg again from the private EMC event in D.C. in October, 2014:

Why data lakes? Because regulatory complexity and the changes can kill you. And are unpredictable in relationship to information governance. …

So what’s better? Data lakes coupled with archiving. Yes, archiving seems emblematic of “old” IT. But archiving and data lifecycle management (DLM) have evolved from a storage focus, to a focus on business value and data loss prevention. DLM recognizes that as data gets older, its value diminishes, but it never becomes worthless. And nobody is throwing out anything and yes, there are negative impacts (unnecessary storage costs, litigation, regulatory sanctions) if not retained or deleted when it should be.

But … companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better.

Because then all you need is a good search platform or search system … like Hadoop which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from OLAP (online analytical processing). And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

Because it is all about search.

Recent Breakthroughs in Artificial Intelligence
Make Possible Save Everything, Find Anything

AIThe New York Times in an opinion editorial this week discussed recent breakthroughs in Artificial Intelligence and speculated on alternative futures this could create. Our Machine Masters, NT Times Op-Ed, by David Brooks (October 31, 2014). The Times article quoted extensively another article in the current issue of Wired by technology blogger Kevin Kelly: The Three Breakthroughs That Have Finally Unleashed AI on the World. Kelly argues, as do I, that artificial intelligence has now reached a breakthrough level. This artificial intelligence breakthrough, Kevin Kelly argues, and David Brook’s agrees, is driven by three things: cheap parallel computation technologies, big data collection, and better algorithms. The upshot is clear in the opinion of both Wired and the New York Times: “The business plans of the next 10,000 start-ups are easy to forecast: Take X and add A.I. This is a big deal, and now it’s here.

These three new technology advances change everything. The Wired article goes into the technology and financial aspects of the new AI; it is where the big money is going and will be made in the next few decades. If Wired is right, then this means in our world of e-discovery, companies and law firms will succeed if, and only if, they add AI to their products and services. The firms and vendors who add AI to document review, and project management, will grow fast. The non-AI enhanced vendors, non-AI enhanced software, will go out of business. The law firms that do not use AI tools will shrink and die.

David_BrooksThe Times article by David Brooks goes into the sociological and philosophical aspects of the recent breakthroughs in Artificial Intelligence:

Two big implications flow from this. The first is sociological. If knowledge is power, we’re about to see an even greater concentration of power.  … [E]ngineers at a few gigantic companies will have vast-though-hidden power to shape how data are collected and framed, to harvest huge amounts of information, to build the frameworks through which the rest of us make decisions and to steer our choices. If you think this power will be used for entirely benign ends, then you have not read enough history.

The second implication is philosophical. A.I. will redefine what it means to be human. Our identity as humans is shaped by what machines and other animals can’t do. For the last few centuries, reason was seen as the ultimate human faculty. But now machines are better at many of the tasks we associate with thinking — like playing chess, winning at Jeopardy, and doing math. [RCL – and, you might add, better at finding relevant evidence.]

On the other hand, machines cannot beat us at the things we do without conscious thinking: developing tastes and affections, mimicking each other and building emotional attachments, experiencing imaginative breakthroughs, forming moral sentiments. [RCL – and, you might add, better at equitable notions of justice and at legal imagination.]

In this future, there is increasing emphasis on personal and moral faculties: being likable, industrious, trustworthy and affectionate. People are evaluated more on these traits, which supplement machine thinking, and not the rote ones that duplicate it.

In the cold, utilitarian future, on the other hand, people become less idiosyncratic. If the choice architecture behind many decisions is based on big data from vast crowds, everybody follows the prompts and chooses to be like each other. The machine prompts us to consume what is popular, the things that are easy and mentally undemanding.

I’m happy Pandora can help me find what I like. I’m a little nervous if it so pervasively shapes my listening that it ends up determining what I like. [RCL – and, you might add, determining what is relevant, what is fair.]

I think we all want to master these machines, not have them master us.

ralph_wrongAlthough I share the concerns of the NY Times about mastering machines and alternative future scenarios, my analysis of the impact of the new AI is focused and limited to the Law. Lawyers must master the AI-search for evidence processes. We must master and use the better algorithms, the better AI-enhanced software, not visa versa. The software does not, nor should it, run itself. Easy buttons in legal search are a trap for the unwary, a first step down a slippery slope to legal dystopia. Human lawyers must never over-delegate our uniquely human insights and abilities. We must train the machines. We must stay in charge and assert our human insights on law, relevance, equity, fairness and justice, and our human abilities to imagine and create new realities of justice for all. I want lawyers and judges to use AI-enhanced machines, but I never want to be judged by a machine alone, nor have a computer alone as a lawyer.

The three big new advances that are allowing better and better AI are nowhere near to threatening the jobs of human judges or lawyers, although they will likely reduce their numbers, and certainly will change their jobs. We are already seeing these changes in Legal Search and Information Governance. Thanks to cheap parallel computation, we now have Big Data Lakes stored in thousands of inexpensive, cloud computers that are operating together. This is where open-sourced software like Hadoop comes in. They make the big clusters of computers possible. The better algorithms is where better AI-enhanced Software comes in. This makes it possible to use predictive coding effectively and inexpensively to find the information needed to resolve law suits. The days of vast numbers of document reviewer attorneys doing linear review are numbered. Instead, we will see a few SMEs, working with small teams of reviewers, search experts, and software experts.

The role of Information Managers will also change drastically. Because of Big Data, cheap parallel computing, and better algorithms, it is now possible to save everything, forever, at a small cost, and to quickly search and find what you need. The new reality of Save Everything, Find Anything undercuts most of the rationale of Information Governance. It is all about search now.

Conclusion

Ralph_Losey_2013_abaNow that storage costs are negligible, and search far more efficient, the twin motivators of Information Science to classify and destroy are gone, or soon will be. The key remaining tasks of Information Governance are now preservation and search, plus relatively new ones of security and privacy. I recognize that the demise of the importance of destruction of ESI could change if more governments enact laws that require the destruction of ESI, like the EU has done with Facebook posts and the so-called “right to be forgotten law.” But for now, most laws are about saving data for various times, and do not require data be destroyed. Note that the new Delaware law on data destruction still keeps it discretionary on whether to destroy personal data or not. House Bill No. 295 – The Safe Destruction of Documents Containing Personal Identifying Information. It only places legal burdens and liability for failures to properly destroy data. This liability for mistakes in destruction serves to discourage data destruction, not encourage it.

Preservation is not too difficult when you can economically save everything forever, so the challenging task remaining is really just one of search. That is why I say that Information Governance will become a sub-set of search. The save everything forever model will, however, create new legal work for lawyers. The cybersecurity protection and privacy aspects of Big Data Lakes are already creating many new legal challenges and issues. More legal issues are sure to arise with the expansion of AI.

Automation, including this latest Second Machine Age of mental process automation, does not eliminate the need for human labor. It just makes our work more interesting and opens up more time for leisure. Automation has always created new jobs as fast as it has eliminated old ones. The challenge for existing workers like ourselves is to learn the new skills necessary to do the new jobs. For us e-discovery lawyers and techs, this means, among other things, acquiring new skills to use AI-enhanced tools. One such skill, the ability for HCIR, human computer information retrieval, is mentioned in most of my articles on predictive coding. It involves new skill sets in active machine learning to train a computer to find the evidence you want from large collections of data sets, typically emails. When I was a law student in the late 1970s, I could never have dreamed that this would be part of my job as a lawyer in 2014.

The new jobs do not rely on physical or mental drudgery and repetition. Instead, they put a premium on what makes up distinctly human, our deep knowledge, understanding, wisdom, and intuition; our empathy, caring, love and compassion; our morality, honesty, and trustworthiness; our sense of justice and fairness; our ability to change and adapt quickly to new conditions; our likability, good will, and friendliness; our imagination, art, wisdom, and creativity. Yes, even our individual eccentricities, and our all important sense of humor. No matter how far we progress, let us never lose that! Please be governed accordingly.



Follow

Get every new post delivered to your Inbox.

Join 3,621 other followers