As a kind of strange lawyer-mid-life-crisis, I wrote my first law review article last year: HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). Following tradition, I tried to make the opening sentences as clever as possible:
For over one hundred years, complex litigation has relied upon the ubiquitous Bates stamp to try and maintain order and clarity in paper evidence by placing sequential numbers on documents. In today’s world of vast quantities of electronic documents, the days of the Bates stamp are numbered. Instead, the future belongs to a new technology, a computer-based mathematical process known as “hash.” (emphasis added)
Ok, maybe not so clever, but still, I was delighted to see an article this week entitled Bates Stamps’ Days May Be Numbered by Tom O’Connor in Law.com’s Legal Technology section. No big surprise here as I met Tom a few weeks ago, and we talked about hash. (I tend to do that, a lot.) I liked how Tom saw the conversion from Bates stamping to hash as symbolic of a paradigm shift, not only in e-discovery, but in the world at large. Tom and a few others, such as Craig Ball, see a significance in the move to hash beyond what I understood when I wrote the article. They also have a better grasp of how this fits with other e-discovery technologies and procedures to facilitate what Tom claims are huge savings in time and money. I gave Tom a copy of my article, as he had heard about it from Craig but not yet read it. (Yes, I usually keep an extra copy in my briefcase.)
I mentioned Tom’s ideas in a prior blog, e-Discovery at the Harvard Club in New York City, based on his presentation at the CLE. The article Tom has since written, Bates Stamps’ Days May Be Numbered, provides more meat for the bones, which I will attempt to summarize here and place into proper hash context. For still more information listen to Monica Bay’s recent interview of Tom on Legal Talk Network.
For those not real clear on what hash is, and what it could possibly have to do with the 19th Century Bates stamp shown above, I suggest you read my law review article. But if the thought of reading a 44 page academic paper with 174 footnotes leaves you cold, I suggest you try my Hash Page summary instead, or my earlier blog on Hash. They will give you a pretty good idea of how hash is the mathematical foundation of e-discovery, not a corned beef dish, and why this math should render sequential numbering obsolete. There are also many interesting comments left on these blogs by experts in the field, including an esoteric argument I had with a few vendors concerning the legal efficacy of hash in ESI authentication. These short articles do not go into law-review-depth, but do lay a helpful predicate to understand what Tom is talking about.
Tom’s article begins by noting that most people doing e-discovery today still rely on Bates stamping. They scan and sequentially number ESI as if it were a piece of paper. Then he observes, as I did in my introduction, that this system will not work “in today’s world of vast quantities of electronic documents.”
But that process is simply not effective when dealing with terabytes of data. To address the sheer volume, many vendors are advocating a new way of working with electronic documents that can reduce costs as much as 65 percent by eliminating the need for text extraction and imaging in the processing phase. Beyond immediate cost savings, this approach also provides cheaper native file production, reducing imaging costs for production sets and saving up to 90 percent of the time needed to process documents. How? By not using Bates numbers on every page.
Later Tom explains that the alternative to Bates numbers is hash values. But first, he details how and why this conversion can save so much time and money:
Currently, to provide Bates numbering, many vendors generate TIFF images from native files and then Bates number those images. But this process complicates native file review and — at anywhere from eight to 20 cents per TIFF — adds considerable cost to the process. Typically, during processing, data is culled, de-duplicated; metadata and text are extracted; and then a TIFF file is created. An unavoidable consequence is that the relationship of the pages to other pages, or attachments, is broken — and then must be re-created for the review process. Page-oriented programs handle this by using a load file to tie everything together from the key of a page number. But most new software use a relational database that stores the data about a document in multiple tables. To load single page TIFFs into a relational database involves a substantial amount of additional and duplicative work in the data load process.
These steps are avoided by changing to an identification system based on hash values of entire ESI files (which Tom here calls “documents”) that eliminates the need for tracking of individual pages. Here is how Tom explains it, using a lot of e-discovery oriented tech-talk, which, if he is speaking, is usually tempered by a few laughs and war stories:
A document-based data model, rather than a page-based approach, eliminates the text extraction and image creation steps from the processing stage and cuts the cost of that process in half. Documents become available in the review platform much faster — as imaging often accounts for as much as 90 percent of the time to process. This enables early case assessment without any processing, by simply dragging and dropping a native file or a PST straight into the application — which cannot be achieved with the page-based batch process. Relational databases allow for one-to-many and many-to-many relationships and support advanced features and functions — as well as compatibility with external engines for tasks such as de-duping and concept searching. Applications that support these functions — such as software from Equivio, Recommind and Vivisimo Inc. — are all document-based and will not perform in the old page environment. Programs that use the document model can eliminate batch transfer. This process (See Diagram 1 below) increases data storage due to the need for data replication in the transfer process and is also prone to a high rate of human error. And elimination of the time that inventory (in this case, electronic data) is stationary will eliminate overall cost as well as reduce production time
Tom’s diagram above shows the Bates stamp work flow model for traditional Tiff image e-discovery process and review. This procedure treats ESI as if it were paper, and uses sequential numbering, instead of hash, to identify information. According to Tom, this traditional procedure requires a number of time consuming and expensive batch transfer processes. He says these steps are unnecessary and can be eliminated in pure native review that relies on hash. The more simplified “Bates-free” process is shown by Tom’s diagram below. In his words, this is “an easier, faster and more cost-effective e-discovery process.”
Tom concludes that:
A modern litigation support program must be able to review native documents that are not just paper equivalents, and directly enable review of any file that is in common use in business today. The future belongs to these new technologies, where native files are processed without the need to convert to TIFF and are identified by their unique hash algorithm. Attorneys and clients who focus on a document-based system will save time and money and can conduct native file review. In today’s world of vast quantities of electronic documents, the days of the Bates stamp are numbered.
I could not agree more, especially since, unlike the tile, Tom now says the “days are numbered” and not “may be numbered.” I have no doubt about it, even though it may still take many years to get there. Old habits die hard, especially in the legal profession. Still, some day, Bates stamping will seem as quaint and antique as the original Bates numbering machine itself. The original shown above was invented in 1893. The first section of my law review article explains the history of this invention, and how Thomas Edison (shown right) purchased the patent from Edwin G. Bates. Then I go into the theory of hash and native ESI. I explain that hash is the digital fingerprint that identifies every electronic file, and reveals any change in the file. I also explain how hash is used in various e-discovery processes, and examine just about every legal decision ever written which mentions hash algorithms.
In case you have never seen a hash value before, here is an example: 4C37FC6257556E954E90755DEE5DB8CDA8D76710. There are many different types of hash formulas, but all produce lengthy alphanumerics hash values such as this. The two most popular are the SHA-1 hash algorithm which creates a 40 place hash value (shown above), and MD5 hash which produces a 32 place value. Both are too long for a practical naming convention to replace a Bates stamp. So I propose that the value be truncated and only the first and last three places be used. Thus the above hash would be shortened to 4C3.710 . I also propose that the # symbol stand for hash. (The # symbol is already commonly known as the hash mark in most of the world, but in many English speaking cultures, including the U.S., it is also called the number sign or the pound sign). So I propose to abbreviate the above SHA-1 hash with #4C3.710. Some of the technical details of this naming protocol are addressed in the law review article. Others will have to be worked out with time and experience, and the adoption of more standards in the e-discovery industry.
I conclude my article by imagining what a courtroom of the future might be like without the Bates stamp:
In countless courtrooms today, a mantra something like this is heard often: “I am handing the witness a document pre-marked as ‘Trial Exhibit 75’ and Bates stamped as ‘Dr. Smith 0573.’” In the future, the author expects something like this will be heard instead: “I am putting on screen for the witness to view an ESI file pre-marked as ‘Trial Exhibit 75’ and hash marked as ‘Dr. Smith Hash 4F7.C3B (Dr. Smith#4F7.C3B).’” The ESI file may still sometimes be converted to paper, in which case it could be handed to a witness, instead of put on a screen, but the same naming protocol would apply and it would bear a “hash mark” somewhere on the bottom: “Dr. Smith#4F7.C3B.”
Sorry, Mr. Bates, your one hundred-year-plus reign is over.