The Days of the Bates Stamp Are Numbered

As a kind of strange lawyer-mid-life-crisis, I wrote my first law review article last year: HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). Following tradition, I tried to make the opening sentences as clever as possible:

For over one hundred years, complex litigation has relied upon the ubiquitous Bates stamp to try and maintain order and clarity in paper evidence by placing sequential numbers on documents. In today’s world of vast quantities of electronic documents, the days of the Bates stamp are numbered. Instead, the future belongs to a new technology, a computer-based mathematical process known as “hash.” (emphasis added)

Ok, maybe not so clever, but still, I was delighted to see an article this week entitled Bates Stamps’ Days May Be Numbered by Tom O’Connor in Law.com’s Legal Technology section. No big surprise here as I met Tom a few weeks ago, and we talked about hash. (I tend to do that, a lot.) I liked how Tom saw the conversion from Bates stamping to hash as symbolic of a paradigm shift, not only in e-discovery, but in the world at large. Tom and a few others, such as Craig Ball, see a significance in the move to hash beyond what I understood when I wrote the article. They also have a better grasp of how this fits with other e-discovery technologies and procedures to facilitate what Tom claims are huge savings in time and money. I gave Tom a copy of my article, as he had heard about it from Craig but not yet read it. (Yes, I usually keep an extra copy in my briefcase.)

I mentioned Tom’s ideas in a prior blog, e-Discovery at the Harvard Club in New York City, based on his presentation at the CLE. The article Tom has since written, Bates Stamps’ Days May Be Numbered, provides more meat for the bones, which I will attempt to summarize here and place into proper hash context. For still more information listen to Monica Bay’s recent interview of Tom on Legal Talk Network.

corned beef hashFor those not real clear on what hash is, and what it could possibly have to do with the 19th Century Bates stamp shown above, I suggest you read my law review article. But if the thought of reading a 44 page academic paper with 174 footnotes leaves you cold, I suggest you try my Hash Page summary instead, or my earlier blog on Hash. They will give you a pretty good idea of how hash is the mathematical foundation of e-discovery, not a corned beef dish, and why this math should render sequential numbering obsolete. There are also many interesting comments left on these blogs by experts in the field, including an esoteric argument I had with a few vendors concerning the legal efficacy of hash in ESI authentication. These short articles do not go into law-review-depth, but do lay a helpful predicate to understand what Tom is talking about.

Tom’s article begins by noting that most people doing e-discovery today still rely on Bates stamping. They scan and sequentially number ESI as if it were a piece of paper. Then he observes, as I did in my introduction, that this system will not work “in today’s world of vast quantities of electronic documents.”

But that process is simply not effective when dealing with terabytes of data. To address the sheer volume, many vendors are advocating a new way of working with electronic documents that can reduce costs as much as 65 percent by eliminating the need for text extraction and imaging in the processing phase. Beyond immediate cost savings, this approach also provides cheaper native file production, reducing imaging costs for production sets and saving up to 90 percent of the time needed to process documents. How? By not using Bates numbers on every page.

Later Tom explains that the alternative to Bates numbers is hash values. But first, he details how and why this conversion can save so much time and money:

Currently, to provide Bates numbering, many vendors generate TIFF images from native files and then Bates number those images. But this process complicates native file review and — at anywhere from eight to 20 cents per TIFF — adds considerable cost to the process. Typically, during processing, data is culled, de-duplicated; metadata and text are extracted; and then a TIFF file is created. An unavoidable consequence is that the relationship of the pages to other pages, or attachments, is broken — and then must be re-created for the review process. Page-oriented programs handle this by using a load file to tie everything together from the key of a page number. But most new software use a relational database that stores the data about a document in multiple tables. To load single page TIFFs into a relational database involves a substantial amount of additional and duplicative work in the data load process.

These steps are avoided by changing to an identification system based on hash values of entire ESI files (which Tom here calls “documents”) that eliminates the need for tracking of individual pages. Here is how Tom explains it, using a lot of e-discovery oriented tech-talk, which, if he is speaking, is usually tempered by a few laughs and war stories:

A document-based data model, rather than a page-based approach, eliminates the text extraction and image creation steps from the processing stage and cuts the cost of that process in half. Documents become available in the review platform much faster — as imaging often accounts for as much as 90 percent of the time to process. This enables early case assessment without any processing, by simply dragging and dropping a native file or a PST straight into the application — which cannot be achieved with the page-based batch process. Relational databases allow for one-to-many and many-to-many relationships and support advanced features and functions — as well as compatibility with external engines for tasks such as de-duping and concept searching. Applications that support these functions — such as software from Equivio, Recommind and Vivisimo Inc. — are all document-based and will not perform in the old page environment. Programs that use the document model can eliminate batch transfer. This process (See Diagram 1 below) increases data storage due to the need for data replication in the transfer process and is also prone to a high rate of human error. And elimination of the time that inventory (in this case, electronic data) is stationary will eliminate overall cost as well as reduce production time

The Bates stamp ESI method
Tom’s diagram above shows the Bates stamp work flow model for traditional Tiff image e-discovery process and review. This procedure treats ESI as if it were paper, and uses sequential numbering, instead of hash, to identify information. According to Tom, this traditional procedure requires a number of time consuming and expensive batch transfer processes. He says these steps are unnecessary and can be eliminated in pure native review that relies on hash. The more simplified “Bates-free” process is shown by Tom’s diagram below. In his words, this is “an easier, faster and more cost-effective e-discovery process.”

The new Hash based model

Tom concludes that:

A modern litigation support program must be able to review native documents that are not just paper equivalents, and directly enable review of any file that is in common use in business today. The future belongs to these new technologies, where native files are processed without the need to convert to TIFF and are identified by their unique hash algorithm. Attorneys and clients who focus on a document-based system will save time and money and can conduct native file review. In today’s world of vast quantities of electronic documents, the days of the Bates stamp are numbered.

Thomas EdisonI could not agree more, especially since, unlike the tile, Tom now says the “days are numbered” and not “may be numbered.” I have no doubt about it, even though it may still take many years to get there. Old habits die hard, especially in the legal profession. Still, some day, Bates stamping will seem as quaint and antique as the original Bates numbering machine itself. The original shown above was invented in 1893. The first section of my law review article explains the history of this invention, and how Thomas Edison (shown right) purchased the patent from Edwin G. Bates. Then I go into the theory of hash and native ESI. I explain that hash is the digital fingerprint that identifies every electronic file, and reveals any change in the file. I also explain how hash is used in various e-discovery processes, and examine just about every legal decision ever written which mentions hash algorithms.

In case you have never seen a hash value before, here is an example: 4C37FC6257556E954E90755DEE5DB8CDA8D76710. There are many different types of hash formulas, but all produce lengthy alphanumerics hash values such as this. The two most popular are the SHA-1 hash algorithm which creates a 40 place hash value (shown above), and MD5 hash which produces a 32 place value. Both are too long for a practical naming convention to replace a Bates stamp. So I propose that the value be truncated and only the first and last three places be used. Thus the above hash would be shortened to 4C3.710 . I also propose that the # symbol stand for hash. (The # symbol is already commonly known as the hash mark in most of the world, but in many English speaking cultures, including the U.S., it is also called the number sign or the pound sign). So I propose to abbreviate the above SHA-1 hash with #4C3.710. Some of the technical details of this naming protocol are addressed in the law review article. Others will have to be worked out with time and experience, and the adoption of more standards in the e-discovery industry.

I conclude my article by imagining what a courtroom of the future might be like without the Bates stamp:

In countless courtrooms today, a mantra something like this is heard often: “I am handing the witness a document pre-marked as ‘Trial Exhibit 75’ and Bates stamped as ‘Dr. Smith 0573.’” In the future, the author expects something like this will be heard instead: “I am putting on screen for the witness to view an ESI file pre-marked as ‘Trial Exhibit 75’ and hash marked as ‘Dr. Smith Hash 4F7.C3B (Dr. Smith#4F7.C3B).’” The ESI file may still sometimes be converted to paper, in which case it could be handed to a witness, instead of put on a screen, but the same naming protocol would apply and it would bear a “hash mark” somewhere on the bottom: “Dr. Smith#4F7.C3B.”

Sorry, Mr. Bates, your one hundred-year-plus reign is over.

10 Responses to The Days of the Bates Stamp Are Numbered

  1. […] Losey, “The Days of the Bates Stamp Are Numbered.”  This picks up on an great law review article Mr. Losey wrote last fall and some recent […]

  2. […] To shed a little more light on the topic, the editor of Law Technology News, Monica Bay, interviewed Tom on her Law Technology Now podcast (playable online here from the Legal Talk Network). Tom explains that attorneys are “very wed” to the traditional Bates numbers that have managed documents for years in the legal world. The equivalent of a Bates number for electronic files is a hash function which is a unique string of numbers that can be applied to each individual, all-inclusive file. For more information on this concept, visit Ralph Losey’s blog post “The Days of the Bates Stamp are Numbered.” […]

  3. Troy says:

    Several years ago I worked out a simple way to Bates number electronic files for eDiscovery (which I set out in my chapter on litigation support in The Handbook of Computer Crime Investigation. Dan Mares, one of the greats of digital forensics, wrote the first bates numbering tool based on my idea. Later, Christopher Brown, of Technology Pathways implemented my bates numbering technique into his ProDiscovery forensics tool.

    So, Bates Numbering doesn’t have to die in the digital age, but you are quite correct that the old paper-based thinking does.

    Thanks.

  4. […] Technology Law & Policy 1 (June 2007). A few months ago I wrote a blog on the article called The Days of the Bates Stamp Are Numbered, talking about some of the more recent developments in this area of the law, especially the shift […]

  5. Paul says:

    Excellent article. However there is two more brief periods in the stage of “Bates Stamping” history that the author did not mention, perhaps because they were only so brief and mainly used in New York City. The first being in the late 1980’s. Xerox bought a patent for a device that plugged in to their 1090 copiers that would add titles and sequential numbers to the copy sets. This was very popular in NYC. legal copy shops, since the inventor of the idea was a New Yorker. The operator would have to first physically count the number of pages to find out what the last number of the stack was going to be, and punch that in the device ( because that is how the copier fed the originals). The key operator had to always keep adding to the last number by increments of 100 or so ( as much as the feeder would hold). Although it sounds tedious, millions of pages were produced to law firms using this method.
    Then in the 1990s a few ( four I believe ) of the large NYC legal copy vendors used inkjet mailing machines to add titles and numbers to the copy set. These machines were typically used for high speed address mailing or used by Pepsi / Coke to spray the expiration date on the bottles. By turning the spray head face up and using a high speed conveyer belt, the copies would pass by the sensor and each copy would be “sprayed” with both titles and numbers in either red ink or black ink incrementally. The ink would dry immediately. It was common to “bates” number 50,60 even 70 banker boxes of documents this way in a 24 hour period.
    Gotta love the ingenuity of New Yorker’s!

  6. EOM Curator says:

    To see where Bates numbering machines fit into the early history of office technology and equipment, see the Early Office Museum’s online exhibit on Antique Date, Time, Number & Name Stamps at http://www.earlyofficemuseum.com/stamps.htm

  7. Todd says:

    This looks like good advice. Note, however, that there is no
    reason to prefer any particular digit of an MD5 or SHA1 hash. The hashes are cryptographic strength and there is no
    bias favoring one digit over another.
    Wikipedia has good descriptions of both of these hash functions for those interested. Taking digits from the front of the hash allows for variable length bates numbering. The number of digits needed depends on the number of documents and the desired probability of avoiding a “collision” where two documents have the same number. 6 digits (equals 24 bits this case since the digits are hex digits) results in a only a 3% chance that there will be a collision in 1000 documents. However if 3,000 documents are being considered, there would be a 24% chance that two would have the same 6 digit hash. To keep the odds of two documents having the same hash low, say under 1% we can use these guidelines:

    1,000 documents use 7 digits from MD5 hash (0.2%)
    10,000 documents use 8 digits from MD5 hash (1.2%)
    100,000 documents use 10 digits from MD5 hash (0.4%)
    500,000 documents use 11 digits from MD5 hash (0.7%)
    1,000,000 documents use 12 digits from MD5 hash (0.2%)

  8. This was a great article. We have tried to improve on the age old physical stamp by providing a very robust Bates engine. You have made me rethink what our next step will be in this arena.

  9. Melissa Smith says:

    I have a Bates Machine Co Model 49 numbering machine. I’ve been trying to research it but I am having little to no luck finding information on this particular model. If you have any information on it or know where I could turn i would greatly appreciate the help. Thank you.

  10. […] did this happen?  Certainly the technology has been eclipsed.  But did it have to happen this way?  Couldn’t the Bates Manufacturing Company and its […]

Leave a Reply

Discover more from e-Discovery Team

Subscribe now to keep reading and get access to the full archive.

Continue reading