Hash

 

51FEC3B6FCB1E7D5465575BED5DCDC1B8897AE5A

Computer hashing

Computer hash is an encryption algorithm that forms the mathematical foundation of e-discovery. Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive. As an example, the hash of the animated GIFF file is shown above. The unique alphanumeric of a computer file is called its “hash value.” Hash is also known in mathematical parlance as the “condensed representation” or “message digest” of the original message. It is more popularly known today as a “digital fingerprint.” Hash is the bedrock of e-discovery because the digital fingerprint guarantees the authenticity of data, and protects it against alteration, either negligent or intentional. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.

Hash is my favorite e-discovery technology. I became fascinated by its great potential as a safeguard for electronic evidence in the future, and ended up reading and experimenting with this algorithm in depth. Ultimately I wrote a forty-four page law review article on the subject. HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). Here I discuss hash at length and review just about every case that mentions it. The article has 174 footnotes to provide reference to almost everything on the subject that might be of interest to a lawyer or others in the e-discovery field. As the title suggests, I make a specific proposal in the article for the adoption of an e-discovery file naming protocol based on hash to replace the paper oriented Bates stamp. For more background on this law review article see my prior blog about it. For information on how the use of hash, instead of Bates stamps, is much more efficient and saves money in e-discovery processing, see my other blog The Days of the Bates Stamp Are Numbered.

Technically, hashing is based on the substitution and transposition of data by various mathematical formulas. Thus the process is called “hashing,” in the linguistic sense of “to chop and mix.” The hash value is commonly represented as a short string of random-looking letters and numbers, which are actually binary data written in hexadecimal notation. Hash is commonly called a file’s “fingerprint” because it represents its absolute uniqueness.

If two computer files are identical, then they will have the same hash value. Even if the files have a different name, if their contents are the same, exactly the same, they will have the same hash. This allows for easy identification and elimination of redundant documents, the mentioned deduplication process. But if you so much as change a single comma in a thousand page text, it will have a completely different hash number than the original. There are no similarities in the hash numbers based on similarities in the files. Each number is unique. That is how the math in all hashing works.

Many kinds of effective hash formulas have been invented, but two are in wide use today: the SHA-1 and MD5 algorithms. Both are very effective, in that mathematicians conjecture that it is “computationally infeasible” for two different files to produce the same hash value. That is why hashing is commonly employed in data transmissions to verify that the integrity of a file has been maintained in transmission. If you hash the file received, and it does not produce the same hash value, then it has been corrupted, and at least one byte is not the same as the original. It is a guaranteed way of verifying the integrity of an electronic file.

Software to run both the SHA-1 and MD5 hash analysis of files is widely available, easy to use and free. I use a HashTab Shell Extension to Windows, available for free at http://www.beeblebrox.org/software.php. The hash value of any file can be instantly determined, regardless of the type of electronic file, including graphics. For instance, the hash values of a Word document I am working on now are:
MD5: 588BCBD1845342C10D9BBD1C23294459
SHA-1: C24AE3125BFDBCE01A27FDDA21B3A7E83FAFF69E
If I only change one comma in this multipage document, all else remaining the same, the hash values are now:
MD5: 5F0266C4C326B9A1EF9E39CB78C352DC
SHA-1: 4C37FC6257556E954E90755DEE5DB8CDA8D76710
Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.

Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different. This requires special software, but again is commonly available from many e-discovery vendors, for a price. This software allows you to hash only portions of a file. Thus, for instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”

25 Responses to Hash

  1. Jr says:

    Nice BLOG!

  2. AS says:

    Has anyone found lots of issues trying to decide which e-mail message properties to be used to create MD5 hash? Not to mention the timezone issues, where communication happened internationally…

  3. Jeff says:

    Why are they so trusting of MD5/SHA1 both are circumventable. There are now ways to make the hashes match even if there are TOTALLY different files. REF:

    http://www.schneier.com/blog/archives/2005/06/more_md5_collis.html

    There are also several tools now to modify MD5/SHA1.

    http://www.stachliu.com/collisions.html

  4. Ralph Losey says:

    Thanks for the comment. The articles you reference are interesting, but the false collisions engineered by experts which are discussed at these sites, and elsewhere, are not a cause for any real concern in the e-discovery arena. Here we primarily use hash to verify that ESI has not been altered, and to determine if two files stored on systems are identical. e-Discovery is not using hash for encryption purposes. These studies do, however, explain why the spy agencies are moving to new hash formulas.

  5. […] I totally agree that we can do away with the Bates system for identifying unique documents in litigation and move towards hashing them […]

  6. Jonathan Jaffe says:

    First, the proposal of using hashing algorithms to identify unique documents, to preserve their integrity, and to reduce duplicative document production is worthy of great praise. Thank you, Ralph, for ardently promoting this wonderful idea.

    To strengthen the argument I’d like to point out something that needs correction. I don’t intend to be argumentative, but as attorneys, we know that using terminology correctly helps avoid serious misunderstandings, particularly between two groups entrenched in their own jargon.

    ‘Hash’ is not a proper noun, and to use it as such increases the disconnect between lawyers and IT professionals. The distinction warrants attention less IT folk continue to look at lawyers sideways as lawyers continue to mis-abuse technological terms.

    In cryptography, Hash, as a proper noun, is non-existent. It might work to call all vacuum cleaners Hoovers, but where hash algorithms must change frequently to keep up with processing power, no algorithm will ever be the Google of search engines.

    There are hash functions and there are hash algorithms, and ‘Hash’ by itself may taste good in the south, or it may get you sky-high in Amsterdam, but it is not a proper noun in technology. Properly used, hash is either employed as a verb, as in ‘to hash a file’, or as a noun, as in, that hash is the *result* of a hash function. Actually, it would be semantically correct to say, “that hash value”, but IT folk know that you’re talking about the value when you say “the resulting hash”.

    What hash is not, however, is a proper noun. To use it as such makes an IT person think, “Huh? Can you please tell me *which* algorithm you’re talking about?”

    Separately, as has been pointed out in other comments, hashing a file does *not* guarantee you a unique value. The questions to be answered by the courts are, what is the probability of collision, and is that uncertainty acceptable? Using two different algorithms on the same file to generate two hash values, however, significantly moves the probability of a collision towards zero.

    All-in-all, the idea is excellent, and the article above was a pleasure to read. Thank you again for forwarding a much-needed idea.

  7. Michael Dodson says:

    HashTab for the Mac is here:
    http://beeblebrox.org/hashtabmac/

    And the link in the blog above for HashTab 3.0 (Windows) did not work for me. The home page does:
    http://beeblebrox.org/

  8. Eli Nelson says:

    My apologies for the late response to an old article, but your hypothesis about using hash values as “the new Bates stamp” always really bothered me, but I didn’t want to air out my opinion until I had something truly constructive to contribute (next paragraph – but first, the rant). Hash values are like reading bar codes manually and just about as prone to transcription errors. Furthermore, they are non-sequential (by their very nature), and utterly useless for anything other than virtually-unique file identification. These problems make them highly problematic as practical substitutes for Bates numbers.

    If we do want to consider using hash values in lieu of Bates numbers, or asking hash values to do *anything* except identify files as unique, maybe we could discuss the merits of using perceptual, rather than cryptographic, hashing algorithms? I do not know if perceptual hashing is as good at avoiding the potential false duplicate problem as using MD5 or SHA (which is of course the primary reason to compute hash values), but its benefit is that files that are pretty similar will have pretty similar hash values computed. This means that we will be able to look at two relatively similar hash values and know there is some similarity between the underlying documents.

    Maybe this is superfluous, since there is a lot of reasonable near duplicate technology out there, but it at least puts something on the table in terms of asking hash values to perform double duty as something besides a “unique” file identifier. If there’s a way to incorporate a sequencing prefix as well, to account for the chronological order in which documents were sent/modified/created, or at least processed, then we would really start moving toward a legitimate replacement for the tried and true Bates stamp.

    Just my $.02

  9. Leslee Ellenson says:

    I’ve been a paralegal for 30+ years. It has been critical to be able to identify the source of a document (i.e., materials from client, materials from EEOC, documents produced by a named party, etc.) and to be able to number the documents sequentially. “Hashing,” whether a misappropriated noun, or a modern high-tech verb/gerund, causes me concern. I need to be able to identify a document with just a glance at the stamp it bears. In my humble opinion, bates stamping will remain viable until “hashing” becomes more user friendly, or until the entire legal industry settles on an overall protocol for identifying, producing, and numbering ESI, whether in native format or other formats.

  10. Doug Jarvis says:

    Having reviewed depositions over the years and seen how Bates numbered pages are constantly referred for identification, I can’t possibly see how a 40 alpha-numeric value is going to gently roll off of any attorney’s tongue. There are many practical, real-world situations that make Hash values unrealistic.

  11. […] line about ‘acceptable losses’ was not in the safety memo when I created it”?  This is where hash value becomes a wonderful thing.  Computing the hash of an electronic file, or computing a hexadecimal […]

  12. […] or “fingerprint” of a file.  (For a thorough discussion of hash values in eDiscovery, see Ralph Losey’s excellent e-Discovery Team blog entry and Ralph’s related Law Review article).   The hash of a file is normally based upon its […]

Leave a Reply