Computer hash is an encryption algorithm that forms the mathematical foundation of e-discovery. Hashing generates a unique alphanumeric value to identify a particular computer file, group of files, or even an entire hard drive. As an example, the hash of the animated GIFF file is shown above. The unique alphanumeric of a computer file is called its “hash value.” Hash is also known in mathematical parlance as the “condensed representation” or “message digest” of the original message. It is more popularly known today as a “digital fingerprint.” Hash is the bedrock of e-discovery because the digital fingerprint guarantees the authenticity of data, and protects it against alteration, either negligent or intentional. Hash also allows for the identification of particular files, and the easy filtration of duplicate documents, a process called “deduplication” that is essential to all e-discovery document processing.
Hash is my favorite e-discovery technology. I became fascinated by its great potential as a safeguard for electronic evidence in the future, and ended up reading and experimenting with this algorithm in depth. Ultimately I wrote a forty-four page law review article on the subject. HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). Here I discuss hash at length and review just about every case that mentions it. The article has 174 footnotes to provide reference to almost everything on the subject that might be of interest to a lawyer or others in the e-discovery field. As the title suggests, I make a specific proposal in the article for the adoption of an e-discovery file naming protocol based on hash to replace the paper oriented Bates stamp. For more background on this law review article see my prior blog about it. For information on how the use of hash, instead of Bates stamps, is much more efficient and saves money in e-discovery processing, see my other blog The Days of the Bates Stamp Are Numbered.
Technically, hashing is based on the substitution and transposition of data by various mathematical formulas. Thus the process is called “hashing,” in the linguistic sense of “to chop and mix.” The hash value is commonly represented as a short string of random-looking letters and numbers, which are actually binary data written in hexadecimal notation. Hash is commonly called a file’s “fingerprint” because it represents its absolute uniqueness.
If two computer files are identical, then they will have the same hash value. Even if the files have a different name, if their contents are the same, exactly the same, they will have the same hash. This allows for easy identification and elimination of redundant documents, the mentioned deduplication process. But if you so much as change a single comma in a thousand page text, it will have a completely different hash number than the original. There are no similarities in the hash numbers based on similarities in the files. Each number is unique. That is how the math in all hashing works.
Many kinds of effective hash formulas have been invented, but two are in wide use today: the SHA-1 and MD5 algorithms. Both are very effective, in that mathematicians conjecture that it is “computationally infeasible” for two different files to produce the same hash value. That is why hashing is commonly employed in data transmissions to verify that the integrity of a file has been maintained in transmission. If you hash the file received, and it does not produce the same hash value, then it has been corrupted, and at least one byte is not the same as the original. It is a guaranteed way of verifying the integrity of an electronic file.
Software to run both the SHA-1 and MD5 hash analysis of files is widely available, easy to use and free. I use a HashTab Shell Extension to Windows, available for free at http://www.beeblebrox.org/software.php. The hash value of any file can be instantly determined, regardless of the type of electronic file, including graphics. For instance, the hash values of a Word document I am working on now are:
If I only change one comma in this multipage document, all else remaining the same, the hash values are now:
Although the two files have only this trivial difference, there are no similarities in these hash values, proving that hashing will detect even the slightest file alteration.
Hashing can also be used to determine when fields or segments within files are identical, even though the entire file might be quite different. This requires special software, but again is commonly available from many e-discovery vendors, for a price. This software allows you to hash only portions of a file. Thus, for instance, you can hash only the body of an email, the actual message, to determine whether it is identical with another email, even when the “reference” or the “to” and “from” fields are different. This allows for an important filtering process called “near de-duplication.”