Part of my discipline as an e-discovery specialist is to try to read (or at least skim) every published opinion on the subject. Lots of attorneys specializing in this area do that. But there is one other type of case I also read, every opinion that uses the word “hash.” No, I do not need help from Narcotics or Overeaters Anonymous. The kind of hash I am addicted to is purely algorithmic. This hash comes in many flavors, but the best known, and the ones usually employed in e-discovery, are called MD5 hash, SHA-1 hash, or the latest and greatest, SHA-2 hash.
As I explain in my blog Hash page, hash is the mathematical foundation of e-discovery and the most powerful tool of any forensic investigator. It reveals the unique mathematical fingerprint of every computer file that allows for perfect identification and authentication of electronic evidence. I became fascinated with the powers of hash a few years ago, and ended up writing a lengthy law review article on the subject. HASH: The New Bates Stamp, 12 Journal of Technology Law & Policy 1 (June 2007). A few months ago I wrote a blog on the article called The Days of the Bates Stamp Are Numbered, talking about some of the more recent developments in this area of the law, especially the shift from Tiffing and linear flat file Bates stamping to native file hash marking.
In the process of researching the original law review article, I am pretty sure I read every legal opinion and legal article ever written that mentions hash. I also read a few scientific and cryptological articles as well, most of which I did not really understand. Having put that much time and effort into the subject, I try to keep up by reading every new legal opinion or article mentioning hash. That is why I have a standing search for all cases using the term, and automatically receive a copy of them by email as soon as they are published. I can be in the middle of dinner and my blackberry will buzz alerting me of a new hash case. Lest you think that’s a tad weird, I am willing to bet that there are a few other hash enthusiasts out there, Craig Ball comes to mind, who do the same thing. (See Craig Ball’s excellent article “In Praise of Hash” at pg. 52.)
Hash and Child Pornography
Most of the new hash cases I see have nothing to do with e-discovery per se. Instead, they are usually criminal law cases, typically cases involving one of the most disgusting of crimes, child pornography. Police have been using hash to catch perps in this area for years. Hash is an effective tool for this because it allows police to know if certain child pornography is located on a computer, usually videos or still photos, by looking to see if the hash values for these files are present. That is a bit of an over-simplification, but suffice it to say that there are lists of hash values that are known to be associated with computer files which are unquestionably child pornography. New York Attorney General Andrew Cuomo explained the process in a press release in June 2008 announcing a deal with major Internet providers to block major sources of child pornography:
As part of the undercover investigation, the Attorney General’s office developed a new system for identifying online content that contains child pornography. Every online picture has a unique “Hash Value” that, once identified and collected, can be used to digitally match the same image anywhere else it is distributed. By building a library of the Hash Values for images identified as being child pornography, the Attorney General’s investigators were able to filter through tens of thousands of online files at a time, speedily identifying which Internet Service Providers were providing access to child pornography images.
U.S. v. Warren
I recently received a new hash case alert from a district court in Missouri. U.S. v. Warren, 2008 WL 3010156 (E.D.Mo. July 24, 2008). A quick review showed it was yet another child porn case, so I did not think much about it. I just added it to my reading list for more careful study later, just in case there might be something special about it. When I got around to reading Warren yesterday, I was very pleasantly surprised, as this was indeed a special case.
Warren is a case considering and rejecting a motion to suppress evidence, namely computer video files of underage teens having sex. The motion to suppress was based on a series of hyper-technical challenges to the affidavit which the St. Louis police submitted to the judge to receive a search warrant of defendant’s computer. The affidavit explained how the police had searched the Internet for files “whose digital SHA-1 value was identical to that of a file known to contain child pornography.” They found a computer with an Internet Protocol address of 70 … 167 offering to share one such known file, and then subpoenaed AT&T to get the physical address of the subscriber with that IP address. The computer was located in Affton, Missouri.
The police detective’s affidavit explained how the hash values and offer to upload established “that a computer in Missouri was ‘offering to participate in the distribution of known child pornography.’” Based on this affidavit, the judge found probable cause to issue the search warrant of the computers located in Warren’s home. The police then went to his home, found no one there, forced entry, and seized his computer. Warren himself later came along, and, foolishly enough, voluntarily came to the police station, waived his right to counsel several times, and spoke at length to the police. The opinion includes extensive excerpts of the taped interview, which Warren later argued was made in violation of his right to legal counsel.
The defendant’s technical search warrant objections forced the court to delve into many of the characteristics and evidentiary properties of hash. For that reason alone, the case is useful to any practitioner trying to better understand the subject. But what is really special about the case, at least for me, is the system of hash file identification used by the court to identify the offending video tape at issue in this case. That video computer file was the key piece of evidence, the “smoking gun.”
Six-Place Hash Truncation Naming Protocol
The opinion by Magistrate Judge David D. Noce in Warren is unusual and special because it is the first case to use the truncated hash value labeling system I proposed in HASH: The New Bates Stamp. My article was not mentioned, and apparently Judge Noce was not aware of it. He used the six-place hash truncation system I proposed in my article because it was, in his words, “convenient” to do so, and because the detectives had used that system in their affidavits and testimony. I doubt the police detectives had read my law review article either, which makes their use of the abbreviation system all the more important. It shows that it is a natural and reasonable thing to do, although this is the first time it has been utilized or mentioned in a legal opinion.
So what is the six-place hash truncation system which I proposed that these Missouri officials are now in fact using? Before I can answer that, I have to go into a little more depth about hash and Bates stamps. HASH: The New Bates Stamp not only explains hash and its importance to e-discovery, it also argues for the legal profession and e-discovery industry to adopt a new type of electronic document naming protocol that uses hash values, instead of sequential numbering, to identify electronic evidence. I argue that the time has come for the legal profession to abandon Nineteenth Century Bates stamp paper mentality, and adopt Twenty-First Century ESI hash mentality. I proposed that sequential Bates stamps be replaced by non-linear, intrinsic hash values.
The hash values would not only identify ESI, they would authenticate it too, something the lowly Bates stamp could never do. But the problem with using hash values to identify ESI, instead of Bates stamps, is that hash values are too long and awkward for the human mind. Here is what a typical forty place hexadecimal SHA-1 hash value looks like: 2B37BC6257556E954F90755DDE5DB8CDA8D76619.
Police detectives, lawyers and judges cannot go around describing computer files used as evidence with such long alphanumerics. It is too cumbersome a name to replace the Bates stamp. So my common sense proposal, which Judge Noce in Warren calls “convenient,” is to only use the first and last three places of the hash value, instead of all forty. So the hash value above becomes the much more manageable 2B3 … 619. That truncated hash value becomes a pretty good document name, and, in my opinion and that of many others, should replace the arbitrary Bates stamp.
Turns out that the detectives in Missouri were already following this six-place truncation protocol at the time my article was published in June 2007. Perhaps they and other law enforcement agencies have been using this system for years. I do not know for sure, although I doubt it has been a widespread practice. I have talked to many e-discovery forensic experts about the hash naming proposal over the past two years. Many of these experts did police work before going into e-discovery, and none ever mentioned having done this before. Also, it certainly does not appear in the legal literature on the subject, that is, until U.S. v. Warren.
Hexadecimal Values v. Base32 Number System
At first, I was disappointed to see that Judge Noce’s introduction of the truncated hash value naming protocol was flawed with two obvious technical errors. See if you can catch them:
The search turned up a list of files, including one with a 32-character alpha-numeric SHA1 designation of “H4V … UTI.” Fn4
FN4 – For convenience, in this opinion the SHA1 value set out in full in the search warrant affidavit will be referred to as “H4V … UTI.” The affidavit defined the term “SHA1” (also known as “SHA-1”) as being a mathematical algorithm that uses the Secure Hash Algorithm (SHA), developed by the National Institute of Standards and Technology (NIST), along with the National Security Agency (NSA) . . . Basically the SHA1 is an algorithm for computing a condensed representation of a message or data file like a fingerprint.
Warren at *1.
First of all, the SHA-1 hash generates a 40-character hexadecimal string, not 32-character. The other kind of hash, MD5 hash, is the one that uses a 32 character string, not SHA-1. For this reason, my first reaction was that the Judge, or police, mixed up the two different types of hash, and meant to say 40 characters, not 32.
But then there seemed to be yet another, even bigger mistake. The letters H V U T and I should not have been in the hash value name. The values generated in e-discovery work to represent SHA-1 and MD5 hash are always hexadecimal. That is a numerical system with a base of 16. This is typically represented by the numbers 0–9 for the first ten values, and A, B, C, D, E, and F to represent the last six, for a total of sixteen. In other words, a hexadecimal value does not employ any letters after F. Yet, the so called SHA-1 alphanumeric stated in the Warren opinion uses the letters H, U, T and I: “H4V … UTI.”
I thought the police or Judge Noce must have messed things up, but I also seemed to remember reading somewhere that were other ways to express hash values, and anyway, I am always very careful before I tell a judge that he or she is wrong. So doing a little online research, I learned that there are indeed other ways to display hash values using different binary based number systems, typically the 32 base or 64 base number systems. Base32 is defined in IETF RFC 3548, as using the characters A-Z and 2-7. While Base64 is defined in IETF PEM RFC 1421 as using the characters A-Z, a-z, 0-9, / and +.
My Online Investigation of Base32 Hash Math
Led to a Shocking Discovery
Coming back to the Warren opinion, the hash values “H4V … UTI” are not hexadecimal, but they could be either Base 32 or Base 64. At this point, I did a little more online research about Base32 hash, and quickly found that there are many websites where you can locate music and videos to download based on their hash values. Almost right away, by simply using Google, I located a site where you can find media to download based upon their SHA1 Base32 value. It then took less than a minute to find the web page where the Base32 SHA-1 hash values were listed that began with “H4V.” That is how all of the media on the site was listed, in numerical order based upon the first three numbers of their Base32 hash values.
There were 83 entires on the webpage whose hash values began with H4V. The site included listings of music and videos ranging from Beethoven’s Symphony No. 9 to a video of Lee Trevano’s Golf Instruction. One video listing which was 11.1 MB in size had a disturbing title that suggested it could contain the kind of porn referenced in Warren. It was dated May 29, 2003. I clicked on its hash value button and saw that the full SHA-1 hash value for this video was H4VIBLSKAZ477WRTKH7IURE6NXEDCUTI.
When I saw that hash value, it shook me up. The first and last three values exactly matched the hash described in Warren: H4V … UTI. My academic investigation of the mathematical properties of hash had led me right to the smoking gun in Warren! I knew from my article, and the research of Bill Speros described in footnote 168, that this match of the first and last three values meant there was a 98.6% probability that this was the exact same file referenced in Warren. Mr. Warren was charged with a felony for distributing this same video. I think it is a crime to even have it on your computer.
I do not know for sure if it is the same file, since the Warren opinion nowhere states the full hash value, but in view of the description of this video, it is just too much of a coincidence for it not to be. It was astonishing on many levels to see just how quickly you can find a file like this on the Internet, simply by knowing the first three hash numbers.
It is probably not possible to actually download or view the file from this website. I do not really know for sure, since that would involve clicking on this file, which I was not about to do. But when I clicked on the link for Beethoven’s Symphony No. 9, a piece of media which I do not find morally reprehensible, it took me to another web page. This page had links to other computers where you may in fact have been able to download Beethoven’s music. (I did not try, recognizing that might be a copyright violation.) At that point, the referring website included a statement that it “ONLY HAS INFO ABOUT FILES, AND DOES NOT OFFER ANY FILES FOR DOWNLOAD.” Still, if any law enforcement agency wants to contact me for the full website address, including Cuomo’s group, I would be happy to provide it. It is really very easy to find, and so I assume the proper authorities are already well aware of this site and its hash values, or lack thereof. I am certainly no police officer, and even if I was, I would not have the stomach for this kind of investigative work. Reading the email of parties in civil suits is about as horrid as I can handle.
Judge Noce Was Right
This little investigation proved to me that Judge Noce and the St. Louis police were correct. There is a SHA-1 hash that has 32 places, not 40, and it can use the whole alphabet, not just A-F.
The hash value H4V … UTI is indeed a correct first and last place truncation of a full SHA-1 hash value. But it is a SHA-1 hash that is expressed in Base32, not hexadecimal. Although the hash values used in e-discovery are almost always hexadecimal, the hash values used in “Peer-to-Peer” websites include a variety of different numerical systems, frequently including the Base32 system.
In addition, in my brief investigation of the P2P webs, I learned that countless P2P type websites now commonly use the first three places of hash values as a convenient shorthand naming system. For all I know, the “perps” may also. As Judge Noce says, it is the convenient thing to do. So when will the e-discovery vendors start doing so too?