The hash - a computer file’s digital fingerprint

One of the many uses of hash functions is the identification and verification of computer files.

Everyone has experienced having to frantically search for an important document that was saved “somewhere in the computer” but is needed immediately. In such situations, we usually resort to the search function built into the operating system; but in the end, we just have to browse through the files, one at a time, until we find (or not) the right one. But what if we need to find a file among hundreds of thousands of others?

The curse of data abundance

The average user seldom has such problems, but they can be an issue, for example, for cloud services providers. When a file is uploaded “into the cloud”, the cloud services provider may want to check whether it already has an identical file in storage. If the file is already in the cloud (e.g. several users have saved the same music file), then “deduplication” may be used – meaning that the file is flagged, thus making it available to a further user, without it having to be uploaded a second time and saved on a server. This reduces the cost of transferring and storing files. Also, social networking sites must quickly verify multiple files, ensuring that their users are not trying to post banned content, such as encouraging terrorism, or violence, or portraying them. Finally, there may be occasions in court proceedings where evidence may be required that given digital storage (such as a laptop’s hard drive) also contained, among many others, certain files that are relevant to the resolution of a case.

It would be difficult to imagine checking so many files manually - it would require a tremendous amount of time and resources. Fortunately, this process may be automated using the hash function.

What are the hash and the hash function?

The hash function is a type of mathematical function, which, when applied to a digital file (record), assigns it a specific value called a hash (or “hash value” or “hash code”). A hash is a sequence of letters and numbers of set length that may be termed the “digital fingerprint” of a computer file. Examples of hashes are given below (hashes were generated from the text strings using the MD5 hash function).

 

Ala has a cat named Mruczek.

8fca969b64f34edc160a205cb3aa5c86

Ala has a cat named Mruczek..

fe634af96c21b486e189224b70018189

Ala has a cat named mruczek.

e478d8300e1e1cf6e1abde3d23948e43

To find a file that we are looking for from among many others, we must first have its hash value. It is not necessary to compute it every single time – instead you can use a hash that was saved earlier. A hash may be generated for any type of file, such as text files, images, sounds, or videos. Then the same hash function should be used to calculate the hash for the searched files. When we find a hash that is identical to the one generated, this means that we just found our file – they are identical (it is possible to say that there is a match in the “digital fingerprint” of those files). There is no need at all to check the contents of the searched files, or even to have a copy of the file you are looking for.

The hash function’s properties

The above example illustrates the basic property of a hash function: the same input value (file) always generates the same output (hash) value. Another important feature of this feature is its one-way nature - it is easy to generate a hash from a file, whereas it is very difficult, if not nigh impossible, to recreate a file from a hash, (requiring great computing power).

This feature limits the vulnerability of the hash function to attacks involving, say, generating more than one file that corresponds to a hash (nonetheless, it is worth noting that there have been successful attacks on certain hash functions that produced “hash collisions”.) The hash function’s nature means that it is theoretically possible for several files to be assigned to a single hash - the set of all possible files is potentially infinite, as opposed to the set of hashes which is finite (due to their fixed length)[1].

In practice, this risk is limited due to another feature of the hash function, described rather graphically as the avalanche effect. Just like every avalanche starts with the slip of a small pebble, in a hash function, even the slightest change in the input value (file checked) causes significant changes in the output (hash) value. This effect is visible in the above examples - slight amendments or errors that do not affect how the content is understood by a human result in completely different hashes.

Solutions based on hash functions

Due to the avalanche effect, file detection may be easily avoided when using a simple hash function by making insignificant amendments to the file, such as modifying the formatting of the text, or cropping a part of a photo that is not visible to the eye. In certain situations, where a file’s content is more important than its sameness, this may be undesirable (for example, filtering prohibited content on social networking sites). To prevent this, more sophisticated solutions are being developed for identifying files based on hash functions.

One of them is PhotoDNA technology, created by Microsoft and refined by Hany Farid, a computer science lecturer at Dartmouth College in USA. As its name implies, it serves to identify image files that are assigned a unique identifier, which is, as it were, equivalent to a DNA code for the given file. It is distinguished by the use of “robust hashing”: to generate a hash, the algorithm divides the image into several smaller fragments, thus making it harder to be deceived by insignificant changes to the file.

Microsoft handed over the rights to PhotoDNA to the International Center for Missing & Exploited Children, an American organisation, which has used the algorithm to create a hash database of files identified as containing child pornography. The database is used, not only by law enforcement agencies, but also by many well-known internet companies, which prevent such content appearing on their own proprietary services and report such cases to appropriate authorities. Some time ago, industry giants announced plans to implement similar filtering mechanisms for materials that encourage terrorism (more information may be found here and here).

Content ID, used by YouTube, operates in a similar way, detecting films and music which  infringe copyrights. This system checks to see if a user-added video clip contains any songs that are also in a database, which contains copyrighted material. Already in 2010, Google reported that the algorithm was verifying “hundreds of hours of videos uploaded to YouTube every minute”, something no human could ever handle.

If an added film clip contains a third party’s copyrighted content, the owner of the copyright is notified and may decide, for example, whether to have the clip deleted or to appropriate the revenue from it. In practice, the system is often criticised by the owners of content (in their view, if often fails to detect content that infringes their rights), as well as by users of the site (who believe that it sometimes blocks lawful content). The latter problem is, to some extent, due to the very principles on which the system operates – it merely checks whether a clip contains any content that is also in a database, omitting the context in which it has appeared.

Hashing algorithms also form the basis of the popular Shazam application, which can identify a song from a short snippet. Its authors have created a database that contains the digital signatures of millions of songs, which can be quickly compared to a signature generated from a song to which an app “listens”.

The hash function’s other uses in brief

Mindful of the usefulness of hash functions, it is worth remembering their downsides, such as the need to prepare a database of hashes based on files reviewed by a human, or the ease of “deceiving” the function by simply changing a file. This is why other solutions for identifying files are also being developed. A particularly interesting example is computer-based image recognition and processing systems – in essence, a type of artificial intelligence that learns to “view” images independently and is able to describe their content without relying on an external database.

However, even if hash functions do get replaced by such computer “vision”, they will certainly still find uses in many other areas - identifying computer files is only one of many. Hash functions are also being used, for example, in electronic signatures (it is not the file that is signed - just its hash), in cryptography, for storing passwords on web sites, and also constitute an important element of blockchain technology.

Rafał Kuchta

 

Previous post
First jurisdiction on the blockchain
Next post
Tokens, blockchain and the law