Is MD5 still good enough to uniquely identify files?

Tag: hash , md5 Author: llg1212 Date: 2010-10-11

Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.

Any thoughts? Thanks a lot in advance :)

Best Answer

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.


could you elaborate on, what is broken in security perspective? what could not be achieved with md5 hashing? why is it not achievable?
@none: For your first question, see here. I'm afraid I don't understand the other questions.
@0xA3: Neither you nor I have any idea what files the OP is referring to, or how much damage a compromise would cause. It could be their kid's baby photo collection for all we know. My goal is to provide the facts; what someone else does with them is their business. Also consider that Bruce Schneier recommends writing down your password; not everything needs to be stored at Fort Knox. Some things will keep just fine under the flower pot.
Add to all that the fact that MD5 hashes work much better than, say, SHA1 hashes as database keys since they fit neatly into a UUID column.
@Marcelo Cantos, I think what is lacking here is a differentiation or unpacking of the term 'security'. Obviously people are assuming 'security' for any use of checksum work, but the nomenclature Marcelo likely means is 'in a laboratory'.

Other Answer1

For practical purposes, the hash created might be suitably random, but theoretically there is always a probability of a collision, due to the Pigeonhole principle. Having different hashes certainly means that the files are different, but getting the same hash doesn't necessarily mean that the files are identical.

Using a hash function for that purpose - no matter whether security is a concern or not - should therefore always only be the first step of a check, especially if the hash algorithm is known to easily create collisions. To reliably find out if two files with the same hash are different you would have to compare those files byte-by-byte.


@Ranhiru. No. The hash gives you a 'summary' value which (for MD5) is only 16 bytes long. To guarantee the files are identical you would need to make a byte by byte check. This is true no matter what hash algorithm you choose, there is always the possibility of a collision.
@Ranhiru. Reread this answer, its imho the most comprehensive one here. Hashing could be used as a first step, which gets you to 99.99^e% certainty that the files are identical, but if you want to be absolutely 100% certain, then you'll need to make a byte by byte check. This is true whether you use MD5, SHA or any other algorithm.
This answer is wrong. Prevention of tampering and verifying uniqueness are the same thing. Also, while hashing doesn't guarantee uniqueness, neither does actual comparison. In fact, the likelihood of a hash accidentally colliding is actually lower that the probability of the comparison failing due to glitches in the CPU generated by normal solar gamma ray emissions. And don't forget that often the only source of the file is sitting on the other side of the world inside a web server, and the only independent piece of information you have for comparison purposes is the hash.
@Marcelo. It doesn't stand to logical reasoning that accidental collision is less likely than accidental bit flips (whilst making a byte by byte comparison). You still have the same chance of bit flips when building the hash (and arguably more since more processing time is involved). @Thomas raised the point originally to suggest that there is no guaranteed way of identifying uniqueness, though the impact of bit flips is highly debatable. The most pessimistic estimate is 1 flip per GB/hour, and ECC RAM would remove even that.
@PaulG: That's not the point. The probability of an accidental collision at the mathematical level is much lower than an error due to a random bit flip (and ECC can't prevent bit-flips in the bus circuitry or the CPU core, btw). Thus, a byte-for-byte comparison would have almost no impact on the chances of getting it right. Besides, the answer is wrong even in principle, since the main purpose of a hash is to confirm the identity of a file when there is no trusted copy to check against, or it is intractable to do so (e.g., comparing a 100 GB file with a copy on the other side of the world).

Other Answer2

MD5 will be good enough if you have no adversary. However, someone can (purposely) create two distinct files which hash to the same value (that's called a collision), and this may or may not be a problem, depending on your exact situation.

Since knowing whether known MD5 weaknesses apply to a given context is a subtle matter, it is recommended not to use MD5. Using a collision-resistant hash function (SHA-256 or SHA-512) is the safe answer. Also, using MD5 is bad public relations (if you use MD5, be prepared to have to justify yourselves; whereas nobody will question your using SHA-256).


This answer might be a bit misleading if the reader isn't too familiar with hashing. There is nothing magical about SHA that prevents hash collisions, they are just more resistant to hash collision attacks. If you wanted to be more than 99.999^e% certain that files are identical, you would still need a byte by byte check.
Actually a byte-to-byte comparison may fail due to a cosmic ray flipping a bit (e.g. transforming a return 0; into a return 1;). This is highly unlikely, but the risk of a collision with SHA-256 is even smaller than that. Mathematically, you cannot be sure that two files which hash to the same value are identical, but you cannot be sure of that either by comparing the files themselves, as long as you use a computer for the comparison. What I mean is that it is meaningless to go beyond some 99.999....9% certainty, and SHA-256 already provides more than that.
What, you don't use ECC memory? ;). Good comment, very interesting thoughts.
Don't forget the tin foil hat ! More seriously, how do you know these factoids about collisions and have you verified this in some way ?

Other Answer3

It depends on what you are trying to achieve. It is extremely unlikely that any two non-identical files will have the same MD5 hash, but keep in mind that two files with the same content will have to the same MD5 hash. In fact an MD5 hash is commonly used to verify the integrity of files, since almost any change to a file will cause its MD5 hash to also change.


Then again same applies to any other hashing algorithms

Other Answer4

An md5 can produce collisions. Theoretically, although highly unlikely, a million files in a row can produce the same hash. Don't test your luck and check for md5 collisions before storing the value.

I personally like to create md5 of random strings, which reduces the overhead of hashing large files. When collisions are found, I iterate and re-hash with the appended loop counter.

You may read on the pigeonhole principle.

Other Answer5

Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.

Other Answer6

I wouldn't recommend it. If the application would work on multi-user system, there might be user, that would have two files with the same md5 hash (he might be engineer and play with such files, or be just curious - they are easily downloadable from , I myself during writing this answer downloaded two samples). Another thing is, that some applications might store such duplicates for whatever reason (I'm not sure, if there are any such applications but the possibility exists).

If you are uniquely identifying files generated by your program I would say it is ok to use MD5. Otherwise, I would recommend any other hash function where no collisions are known yet.

Other Answer7

MD5 has been broken, you could use SHA1 instead (implemented in most languages)

Other Answer8

yes, for sure, MD5 produces a unique hash for every different input you provide and it is also used very much for calculating the hash of a file to keep it safe from injecting some malware or virus code in it..

SHA1 might be a good option for this purpose also.


MD5 doesn't guarantee a unique hash. There is a non-zero probability that two different files will have the same key. The question is whether MD5 is still safe enough for non-security use. As for protecting against malware injection, it is now useless for that purpose because its security properties have been thoroughly demolished. AFAIK, a hacker on a laptop can generate a virus-infected version of a program with the same MD5 sum as the original in a few minutes.
The only hashing algorithm off the top of my head that would produce a unique hash for every input is one that produces infinitely-long hashes.
No, this answer is absolutely wrong. No hash function that maps infinite inputs to a fixed number of possible outputs can guarantee uniqueness. By definition.
@BoltClock - Perhaps you mean no hash at all. i.e. the hash is the original file which is not actually then a hash is it?