MD5 To Be Considered Harmful Someday 401
Effugas writes "I've completed an applied security analysis (pdf) of MD5 given Xiaoyun Wang et al's collision attack (covered here and here). From an applied perspective, the attack itself is pretty limited -- essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note."
Re:damn (Score:2, Informative)
Thats the nature of the game we play.
The only hashing system without collisions is sending the original file itself.
md5 vs sha1 vs ? (Score:5, Informative)
here is a very good link about the algo...
The "Detailed Summary" (Score:5, Informative)
I've been doing some analysis on MD5 collision announced by Wang et al. Short version: Yes, Virginia, there is no such thing as a safe hash collision -- at least in a function that's specified to be cryptographically secure. The full details may be acquired at the following link:
http://www.doxpara.com/md5_someday.pdf
A tool, Stripwire, has been assembled to demonstrate some of the attacks described in the paper. It may be acquired at the following address:
http://www.doxpara.com/stripwire-1.1.tar.gz
Incidentally, the expectations management is by no means accidental -- the paper's titled "MD5 To Be Considered Harmful Someday" for a reason. Some people have said there's no applied implications to Joux and Wang's research. They're wrong; arbitrary payloads can be successfully integrated into a hash collision. But the attacks are not wildly practical, and in most cases exposure remains thankfully limited, for now. But the risks are real enough that responsible engineers should take note: This is not merely an academic threat, systems designed with MD5 now need to take far more care than they would if they were employing an unbroken hashing algorithm, and the problems are only going to get worse.
Some highlights from the paper:
* The attack itself is pretty limited -- essentially, we can create "doppelganger" blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum.
* MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. This leads to...
* Attacks are possible using only the proof of concept test vectors released by Wang -- the actual attack is not necessary.
* Stripwire emits two binary packages. They both contain an arbitrary payload, but the payload is encrypted with AES. Only one of the packages ("Fire") is decryptable and thus dangerous; the other ("Ice") shields its data behind AES. Both files share the same MD5 hash.
* Digital Signature systems are vulnerable, as they almost always sign a hashed representation of data rather than the data itself.
* This is an excellent vector for malicious developers to get unsafe code past a group of auditors, perhaps to acquire a required third party signature. Alternatively, build tools themselves could be compromised to embed safe versions of dangerous payloads in each build. At some later point, the embedded payload could be safely "activated", without the MD5 changing. This has implications for Tripwire, DRM, and several package management architectures.
* HMAC's invulnerability has been slightly overstated. It's definitely possible, given the key, to create two datasets with the same HMAC. Attacker possession of the key violates MAC presumptions, so the impact of this is particularly questionable.
* Very interesting possibilities open up once the full attack is made available -- among other things, we can create self-decrypting executables (fire.exe and ice.exe) that exhibit differential behavior based on their internal colliding payloads. They'll still have the same MD5 hash.
* Several doppelgangers may (relatively quickly, as per Joux) be computed within a single multicollision-friendly block. As such, the particular selection of doppelganger sets within a file can itself be made to represent data. It's relatively straightforward to embed a 128 bit signature inside an arbitrary file, in such a way that no matter the value of the signature, a constant MD5 hash is maintained. This is curiously steganographic.
* Many popular P2P networks (and innumerable distributed content databases) use MD5 hashes as both a reliable search handle and a mechanism to ensure file integrity. This makes them blind to any sign
Re:damn (Score:2, Informative)
That's not hashing:
Producing hash values for accessing data or for security. A hash value (or simply hash), also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.
Solution: Use more than one hash algorithm (Score:3, Informative)
Using MD5 with SHA1, or even the older MD2 or MD4 will reduce the probability of creating a compatable binary with the same checksum to virtually zero.
If only one checksum is required then just XOR the resulting checksums from each algorithm.
Re:damn (Score:3, Informative)
I have to disagree with you here.
If I have algo A and algo B:
I hash with algo A and get a value which I store.
I hash with algo B and get a value which I store.
While the security does not add up to A^B it does ammount to > A+B, which is still better than A or B only. (I really wish I had my Crypto reference books handy)
Other posters mentioned it was more work and equiv to one secure algo, both those statements are true; as I pointed out this was an alternative to writing a new SHA-1 algo.
-nB
Re:In english (Score:5, Informative)
Short version: A common technology for verifying that a file you've downloaded is legitimate and untampered-with, known as MD5, isn't as secure as people thought.
Slightly longer version: MD5 is a way of generating a checksum -- a single, comparable value -- from a file. Ideally it is supposed to give you different numbers for different files, so if a web site advertises the checksum a file should have, you can compare that with one generated from the file you actually got to see whether the file you've downloaded has been modified, potentially maliciously.
The research shows that it is possible for someone to construct a drop-in replacement for the file you thought you had that generates the same MD5 checksum as the original, so anyone attempting to validate the file this way would think they had the real thing. If it turns out that you can construct a damaging replacement for a common file -- perhaps an installer for a popular application like Firefox or OpenOffice that's usually downloaded from a public server -- then this could open a loophole for viruses, worms, etc. that would slip through the security net often used by cautious people when downloading such programs.
Re:Already Happening (Score:4, Informative)
I'll give you $50 if you can back that claim up. I want to see two video files. They must start out the same, but have a difference about half-way through. And they have to have the same md5 sum. Just post where I can download the two files, and your paypal address.
The way I see it, you've got a 1/2^64 chance of being right. So I'm risking $50/184467440737095516, which isn't a whole lot.
Re:Cash Money? (Score:3, Informative)
XiaoyunWang, Dengguo Feng, Xuejia Lai, and Hongbo Yu
"Collisions for hash functions md4, md5,haval-128 and ripemd"
OT yet informative: Spock and Try (Score:1, Informative)
-- Dr. Spock, stardate 2822.3."
I think you mean Yoda.
Besides, Dr. Spock is the baby doctor.
Re:Exploit? (Score:2, Informative)
As the length of the file is sent in addition to the MD5, in the vast majority of cases it's going to be impossible to find a file which gives you the same length and MD5. I guess as the size of media files increases this gets more and more likely, but if it ever starts affecting more than 0.0000000001% of files you can just increase the length of the hash.
Re:MD5 is obviously less secure (Score:3, Informative)
Sure, there are an infinite number of possible passwords, but they're all impossibly huge. I can come up with a password which is one trillion characters long and which hashes to the right result, but that's not practical.
Re:Exploit? (Score:4, Informative)
file1: xxxxccccccc....
file2: yyyyccccccc....
%file1 = %file2
Which is the example given in the article.
However, Wang said she could get to a collision from any intermediate hash code within the hour (according to the article). That would mean:
file1: ccccxxxxccccccccc......
file2: ccccyyyyccccccccc......
%file1 = %file2
Where xxxx and yyyy are (pre?)calculated and cccc.... is the payload.
If _I_ am not mistaken.
Re:Not just MD5 (Score:3, Informative)
Still... I would switch out MD5 if you have a target that's worth pretty much anything at all. After a break like this, I'd expect MD5 to become basically useless pretty fast. Of course, I don't work in hash collisions, I work mostly in protocols...
Lea
You're wrong. (Score:5, Informative)
He can create a file that MD5sum's to the same result as a legitimate file, but does not have full control over the content or size of the result (making this a mostly useless avenue of exploitation except for people who want to spread trash on P2P networks -- I.E. it shouldn't particularly bother anyone except people who already don't care about security).
Suppose you're storing passwords as encrypted hashes, so that intercepting the hashes doesn't tell you what the password is. But if you can generate a password to match that MD5...
You know that GPG keys are identified and signed by their MD5 hashes? Suppose that I can generate a GPG key that would be identified as yours, and distributed it.
Or he can create two files that MD5sum to the same result. But he has to have control over both files, which offers effectively no advantage to someone who is trying to spread malware or tamper with existing archives that have been MD5summed.
There's a coin-flipping protocol that goes as follows. Suppose that Alice and Bob want to flip a coin (over the Internet), but they don't trust each other.
Now, suppose that Alice generated multiple files in step 1. When Bob makes his guess, she tries to pick a file that will make her win. If she generated only two files, completely randomly, this would let Alice win 75% of the time.
These are just the first ideas I thought of. If I were looking for other problems, I'd think about undeniable signatures [rsasecurity.com], keysigning (which as GPG and X.509 SSL are heavily based on) and other specialized signature systems. In particular, I expect that the first type of crack could cause issues with SSH keys, both user keys (used for authentication) and host keys (to prevent man-in-the-middle attacks).
Digital signatures are used for much more than just testing for file tampering.
Re:Correct me if I'm wrong, but... (Score:4, Informative)
Re:Correct me if I'm wrong, but... (Score:3, Informative)
Re:Very true (Score:3, Informative)
The ISO approval also carries some weight in industry, although after some rather disasterous specifications (such as ISO 9000), they have lost some of their image. However, there are plenty of organizations that would consider an ISO standard an absolute must.
I don't know of anyone using Whirlpool for highly secure systems. It certainly wouldn't be ok in the US, as it's not a FIPS standard. France or Germany would be better bets.
Re:What does this mean (Score:1, Informative)
Just ask yourself : how many values can a MD5-hash (checksum) have, and how many (different) files are there on this world of ours.
If the number of files exeedes (is more than) the number of checksum-values, logic dictates that there must be several files that generate the same MD5-hash (checksum)
But Having multiple files with the same MD5-hash is not the problem. The problem is that someone could choose which MD5-hash his (program-)file should generate. And that would mean he could replace any file he wants with another one
And that's just what the MD5-hash should make impossible
Re:Correct me if I'm wrong, but... (Score:3, Informative)
What I'm getting at is that you'll probably always have to either know the password + the salt (if there is one), use brute force, or use a database of MD5's for all possible passwords in order to decrypt an MD5'd password. But since there are already MD5 databases, we're kind of past this part anyway.
As for access tokens, MD5's are chosen simply because they're large and seemingly random which makes them "unguessable". Since they're just temporary anyway, guessability is all you're trying to prevent - you'll get a new one next time and the attacker will have to start over.
Re:The "Detailed Summary" (Score:3, Informative)
SHA-1 has a much stronger design. It's starting to show cracks, though, so I don't recommend anything. Something based on AES will come, though -- maybe AES-OMAC, maybe Whirlpool. At the core of almost every hashing algorithm is just a block cipher anyway...
--Dan
Re:Two files with the same md5 hash? (Score:5, Informative)
Re: Exploit? (Score:4, Informative)
This would worry me, except that BT uses SHA1, not MD5, so this is irrelevant. MD5 has seemed suspect for years, & Bram's the sort to pay attention to that sort of thing.
I checked; Edonkey is based on MD4. Gnutella variants might use MD5.
Re:Exploit? (Score:3, Informative)
Re:You're wrong. (Score:3, Informative)
They're not. The old PGP 2.6 keys are, but GnuPG generates OpenPGP keys that use SHA-1.
GnuPG will use an already generated PGP 2.6 key, but will not make more of them.
Re:Exploit? (Score:2, Informative)
Re:I have a novel solution (Score:3, Informative)
So you are safe downloading linux for now via bittorrent. Besides, the chances of MD5 collisions happening from sheer luck/unluck are very slim. (after all, we've been using it for ages with no reports)
The most dangerous factor to continued use of MD5 are malicious individuals.