MD5 To Be Considered Harmful Someday 401
Effugas writes "I've completed an applied security analysis (pdf) of MD5 given Xiaoyun Wang et al's collision attack (covered here and here). From an applied perspective, the attack itself is pretty limited -- essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note."
damn (Score:2, Interesting)
Exploit? (Score:5, Interesting)
Not just MD5 (Score:3, Interesting)
SHA-256 is a good replacement. SHA-1 should be fine but if you are going through the trouble of an upgrade, why not make it sufficiently future proof?
Re:damn (Score:5, Interesting)
Yes, but a good hash makes it *extremely* difficult to find them. MD5 is looking pretty mediocre right now.
break it down (Score:0, Interesting)
a. MD5 sum entire file
b. MD5 sum each 10K block in the file (overlap blocks by ZZZ bytes)
c. break file into different blocks such as every 40th byte in the file
It actually easy to see this (Score:2, Interesting)
sort 15,000 jpg image files to remove duplicates.
Since the names varied, and it is not uncommon
to have many images with the same length, it
seemed like a good idea to use md5 hashes.
So I coded it up in python to do the md5 hash,
and then stuffed the file name into a table
keyed by the md5 hash. Big suprise when multiple
different files landed in the same hash.
Some property of jpgs probably reduces the
randomness of the files and increases the
probability of hash collisions.
Is a two-pass just as vulnerable? (Score:3, Interesting)
Is doing a 2nd pass helpful?
In other words, if
ABCfireDEF
and
ABCiceeDEF
hash the same, does
ABCfireDEFABCfireDEF
necessarily, or even frequently, hash to
ABCiceeABCiceeDEF?
Even if it were to provide protection, in practical terms,
1) other hashing althorithms are likely faster than "MD5 times two"
2) there may be some - hopefully a lot fewer - places in a long file where where
ABCfireDEFABCfireDEF
can be replaced with
ABCiceeDEFABCmeltDEF
I'm beginning to like the "pick two algorithms and call me in the morning" approach.
Re:Correct me if I'm wrong, but... (Score:3, Interesting)
If I had decided to use MD5 for the hash in the digital signature, my scheme is now vulnerable. It's not too far-fetched to imagine that somebody could come up with a way to embed an exploit in one of the files while staying within the limitations imposed by this collision technique. Then if he can accomplish a man-in-the-middle attack, he's successfully used my automatic downloader to compromise somebody's machine. Not fun.
This may not be completely feasible currently, but the technique shows that it may be possible in the future. If you're currently designing a system that you plan to have function for several years, you should not assume that MD5 is cryptographically secure.
Re:Let's face it (Score:3, Interesting)
True, you'd spend longer validating and decrypting the unholy mess generated than you'd spend downloading it, but I think you'd be fairly safe in assuming that the file was what it claimed to be.
In other words, it isn't impossible to be so close to 100% secure as to make no odds. It is merely a question of whether it's worth it. Is the expense of protecting something greater than the value of what you're protecting? If so, nobody is going to bother, no matter how simple the task of protecting is.
Re:Correct me if I'm wrong, but... (Score:3, Interesting)
See the long list of cryptographic hash functions [wikipedia.org] on Wikipedia. (Bottom of page.)
If you used three or four different hash functions, to produce a long combined message digest (concatenated from the several message digests), it would seem to be much more difficult to produce files that collide. An attack resulting in a collision on one algorithm, is not likely to work for another algorithm. Finding two files that simultaneously fool two or more algorithms seems a much more difficult task than just fooling a single algorithm like MD5.
This is of utmost importance. We can't have the RIAA infiltrating garbage files ("what the fiddlesticks do you think you're doing?") into our p2p networks.
Re:You're wrong. (Score:1, Interesting)
"In particular, I expect that the first type of crack could cause issues with SSH keys, both user keys (used for authentication) and host keys (to prevent man-in-the-middle attacks)."
The server has a store of acceptable public keys, third paty observable random data from both the client and server are combined and hashed using SHA-1, that data is then signed by the client's private key. Being able to come up with substitute md5 or sha-1 data will not affect the outcome of the signed data algorithm.
For server authentication a certificate chain could be at risk. Although most SSH client implementations simply cache the public key, so they are vulnerable to first time active attacks.
An algorithm like the following would be helpful for certificate chains. See original zero score post here (http://developers.slashdot.org/comments.pl?sid=1
H = H1(m + 0) + H2(m + 0)
for counter = 1 to 999
H = H1(m + counter + H) + H2(m + counter + H)
return H;
H1 and H2 are crypto hash algorithms. This approach is very strong, since both hashing algorithms are interleaved together. If one hashing algorithm is found to be weak, and other is strong, attacks will fail. Additionally, even if both algorithms are found to be weak interlacing and the number of rounds make it much more computationally difficult to attack. In comparison to DSA or RSA signature verification algorithms this additional work for generating the hash is very CPU light weight.
Just finishing up the patent application. Done.
Hey Moderator, this has to be worth more than a zero. Sheesh.
Almost forgot (Score:4, Interesting)
Where time isn't critical (eg: creating and validating checksums for files), I'd say use both. The overhead isn't great, and you'd get much more security.
Where time is critical AND you don't have to be concerned with computers not under your control, use Whirlpool. Rijndael is fast, SHA-1 is slow. Whirlpool also offers a longer hash string than SHA-1.
In any other situation, use SHA-1. Whirlpool might turn out to be the greatest algorithm out there, but that doesn't help if you're trying to talk to a remote computer that doesn't support it.
Re:damn (Score:2, Interesting)
Agreed that folding A with B after A has been compromised is a nonstarter.
However, isn't there merit to the general idea of combining two 'dissimilar' (i.e., the theoretical basis for the 'security' of each share as few attributes as feasible) hash functions X and Y where both X and Y are currently though to be 'secure'?
In those very common cases where the hash is protecting something with decreasing time value (such as software binaries which become obsolete in a few years) or where the original source is secure and it is only alterations of copies that are a concern (perhaps historical legal documents), this buys time if X or Y (but not both) is compromised (or seem more likely to be compromised - as MD5 is now)?
For example, if X=MD5, the concerns about MD5 would result in an effort to discontinue its use and replace it with a new hash function Z and start using Y+Z. While the community switches to a new combination, everything is still pretty safe (as safe as Y in this case) during this transition period.
It seems that in the private sector (spooks aside), it's quite unlikely that X and Y would both be effectively compromised simultaneously and with little warning unless the notion of dissimilarity is also compromised. Fortunately, the skills needed to compromise such functions seem to be concentrated in the guys with the white hats so they tend to publish their preliminary results.
Disclaimer: I'm not a cryptomaniac so my observations are worth substantially less than you paid for them.
Re:damn (Score:3, Interesting)
Depends on the hash function, if the input length is equal or smaller then output length reconstructing the message might be possible, ie:
$ echo "hello" | md5sum
b1946ac92492d2347c6235b4d2611184 -
Should be easily reversible with a dictonary attack. However in the more common case a hash maps a large input domain, to a much smaller output domain, so yep, hashes are not reversible unless input is somehow smaller then the hash itself.
Re:damn (Score:1, Interesting)
300+0 records in
300+0 records out
314572800 bytes transferred in 63.270583 seconds (4971865 bytes/sec)
$ time md5sum yoink
a4a846995ccf65be3bfcd9e69325e69a yoink
real 0m2.482s
user 0m2.342s
sys 0m0.131s
This is about 121MB/s.
Northwood P4 at 2.8 800FSB, DDR400
linux 2.6.8
Same test, SHA1: 77.3MB/s
cksum: 246MB/s
Stop-gap solution: hash offset files (Score:2, Interesting)