Forgot your password?
typodupeerror
Encryption Security

MD5 To Be Considered Harmful Someday 401

Posted by michael
from the house-built-on-sand dept.
Effugas writes "I've completed an applied security analysis (pdf) of MD5 given Xiaoyun Wang et al's collision attack (covered here and here). From an applied perspective, the attack itself is pretty limited -- essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note."
This discussion has been archived. No new comments can be posted.

MD5 To Be Considered Harmful Someday

Comments Filter:
  • damn (Score:2, Interesting)

    by YetAnotherDave (159442) on Tuesday December 07, 2004 @05:08PM (#11023624)
    I guess I know what I'll be coding now - SHA-1 just got a lot more important in my priority list... :(
  • Exploit? (Score:5, Interesting)

    by Limburgher (523006) on Tuesday December 07, 2004 @05:11PM (#11023669) Homepage Journal
    So does this mean that it's possible to find a useful MD5-equivalent file for any file? Just because someone alters a file does not mean they have done anything destructive. Would one be able to take a binary, make a change of some sort, and then run a tool to determine the block of data to add to the binary to both allow the change to take effect and cancel out the MD5 change? How complex would it be to construct this tool?
  • Not just MD5 (Score:3, Interesting)

    by PureFiction (10256) on Tuesday December 07, 2004 @05:20PM (#11023806)
    Other weaknesses were reported in various other secure digests, including MD4, RIPEMD, HAVAL-128, SHA-0.

    SHA-256 is a good replacement. SHA-1 should be fine but if you are going through the trouble of an upgrade, why not make it sufficiently future proof?
  • Re:damn (Score:5, Interesting)

    by WolfWithoutAClause (162946) on Tuesday December 07, 2004 @05:22PM (#11023849) Homepage
    There will ALWAYS be collisions with any kind of hashing algorythm.

    Yes, but a good hash makes it *extremely* difficult to find them. MD5 is looking pretty mediocre right now.

  • break it down (Score:0, Interesting)

    by Anonymous Coward on Tuesday December 07, 2004 @05:27PM (#11023914)
    Slicing and dicing the file into many different overlaping blocks and then computing a md5 sum for each block would be much much stronger.

    a. MD5 sum entire file
    b. MD5 sum each 10K block in the file (overlap blocks by ZZZ bytes)
    c. break file into different blocks such as every 40th byte in the file ....
  • by Anonymous Coward on Tuesday December 07, 2004 @05:32PM (#11023983)
    I was a bit suprised recently when I tried to
    sort 15,000 jpg image files to remove duplicates.
    Since the names varied, and it is not uncommon
    to have many images with the same length, it
    seemed like a good idea to use md5 hashes.
    So I coded it up in python to do the md5 hash,
    and then stuffed the file name into a table
    keyed by the md5 hash. Big suprise when multiple
    different files landed in the same hash.
    Some property of jpgs probably reduces the
    randomness of the files and increases the
    probability of hash collisions.
  • by davidwr (791652) on Tuesday December 07, 2004 @05:48PM (#11024217) Homepage Journal
    I haven't had time to think this out, so I'm throwing it out for you guys:

    Is doing a 2nd pass helpful?

    In other words, if
    ABCfireDEF
    and
    ABCiceeDEF
    hash the same, does
    ABCfireDEFABCfireDEF
    necessarily, or even frequently, hash to
    ABCiceeABCiceeDEF?

    Even if it were to provide protection, in practical terms,
    1) other hashing althorithms are likely faster than "MD5 times two"
    2) there may be some - hopefully a lot fewer - places in a long file where where
    ABCfireDEFABCfireDEF
    can be replaced with
    ABCiceeDEFABCmeltDEF

    I'm beginning to like the "pick two algorithms and call me in the morning" approach.
  • by HeghmoH (13204) on Tuesday December 07, 2004 @05:54PM (#11024311) Homepage Journal
    Let's say I have a system that downloads files over the internet. To prevent man-in-the-middle attacks, I digitally sign the files. This prevents me from having to vet all of the code that deals with local files for buffer overflows. I implement the digital signatures by simply encrypting a hash of the file with an RSA private key when I create the file, and decrypting and verifying the hash on the receiving side.

    If I had decided to use MD5 for the hash in the digital signature, my scheme is now vulnerable. It's not too far-fetched to imagine that somebody could come up with a way to embed an exploit in one of the files while staying within the limitations imposed by this collision technique. Then if he can accomplish a man-in-the-middle attack, he's successfully used my automatic downloader to compromise somebody's machine. Not fun.

    This may not be completely feasible currently, but the technique shows that it may be possible in the future. If you're currently designing a system that you plan to have function for several years, you should not assume that MD5 is cryptographically secure.
  • Re:Let's face it (Score:3, Interesting)

    by jd (1658) <imipak@yaCOLAhoo.com minus caffeine> on Tuesday December 07, 2004 @06:04PM (#11024489) Homepage Journal
    Let's say you break a file into blocks, encrypt those blocks with Rijndael or Serpent using a chaining method that authenticates the prior block, digitally sign the result using (seperately generated) RSA, DSA and ECC signatures in turn, and generate SHA-1 and Whirlpool checksums of both the encrypted and unencrypted file.


    True, you'd spend longer validating and decrypting the unholy mess generated than you'd spend downloading it, but I think you'd be fairly safe in assuming that the file was what it claimed to be.


    In other words, it isn't impossible to be so close to 100% secure as to make no odds. It is merely a question of whether it's worth it. Is the expense of protecting something greater than the value of what you're protecting? If so, nobody is going to bother, no matter how simple the task of protecting is.

  • by ReelOddeeo (115880) on Tuesday December 07, 2004 @06:06PM (#11024516)
    I was thinking the same thing.

    See the long list of cryptographic hash functions [wikipedia.org] on Wikipedia. (Bottom of page.)

    If you used three or four different hash functions, to produce a long combined message digest (concatenated from the several message digests), it would seem to be much more difficult to produce files that collide. An attack resulting in a collision on one algorithm, is not likely to work for another algorithm. Finding two files that simultaneously fool two or more algorithms seems a much more difficult task than just fooling a single algorithm like MD5.

    This is of utmost importance. We can't have the RIAA infiltrating garbage files ("what the fiddlesticks do you think you're doing?") into our p2p networks.
  • Re:You're wrong. (Score:1, Interesting)

    by Anonymous Coward on Tuesday December 07, 2004 @06:41PM (#11025085)
    You are correct that digital signatures w/ certificates is where the real risk lies for a flaw in MD5. Specifically, certificate chains are at risk, since you cannot truly validate a chain. This will have no affect on:

    "In particular, I expect that the first type of crack could cause issues with SSH keys, both user keys (used for authentication) and host keys (to prevent man-in-the-middle attacks)."

    The server has a store of acceptable public keys, third paty observable random data from both the client and server are combined and hashed using SHA-1, that data is then signed by the client's private key. Being able to come up with substitute md5 or sha-1 data will not affect the outcome of the signed data algorithm.

    For server authentication a certificate chain could be at risk. Although most SSH client implementations simply cache the public key, so they are vulnerable to first time active attacks.

    An algorithm like the following would be helpful for certificate chains. See original zero score post here (http://developers.slashdot.org/comments.pl?sid=13 1994&cid=11024443)

    H = H1(m + 0) + H2(m + 0)

    for counter = 1 to 999
    H = H1(m + counter + H) + H2(m + counter + H)

    return H;

    H1 and H2 are crypto hash algorithms. This approach is very strong, since both hashing algorithms are interleaved together. If one hashing algorithm is found to be weak, and other is strong, attacks will fail. Additionally, even if both algorithms are found to be weak interlacing and the number of rounds make it much more computationally difficult to attack. In comparison to DSA or RSA signature verification algorithms this additional work for generating the hash is very CPU light weight.

    Just finishing up the patent application. Done.

    Hey Moderator, this has to be worth more than a zero. Sheesh.

  • Almost forgot (Score:4, Interesting)

    by jd (1658) <imipak@yaCOLAhoo.com minus caffeine> on Tuesday December 07, 2004 @06:42PM (#11025091) Homepage Journal
    Whirlpool is a 256-bit hashing algorithm, derived from the Rijndael encryption algorithm. Rijndael is known to be strong, and has been approved by NIST, but the conversion to a hash function has not been sufficiently tested.


    Where time isn't critical (eg: creating and validating checksums for files), I'd say use both. The overhead isn't great, and you'd get much more security.


    Where time is critical AND you don't have to be concerned with computers not under your control, use Whirlpool. Rijndael is fast, SHA-1 is slow. Whirlpool also offers a longer hash string than SHA-1.


    In any other situation, use SHA-1. Whirlpool might turn out to be the greatest algorithm out there, but that doesn't help if you're trying to talk to a remote computer that doesn't support it.

  • Re:damn (Score:2, Interesting)

    by uncqual (836337) on Tuesday December 07, 2004 @07:34PM (#11025872)

    Agreed that folding A with B after A has been compromised is a nonstarter.

    However, isn't there merit to the general idea of combining two 'dissimilar' (i.e., the theoretical basis for the 'security' of each share as few attributes as feasible) hash functions X and Y where both X and Y are currently though to be 'secure'?

    In those very common cases where the hash is protecting something with decreasing time value (such as software binaries which become obsolete in a few years) or where the original source is secure and it is only alterations of copies that are a concern (perhaps historical legal documents), this buys time if X or Y (but not both) is compromised (or seem more likely to be compromised - as MD5 is now)?

    For example, if X=MD5, the concerns about MD5 would result in an effort to discontinue its use and replace it with a new hash function Z and start using Y+Z. While the community switches to a new combination, everything is still pretty safe (as safe as Y in this case) during this transition period.

    It seems that in the private sector (spooks aside), it's quite unlikely that X and Y would both be effectively compromised simultaneously and with little warning unless the notion of dissimilarity is also compromised. Fortunately, the skills needed to compromise such functions seem to be concentrated in the guys with the white hats so they tend to publish their preliminary results.

    Disclaimer: I'm not a cryptomaniac so my observations are worth substantially less than you paid for them.

  • Re:damn (Score:3, Interesting)

    by grumbel (592662) <grumbel@gmx.de> on Tuesday December 07, 2004 @11:01PM (#11027985) Homepage
    ### If it's a hash, the message CANNOT be reconstructable from it

    Depends on the hash function, if the input length is equal or smaller then output length reconstructing the message might be possible, ie:

    $ echo "hello" | md5sum
    b1946ac92492d2347c6235b4d2611184 -

    Should be easily reversible with a dictonary attack. However in the more common case a hash maps a large input domain, to a much smaller output domain, so yep, hashes are not reversible unless input is somehow smaller then the hash itself.
  • Re:damn (Score:1, Interesting)

    by Anonymous Coward on Wednesday December 08, 2004 @04:08AM (#11029955)
    $ dd if=/dev/urandom bs=1M count=300 > yoink
    300+0 records in
    300+0 records out
    314572800 bytes transferred in 63.270583 seconds (4971865 bytes/sec)
    $ time md5sum yoink
    a4a846995ccf65be3bfcd9e69325e69a yoink

    real 0m2.482s
    user 0m2.342s
    sys 0m0.131s

    This is about 121MB/s.
    Northwood P4 at 2.8 800FSB, DDR400
    linux 2.6.8

    Same test, SHA1: 77.3MB/s
    cksum: 246MB/s
  • by AsciiNaut (630729) on Wednesday December 08, 2004 @09:55AM (#11031308)
    Here's a potential stop-gap solution: provide two md5sums per file, one of the whole file as normal, and one of the file offset by one byte. Let's look at the two hash-equivalent files [tcs.hut.fi] cited by parent:
    $ cmp file1.dat file2.dat
    file1.dat file2.dat differ: char 20, line 1

    $ md5sum file1.dat file2.dat
    a4c0d35c95a63a805915367dcfe6b751 file1.dat
    a4c0d35c95a63a805915367dcfe6b751 file2.dat
    Whoops! Now examine hashes of the same files, omitting the first (identical) bytes:
    $ xxd -s 1 -ps file1.dat | xxd -r -p | md5sum
    84e6e0a21e2c4c9ef53f3762fdc90bc8 -
    $ xxd -s 1 -ps file2.dat | xxd -r -p | md5sum
    a63151008e5f8fc116ba947fd8af8c5a -
    Clearly this would not work with all collisions (nothing useful would), but it might hugely limit their frequency. And it would be relatively easy to tag on tables of 1-byte offset md5sums to existing md5sum tables out there.

This is a good time to punt work.

Working...