# It'd Be Faster Than Torrenting...April 6, 2010 10:57 AM   Subscribe

Why can't I reverse-engineer MD5 checksums to the file that they're from?
posted by 47triple2 to Technology (28 answers total) 2 users marked this as a favorite

Here is the number 5.

posted by adipocere at 11:01 AM on April 6, 2010 [16 favorites]

Response by poster: If we know that we are always using whole numbers, then I can say that it's:

0,5; 1,4; 2,3; 3,2; 4,1; 5,0.

So, I can choose between five possible options. (So, if I know that what they are giving me is a movie, I would pick the option that results in a movie.)

We know how MD5 works because we made it. So, if it only uses whole numbers, we can use that to figure it out.
posted by 47triple2 at 11:07 AM on April 6, 2010

There are infinitely many files that would generate the same checksum. How do you decide which one was used to generate the checksum in front of you when all you have is the checksum?
posted by mr_roboto at 11:09 AM on April 6, 2010

But we don't know that. And they can be negative numbers, too.
posted by dmd at 11:10 AM on April 6, 2010

Okay, then it's more like:

Here is the number 5.

Please tell me what numbers I added together and then modded by 10 to get it.
posted by suedehead at 11:11 AM on April 6, 2010

MD5 works by throwing information away. That's how you get a constant 128bit value no matter how much data you throw at it. Of course, because you throw information away, you can never take just the checksum and reconstruct the original data. Once data is lost, it's lost!

That's what adipocere was trying to point out.
posted by sbutler at 11:12 AM on April 6, 2010

Best answer: Ah, but you could be using negative numbers, too. You could be using the reals. You could be using complex numbers. The possibilities expand and expand and expand.

Now, look at the amount of information in an MD5 hash. And remember that it cycles. Here is a better fictional idea. Please never use this seriously:

Take your string of characters. Break it into five character long chunks. Convert each character to ASCII. Add those numbers together, then do a modulo 100 on them. Repeat for the next block of five, and add the previous sum to your new sum. At the end of it, you would have a hash.

Information is destroyed in making these hashes. It's burned up, just like addition.
posted by adipocere at 11:12 AM on April 6, 2010 [1 favorite]

n'thing it's that it's an irreversible transformation. Adding that there are MD5 collisions such that even if you could reverse it, there are multiple source sets that could satisfy a single hash.
posted by lych at 11:15 AM on April 6, 2010

Best answer: I just took all the letters and numbers in my password for my swiss bank account (it has billions in it, so it's in your interest to reverse this) and added them up. (A-Z = 1-26). Then I added the digits in that number, and so on, until I was left with one number.

That number is also 5, by coincidence.

I've told you exactly how I did it, so I expect to be a poor man soon.
posted by dmd at 11:15 AM on April 6, 2010 [6 favorites]

blatantly copied from wikipedia:

the ideal cryptographic hash function has four main or significant properties:

* it is easy to compute the hash value for any given message,
* it is infeasible to find a message that has a given hash,
* it is infeasible to modify a message without changing its hash,
* it is infeasible to find two different messages with the same hash.

if it were possible to easily determine the message, in your example a movie, the md5 would not be a useful cryptographic hash. a hash is not designed to compress data.
posted by phil at 11:17 AM on April 6, 2010

If you know the input into the MD5, then you can create a dictionary of input and the resulting MD5 hash. This doesn't work if MD5 is being used for security reasons where it's been salted (since you'll not know the complete input).
posted by forforf at 11:18 AM on April 6, 2010

An md5 checksum does not contain all the information needed to reconstruct the original file. It's not really guaranteed to be unique for every file. But what it aims to do is that it's "infeasible" or "hard" to find a file which has that checksum. Checksum collisions are expected, the desire is that it's just not easy to find them. And there certainly isn't enough information in the checksum to generate the original file.

for more reading: md5 and other hash functions are cryptographic hash function which is a form of a one way function.
posted by escher at 11:23 AM on April 6, 2010

Think of it this way; Here's a hash of a book:

Third letter on the 10th page is "e".
57th letter of the 2nd paragraph on page 12 is "q".
Page 48 is blank.
Assuming A=1,B=2,C=3, etc, the total value of the 5th, 12th, 30th, and 65th letters on each page have a remainder of 13 when divided by 26.

Now, please write the contents of the book based on this information.

Can't do it, can you? On the other hand, if you had a book, you could easily use this information to check if the book you have is the one referenced by the hash. It's good for checking for validity.

Of course, this is not the actual MD5 method. But it's the same ultimate idea. A "hash" is just a key number that is a fingerprint of the data. Not the data itself. A reversible encoding of an entire movie would be about as large as the movie itself. In fact a "ZIP" file is a good example of such an encoding.
posted by Vorteks at 11:24 AM on April 6, 2010 [1 favorite]

it would probably be worthwhile to read up on the basic concept of a cryptographic hash function.

If this stuff really interests you, you might also like Applied Cryptography by Bruce Schneier, which is pretty much the canonical text on crypto.
(rumor has it that Bruce knows Alice and Bob's shared secret.)
posted by namewithoutwords at 11:26 AM on April 6, 2010 [1 favorite]

Even leaving aside the problem that infinitely many strings have the same MD5 hash (i.e.: information is lost in the hashing process), we still don't have a great way of computing preimages for MD5. According to Wikipedia, the best known preimage attacks have complexity 2123.4, slightly better than pure brute force but still outside the reach of mere mortals. So, in your example ("So, I can choose between five possible options.") there are at least two problems: 1) there are not five options, but infinitely many, and 2) the best known methods take roughly 2123.4 operations to find even one option.

Now, why is this? Well, it's not necessarily because no better method exists. It's because no better method is currently known. I don't think there are any mathematically provable lower bounds attached to MD5. Could someday someone find an algorithm that computes MD5 preimages in a reasonable time-frame? Possibly. But as far as anyone knows, we can't do it now.
posted by mhum at 11:28 AM on April 6, 2010

Response by poster: Thanks everybody for explaining it! :)
posted by 47triple2 at 11:30 AM on April 6, 2010

There are infinitely many files that would generate the same checksum.

If you know the ballpark size of the file that generated the checksum, there shouldn't be infinitely many, but certainly enough to make the process impractical. MD5 encrypted passwords, because they're much smaller than a typical movie file, for instance, can be brute forced. You still end up with collisions, but depending on what your expectation of password length is, possibly not many.

In the magical scenario where you have infinitely fast computing for the purposes of brute forcing, you could could get yourself a set of collisions which you could further winnow down because 1) not all of them would be valid movie files & 2) a large proportion of those that were valid movie files could be eliminated because the content of the video could probably be algorithmically determined to be noise.

But yeah, that's brute force, it's not using the knowledge of the algorithm to reverse anything.
posted by juv3nal at 11:40 AM on April 6, 2010

Given this hamburger, why can't you recreate the cow?
posted by chairface at 11:41 AM on April 6, 2010 [2 favorites]

Let's tackle this from another angle. Your question title ("It's Be Faster Than Torrenting...") suggests you're fundamental thought problem is one of compression. Let's say we have a 2GB H.264 movie with AAC compression in an m4v format, and that it has an MD5 hash of X. So we tell our friend here are the parameters:

- file size: 2GB
- video codec: H.264
- audio codec: AAC
- container: m4v
- MD5 hash: X

Now, give all of those constraints, there is probably only one file that matches. Sure, there are an infinite number of files that have an MD5 hash of X. But also are 2GB in size? And valid m4v? And with valid H.264? And with valid AAC? No... pretty sure you won't find another one.

So how does our friend go about reconstructing the original file? It's been mathematically proven that you cannot have a lossless compression scheme that takes an arbitrary input and produces a fixed size output (wikipedia). And that's what our hash is... a fixed sized output.

What our friend must do is guess what the 2GB file looks like, hash it, and see if the MD5's match. Then guess again. And again. And again. He'll be guessing till the cows come home. Eventually he might stumble upon the correct input to meet our constraints, but by that time he might as well have just transfered the file over bittorrent.

The fact is that for any non-trival example, it is much much easier to just transfer the file than guess and test it. Even for passwords (6-12 characters) the problem is not easy. There are MD5 rainbow tables but the beauty of those is that someone already went to all the trouble to generate them. If you had to generate them yourself it might take too long*.

* MD5 is a bad hash to choose for this example because current research has discovered weaknesses that makes it easier to guess a password that hashes to the same value. But my point stands in general.
posted by sbutler at 11:42 AM on April 6, 2010

It has been explained, but I think it is cool so let me answer. MD5 loses information, so you can't reverse it. The best you can do is guess what the original document was, then MD5 your guess to see if it matches. Then you may or may not have the original, since other things might make the same hash, but at least you can try it and see.

To understand how big the search space is, there are 2128 possible hashes. If you do the conversion, this works out to about 1038. As a point of reference, the diameter of the visible universe is approximately 1026meters. So there are about as many hash possibilities to try as there are if you lined about 1 trillion observable universes up in a row and then measured them with a measuring stick.

Of course, on average you only have to try half of the possibilities before you find a match, so you have that going for you.
posted by procrastination at 11:44 AM on April 6, 2010 [1 favorite]

What our friend must do is guess what the 2GB file looks like, hash it, and see if the MD5's match. Then guess again. And again. And again. He'll be guessing till the cows come home. Eventually he might stumble upon the correct input to meet our constraints, but by that time he might as well have just transfered the file over bittorrent.

Yeah. In case it's not utterly clear what brute force guessing entails, it means generating every sequence of bits that could possibly fit the size requirements. If you have exhausted the problem space in doing so (and you must because having found one match doesn't mean you've found all collisions), you would have generated every movie file of that file size that will ever be made including movies featuring stars that have not been born yet etc.
posted by juv3nal at 11:51 AM on April 6, 2010

to save you a question next week, this idea won't work either: instead of transmitting a long string of binary digits, just find that same string somewhere in pi, and send the start and end position. everyone could just keep a copy of pi out to 1 trillion digits or so and have a copy of almost everything, and never need to actualy download more than start and end position of where the requested string exists in pi.

the problem exists in actually finding the string you want in pi, and having someone else pull out the same numbers. it'd probably be faster to torrent it.
posted by jrishel at 12:44 PM on April 6, 2010

It has been explained, but I think it is cool so let me answer. MD5 loses information, so you can't reverse it.

No one seems to have mentioned why. Claude Shannon proved that getting that information back was the same as violating the Second Law of Thermodynamics.

In other words, doing what you want to do is exactly the same as inventing perpetual motion. It is physically impossible.
posted by Chocolate Pickle at 1:38 PM on April 6, 2010

Shannon's entropy is not the same as thermodynamical entropy. There are quantum-mechanical implications to information, but it's not thermodynamics.
posted by phliar at 4:51 PM on April 6, 2010

No one seems to have mentioned why. Claude Shannon proved that getting that information back was the same as violating the Second Law of Thermodynamics.
I think you mean it's because MD5 checksums are just applications of a hash function that map an infinite set to a set with only 2128 elements. It's onto but it's also obvious that there will be multiple elements in the domain that map to the same element in the range. You'll need more information to guess which of the multiple elements is the original (and I think that's where some of Shannon's research comes in).
posted by tksh at 5:13 PM on April 6, 2010

I think a lot of the answers here have missed something important. Yes, it's true that the MD5 key-space is finite, so there's no 1:1 mapping between MD5 hashes and all possible files. But 2128 is a very big number, and in practice we expect never to find two files that hash to the same MD5 in a practical application. For the universe of English texts, or all 32 character passwords, or all images on the Internet MD5 might as well be 1:1.

A far more important property of MD5's security is that it's a cryptographic hash that mixes the input bits so well that knowing the MD5 hash doesn't tell you much about the input source. There's lots of good links about cryptographic hashes up there that will explain the concept in more detail. The analogy of grinding a cow into hamburger isn't bad, actually.

Also know that MD5 is actually pretty weak as cryptography hashes go, there are better mixing functions out there now. In particular, MD5 weaknesses combined with rainbow tables may lead to practical attacks on real MD5 hashed data.
posted by Nelson at 6:49 PM on April 6, 2010

Sure, there are an infinite number of files that have an MD5 hash of X. But also are 2GB in size? And valid m4v? And with valid H.264? And with valid AAC?

Well, a 2GB file has 16 billion bits in it. an MD5 has 128. This means that (assuming even distribution,) there are 2**(16,000,000,000 - 128) possible files, given a single MD5. Assuming AAC uses a 128 bit hash for its copy protection, then only one out of 2**128 of those will be valid. So, you'll have 2**(16,000,000,000 - 256) possible files. In fact, in general, if there are 'n' bits of the file that have to be "just so" for it to be valid H.264, m4v, or AAC, or whatever, you'll have 2**(16,000,000,000 - 128 - n) possibilities. Since those n bits have to be in a particular configuration, they're not really conveying any information, so if the compression scheme is reasonably good, then n should be small. In fact, in an ideal compression scheme, all strings of bits would be valid (minus a few for error checking.) So, if we go ahead and assume that about 1/16th of a m4v/H.264 file is this kind of overhead-- that is, for every 15 bits encoded, there is a single bit which has to be a particular way-- we are left with 2**(15,000,000 - 128) possible valid 2GB m4v/H.264/AAC movies with a particular MD5 hash. That's a lot of movies! One of them is probably a documentary on how to cure cancer.
posted by blenderfish at 10:21 PM on April 6, 2010

sorry. 2**(15,000,000,000 - 128) files all told.
posted by blenderfish at 10:29 PM on April 6, 2010

« Older Who needs architectural postcards and books?   |   Why don't sites like DealExtreme sell their goods... Newer »