Comments on: It'd Be Faster Than Torrenting...
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting/
Comments on Ask MetaFilter post It'd Be Faster Than Torrenting...Tue, 06 Apr 2010 11:01:54 -0800Tue, 06 Apr 2010 11:01:54 -0800en-ushttp://blogs.law.harvard.edu/tech/rss60Question: It'd Be Faster Than Torrenting...
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting
Why can't I reverse-engineer MD5 checksums to the file that they're from?post:ask.metafilter.com,2010:site.150449Tue, 06 Apr 2010 10:57:43 -080047triple2MD5ChecksumsecurityresolvedBy: adipocere
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155092
Here is the number 5.<br>
<br>
Please tell me what numbers I added together to get it.comment:ask.metafilter.com,2010:site.150449-2155092Tue, 06 Apr 2010 11:01:54 -0800adipocereBy: 47triple2
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155095
If we know that we are always using whole numbers, then I can say that it's:<br>
<br>
0,5; 1,4; 2,3; 3,2; 4,1; 5,0.<br>
<br>
So, I can choose between five possible options. (So, if I know that what they are giving me is a movie, I would pick the option that results in a movie.)<br>
<br>
We know how MD5 works because we made it. So, if it only uses whole numbers, we can use that to figure it out.comment:ask.metafilter.com,2010:site.150449-2155095Tue, 06 Apr 2010 11:07:59 -080047triple2By: mr_roboto
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155097
There are infinitely many files that would generate the same checksum. How do you decide which one was used to generate the checksum in front of you when all you have is the checksum?comment:ask.metafilter.com,2010:site.150449-2155097Tue, 06 Apr 2010 11:09:28 -0800mr_robotoBy: dmd
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155098
But we don't know that. And they can be negative numbers, too.comment:ask.metafilter.com,2010:site.150449-2155098Tue, 06 Apr 2010 11:10:23 -0800dmdBy: suedehead
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155099
Okay, then it's more like:<br>
<br>
Here is the number 5.<br>
<br>
Please tell me what numbers I added together and then <a href="http://en.wikipedia.org/wiki/Modulo_operation">modded</a> by 10 to get it.comment:ask.metafilter.com,2010:site.150449-2155099Tue, 06 Apr 2010 11:11:15 -0800suedeheadBy: sbutler
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155100
MD5 works by throwing information away. That's how you get a constant 128bit value no matter how much data you throw at it. Of course, because you throw information away, you can never take just the checksum and reconstruct the original data. Once data is lost, it's lost!<br>
<br>
That's what adipocere was trying to point out.comment:ask.metafilter.com,2010:site.150449-2155100Tue, 06 Apr 2010 11:12:00 -0800sbutlerBy: adipocere
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155102
Ah, but you could be using negative numbers, too. You could be using the reals. You could be using complex numbers. The possibilities expand and expand and expand.<br>
<br>
Now, look at the amount of information in an MD5 hash. And remember that it cycles. Here is a better fictional idea. Please never use this seriously:<br>
<br>
Take your string of characters. Break it into five character long chunks. Convert each character to ASCII. Add those numbers together, then do a modulo 100 on them. Repeat for the next block of five, and add the previous sum to your new sum. At the end of it, you would have a hash.<br>
<br>
Information is destroyed in making these hashes. It's burned up, just like addition.comment:ask.metafilter.com,2010:site.150449-2155102Tue, 06 Apr 2010 11:12:43 -0800adipocereBy: lych
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155104
n'thing it's that it's an irreversible transformation. Adding that there are MD5 collisions such that even if you could reverse it, there are multiple source sets that could satisfy a single hash.comment:ask.metafilter.com,2010:site.150449-2155104Tue, 06 Apr 2010 11:15:00 -0800lychBy: dmd
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155105
I just took all the letters and numbers in my password for my swiss bank account (it has billions in it, so it's in your interest to reverse this) and added them up. (A-Z = 1-26). Then I added the digits in that number, and so on, until I was left with one number.<br>
<br>
That number is also 5, by coincidence.<br>
<br>
I've told you exactly how I did it, so I expect to be a poor man soon.comment:ask.metafilter.com,2010:site.150449-2155105Tue, 06 Apr 2010 11:15:11 -0800dmdBy: phil
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155109
blatantly copied from wikipedia:<br>
<br>
the ideal cryptographic hash function has four main or significant properties:<br>
<br>
* it is easy to compute the hash value for any given message,<br>
* <b>it is infeasible to find a message that has a given hash</b>,<br>
* it is infeasible to modify a message without changing its hash,<br>
* it is infeasible to find two different messages with the same hash.<br>
<br>
if it were possible to easily determine the message, in your example a movie, the md5 would not be a useful cryptographic hash. a hash is not designed to compress data.comment:ask.metafilter.com,2010:site.150449-2155109Tue, 06 Apr 2010 11:17:24 -0800philBy: forforf
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155113
If you know the input into the MD5, then you can create a dictionary of input and the resulting MD5 hash. This doesn't work if MD5 is being used for security reasons where it's been salted (since you'll not know the complete input).comment:ask.metafilter.com,2010:site.150449-2155113Tue, 06 Apr 2010 11:18:52 -0800forforfBy: escher
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155117
An md5 checksum does not contain all the information needed to reconstruct the original file. It's not really guaranteed to be unique for every file. But what it aims to do is that it's "infeasible" or "hard" to find a file which has that checksum. Checksum collisions are expected, the desire is that it's just not easy to find them. And there certainly isn't enough information in the checksum to generate the original file.<br>
<br>
for more reading: md5 and other hash functions are <a href="http://en.wikipedia.org/wiki/Cryptographic_hash_function">cryptographic hash function</a> which is a form of a <a href="http://en.wikipedia.org/wiki/One-way_function">one way function</a>.comment:ask.metafilter.com,2010:site.150449-2155117Tue, 06 Apr 2010 11:23:11 -0800escherBy: Vorteks
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155119
Think of it this way; Here's a hash of a book:<br>
<br>
Third letter on the 10th page is "e".<br>
57th letter of the 2nd paragraph on page 12 is "q".<br>
Page 48 is blank.<br>
Assuming A=1,B=2,C=3, etc, the total value of the 5th, 12th, 30th, and 65th letters on each page have a remainder of 13 when divided by 26.<br>
<br>
Now, please write the contents of the book based on this information.<br>
<br>
Can't do it, can you? On the other hand, if you had a book, you could easily use this information to check if the book you have is the one referenced by the hash. It's good for checking for validity.<br>
<br>
Of course, this is not the actual MD5 method. But it's the same ultimate idea. A "hash" is just a key number that is a fingerprint of the data. Not the data itself. A reversible encoding of an entire movie would be about as large as the movie itself. In fact a "ZIP" file is a good example of such an encoding.comment:ask.metafilter.com,2010:site.150449-2155119Tue, 06 Apr 2010 11:24:19 -0800VorteksBy: namewithoutwords
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155123
it would probably be worthwhile to read up on the basic concept of a <a href="http://en.wikipedia.org/wiki/Cryptographic_hash_function">cryptographic hash function</a>. <br>
<br>
If this stuff really interests you, you might also like <a href="http://www.amazon.com/exec/obidos/ASIN/0471117099/metafilter-20/ref=nosim/">Applied Cryptography</a> by Bruce Schneier, which is pretty much the canonical text on crypto. <br>
<small>(<a href="http://www.schneierfacts.com/fact/18">rumor has it</a> that Bruce knows Alice and Bob's shared secret.)</small>comment:ask.metafilter.com,2010:site.150449-2155123Tue, 06 Apr 2010 11:26:39 -0800namewithoutwordsBy: mhum
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155125
Even leaving aside the problem that infinitely many strings have the same MD5 hash (i.e.: information is lost in the hashing process), we still don't have a great way of computing <a href="http://en.wikipedia.org/wiki/Preimage_attack">preimages</a> for MD5. According to <a href="http://en.wikipedia.org/wiki/MD5#Preimage_vulnerability">Wikipedia</a>, the best known preimage attacks have complexity 2<sup>123.4</sup>, slightly better than pure brute force but still outside the reach of mere mortals. So, in your example ("<i>So, I can choose between five possible options.</i>") there are at least two problems: 1) there are not five options, but infinitely many, and 2) the best known methods take roughly 2<sup>123.4</sup> operations to find even one option.<br>
<br>
Now, why is this? Well, it's not necessarily because no better method exists. It's because no better method <i>is currently known</i>. I don't think there are any mathematically provable lower bounds attached to MD5. Could someday someone find an algorithm that computes MD5 preimages in a reasonable time-frame? Possibly. But as far as anyone knows, we can't do it now.comment:ask.metafilter.com,2010:site.150449-2155125Tue, 06 Apr 2010 11:28:54 -0800mhumBy: 47triple2
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155126
Thanks everybody for explaining it! :)comment:ask.metafilter.com,2010:site.150449-2155126Tue, 06 Apr 2010 11:30:31 -080047triple2By: juv3nal
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155132
<em>There are infinitely many files that would generate the same checksum.</em><br>
<br>
If you know the ballpark size of the file that generated the checksum, there shouldn't be <em>infinitely</em> many, but certainly enough to make the process impractical. MD5 encrypted passwords, because they're much smaller than a typical movie file, for instance, can be brute forced. You still end up with collisions, but depending on what your expectation of password length is, possibly not many.<br>
<br>
In the magical scenario where you have infinitely fast computing for the purposes of brute forcing, you could could get yourself a set of collisions which you could further winnow down because 1) not all of them would be valid movie files & 2) a large proportion of those that were valid movie files could be eliminated because the content of the video could probably be algorithmically determined to be noise.<br>
<br>
But yeah, that's brute force, it's not using the knowledge of the algorithm to reverse anything.comment:ask.metafilter.com,2010:site.150449-2155132Tue, 06 Apr 2010 11:40:15 -0800juv3nalBy: chairface
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155135
Given this hamburger, why can't you recreate the cow?comment:ask.metafilter.com,2010:site.150449-2155135Tue, 06 Apr 2010 11:41:57 -0800chairfaceBy: sbutler
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155136
Let's tackle this from another angle. Your question title ("It's Be Faster Than Torrenting...") suggests you're fundamental thought problem is one of compression. Let's say we have a 2GB H.264 movie with AAC compression in an m4v format, and that it has an MD5 hash of X. So we tell our friend here are the parameters:<br>
<br>
- file size: 2GB<br>
- video codec: H.264<br>
- audio codec: AAC<br>
- container: m4v<br>
- MD5 hash: X<br>
<br>
Now, give <em>all</em> of those constraints, there is probably only one file that matches. Sure, there are an infinite number of files that have an MD5 hash of X. But also are 2GB in size? And valid m4v? And with valid H.264? And with valid AAC? No... pretty sure you won't find another one.<br>
<br>
So how does our friend go about reconstructing the original file? It's been mathematically proven that you cannot have a lossless compression scheme that takes an arbitrary input and produces a fixed size output (<a href="http://en.wikipedia.org/wiki/Lossless_compression#Limitations">wikipedia</a>). And that's what our hash is... a fixed sized output.<br>
<br>
What our friend must do is guess what the 2GB file looks like, hash it, and see if the MD5's match. Then guess again. And again. And again. He'll be guessing till the cows come home. Eventually he might stumble upon the correct input to meet our constraints, but by that time he might as well have just transfered the file over bittorrent.<br>
<br>
The fact is that for any non-trival example, it is much much easier to just transfer the file than guess and test it. Even for passwords (6-12 characters) the problem is not easy. There are MD5 <a href="http://en.wikipedia.org/wiki/Rainbow_tables">rainbow tables</a> but the beauty of those is that someone already went to all the trouble to generate them. If you had to generate them yourself it might take too long*.<br>
<br>
<small>* MD5 is a bad hash to choose for this example because current research has discovered weaknesses that makes it easier to guess a password that hashes to the same value. But my point stands in general.</small>comment:ask.metafilter.com,2010:site.150449-2155136Tue, 06 Apr 2010 11:42:23 -0800sbutlerBy: procrastination
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155138
It has been explained, but I think it is cool so let me answer. MD5 loses information, so you can't reverse it. The best you can do is guess what the original document was, then MD5 your guess to see if it matches. Then you may or may not have the original, since other things might make the same hash, but at least you can try it and see.<br>
<br>
To understand how big the search space is, there are 2<sup>128</sup> possible hashes. If you do the conversion, this works out to about 10<sup>38</sup>. As a point of reference, the diameter of the visible universe is approximately 10<sup>26</sup>meters. So there are about as many hash possibilities to try as there are if you lined about 1 trillion observable universes up in a row and then measured them with a measuring stick.<br>
<br>
Of course, on average you only have to try half of the possibilities before you find a match, so you have that going for you.comment:ask.metafilter.com,2010:site.150449-2155138Tue, 06 Apr 2010 11:44:48 -0800procrastinationBy: juv3nal
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155151
<em>What our friend must do is guess what the 2GB file looks like, hash it, and see if the MD5's match. Then guess again. And again. And again. He'll be guessing till the cows come home. Eventually he might stumble upon the correct input to meet our constraints, but by that time he might as well have just transfered the file over bittorrent.</em><br>
<br>
Yeah. In case it's not utterly clear what brute force guessing entails, it means generating every sequence of bits that could possibly fit the size requirements. If you have exhausted the problem space in doing so (and you must because having found one match doesn't mean you've found all collisions), <em>you would have generated every movie file of that file size that will ever be made including movies featuring stars that have not been born yet</em> etc.comment:ask.metafilter.com,2010:site.150449-2155151Tue, 06 Apr 2010 11:51:34 -0800juv3nalBy: jrishel
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155224
to save you a question next week,<a href="http://www.dublish.com/articles/31.html"> this idea won't work either</a>: instead of transmitting a long string of binary digits, just find that same string somewhere in pi, and send the start and end position. everyone could just keep a copy of pi out to 1 trillion digits or so and have a copy of almost everything, and never need to actualy download more than start and end position of where the requested string exists in pi.<br>
<br>
the problem exists in actually finding the string you want in pi, and having someone else pull out the same numbers. it'd probably be faster to torrent it.comment:ask.metafilter.com,2010:site.150449-2155224Tue, 06 Apr 2010 12:44:22 -0800jrishelBy: Chocolate Pickle
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155278
<i>It has been explained, but I think it is cool so let me answer. MD5 loses information, so you can't reverse it.</i><br>
<br>
No one seems to have mentioned why. <a href="http://en.wikipedia.org/wiki/Information_theory">Claude Shannon</a> proved that getting that information back was the same as violating the Second Law of Thermodynamics.<br>
<br>
In other words, doing what you want to do is exactly the same as inventing perpetual motion. It is physically impossible.comment:ask.metafilter.com,2010:site.150449-2155278Tue, 06 Apr 2010 13:38:29 -0800Chocolate PickleBy: phliar
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155468
Shannon's entropy is not the same as thermodynamical entropy. There are quantum-mechanical implications to information, but it's not thermodynamics.comment:ask.metafilter.com,2010:site.150449-2155468Tue, 06 Apr 2010 16:51:16 -0800phliarBy: tksh
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155495
<blockquote><em>No one seems to have mentioned why. Claude Shannon proved that getting that information back was the same as violating the Second Law of Thermodynamics.</em></blockquote><br>
<br>
I think you mean it's because MD5 checksums are just applications of a hash function that map an infinite set to a set with only 2<sup>128</sup> elements. It's onto but it's also obvious that there will be multiple elements in the domain that map to the same element in the range. You'll need more information to guess which of the multiple elements is the original (and I think that's where some of Shannon's research comes in).comment:ask.metafilter.com,2010:site.150449-2155495Tue, 06 Apr 2010 17:13:28 -0800tkshBy: Nelson
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155614
I think a lot of the answers here have missed something important. Yes, it's true that the MD5 key-space is finite, so there's no 1:1 mapping between MD5 hashes and all possible files. But 2<sup>128</sup> is a very big number, and in practice we expect never to find two files that hash to the same MD5 in a practical application. For the universe of English texts, or all 32 character passwords, or all images on the Internet MD5 might as well be 1:1.<br>
<br>
A far more important property of MD5's security is that it's a cryptographic hash that mixes the input bits so well that knowing the MD5 hash doesn't tell you much about the input source. There's lots of good links about cryptographic hashes up there that will explain the concept in more detail. The analogy of grinding a cow into hamburger isn't bad, actually.<br>
<br>
Also know that MD5 is actually pretty weak as cryptography hashes go, there are better mixing functions out there now. In particular, MD5 weaknesses combined with <a href="http://en.wikipedia.org/wiki/Rainbow_table">rainbow tables</a> may lead to practical attacks on real MD5 hashed data.comment:ask.metafilter.com,2010:site.150449-2155614Tue, 06 Apr 2010 18:49:08 -0800NelsonBy: blenderfish
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155854
<i>Sure, there are an infinite number of files that have an MD5 hash of X. But also are 2GB in size? And valid m4v? And with valid H.264? And with valid AAC?</i><br>
<br>
Well, a 2GB file has 16 billion bits in it. an MD5 has 128. This means that (assuming even distribution,) there are 2**(16,000,000,000 - 128) possible files, given a single MD5. Assuming AAC uses a 128 bit hash for its copy protection, then only one out of 2**128 of those will be valid. So, you'll have 2**(16,000,000,000 - 256) possible files. In fact, in general, if there are '<i>n</i>' bits of the file that have to be "just so" for it to be valid H.264, m4v, or AAC, or whatever, you'll have 2**(16,000,000,000 - 128 - n) possibilities. Since those <i>n</i> bits have to be in a particular configuration, they're not really conveying any information, so if the compression scheme is reasonably good, then n should be small. In fact, in an ideal compression scheme, <i>all</i> strings of bits would be valid (minus a few for error checking.) So, if we go ahead and assume that about 1/16th of a m4v/H.264 file is this kind of overhead-- that is, for every 15 bits encoded, there is a single bit which has to be a particular way-- we are left with 2**(15,000,000 - 128) possible valid 2GB m4v/H.264/AAC movies with a particular MD5 hash. That's a lot of movies! One of them is probably a documentary on how to cure cancer.comment:ask.metafilter.com,2010:site.150449-2155854Tue, 06 Apr 2010 22:21:35 -0800blenderfishBy: blenderfish
http://ask.metafilter.com/150449/Itd-Be-Faster-Than-Torrenting#2155860
sorry. 2**(15,000,000<b>,000</b> - 128) files all told.comment:ask.metafilter.com,2010:site.150449-2155860Tue, 06 Apr 2010 22:29:02 -0800blenderfish