MD5 and the probability of collisions...
November 7, 2006 1:28 AM   Subscribe

MD5 and the probability of collisions...

I have a database in which the key is a string of variable length (basically user agent strings of anything between 20 and 200+ characters). I really don't like this idea.

I was wondering about changing it so that it stores the md5 hash of this string, but I'm worried that I'm going to end up with collisions.

What is the chance that two records from my 10,000 record strong database will have the same md5?

Is there a better way of doing this?
posted by twine42 to Computers & Internet (17 answers total)
 
Slim. An md5 hash is 128 bits long, so assuming that all hashes have an equal chance of occuring, the odds of any two random strings hashing to the same value are 1 in 2^128.

However, the correct way to do this is simply add a new integer column to the table, and use that as the primary key.
posted by Leon at 2:01 AM on November 7, 2006


Best answer: Assuming MD5 ideally distributes its results along the 0..2^128 space (which it doesn't, quite,) you can calculate the chance of two keys in a collection of size n colliding. This is often called the 'birthday problem(or paradox).' (Wikipedia link.) I won't calculate the probability of collision for a population of 10,000, but it is a very very very small number, I assure you.

However, of course, MD5 is not going to be perfectly ideal in its distribution of keys. But, MD5 is designed to be fairly close. For a long time, finding two sequences of characters which result in the same MD5 hash was a worthwhile cryptographic persuit. Finding two collisions using a 'brute force' approach, which it could be argued is akin to what you'd (unwittingly) be doing here, would take a very very very long time. Approaches which locate a collision in a short amount of time use known holes in the hashing algorithm.

In fact, MD5 may be overkill, relative to, say CRC-32.

You'll have to decide how unacceptible the chance of collision is, though. I accept no liability. :) Don't use it if you're writing life critical applications, nuclear plant control software, or mind control laser firmware.
posted by blenderfish at 2:10 AM on November 7, 2006 [2 favorites]


Yeah, second Leon's suggestion you add a table which maps an (probably autoincrementing) integer to a string, and use that integer everywhere as a key. This allows you to reverse lookup from key back to string, which you of course can't do with MD5.
posted by blenderfish at 2:13 AM on November 7, 2006


Response by poster: The problem with Leon's correction is that the string I'm getting given is coming from an external system, so if 'bert' and 'ernie' both led to the same md5 string I'd need to know which one of them it was - a numeric key isn't going to help me with that one.

blenderfish - no mind control lasers you say? Damn. What about an md5().sha1()? ;)
posted by twine42 at 2:20 AM on November 7, 2006


Response by poster: That wasn't well explained - I missed out that the main attraction of this is to get rid of the potentially any length text strings to replace them with a fixed or known length string. Moving the string and a key out into a separate table just moves my problem elsewhere.
posted by twine42 at 2:22 AM on November 7, 2006


Best answer: Your problem is an example of the birthday paradox. In your case, since MD5 is a 128-bit hash, the probability of a collision is less than 2-100. You'd need about 264 records before the probability of a collision rose to 50%. However, MD5 is no longer considered to be cryptographically secure (MD5 was broken in 2005). Are there adverse consequences for your application if an opponent could force a collision (perhaps by supplying two different user agent strings that are known to hash to the same value using MD5)? If the answer is yes, don't use MD5, use a hash function that is cryptographically secure (I recommend SHA-256).
posted by RichardP at 2:23 AM on November 7, 2006


So 10k strings of up to, lets say, a limit of 255 characters-- is the 2.5MB really an issue?

Keep in mind that if you hash the value before you store it, you won't have a good way to go backwards to the original string if you need to (i.e., the user-agent string which hashes to "0xDEAFBEEF" always crashes my web process; I wish I knew what that string was.)
posted by blenderfish at 2:33 AM on November 7, 2006


Previously, except I was just asking about collisions on the first 16 digits...
posted by disillusioned at 2:40 AM on November 7, 2006


I'm a bit confused here. You don't need additional tables, or same-length text strings as keys.

Adding an additional column and making it PK doesn't stop you using the existing text column in any way you want.

Using MySQL as an example, and given the following table:

CREATE TABLE tUseragent (
sUseragent VARCHAR( 255 ) NOT NULL ,
PRIMARY KEY ( sUseragent )
);

just add a primary key column on the front of the table:

ALTER TABLE tUseragent DROP PRIMARY KEY;
ALTER TABLE tUseragent ADD iUseragent INT( 11 ) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;

Index the sUseragent column to taste.

Or am I missing something?
posted by Leon at 3:19 AM on November 7, 2006


"I have a database in which the key is a string of variable length (basically user agent strings of anything between 20 and 200+ characters). I really don't like this idea."

On preview, what Leon said. IDs should be synthetic auto-increment.

Then just put a unique constraint or unique index on the useragent column. (The database likely uses hashing to enforce this constraint, but that's an implementation detail you don't have to deal with. If the RDBMS is any good, it'll do this right.)
posted by orthogonality at 3:29 AM on November 7, 2006


More information
collisions
in MD5 (in under 1 minute), in case this is a security application.
posted by about_time at 5:12 AM on November 7, 2006


Dan Kaminsky wrote a nice paper on this at MD5 considered harmful someday. It might be worth a read through if you want to understand the implications of MD5 being "broken."
posted by mock at 5:17 AM on November 7, 2006


I definitely have to recommend using an auto_increment for the primary key. Simple and easy.
posted by cellphone at 6:08 AM on November 7, 2006


Response by poster: I don't understand why people are apparently misunderstanding my question.

I know all about auto_incrementing primary keys. I love them muchly and tell them so on a regular basis.

What I wanted to know was how likely two strings of indeterminate length were to crash if I md5ed (or sha1ed, or whatever) them because I wanted a fixed length string rather than one that could be any length from 20 characters upwards. It's this string that is the only column ever searched for in the table. Introducing a numeric key gives me a unique identifier for that row, but doesn't answer my question about shortening the search string.

Leon - my apologies. I read blenderfish's reference to your comment and attributed it to you.
posted by twine42 at 8:33 AM on November 7, 2006


I'd almost guarantee you could use any common hashing algorithm and not run into a collision. If you continued to include the plaintext column (but not index it), you could easily work around this problem. (WHERE hash=whatever AND plaintext=whatever). The index on the hash column would be used to narrow things down to a handful of rows, so the comparison against plaintext would only happen on those rows that are probably correct anyway.

That is, if you keep the plaintext column. It wouldn't make a lot of sense to drop it entirely, since you'd effectively lose all that data (hashing is one-way).

Keep in mind though that you'd be limiting yourself to exact-match searches only if you dropped the index on your plaintext column. Given that these are user-agent strings, you might want to match, for example, all versions of IE. You can't do this with a hash, obviously.

Yeah, indexes on variable-length text fields suck, but sometimes that's exactly what you need (the ability to search for matches against variable-length text fields).

You could make a new table called "ua_strings" that maps each user agent to a numerical ID. The idea is that each unique string would only be in this mapping table once. The index would be smaller, depending on how many repetitions of the same string you have. Then, in your other table, you just refer to the numerical ID of that particular string.

This is worth the hassle if the original table has hundreds of times the number of unique strings, but it's been my experience that with all of the crap stuffed into UA strings these days, they don't repeat enough for this to help much.

Still another option is to parse the UA string into pieces before inserting. Say a column called "browser" which is an enum of (Mozilla, Opera, IE, Firefox, other), and a float called browser_ver that you could stuff the major and minor versions into with a period separating. This is useful if your concern is the major browsers, but wouldn't help if you're trying to find all those obscure browsers, or if you care about the extra data that works its way into UA strings.
posted by hutta at 9:22 AM on November 7, 2006


Response by poster: hutta - I full agree... thankfully/unfortunately (depending on your pov) the agent strings are all unique, and have very tightly fixed specs because I'm working with a system based on wurfl to identify mobile phone browsers.
posted by twine42 at 1:06 PM on November 7, 2006


The likelihood of not having a collision (assuming md5 distributes its keys evenly) is 99.99999999999999999999999999999999706126%
posted by jewzilla at 10:11 AM on November 12, 2006


« Older Where can we have a quiet one in Melbourne city?   |   How to prove an online lie? Newer »
This thread is closed to new comments.