$is_unique = (substr(md5($customer_id . $last_name),0,16) != $other_code ? 1 : 0);
June 29, 2006 9:10 PM   Subscribe

MD5()Filter: Are the first 16 characters enough to use without worrying about any repeats, ostensibly?

I need to create a confirmation code. I want to run an MD5 hash on the (unique) customer id concatted with their last name, to keep the seed plenty random.

I think 32 characters is a bit long, and was wondering if I use only the first 16 characters for the confirmation code, if I'll find any repeats any time soon. The math is hurting my head, but I'm thinking that the likeliness is incredibly unlikely. On the other hand, we're talking volumes from 300,000 to 1 million customers a year, so keeping confirmation codes linked to *just* a single customer is important.

How likely is it that I'll run into any troubles with this?
posted by disillusioned to Computers & Internet (22 answers total) 5 users marked this as a favorite
 
I don't know the answer to your question off the bat, but why bother with the math?

Make a script that tests the theory by running maybe 500,000 samples and checks for duplicates... Sure, you'll tie up your computer for a few minutes, but you can skip the math and get a real world sample.
posted by twiggy at 9:15 PM on June 29, 2006


on second thought, you're better off skipping the md5 hash and just choosing a random character (number or upper/lowercase letter) 16 times to generate a key.

The math for this is easy... (26+26+10)^16

This probably isn't much less efficient than an md5 hash, hell it might be more efficient.
posted by twiggy at 9:18 PM on June 29, 2006


16 characters means anything from 0000000000000000 to ffffffffffffffff hex, or 0 to 1.844674407371E+019 decimal. Collisions will be plenty unlikely.
posted by moift at 9:19 PM on June 29, 2006


Well, md5 is hexidecimal, so there are 16 possible values at each of 16 possible places. 16^16=18446744073709551616. This is a considerably higher number than your 1 million customers, so I'd say 16 is safe, unless the birthday paradox comes into account. (someone with better maths than me should be able to work it out).
posted by scodger at 9:20 PM on June 29, 2006


Also, the ternary condition in your page title is unnecessary.
posted by moift at 9:22 PM on June 29, 2006


It's the most obfuscated I cared to make it on one line. :-)
posted by disillusioned at 9:25 PM on June 29, 2006


moift and scodger: you're right about what characters result from an md5 hash, but the concern disillusioned has is that an md5 hash run upon 2 different strings might be less than purely random, as far as the first 16 characters matching. it's a valid concern, since md5 is seeded just once.

doing 16 characters independently would mean 16 different seeds, and bring it much closer to purely random.
posted by twiggy at 9:25 PM on June 29, 2006


Why not just use a monotonic increasing confirmation code? No chance of repeats at all.

If codes need to be "secret" so that other users can't use them, require the user supply the conformation code plus his user name and password.

If the codes need to be hard to guess, use montonic increasing but increment randomly:

mutex m( last_conf_code ); last_conf_code = last_conf_code + random() * 877 ; return last_conf_code() ;
posted by orthogonality at 9:26 PM on June 29, 2006


What exactly do you mean by confirmation code? Is this something that, should you get it wrong, will endanger the safety of your data or application? Or is this more along the lines of "Click this link to confirm you've received this e-mail," in which case, of course, should something go wrong, the person can simply request a new e-mail.
posted by odinsdream at 9:26 PM on June 29, 2006


Assuming you don't need cryptographic strength (MD5 was broken in 2005), MD5 is still a fine hash function. MD5 is a 128-bit hash function. The 128-bit (16 octet) result of MD5 is often encoded with 32 hexadecimal digits, which is what I assume you mean when you say that your MD5 function produces a result that is 32 characters long. If the result is encoded as 32 hexadecimal characters, truncating the result to 16 characters produces a result that contains 64 pseudo-random bits. How much confidence in the uniqueness of your confirmation code does your application require? At about 2^32 IDs (~4 billion) the chance of a duplicate is about 50% (see the birthday paradox.

Are there any consequences to an adversary being able to forge a confirmation code? If so, you need to use a message authentication code (MAC), not a hash function. Fortunately you can build a MAC from a secure cryptographic hash function (but not MD5, it's broken...) using the keyed-hash message authentication code (HMAC). With a pIf you need a secure cryptographic hash function, I recommend SHA-2.

Also, do you really intend to issue one confirmation code per customer or do you need one per transaction? If the latter, you really should be using the transaction ID, not the customer ID, when generating your confirmation code.
posted by RichardP at 9:39 PM on June 29, 2006 [2 favorites]


The confirmation code is for something a user submits. (Say, a rebate.) To check the status of the rebate, without needing *any* other information, we're going to require a confirmation code. They won't be able to edit the properties of their rebate (for that, they'll need their normal account information) but I don't want them just going and decrementing the confirmation code by one and voila, their neighbor's rebate information. Privacy concerns at all.

We already have a lookup by address system, but people are dumb and like confirmation codes.

I'm thinking more and more that we're going to do the md5() method, for reasons I may make clear shortly.

Ortho, your method seems to lock out a significant portion of potential codes relatively quickly, which, although not a *true* issue, is a bit unpleasant.

In the end, we just don't want something like this to occur.
posted by disillusioned at 9:39 PM on June 29, 2006


In the end, we just don't want something like this to occur.

Strictly speaking, the problem there is not potential collisions, but putting sensitive, user-editable information into a GET request.
posted by Mr. Six at 9:48 PM on June 29, 2006


If the only thing you're hashing is the customer's unique ID with their firstname, it doesn't sound particularly secure. That means someone could run a script that could just loop through various md5 hashes of common names and integers (like "smith1", "chen5", etc) in an attempt to extract data. Furthermore, if you accidentally - for any reason - expose customer ids somewhere, you've unnecessarily made it easier for someone to extract the other two variables.

I'd say a much better solution is to

1) create a new column in your database titled 'token',

2) write a function that creates a random string using a cryptographically secure random number generator, and

3) assign a random token to each user.

That way, the token is completely independent from the user data.

Even if you do go with the md5 hash, you should probably concatenate a salt and/or secret word to each hash string.
posted by helios at 9:49 PM on June 29, 2006


Helios, correct, I was just using the customer id/last name as an example. We have enough other data of theirs (including an md5-hashed password field) to generate an appropriate key. No where on the site is their customer id visible in the url.

Of course, if anyone took the time to try the common names plus integers loop idea, *and* realized that was how we generated the confirmation code to begin with, I'd be incredibly impressed.
posted by disillusioned at 9:52 PM on June 29, 2006


If you use numbers plus 22 letters of the alphabet you can store 5 bits in one character (instead of 4 for hex). That shortens the code to 26 characters.

The other option is to store the code in the database instead of making it a hash of the customer id. Then generate them randomly and if you hit a duplicate you generate another one.
posted by cillit bang at 9:53 PM on June 29, 2006


you should probably concatenate a salt and/or secret word to each hash string.
Do do use this method for anything important! The method described (either the secret prefix method represented as H(K || message) or the secret suffix method represented as H(message || K)) is not a secure mechanism for generating a message authentication code - it can be broken even if H is a cryptographically secure hash function. Only use a MAC that has been proven to be secure (e.g. HMAC).
posted by RichardP at 9:58 PM on June 29, 2006


Argh, "Do do use this" should have "Do not use this."
posted by RichardP at 9:59 PM on June 29, 2006


I believe that MD5 is only 'broken' in a few senses. Finding a hash collision may be useful in the case of passwords and the like, but for document confirmation, it won't matter much.

Why? Because for any given document that produces a hash value, any other documents that produce the same value are _extremely_ unlikely to make any sense whatsoever. They'll be random gibberish.

I don't think there's a method of taking a desired plaintext and massaging it in any way, shape, or form to get a hash collision with a different, known document.
posted by Malor at 10:34 PM on June 29, 2006


We're going offtopic here, but Malor: unfortunately it's not that simple. Document types that can contain executable code are susceptible, e.g. anything with macros. Postscripts certainly are as there is a proof of concept. Given two plaintexts you can construct two postscripts with the same md5 hash.

Getting two literally plain plaintexts (like two .txt) to collide is still hard I think, and this does not impact the question, but you cannot rely on md5 as a hash to authenticate the content of many types of document.
posted by edd at 12:22 AM on June 30, 2006


As twiggy said, you shouldn't really use a hash for this, but use a Base32 random number field to the database.

Base32, a derivation of Base64, is a notation for expressing large numbers in a form that can be conveniently and accurately transmitted between humans and computer systems. It uses an alphabet of A–Z, followed by 2–7 (thus "2" actually has a numerical value of 26). 0 and 1 are skipped due to their similarity with the letters O and I.

It needs to be in the database anyway, because you're going to query the database for the specific record, and otherwise you're going to have to scan the entire database and calcluate the hashes until you find the correct record.
posted by Sharcho at 1:15 AM on June 30, 2006


"I think 32 characters is a bit long, and was wondering if I use only the first 16 characters for the confirmation code"

That's only if you stick with the popular hex encoding (base 16). A quick way to shorten the result losslessly is to base64 it instead:

base64_encode(pack("H*", md5($str))

Now it's 22 characters plus padding.

If that's not good enough, it's fairly simple to convert to arbitary bases and use arbitary sets of characters to increase usability in some circumstances; e.g. using 0-9a-Z but skipping O, l, I etc to reduce confusion if a user ever has to read it over the phone to someone, or using punctuation but skipping metacharacters in whatever environment you're developing in.
posted by Freaky at 1:16 AM on June 30, 2006


Couldn't you just add a duplicate check on the generated string? In other words, the script would check to see if it has ever issued a order or customer id that matches the one just generated. If it has, it would regenerate the code.
posted by richardhay at 3:50 AM on June 30, 2006


« Older cel-ray quest!!   |   Why lock half of a double door? Newer »
This thread is closed to new comments.