How to create a new Unique Identifer Number without Clashes
April 22, 2010 12:44 AM   Subscribe

Help Creating A New Unique Identifier

I have two files with multiple sales data for housing parcels that possess unique identifiers UIN (14 digit strings, but again, they aren't unique because some houses have been sold multiple times).

I have dates for each of these sales (Month, Year, and Day), and I wanted to mix the dates and the parcel data to create a new actually unique identifier based on the last date sold; so I can join the two separate datasets.

I've tried mixing the Parcel Numbers and the Sale Date in multiple ways (multiplying, adding, squaring etc) to create a new UIN but I keep creating new conflicting duplicates. I can see this is a mathematical result based simply on the number of data points that I am creating.

How do I do this?
posted by stratastar to Technology (13 answers total) 1 user marked this as a favorite
 
Look into making a SHA1 or other hash of your identifiers.
posted by Blazecock Pileon at 12:50 AM on April 22, 2010


For example, if you're on a Mac OS X or Linux workstation, you can open up a terminal and use OpenSSL's SHA1 hash function.

Here's what you get for an example ID:
$ cat > test.txt
ParcelNumber0001_SaleDate042110
^D
$ cat test.txt | openssl dgst -sha1 
9ee72b1d2a4076d3c64132ef49b79be35b89eb39
Here's the second example ID:
$ cat > test2.txt
ParcelNumber0002_SaleDate042110
^D
$ cat test2.txt | openssl dgst -sha1 
73770cfc5f0f7c0ab8b03cd459d3ec22a0858b1d
That ^D character is a Control-D keypress.
posted by Blazecock Pileon at 12:56 AM on April 22, 2010


As has been pointed out, you want a hash.

I'm not sure what you're using these values for, but FYI - it's generally advisable to make unique identifiers completely independent of the data. You never know if your requirements might change and it becomes inconvenient to have your identifier tied to what it's identifying. I can't think of an example now, but if programmers could be that clairvoyant we wouldn't have any problems.
posted by wonnage at 1:01 AM on April 22, 2010


Response by poster: Ack, my Ubuntu VM just got corrupted. I'll try out the hash library in R.
posted by stratastar at 1:25 AM on April 22, 2010


Strictly speaking a hash isn't guaranteed to be unique but will be almost all of the time. A Universal Unique IDentifier (UUID) on the other hand is. As a bonus you don't need the original data to make the UUID, so it's probably computationally more efficient to generate. So in OS X or Ubuntu from the terminal you can issue the command uuidgen which will give you a uuid:
$ uuidgen
9FD86CE2-B48E-4C07-99C8-31E6DD887122
$ uuidgen
079C0FB7-AD15-42C3-B04A-C4081543E9A5
And so on. If you're unfortunate enough to be stuck in windows land there's some info here that may be useful, but I'm damned if I understand it.
posted by singingfish at 2:07 AM on April 22, 2010 [1 favorite]


Best answer: Do you need the new number to be a 14 digit integer?

If no, just concatenate (old) UIN and date-month-time and be done with it.

If yes, then a hash will have a vanishingly small number of collisions compared to anything you are likely to come up yourself.
posted by zippy at 2:32 AM on April 22, 2010


Response by poster: zippy, the concatenate idea came to me after I posted, and I did it in excel.

wonnage, I / we use the parcel number as the ID because it is matched in a spatial GIS data base. I needed a new ID to join data from two different time points of that parcel ID.

Thanks guys for the help.
posted by stratastar at 2:39 AM on April 22, 2010


Yeah, hashes can collide. Very unlikely.

I was about to suggest concatenation. If a hash works, so will concatenation, it will just be longer.
posted by delmoi at 5:28 AM on April 22, 2010


Best answer: Here's a complicating question: do you need to be able to work backwards to the original parcel number from the new unique ID? If so, secure hashes are exactly the wrong solution, since the whole point of a secure hash is to obscure the original data. Then again, I suppose this might be desirable, depending on your needs.

If it were up to me and I didn't have a need to obscure the underlying data, I'd just concatenate them them as parcelnumber_YYYYMMDD.
posted by adamrice at 7:00 AM on April 22, 2010


Best answer: +1 to adamrice's advice. We really need to know what you're planning to do with this before offering you any further advice.

Hashes are indeed the wrong solution for creating unique identifiers. There can be overlaps, however unlikely they might be. If you legitimately have have two identical rows (which could hypothetically happen if a house changes hands twice in one day), you're essentially up a creek, as I don't think you have enough data to differentiate between legitimate duplicates, and duplicates that are occurring as a result of overlap between the two files. Hashing/concatenation won't help you here.

I'm also a bit confused about the difference between these two data files. Do they contain the same fields?

It sounds like you should really be using a proper database to manage this data (Access is fine; Excel is not). If you're using SQL, you can query two separate tables, and grab a list of unique items using the UNION keyword (this is especially helpful if you have other fields in your data file that can help establish "uniqueness").

Also, once you have your data in a database, you can create a field in your table to be used as an automatically-assigned unique identifier.
posted by schmod at 7:41 AM on April 22, 2010


Response by poster: The two data files do not contain the same fields which is why I need a join based on a unique identifier.

One file has foreclosure data (foreclosed sale price, foreclosed date etc); the other has a slew of housing value assessment indicators, sale date, sale price (where each row contains an independent sale; the same sale date and price should match the value in the foreclosure dataset, but this one gives no indication of whether or not the sale was a foreclosure or not).
posted by stratastar at 8:43 AM on April 22, 2010


Best answer: Simply merging the fields you want to compare will give you a trivial "perfect hash", with no possibility of collision. But if you need to reduce the string length, a hash is probably best. The possibility of collision is non-zero but still pretty low.

Using uuidgen is guaranteed to create a unique identifier, but since it gives you different results each time you use it, I'm not certain how you would join two records with distinct UUIDs.

One alternative is to use a proper SQL database. SQLite or MySQL are free options with graphical tools. You can import text files into these databases, once you have the schema set up. It'd be a bit of setup work, but it could be done. As a bonus, you get weak enforcement of data checking when inputting data. If you try to insert a string into a date field, for example, the database complains.

If you want to do this kind of SQL operation on text files directly, you can also use the GNU sort, awk, and join commands to sort on the fields-of-interest, merge those fields together into an identifier string (awk), and then search for similarities between files (join) based on that identifier string. This is doable with about 10-15 minutes of work.
posted by Blazecock Pileon at 11:05 AM on April 22, 2010


Best answer: Simply merging the fields you want to compare will give you a trivial "perfect hash", with no possibility of collision.

If the dates aren't checked for uniqueness, then you can get collisions here. Cases I've seen in other databases are where data entry people used dates like 0/0/00 or 9/9/99 to indicate "no date available."

You could see collisions if cases like these are possible.
posted by zippy at 2:16 PM on April 22, 2010


« Older Should I get involved with my friend?   |   Liechtenstein Newer »
This thread is closed to new comments.