Is it possible to securely anonymize identifers while having the capacity to reverse the anonymization?
February 3, 2012 1:18 PM   Subscribe

Anonymizing identifiers question. We have datasets with unique client identifiers. We get many requests from researchers for anonymized datasets for analyses. I'd like to find an operation that could securely create anonymous identifiers but that would also allow us to de-anonymize the records if needed.

Before releasing datasets for research, we strip them of identifiers and do various analyses to ensure that no combination of fields could be used to identify clients. However, sometimes we need to go back into the original dataset to clarify something about the record. We generally assign some random autoincremented idenfier to each record and then keep a crosswalk table. But that can be kind of a hassle, especially when a single client stretches over multiple rows. It would be awesome to have a procedure that we could apply in a database query that would create a random unique identifier by performing some set of operations on the ID.

An analogy would be multiplying each identifier by some number. The new ID would be different than the old one and each client's uniqueness would be preserved. However, simple multiplication is too easily reversable, of course. The main thing is that the operations need to be such that we could reverse the operations to get back to the original identifier AND that no one else could do so, as these are sensitive data. Bonus points for having the operations be something that could be executed as part of a SQL query.
posted by jasper411 to Computers & Internet (14 answers total) 2 users marked this as a favorite
So this is one of the things my Largely Mythological Husband does for a living (though for added difficulty points, with unstructured databases).

He recommends the work of the Cornell Data Privacy Group as a starting point if you're looking to roll your own, and the IBM Anonymous Resolution System if you're looking to buy someone else's solution.
posted by Sidhedevil at 1:40 PM on February 3, 2012 [1 favorite]

You are asking for a two-way hash. The private key stays on your side (maybe it lives with your Certificate of Confidentiality). Encrypt the original id with the key --> new (much longer, ugh!) key.

(Editorial: On the whole though, I would really caution against even making this possible. Your process for making these clean data sets should be fully deterministic and shouldn't lead to 'having to check' particular records. If there is *any* field that can uniquely go back to the originals, it's a risk.)
posted by gregglind at 1:40 PM on February 3, 2012

(Readers and posters, note the OP is looking for / asking about *REVERSIBLE* hashes!)
posted by gregglind at 1:41 PM on February 3, 2012 [1 favorite]

the operation doesn't have to be reversible, just repeatable. to find the original row with anon_id = '647385' you can just

SELECT * FROM foo WHERE anonymize(id) = '647385';

the operation should *not* be a plain hash, since that allows anyone with anonymised data to confirm id values from the original data set.

if this data is sensitive, don't roll your own - anonymization is hard, and even the big players get it wrong.
posted by russm at 1:54 PM on February 3, 2012

Response by poster: The issue about eliminating combinations of fields that *could* identify a record is based on my reading of k-Anonymity. You have to check manually to determine that there are always more than a minimum number of people who have a set of identifiers. For example, if there's only one person in the dataset who is a Korean Male aged 43 years, you need to mask some of those identifiers or resolve them up to a higher level (e.g., Asian Male 40-50). I don't know of anyway to do that other than manually.

I was wondering about reversible hashes, but I don't really understand how to execute them. I'll be checking through these links! And feel free to help contribute to my understanding of implementing this as if I was a dummy!
posted by jasper411 at 1:57 PM on February 3, 2012

off the top of my head, i'd:

1. randomly order the cases

2. randomly assign new IDs

3. copy old IDs and new IDs to a separate, secure, encrypted, dataset

4. delete the old keys

5. write a script to automatically cross tabulate variables of interest (maybe all of them in the data?) to see if any cells fall below the threshold. there is probably some way to scrape then report which cells are too small

6. manually recode problem cells (since the computer doesn't know what is a reasonable grouping of values or how to label them)
posted by cupcake1337 at 2:32 PM on February 3, 2012

Reversible hashing is called encryption :)

Using a secret key on your side, encrypt the ID, then publish it.

When someone wants details, decrypt the ID and you're done. Even better, include a version number in the published dataset, then you can use a different secret key for each dataset (where secret key maps to the version number). This reduces the blast radius if your secret is compromised.

You can choose to use a new secret for every dataset published, or per client, or just rotate monthly/quarterly.

The bigger problem is what you mention about making sure the other attributes in the dataset are not identifiable, that's much more difficult...
posted by jpeacock at 2:34 PM on February 3, 2012

Response by poster: Is there a reliable SQL function or a VBA module that could perform encryption as part of a query?
posted by jasper411 at 3:45 PM on February 3, 2012

How big of a dataset are you talking about? If it's not too big, and you're not going to be frequently de-anonymizing data, russm's solution is pretty good.
posted by losvedir at 4:21 PM on February 3, 2012

Versions of SQLite abound which support encrypted databases (example). You could keep a second, encrypted database of identifiers that match a random string ID (from the unencrypted database) with a true ID (from the encrypted database). You keep the key that decrypts the true-ID map somewhere safe.
posted by Blazecock Pileon at 4:53 PM on February 3, 2012

Response by poster: The databases are often thousands of records. Depending on what's being provided they can go up to several hundred thousands.

The reasons we might need to go back is, like someone may decide they want an additional piece of data, or there might be some question about the data which requires us to go back in and examine particular records. In that case I just need to be able to know what the original ID is from the anonymized random ID.
posted by jasper411 at 5:00 PM on February 3, 2012

even with hundreds of thousands of records, I'd lean towards using a one-way function and then just regenerating the mapping if required. to remove the ability of anyone to guess the id -> anon_id mapping I'd use something like an hmac with a different (secret) key for each recipient of anonymised data. you doin't say what database you use but Postgres has hmac in the pgcrypto contrib module.

this way, if you need to supply additional data that can be joined to the originally supplied anon dataset you can just do another export with the same hmac key, and if you need to further examine some specific records in the non-anonymised data the load of doing

SELECT id, hmac(id, 's00persekr1t', 'sha256') AS anon_id FROM some_table WHERE anon_id = '9834157983479';

will still be minimal unless you've got many millions of rows.
posted by russm at 9:46 PM on February 3, 2012

the load of doing

SELECT id, hmac(id, 's00persekr1t', 'sha256') AS anon_id FROM some_table WHERE anon_id = '9834157983479';

will still be minimal unless you've got many millions of rows.

specifically, on my hokey old workstation, selecting a million "id, hmac(id::text, 's00persekr1t', 'sha256') AS anon_id" rows takes 15 seconds of CPU... and using an md5 hmac takes <9 seconds of CPU...
posted by russm at 2:15 PM on February 5, 2012

your problem sounds complex enough that i'd suggest hiring a freelance database guy to come in for a day, or however long it takes, to set up a workflow to do what you want. there are tones of competent database people who would love a gig like this, right?
posted by cupcake1337 at 1:29 PM on February 7, 2012

« Older I been in the right place, but it must've been the...   |   What's the difference between a Camry and a Lexus? Newer »
This thread is closed to new comments.