Join 3,512 readers in helping fund MetaFilter (Hide)


help anonymize data
September 13, 2011 12:37 PM   Subscribe

Is there a way to Anonymize various excel columns to protect privacy during a presentation?

I work with sensitive data, and would like to anonymize the sensitive columns (ID Number, phone numbers, emails, etc) but have them remain consistent so that if two people have the same email, it'll show that when anonymized...

I suppose i could create a hash function for each column, but it'd be nice to have the phone numbers look like phone number and the ID#'s look like ID#'s...

After the data is processed, i'm using Social Network charts to show where people have the same number, address, etc, to show clusters of people. So it's important to have the data processed accurately.

If there was a small amount of data, i'd do it manually, but i am dealing with thousands of rows...

any ideas? also, if i am using the wrong word for this, please help, I have anonymize in my head, and might be thinking of something else.
posted by fozzie33 to Computers & Internet (5 answers total)
 
Hashing is the correct process. You'll need to give up on making it look "the same" as the original data, though.
posted by odinsdream at 12:55 PM on September 13, 2011


You could perform a simple rotation operation or addition on each of the phone number columns (create a new column, make it equal to old column + randbetween(0,20) or something like that to just gibberish them up. Same with the ids. For email, not sure the best strategy there... add numbers to each before the @? How closely will people be looking at them?
posted by disillusioned at 12:56 PM on September 13, 2011


Assuming you're not looking to create something that's crack-proof, just substitute a few common numbers and letters. For example, switch "1" and "0" (by replacing 1s with %%%s, then 0s with 1s, then %%%s with 0s), then 2-3, and 4-5, and 6-7, and 8-9. Then do the same with vowels -- maybe As become Es become Is, and so on.
posted by Doofus Magoo at 1:04 PM on September 13, 2011


How about this:

Acquire a list of random addresses, and alphabetize. Alphabetize your data, and transform each to the 'fake' address of the same rank. Thus, if my actual data is ['1 Acacia Avenue', '1 Acacia Avenue', '1 Curtain Close'], they would be replaced by addresses 1,1 and 3 on the list of fakes. Do the same for phone numbers and SS IDs.

That seems like it'd work to me, though I may be misinterpreting the problem.
posted by piato at 2:09 PM on September 13, 2011


I would make a separate sheet (say, 'Sheet2'), and build some lookup tables. Put a distinct list of your ID numbers, in column A. In B1 put:

=TEXT(RANDBETWEEN(0,99999),"00000")

(I'm assuming a five digit ID number for no reason at all). Back on your presentation page you want (assuming the real ID's are in column A, and you've inserted a blank column as B) B1 to be:

=VLOOKUP(A1,Sheet2!A:B,2,0)

When you're done, paste these things as values, or the random functions will recalc when you change anything. Something similar would work for phone numbers, just changing the TEXT function above which gives the number its formatting.

For addresses, have one lookup table with 1....n in one column, and a column of dictionary words, or text from the web or whatever (let's name that range 'streets'). Now have another lookup like the IDs, matching your real address to something like:

=randbetween(1,50)&" "&vlookup(randbetween(1,10000),streets,2,0))&" Street"

So each address maps to something consistent and address-like, but anonymous. Email addresses could be built similarly.
posted by pompomtom at 2:41 PM on September 13, 2011


« Older Stupid Word Problem: Shift key...   |  Effective steps for the averag... Newer »
This thread is closed to new comments.