I see a tedious task in my future...
September 30, 2010 11:07 AM   Subscribe

Another what's the best software for this task question...what's the easiest way to identify and relabel observations with the wrong unique ID?

So I've got a dataset of repeated measures, taken daily for participants over a period of months. Participants are identified by an ID code. Problem is, sometimes people messed up their ID code when entering it. I'm now trying to figure out who they are so I can correctly label their data. The only other unique variable in the dataset is IP address. However, it's not perfect because: (1.) people can have multiple IP addresses and (2.) Some participants are in the same household, meaning that piece of data could possibly belong to more than one person. (The ID codes are similar to one another in a particular pattern. Depending on how they messed it up, it may either be easy or difficult to tell just by looking what they should have typed.)

What's the best way to go about this? I have Excel and SAS at my disposal, and I could probably get access to another stats package if it would be particularly helpful.

The only way I've figured out how to do it still involves some hand calculation, and there are enough of these that it will take a lot of time. Can anyone think of an easy way? Thank you!
posted by unannihilated to Computers & Internet (3 answers total)
 
Is the space of IDs dense enough that a messed-up ID might be someone else's valid ID? That is, do you merely need to find the correct IDs for each of a set of known-bad IDs and a set of known-good IDs, or do you also need to identify the garbled IDs in the first place?

If the first, the usual way is to model the kinds of mistakes people are making so you can compute the "distance" to each possible correct ID. Stats packages may have routines for computing edit distances or Levenshtein distances which would be helpful, although you'd ideally want a slightly different edit-distance measure than Levenshtein (in my experience, data-entry errors are most often transpositions or substitutions of nearby keys, rather than arbitrary inserts and deletes).

If I understand you correctly you have another axis for consistency-checking which is that each ID should appear exactly once per day. Presumably there are some legitimate missed or doubled observations but if you flag any (ID,day) pairs that appear 0 or >1 times, and any (ID, IPaddress) pairs that appear abnormally few times, it seems like you'd find most of the errors.
posted by hattifattener at 12:33 PM on September 30, 2010


Response by poster: Well, just from looking over the data generally, most of the mistakes seem to be of the type where people are hitting nearby keys - the letter O instead of the number zero, for instance - or leaving off part of the ID entirely (unfortunately sometimes the part that identifies them, unfortunately). So, no, generally the bad IDs do not end up being someone else's real ID.

The ID consists of letters that identify certain groups and characteristics and then numbers within the letter groups/combinations just counting up from 1 to identify people within the groups. People may have the same letter combination or the same number assignment but never both.
posted by unannihilated at 12:53 PM on September 30, 2010


How common is it in your dataset for two people to share the same IP address? If you could generate a table that maps unshared IP addresses to correct IDs, would that take care of a lot of the errors? In the remaining cases, presumably a single address is going to be shared by only 2 or 3 participants, so your problem is reduced to "here's the erroneous ID, I need to figure out which of these 2 or 3 actual IDs it should be". In some of these cases the letter group or the number may be correct, which indicates which ID is the correct one. After that, you might have to resort to a more complicated solution, but you'll hopefully have substantially reduced the number of remaining uncorrected errors.

Could you give an indication of how many errors you're starting with? That will of course affect what kind of solutions make sense here. If you have a couple of hundred errors, then a mostly-automated process that needs a little hand-holding and a quick eyeball-check at the end could be feasible. But if there are millions of entries to correct, I guess you need a method that's basically guaranteed to work correctly without intervention or checking!
posted by logopetria at 4:09 PM on October 3, 2010


« Older Help me deal with this totally religious state!   |   Maybe just a little ADHD, but probably not, but... Newer »
This thread is closed to new comments.