Correlating names between two spreadsheets
January 18, 2013 8:35 AM   Subscribe

I need to correlate two spreadsheets which relate to the same group of people. The name data is messy, however, and I'm having a hard time matching the names.

Hello! I've got a bit of a puzzle here and I need some expert help.

I'm trying to correlate data from two different spreadsheets. The context of this data is that it was collected as part of a very complex international project. Both reports are showing student results from different online courses.

There are about 3,000 students.

I need to correlate the data in order to make comparisons of performance for each student in the different courses.

However, the data collection was done separately, so names were written down in many different ways. It's a classic problem, I'm sure.

Here are some examples of the data issues:

An example:
Spreadsheet 1 Spreadsheet 2
1. Mary Williams mary williams
2. John Paul Gomez John Gomez
3. Marisa Lpez Marisa López
4. Sara Lpez Sara Lopez
5. Xin Wei Wei Xin

Problem 1: Case. I can make everything lower case. So I think I've figured that out.
Problem 2: Middle names. I can drop out middle names.
Problem 3: Characters not displaying correctly (I copied these into HTML file, opened with browser, and then copy-pasted. If you know a more elegant way to do this, let me know!)
Problem 4: Special characters were often not used at all in the data collection for Spreadsheet 2. So I'll need to find a way to convert all accented characters into plain characters. What's the best way to do that?
Problem 5: Surnames and first names flipped (usually with China). It's not consistent, however.

I've just listed a bunch of data issues, and would love advice on each of those. Even better, does anybody know a one-stop solution for this problem of correlating names from different databases/spreadsheets? I'm new to working with messy data like this so any general tips/resources would also be very helpful.

I don't know Visual Basic but I can use Python very well.

Thanks for your help!
posted by mammary16 to Work & Money (8 answers total) 9 users marked this as a favorite
 
Best answer: A tool built for this task is OpenRefine (formerly Google Refine). I can't say I know it well enough to explicitly tell you how to handle this problem, but I expect it would be a one-stop package to find a solution.
posted by nowoutside at 8:45 AM on January 18, 2013 [1 favorite]


I started to prepare some answers, but I realized I was just checking StackOverflow for everything. You can do that. For example, go there and search for "python regex to remove accents" to answer your Problem 4. Similar searches will probably handle the rest.

http://stackoverflow.com/
posted by paper chromatographologist at 8:58 AM on January 18, 2013


Best answer: What you are looking for is something called fuzzy matching. Microsoft research has a Fuzzy Matching Excel add-in. I have never tried it but that is where I would start if I had the same problem.
posted by crazycanuck at 9:19 AM on January 18, 2013


Best answer: I have had to do similar things and I don't think there'll be a one-stop solution. Instead I think you'll need something that confidence ranks A => B and then you'll have to manually check the matches.

Why I say this is because duplicate names is incredibly common. Take the English language surname 'Smith' - You can almost guarantee there'll be multiple Smiths. I'm sure other languages have common combinations of first name and surname.

I would look for any other data that might allow more chance of matching up someone. Address? Date of Birth? Otherwise I think you are out of luck.
posted by Wysawyg at 10:10 AM on January 18, 2013


I've been using Google Refine for just this type of data manipulation. It can do all of what you want and more. Easy to use, and lots of helpful videos out there.
posted by iamkimiam at 10:20 AM on January 18, 2013


Thirding Openrefine. Fantastic tool, and just up its alley.
posted by rockindata at 11:22 AM on January 18, 2013


Best answer: I've spent the last week working on a similar problem (with inventory, not people) and what I came up with works but it will take some time:

Sort your spreadsheets so your data on both is sorting the Target Column in the same order. Open a new spreadsheet, copy Data 1, add a column then copy Data 2. (I found color coding one set made it easier to remember which set was which)

Hide all the columns you won't be needing but leave the beginning and ending columns exposed (if Data 1 runs Column A to Column R and you're searching Column Column D, leave A, D and R exposed.) Do the same with Data 2. (let's say this data runs T to AX and you're searching in W, leave T, W and AX exposed)**

In the empty column, type "=exact(D1,W1)". You'll get a TRUE if the data matches and FALSE if it does not. If the same person is in the same row in both sets of data but the names are slightly off, you'll be able to spot it quickly and make them match.

I started off pasting the formula down the whole data set but then had to keep ran into: "Bag, 1lb, Peanut" and "1lb Peanut Bag" which are the same thing but sorted differently so one would need to be moved then I'd have to insert on one data set etc. This will be especially help with Wei Xin Xin Wei because you're going to be fixing one of them and moving it up or down your list.

**if you need to add, move or delete rows, having the first & last column exposed will help you avoid not moving the whole set of data as you work.
posted by jaimystery at 2:17 PM on January 19, 2013


For cleaning names, this tool is pretty useful.
posted by brentajones at 7:04 PM on January 20, 2013


« Older Sriracha Oatmeal Cookies? OK...   |   Crap Newer »
This thread is closed to new comments.