Correlating names between two spreadsheets
January 18, 2013 8:35 AM Subscribe
I need to correlate two spreadsheets which relate to the same group of people. The name data is messy, however, and I'm having a hard time matching the names.
Hello! I've got a bit of a puzzle here and I need some expert help.
I'm trying to correlate data from two different spreadsheets. The context of this data is that it was collected as part of a very complex international project. Both reports are showing student results from different online courses.
There are about 3,000 students.
I need to correlate the data in order to make comparisons of performance for each student in the different courses.
However, the data collection was done separately, so names were written down in many different ways. It's a classic problem, I'm sure.
Here are some examples of the data issues:
Spreadsheet 1 Spreadsheet 2
1. Mary Williams mary williams
2. John Paul Gomez John Gomez
3. Marisa Lpez Marisa López
4. Sara Lpez Sara Lopez
5. Xin Wei Wei Xin
Problem 1: Case. I can make everything lower case. So I think I've figured that out.
Problem 2: Middle names. I can drop out middle names.
Problem 3: Characters not displaying correctly (I copied these into HTML file, opened with browser, and then copy-pasted. If you know a more elegant way to do this, let me know!)
Problem 4: Special characters were often not used at all in the data collection for Spreadsheet 2. So I'll need to find a way to convert all accented characters into plain characters. What's the best way to do that?
Problem 5: Surnames and first names flipped (usually with China). It's not consistent, however.
I've just listed a bunch of data issues, and would love advice on each of those. Even better, does anybody know a one-stop solution for this problem of correlating names from different databases/spreadsheets? I'm new to working with messy data like this so any general tips/resources would also be very helpful.
I don't know Visual Basic but I can use Python very well.
Thanks for your help!