How can I compare the text in two excel cells for common terms?
December 6, 2010 5:39 PM   Subscribe

How can I compare the text in two excel cells for common terms?

I am doing genetic genealogical research, and I have long lists of surnames from possible relatives, which I am comparing manually in Excel, in order to find names that people share. Though this approach does work, it is time-consuming and error-prone. I am looking for a way to select two cells filled with surnames in Excel, and compare them to find out which names show up in both cells. For example, one cell includes the text "Smith, Jones, Wilson..(plus another 100 names)" and the other cell includes the text "Baxter, Lewis, Wilson..(plus another 100 names)". I need a tool that I can activate which will spit out "Wilson" as the term that both cells have in common. I've seen code-oriented tools that compare every character difference between two files, which is not what I'm looking for. I've seen a bunch of tools that compare text files, but since I have thousands of entries in an excel file I'd really like to do the comparison in Excel; or at least just paste the text from excel into another tool. This Compare Suite website is the closest thing I've found to what I'm looking for: http://www.comparesuite.com/online/try_for_free.php. Does anyone know if something like this exists in Excel, or in some other form (Mac or online)? Thanks!
posted by pantufla to Computers & Internet (5 answers total) 1 user marked this as a favorite
 
Is there a reason one cell contains many surnames?
If they are comma seperated, you can select the text, paste it into a .txt file, then re-open with excel (do a file|open, filetype .txt), and deliniate on the commas. Then you have the information in seperate cells.
If you have two columns of names (from the two original cells), then you can easily see which names in the second column show up in the first, by creating a third column, which checks to see if stuff in the second column has a match in the first



A B C
1 2 ..formula..
3 5 ..formula..
4 6 ..formula..
5 7 ..formula..

where ..formula.. is
=IF(ISNA(MATCH(B1,$A$1:$A$4,0)),"no match","match").


Then column C will tell you which of the ones in B have a match in A. Spelling must be identical, and I think capitalization too.
posted by defcom1 at 6:13 PM on December 6, 2010 [2 favorites]


You don't need to do defcom's extra step of exporting to a .txt file and then back to excel -- you can just do text to columns within excel (do a help search on how to do this, since you don't say what version of Excel you're using) and delineate on whatever separates the names (commas, spaces, what have you)

You might then have to do a transpose to turn rows into columns, depending on how it's set up?

But other than that, he's spot-on -- get these things out of one cell.
posted by brainmouse at 6:19 PM on December 6, 2010 [1 favorite]


Response by poster: Hi Guys, thanks for your tips! I got defcom1's formula to work, and I was able to run text to columns and transpose as well. This does help me automate comparison of surnames, but I'm not sure how to implement this having each surname in its own cell approach with the file I've been using. The issue is that I'm comparing attributes of various people: segments of DNA that they share with me and each other as well as their surnames. This is information gathered from 23andme testing.

Simplifying a bit, the excel file I've created looks like this:
columns: Person's name, DNA match: chromosome # and location start & end, person's surnames. Each row is devoted to a person. Pardon my table formatting here, it seems to have collapsed a bit.

For example:

Name chromosome Start End Surnames

Joe Smith 1 4,000,000 7,000,000 Hayes, Miller, Smith, Williams
Sally Jones 1 5,000,000 9,000,000 Hall, Jackson, Jones, Martin, Miller, Smith
Paul Douglas 1 15,000,000 22,000,000 Brown, Douglas, Martin
Dave Hayes 5 21,000,000 26,000,000 Allen, Brown, Hayes, Jackson, Miller

As I mentioned, I've got all of each person's surnames in a cell.
Now, compared to the simplified example above, my file has about 1,000 people/rows, and an average of about 30 surnames per person (and over 100 surnames/person in several cases). Having the surnames for each person lumped into a cell has worked for me so far, but does not scale well. :( I set the spreadsheet up that way because I don't know much about Excel. So, I'm not sure how to rework my name-packed cells with a cell for each name. Perhaps I could place each person's names in their own cell, moving to the right of the spreadsheet. Applying that to the example above would create something like this:

[A Name] [B chr.] [C Start] [D End] [E] [F] [G] [H] [I] [J]
Joe Smith 1 4,000,000 7,000,000 Hayes Miller Smith Williams
Sally Jones 1 5,000,000 9,000,000 Hall Jackson Jones Martin Miller Smith


I suppose that once I broke each name block into individual cells, I could set use the ISNA equation to do the person to person comparisons. However, I'd like to do more than compare 2 sets of names.

Ideally, I want to:

1. Compare each person's surnames with all the other folks' entered names, to get an idea of which names appear more than once for everyone in the file (e.g., in the first example, I see that Miller appears three times, Hayes twice, but Williams appears only once. This may help me to distinguish meaningful names from outliers.)
2. Compare each person's surnames with all the other people who have DNA matches for the same chromosome. Similar rationale to point 1. For example, I see that Martin, Miller and Smith each appear twice on chromosome 1, but not at all on chromosome 5.
3. Compare each person's surnames with people who share the same segment of DNA. In the example above, Joe Smith and Sally Jones share DNA from 5,000,000-7,000,000 on chr. 1. I also see that they share the names Smith and Miller. This suggests that Smith and Miller would be likely clues to a common ancestor.

Also, how hard is it to have Excel count how many times a certain name appears in my file? I'd love to see a summary, again, to get an idea of which names are appearing the most.

I hope this makes sense. Any input would be greatly appreciated. Thanks!

BTW I'm using Excel 2008 for Mac.
posted by pantufla at 10:44 PM on December 6, 2010


You should put each name in it's own cell. It will make life much easier. This can be done with the text to columns function brainmouse referenced.

To find out the frequency of each name in the list, I would copy all the names into a one single column (column A) of a new worksheet, then copy that list again into column B (or add a few spacer columns if you like). Highlight the second column and select "Remove Duplicates" under the Data tab. (Note: Don't expand the selection when prompted.) This should generate a list of all unique names in your table. Then in column C, you could put a formula like:

=Countif($A:$A,[First Name in Column B])

Then drag fill this to the bottom of the list. This will show how many time that name shows up in the list. I would then highlight both of these columns and convert it into a table (Under the Insert table) so that I could easily sort them.

If you have a ton of lines and aren't really into condensing them by hand, you can copy the entire block of cells into Word and replace the paragraph symbol with a tab. I would also take advantage of this time to scrub any extraneous spaces out of the data.

Also, you can have Excel highlight duplicate values. Highlight a series of date (it can be in multiple rows and columns, then on the Home Tab, select Conditional Formatting, highlight cells rules duplicate values.)

I would need to think some more on how to do the rest of your frequency analysis. If you need more help with how to the other steps I mentioned, please free free to PM me.
posted by gagoumot at 10:23 AM on December 7, 2010


Response by poster: Thanks for the suggestions, gagoumot. I will try them out.
posted by pantufla at 10:19 AM on December 10, 2010


« Older I'm worried about talking too much/too little with...   |   Winterproofing Blunnies Newer »
This thread is closed to new comments.