How can I find common words in two lists?
December 13, 2010 10:00 PM Subscribe
How can I find common words in two (or more) lists?
I have several lists of surnames, and I want to compare them to find common names. For example, list one is "Adams, Jones, Smith" and list two is "Allen, Hughes, Smith". I want a tool that can identify "Smith" as the common term. (Of course, the lists I'm dealing with are much longer than these :) ) I've been looking for Mac or online tools that would help me do so, but most of what I've found is geared towards coding and looking for character-by-character differences. Are there any tools or techniques you could recommend for me to compare these texts? The closest thing to what I'm looking for is Compare Suite, which is Windows only (http://comparesuite.com/online.htm). I'd prefer to simply copy and paste name lists into a comparison tool, instead of keeping them in text files, if possible. This seems like such a basic task and I'm surprised that I haven't come across this type of tool yet. Thanks!
I have several lists of surnames, and I want to compare them to find common names. For example, list one is "Adams, Jones, Smith" and list two is "Allen, Hughes, Smith". I want a tool that can identify "Smith" as the common term. (Of course, the lists I'm dealing with are much longer than these :) ) I've been looking for Mac or online tools that would help me do so, but most of what I've found is geared towards coding and looking for character-by-character differences. Are there any tools or techniques you could recommend for me to compare these texts? The closest thing to what I'm looking for is Compare Suite, which is Windows only (http://comparesuite.com/online.htm). I'd prefer to simply copy and paste name lists into a comparison tool, instead of keeping them in text files, if possible. This seems like such a basic task and I'm surprised that I haven't come across this type of tool yet. Thanks!
Since they're comma-separated, you could also import them into Google Spreadsheets. At the end of each row, paste the formula
Example:
Adams | Jones | Smith | 2
Allen | Hughes | Smith | 1
posted by christopherious at 10:31 PM on December 13, 2010
=countif(A1:C2, "smith")
Example:
Adams | Jones | Smith | 2
Allen | Hughes | Smith | 1
posted by christopherious at 10:31 PM on December 13, 2010
Oops, sorry, I misread a key part of your question there, please disregard my advice. I still think an Excel/GSS formula might still be worth looking into, but the above would not yield the answer.
posted by christopherious at 10:41 PM on December 13, 2010
posted by christopherious at 10:41 PM on December 13, 2010
Supposing you have two files, 1.txt and 2.txt, containing one surname per line:
should do it.
posted by Monday, stony Monday at 10:51 PM on December 13, 2010
cat 1.txt 2.txt | sort | uniq -d
should do it.
posted by Monday, stony Monday at 10:51 PM on December 13, 2010
this is ugly, but if you know that no name will be repeated in either list I'd do it something like this in Terminal.app
perl -ne 'chomp; foreach $name (split(/, /)) {$seen{$name}++}; END {foreach $name (sort keys %seen) {print "$name\n" if $seen{$name} > 1}}' names1.txt names2.txt
this splits on ", " counts the number of each name it sees, and then prints any that appear more than once.
comm can also do the job if you can reformat the files to have 1 name per line instead of having them comma-seperated. In that case you probably want "comm -12 names1.txt names2.txt" to only print the matching items.
posted by russm at 10:56 PM on December 13, 2010
perl -ne 'chomp; foreach $name (split(/, /)) {$seen{$name}++}; END {foreach $name (sort keys %seen) {print "$name\n" if $seen{$name} > 1}}' names1.txt names2.txt
this splits on ", " counts the number of each name it sees, and then prints any that appear more than once.
comm can also do the job if you can reformat the files to have 1 name per line instead of having them comma-seperated. In that case you probably want "comm -12 names1.txt names2.txt" to only print the matching items.
posted by russm at 10:56 PM on December 13, 2010
Response by poster: Thanks for the feedback, guys! Wayland: I tried your suggestion, but when I run (comm -1 -2 file1.txt file2.txt), nothing results. (comm -1 file1.txt file2.txt) shows me file 1's contents, (comm -2 file1.txt file2.txt) shows me file's 2's contents, (comm -3 file1.txt file2.txt) shows me both file 1 and file 2's contents. I can't seem to get the common terms for files 1 and two to show up! Christopherious, thanks for the info. Monday: I tried your suggest, and nothing happens when I run it. Pardon my lack of terminal knowledge. When I just run cat 1.txt 2.txt, I get the contents of both files. I must be doing something wrong.
posted by pantufla at 11:05 PM on December 13, 2010
posted by pantufla at 11:05 PM on December 13, 2010
Response by poster: Russm - didn't see your posting. Thanks, that code did work!
posted by pantufla at 11:14 PM on December 13, 2010
posted by pantufla at 11:14 PM on December 13, 2010
Do you mean each line is name(s) separated by commas or do you mean each list has one name per line?
The second would be most common. The
But
This sorts the 2 lists together (could be any number of lists) and then counts the number of times each line is seen (then sorts the results of that numerically to make it easier to notice answers).
Since 'Smith' is in both files it has a count of 2. You can take this a bit further like so:
Which just prints the second ($2, the name) column where the first column ($1, the count) is 2. You could do this with N files and match count of N (every file) or even
If your files really have each line like "Name1, Name3, Name3, ..., NameN" then I would split them up first because it's way easier to work with 1 name per line files.
Will split up the lines in the file to one name per line. Oh, and don't feel bad about
posted by zengargoyle at 11:18 PM on December 13, 2010
The second would be most common. The
comm
and a lot of other text tools require the files to be sorted in order to work. Your examples are if they are indeed one name per line.
$ comm -1 -2 l1.txt l2.txt
Smith
But
comm
only works on 2 files. A more generic solution can be had many ways but the general idea is the same. Combine the files, sort them, count the number of duplicate lines and print those.
$ sort l1.txt l2.txt | uniq -c | sort -n
1 Adams
1 Allen
1 Hughes
1 Jones
2 Smith
This sorts the 2 lists together (could be any number of lists) and then counts the number of times each line is seen (then sorts the results of that numerically to make it easier to notice answers).
Since 'Smith' is in both files it has a count of 2. You can take this a bit further like so:
$ sort l1.txt l2.txt | uniq -c | awk '$1==2{print $2}'
Smith
Which just prints the second ($2, the name) column where the first column ($1, the count) is 2. You could do this with N files and match count of N (every file) or even
$1>2
to list names that occur in more than 2 files.If your files really have each line like "Name1, Name3, Name3, ..., NameN" then I would split them up first because it's way easier to work with 1 name per line files.
$ perl -lne 'print join("\n",split(/,\s+/))' list_N_per_line.txt > list_1_per_line.txt
Will split up the lines in the file to one name per line. Oh, and don't feel bad about
comm
, it's a real PITA to use sometimes.
# oblig Perl golf
$ perl -lane '$c{$_}++;BEGIN{$c=@ARGV}END{do{print if$c{$_}==$c}for%c}' l1.txt l2.txt
Smith
posted by zengargoyle at 11:18 PM on December 13, 2010
no worries... as others have said there are nicer ways of doing it if the data is newline separated rather than comma separated - that's generally The Unix Way(tm)...
posted by russm at 11:47 PM on December 13, 2010
posted by russm at 11:47 PM on December 13, 2010
This thread is closed to new comments.
posted by wayland at 10:25 PM on December 13, 2010