How can I find common words in two lists?
December 13, 2010 10:00 PM   Subscribe

How can I find common words in two (or more) lists?

I have several lists of surnames, and I want to compare them to find common names. For example, list one is "Adams, Jones, Smith" and list two is "Allen, Hughes, Smith". I want a tool that can identify "Smith" as the common term. (Of course, the lists I'm dealing with are much longer than these :) ) I've been looking for Mac or online tools that would help me do so, but most of what I've found is geared towards coding and looking for character-by-character differences. Are there any tools or techniques you could recommend for me to compare these texts? The closest thing to what I'm looking for is Compare Suite, which is Windows only ( I'd prefer to simply copy and paste name lists into a comparison tool, instead of keeping them in text files, if possible. This seems like such a basic task and I'm surprised that I haven't come across this type of tool yet. Thanks!
posted by pantufla to Technology (9 answers total)
"comm file1 file2" in the terminal should do it. You can drag your two text files into the terminal window to get their paths.
posted by wayland at 10:25 PM on December 13, 2010

Since they're comma-separated, you could also import them into Google Spreadsheets. At the end of each row, paste the formula =countif(A1:C2, "smith")

Adams | Jones | Smith | 2
Allen | Hughes | Smith | 1

posted by christopherious at 10:31 PM on December 13, 2010

Oops, sorry, I misread a key part of your question there, please disregard my advice. I still think an Excel/GSS formula might still be worth looking into, but the above would not yield the answer.
posted by christopherious at 10:41 PM on December 13, 2010

Supposing you have two files, 1.txt and 2.txt, containing one surname per line:

cat 1.txt 2.txt | sort | uniq -d

should do it.
posted by Monday, stony Monday at 10:51 PM on December 13, 2010

this is ugly, but if you know that no name will be repeated in either list I'd do it something like this in

perl -ne 'chomp; foreach $name (split(/, /)) {$seen{$name}++}; END {foreach $name (sort keys %seen) {print "$name\n" if $seen{$name} > 1}}' names1.txt names2.txt

this splits on ", " counts the number of each name it sees, and then prints any that appear more than once.

comm can also do the job if you can reformat the files to have 1 name per line instead of having them comma-seperated. In that case you probably want "comm -12 names1.txt names2.txt" to only print the matching items.
posted by russm at 10:56 PM on December 13, 2010

Response by poster: Thanks for the feedback, guys! Wayland: I tried your suggestion, but when I run (comm -1 -2 file1.txt file2.txt), nothing results. (comm -1 file1.txt file2.txt) shows me file 1's contents, (comm -2 file1.txt file2.txt) shows me file's 2's contents, (comm -3 file1.txt file2.txt) shows me both file 1 and file 2's contents. I can't seem to get the common terms for files 1 and two to show up! Christopherious, thanks for the info. Monday: I tried your suggest, and nothing happens when I run it. Pardon my lack of terminal knowledge. When I just run cat 1.txt 2.txt, I get the contents of both files. I must be doing something wrong.
posted by pantufla at 11:05 PM on December 13, 2010

Response by poster: Russm - didn't see your posting. Thanks, that code did work!
posted by pantufla at 11:14 PM on December 13, 2010

Do you mean each line is name(s) separated by commas or do you mean each list has one name per line?

The second would be most common. The comm and a lot of other text tools require the files to be sorted in order to work. Your examples are if they are indeed one name per line.

$ comm -1 -2 l1.txt l2.txt

But comm only works on 2 files. A more generic solution can be had many ways but the general idea is the same. Combine the files, sort them, count the number of duplicate lines and print those.

$ sort l1.txt l2.txt | uniq -c | sort -n
1 Adams
1 Allen
1 Hughes
1 Jones
2 Smith

This sorts the 2 lists together (could be any number of lists) and then counts the number of times each line is seen (then sorts the results of that numerically to make it easier to notice answers).

Since 'Smith' is in both files it has a count of 2. You can take this a bit further like so:

$ sort l1.txt l2.txt | uniq -c | awk '$1==2{print $2}'

Which just prints the second ($2, the name) column where the first column ($1, the count) is 2. You could do this with N files and match count of N (every file) or even $1>2 to list names that occur in more than 2 files.

If your files really have each line like "Name1, Name3, Name3, ..., NameN" then I would split them up first because it's way easier to work with 1 name per line files.

$ perl -lne 'print join("\n",split(/,\s+/))' list_N_per_line.txt > list_1_per_line.txt

Will split up the lines in the file to one name per line. Oh, and don't feel bad about comm, it's a real PITA to use sometimes.

# oblig Perl golf
$ perl -lane '$c{$_}++;BEGIN{$c=@ARGV}END{do{print if$c{$_}==$c}for%c}' l1.txt l2.txt

posted by zengargoyle at 11:18 PM on December 13, 2010

no worries... as others have said there are nicer ways of doing it if the data is newline separated rather than comma separated - that's generally The Unix Way(tm)...
posted by russm at 11:47 PM on December 13, 2010

« Older Good clipboard managers?   |   not tonight, headache Newer »
This thread is closed to new comments.