How do I compare these lists?
September 28, 2009 6:30 PM   Subscribe

I have two lists of numbers. I would like assistance creating some kind of program or method to output a list of all numbers that are present in one of these lists but not the other.

This is not a homework question. No, really. I promise. So I have two very long lists of numbers. Currently they are sitting on my computer as Excel spreadsheets, but I can export them to whatever format is necessary. Both lists are made up of integers, ascending, though never actually consecutive numbers, if it matters.

List A has approximately 4000 unique entries, and is the list I'm starting from.

List B is a list of about 8000 entries I made up by squishing together a bunch of smaller lists I had made from List A. There are a lot of duplicate entries in List B, which I haven't removed (partially because I'm not really sure how to). So there are many numbers that are present multiple times in List B.

I am absolutely positive that there are numbers in List A that are nowhere in List B, and I want to know what they are. It would be extra awesome if they were printed out in such a way that I or it could easily add some HTML links surrounding them, blahblah.com/printnumberhere.html -- the numbers are actually parts of URLs I want to access.

I was thinking perl would be a great way to solve this task, and I googled and got several solutions involving grep and/or comparing arrays. My perl skills, alas, have atrophied to the point where I can no longer tell how to make these given solutions mesh with the really basic parts of coding that I have forgotten. For example, I wouldn't know how to get this data in in a usable form. Comma-separated, perhaps? And then I somehow stuff it in an array? I don't even remember how to read from a file. My perl skills fail me. And with the amount of data I have, I'd hate to do it wrong and, say, end up with an incomplete list of the unique numbers.

Or if there's an easier solution that totally hasn't occurred to me (some kind of Excel macro? I have Office '04), that would be excellent too. I'm running OS X 10.3.9, which will probably limit my nifty programming language options somewhat. Thanks for any help you can provide!
posted by sineala to Computers & Internet (16 answers total) 1 user marked this as a favorite
 
If I was gonna do this in python I would probably paste into a txt file import using readlines as lists and then just run a loop:

for each a in Lista :
if a is in Listb:
listc.append(a)
PS in no way is this properly formatted.
posted by Rubbstone at 6:41 PM on September 28, 2009


If you have command-line cat sort grep and uniq:
With one number per line in listA and listB:

sort listBsorted
cat listA listBsorted | sort | uniq -u >listU
cat listU listA | sort | uniq -d >inAnotB

posted by hexatron at 6:46 PM on September 28, 2009


You can totally do this with command-line text utilities. I'm in a bit of a hurry now, so I'll come back in a few hours, but in the meantime, looking up the man pages for sort, uniq, and join will get you started, at least. join, in particular, is the key tool for you.

(And yes, exporting to CSV is the way to turn these Excel numbers into a thing Unix text utilities can use. Either that or copying and pasting the thing into a text editor.)
posted by tellumo at 6:47 PM on September 28, 2009


Best answer: Or legibly:

If you have command-line cat sort grep and uniq:
With one number per line in listA and listB:

sort <listB | uniq >listBsorted
cat listA listBsorted | sort | uniq -u >listU
cat listU listA | sort | uniq -d >inAnotB
posted by hexatron at 6:47 PM on September 28, 2009


No need for programming. Excel has a built in function to dedupe lists:http://office.microsoft.com/en-us/excel/HP100739431033.aspx
posted by dfriedman at 7:18 PM on September 28, 2009


From the command-line, the comm utility is designed to do exactly this. Still need to get it into sorted, one-item-per-line format though.
posted by hattifattener at 7:19 PM on September 28, 2009


If this is just two one-column lists of numbers, it's actually very simple to do this just using Excel. I'd copy and paste List A next to List B, and then in a third column, type the formula "=COUNTIF(B:B,A2)" and copy it all the way down the column. (This assumes that List A is in Column A and List B is in Column B with column titles in Row 1.)

The output of this formula is the number of times the number in Column A appears in Column B. If 0, then you need to add it to your List B.
posted by emumimic at 7:20 PM on September 28, 2009


hexatron has it. On OS X you have all these tools. The trick is to get your lists out of Excel and into a text file. If your list is a column of all numbers, then saving as a csv file should do that.

However, you can also totally do this in Excel. My antiquated copy had this in the help by just searching for "unique". Here are the instructions:
  1. Select the column or click a cell in the range or list you want to filter.
  2. On the Data menu, point to Filter, and then click Advanced Filter.
  3. Do one of the following.
  4. To filter the range or list in place, similar to using AutoFilter, click Filter the list, in-place.
  5. To copy the results of the filter to another location, click Copy to another location. Then, in the Copy To box, enter a cell reference.
  6. To select a cell, click Collapse Dialog to temporarily hide the dialog box. Select the cell on the worksheet, and then press Expand Dialog .
  7. Select the Unique records only check box.
So here's what you do: select all from list A, copy, paste into new sheet c. Select all from list b, copy, paste to the end of sheet c. Select all in sheet c, follow the instructions above. Done.
posted by plinth at 7:20 PM on September 28, 2009


in Python, save the first list as a "a.csv" and the second as "b.csv". Create a file dif.py in the same directory that contains the following (the "_" in the last line should be spaces):

a = set(open("a.csv"))
b = set(open("b.csv"))

c = a-b

f = open("c.csv","w")
for number in c:
____f.writeline(number+"\n")
posted by gus at 7:30 PM on September 28, 2009


Response by poster: Thanks, guys! The command-line solution worked for me! And thanks for the other solutions that I didn't actually try!

(Now I'm probably going to try figure out how to linkify my list... sigh. Or I'll probably just copy and paste into my URL template a lot. That works too.)
posted by sineala at 7:37 PM on September 28, 2009


List[AB] are one entry per line:

Print all items that are in ListA but not ListB:

fgrep -vf ListB ListA
posted by thewalledcity at 8:10 PM on September 28, 2009


Best answer: perl -ne 'chomp; print "http://blahblah.com/$_.html\n"' inAnotB >out.html
posted by Rhomboid at 8:12 PM on September 28, 2009


(or rather <a href=" et cetera, you can format the html how you wish.)
posted by Rhomboid at 8:13 PM on September 28, 2009


hexatron's bash script can be shortened to one line:

cat listA listB listB | sort | uniq -u

considering the numbers in listB twice means that the only way a number can be unique is if it's only in listA.

This adds extra numbers to sort, but unless you're talking million-item lists, the effect is negligible.
posted by chrisamiller at 9:35 PM on September 28, 2009 [2 favorites]


In R, where listA and listB are vectors.

sort(listA[listA%in%listB==FALSE])

... returns a sorted list of entries that are in listA, but not in listB.
posted by jonesor at 3:12 AM on September 29, 2009


On returning after a bit too much time . . . yeah, I defer to hexatron who knows much more about this sort of thing than I do. :-)
posted by tellumo at 11:35 PM on September 29, 2009


« Older Plays Well With Others?   |   Low threshold for noise Newer »
This thread is closed to new comments.