Identify certain repeats in table
May 8, 2015 5:10 PM   Subscribe

Please help me eliminate duplicates in a table based on specific criteria.

I've got a table consisting of a bunch of e-mail addresses and a bunch of scores, which are integers 1-3. There may be duplicate e-mail addresses with different scores, and there may be duplicates with the same score.

If there are duplicates, I want to keep only the lowest-scoring one and blank out the rest. And blank out duplicates with the same score as well.

I'm guessing there's an Excel formula (or maybe GREP pattern) that can accomplish this, but I haven't figured it out.

There are about 600 rows in the table, so I'd prefer not to do it by hand.
posted by adamrice to Computers & Internet (4 answers total) 1 user marked this as a favorite
This is incredibly easy in Excel. First, sort your data by email address and then score (such that if a person has 3 scores, the one you want is on top). Then select your whole table, and find "remove duplicates" in the version of Excel you're using (it's not a formula you put in a cell, it's a feature in the program). When it opens a dialog box, only check the email address box, not the score box. Click OK. Done.
posted by brainmouse at 5:46 PM on May 8, 2015

Not exactly a formula, but an Excel pivot table is the way to do this. Here's Microsoft's very high level overview (tutorial for older versions of Excel here) of how to create a pivot table.

What you want to do is set the email address as the row label (as described in section 6 of Create a PivotTable from worksheet data on that second link. It's in bold as Add To Row Label) and set the score with Add To Values.

That will give you the sum of the scores for that email address. Where the magic comes in is that you then right click on one of those summed up scores and select Summarize Values By and select Min. Which will give you the minimum score for each email address.
posted by ambrosen at 5:49 PM on May 8, 2015 [1 favorite]

If you can dump it into SQL, it's

SELECT email, min(score) FROM foo GROUP BY email.

You could also do this same thing in R with dplyr or Python with pandas. There's probably a one line command line solution, but I'd be done with R before figuring it out.
posted by hoyland at 6:19 PM on May 8, 2015

It looks like example 7 here is the awk solution (once you reverse the inequality, of course).
posted by hoyland at 7:31 PM on May 8, 2015 [1 favorite]

« Older Help a hair novice cover her greys   |   What to do with baby bird with swollen belly that... Newer »
This thread is closed to new comments.