How to cross-ref two lists in Excel and remove duplicates?
November 14, 2008 8:10 AM Subscribe
Excel-filter: I've got two sheets of data (Firstname, e-mail, lastname, customer #). I'd like to look for and remove any duplicate entries. More inside.
I'm updating our customer e-mail newsletter mailing list. No one has done this in about a year. In that year our list of 20,000 names has dwindled to about 15,000. I have that list of 15,000 with bounces and unsubscribes removed. There are a few operations I want to perform to bring the list up to speed:
-People can opt-in to our newsletter when they buy something. This flags their customer entry in the shipping/inventory software but does NOT add them to the mailing list. I pulled a list of everyone who wants to get the e-mail from the last 6 months. I just want to cross-ref this list with the existing list AND with unsubscribes to make sure they didn't get in another way or have opted out since.
There are a couple of operations that happen in there. All of them come back to the same basic principal- To look at a sheet, or two sheets, and cross ref them looking for duplicates. If there's a duplicate, delete one of them.
This seems like it should be easy but I'm having a hell of a time. A Google search brings up lots of expensive add-ons but no real advice. Excel-ers of the world: HALP!
I'm updating our customer e-mail newsletter mailing list. No one has done this in about a year. In that year our list of 20,000 names has dwindled to about 15,000. I have that list of 15,000 with bounces and unsubscribes removed. There are a few operations I want to perform to bring the list up to speed:
-People can opt-in to our newsletter when they buy something. This flags their customer entry in the shipping/inventory software but does NOT add them to the mailing list. I pulled a list of everyone who wants to get the e-mail from the last 6 months. I just want to cross-ref this list with the existing list AND with unsubscribes to make sure they didn't get in another way or have opted out since.
There are a couple of operations that happen in there. All of them come back to the same basic principal- To look at a sheet, or two sheets, and cross ref them looking for duplicates. If there's a duplicate, delete one of them.
This seems like it should be easy but I'm having a hell of a time. A Google search brings up lots of expensive add-ons but no real advice. Excel-ers of the world: HALP!
The easiest way to do it in my opinion is to create a new sheet and combine both of the sheets on to it. You will want to make sure that your unique key is in column A (i.e. if you have a customer ID number, etc... If you use the name as the unique key you run the risk of deleting customers who have the same name and/or having duplicate entries due to the lack of integrity of the data entry.) Once you have done that, select Data > Filter > Advanced Filter and then click the check box next to 'Unique records only'. Once you have done that select OK and the remaining information will be unique records only. Select all and then copy and paste the information back in to your original spreadsheet and you should be good to go! Hope this helps!
If you need more information on using the Advanced Filter option I suggest you read up on it here.
posted by lrkuperman at 8:41 AM on November 14, 2008
If you need more information on using the Advanced Filter option I suggest you read up on it here.
posted by lrkuperman at 8:41 AM on November 14, 2008
The way you find duplicates is to sort all your names. Then duplicates are right next to each other. So you scan down the sorted list, and whenever next item is the same as current item, you delete next (or emit only current), and increment current.
This is simple to do with a database (select distinct whatever) or with the linux sort command:
sort -u < inputfile > outputfile
If you insist on using excel to do this, the instructions are here: http://support.microsoft.com/kb/262277
If you expect to be frequently working with customer lists, taking the time to get them into a proper database will payoff despite the greater initial learning curve.
posted by orthogonality at 8:41 AM on November 14, 2008
This is simple to do with a database (select distinct whatever) or with the linux sort command:
sort -u < inputfile > outputfile
If you insist on using excel to do this, the instructions are here: http://support.microsoft.com/kb/262277
If you expect to be frequently working with customer lists, taking the time to get them into a proper database will payoff despite the greater initial learning curve.
posted by orthogonality at 8:41 AM on November 14, 2008
The easiest way would probably be using Access.
posted by mealy-mouthed at 8:43 AM on November 14, 2008
posted by mealy-mouthed at 8:43 AM on November 14, 2008
If I understand the issue, you can solve this problem with a series of if statements and sorts.
If you want to keep the original files separate, then use a column to assign them all a unique number so that the email list members are 1-n, the existing list members are (n+1)-y, and the unsubscribes are (y+1) -z.
Now make a new file that combines the email and existing lists. Sort by whichever variable is most likely to be unique... usually e-mail address. Now create another column and write an if statement that puts a one in the cell if the variable is identical to the one above it. If you are confident that there are no data entry foibles likely to return false positives (e.g., McHenry vs Mc Henry) you could just sort out and then delete the records with the ones, but it's usually a good idea to do this a few more times with other variables (or clusters of variables, e.g., city, state and zip). If you accumulate many different columns of "ones" you can add them up, sort on the sum, and eliminate only those with "perfect" scores, looking at the others individually.
Once this file is clean, run it against the unsubscribe list in the same way. If you have dates, this time use an "and" statement to assign the "one" only to whichever record is older (so that the more current the opt-in or opt-out is reflected). As your last trick, sort on the numbers you assigned in the first step and eliminate all records with numbers above (y+1). Hope this helps.
posted by carmicha at 8:52 AM on November 14, 2008
If you want to keep the original files separate, then use a column to assign them all a unique number so that the email list members are 1-n, the existing list members are (n+1)-y, and the unsubscribes are (y+1) -z.
Now make a new file that combines the email and existing lists. Sort by whichever variable is most likely to be unique... usually e-mail address. Now create another column and write an if statement that puts a one in the cell if the variable is identical to the one above it. If you are confident that there are no data entry foibles likely to return false positives (e.g., McHenry vs Mc Henry) you could just sort out and then delete the records with the ones, but it's usually a good idea to do this a few more times with other variables (or clusters of variables, e.g., city, state and zip). If you accumulate many different columns of "ones" you can add them up, sort on the sum, and eliminate only those with "perfect" scores, looking at the others individually.
Once this file is clean, run it against the unsubscribe list in the same way. If you have dates, this time use an "and" statement to assign the "one" only to whichever record is older (so that the more current the opt-in or opt-out is reflected). As your last trick, sort on the numbers you assigned in the first step and eliminate all records with numbers above (y+1). Hope this helps.
posted by carmicha at 8:52 AM on November 14, 2008
Oh, if you do more than one sort/match routine, you'll have to copy the "if statement" results so they stay consistent through future sorts.
posted by carmicha at 8:54 AM on November 14, 2008
posted by carmicha at 8:54 AM on November 14, 2008
Response by poster: Combining these sheets is totally an option. I just needed the process to be somewhat automatic because we're talking about 20,000 or so entries.
posted by GilloD at 9:14 AM on November 14, 2008
posted by GilloD at 9:14 AM on November 14, 2008
There's a filter for unique records under data/filter/advanced filter. I usually use that, then the copy visible cells option to paste the filtered records somewhere useful.
posted by A Terrible Llama at 9:54 AM on November 14, 2008
posted by A Terrible Llama at 9:54 AM on November 14, 2008
Sorry, what three other people already said. I mean.
posted by A Terrible Llama at 9:55 AM on November 14, 2008
posted by A Terrible Llama at 9:55 AM on November 14, 2008
WOW! I am suprised that no one has mentioned VLOOKUP. In each file is the customer ID a unique number? So that Joe Smith has the same customer ID in each file?
If the answer is yes you can do the following. But first let me make sure I understand:
1. File A is an excel sheet with the names of everyone who ordered in the last 6 months that wants emails.
2. File B is the existing list of emails
3. File C is the list of people that asked to be unsubscribed.
So we are going to do all our work in File B. Lets start by making sure that the unsubscribe names are not still in File B
in a new column in file B we are going to use the vlookup formula. It will look something like this:
=vlookup(cell where the customerID lives in file B,range of cells in file C where the other Customer IDs are,1,false)"
Use the excel formula builder. It will help you get the correct pieces together.
You need to do this twice. Once between File B and file C. Find the matches in file B and delete them. These are the people that opted out of your emails but were not removed before.
Then compare file A to File B but vlookup in file A to file B. These will be the people in file A that are already in file B. Delete them from file A.
Then dump the entire remaining content of file A into file B and you will have one file with no dupes ready for mailing.
Note: The best way to delete records in bulk is to use the autofilter in excel. Read up on the autofilter and the vlookup formula and you should be golden.
Now copy this formula
posted by remthewanderer at 11:07 AM on November 14, 2008
If the answer is yes you can do the following. But first let me make sure I understand:
1. File A is an excel sheet with the names of everyone who ordered in the last 6 months that wants emails.
2. File B is the existing list of emails
3. File C is the list of people that asked to be unsubscribed.
So we are going to do all our work in File B. Lets start by making sure that the unsubscribe names are not still in File B
in a new column in file B we are going to use the vlookup formula. It will look something like this:
=vlookup(cell where the customerID lives in file B,range of cells in file C where the other Customer IDs are,1,false)"
Use the excel formula builder. It will help you get the correct pieces together.
You need to do this twice. Once between File B and file C. Find the matches in file B and delete them. These are the people that opted out of your emails but were not removed before.
Then compare file A to File B but vlookup in file A to file B. These will be the people in file A that are already in file B. Delete them from file A.
Then dump the entire remaining content of file A into file B and you will have one file with no dupes ready for mailing.
Note: The best way to delete records in bulk is to use the autofilter in excel. Read up on the autofilter and the vlookup formula and you should be golden.
Now copy this formula
posted by remthewanderer at 11:07 AM on November 14, 2008
Response by poster: Woah- I'll check that out. And, yep, a customer ID is the same across all entries. Some clever person pre-me was nice enough to make sure those got added to the e-mail DB.
posted by GilloD at 11:41 AM on November 14, 2008
posted by GilloD at 11:41 AM on November 14, 2008
I would throw them all into one worksheet with identical columns. Make sure you have a header row
Choose the column you want to run duplicates against. In the next free column, run the COUNTIF function against the column with duplicates - it's either =COUNTIF(A:A,A1) or =COUNTIF(A1,A:A) - (substitute A for your duplicates column obviously), and copy it the whole way down. Hit F9, select the new column & Paste Special > Values.
Then select the whole range, and sort firstly by the new column DESC, then by the column the duplicates are in - you'll end up with all the duplicates grouped together in descending order based on the number of duplicates each record has.
In another new column, you want to number the duplicate records - if there are three duplicates, label them A, B, C - if there are two, label them A, B - for all the records with only one entry, label them A (You can copy & paste the values down for all other duplicates with the same number).
Select the range & sort by the new column. Delete all other rows than those with the value A
That will do the trick.
Now, one other thing. If you are, for example, removing duplicate email addresses, have a think about what you'll do if the same email address is used by a number of different customer records and you need to send an email to each customer record. Easy solution to that is to concatenate the customer record & email together - I separate them with a "~" character. If you just want to remove duplicate customer numbers, don't worry about this paragraph.
Yes, IAA-Comms Database Geek
posted by MatJ at 3:29 PM on November 14, 2008
Choose the column you want to run duplicates against. In the next free column, run the COUNTIF function against the column with duplicates - it's either =COUNTIF(A:A,A1) or =COUNTIF(A1,A:A) - (substitute A for your duplicates column obviously), and copy it the whole way down. Hit F9, select the new column & Paste Special > Values.
Then select the whole range, and sort firstly by the new column DESC, then by the column the duplicates are in - you'll end up with all the duplicates grouped together in descending order based on the number of duplicates each record has.
In another new column, you want to number the duplicate records - if there are three duplicates, label them A, B, C - if there are two, label them A, B - for all the records with only one entry, label them A (You can copy & paste the values down for all other duplicates with the same number).
Select the range & sort by the new column. Delete all other rows than those with the value A
That will do the trick.
Now, one other thing. If you are, for example, removing duplicate email addresses, have a think about what you'll do if the same email address is used by a number of different customer records and you need to send an email to each customer record. Easy solution to that is to concatenate the customer record & email together - I separate them with a "~" character. If you just want to remove duplicate customer numbers, don't worry about this paragraph.
Yes, IAA-Comms Database Geek
posted by MatJ at 3:29 PM on November 14, 2008
ASAP Utilities is a free Excel add-in that does stuff like this fairly effortlessly once you have your data set up on the spreadsheet correctly (i.e., data combined under identical columns and in consistent format, such as text or dates). There are several options on what to do with the duplicates such as delete them, turn their cells or font a different color, etc.). It is a very handy add-in, especially given that it's free.
posted by fuse theorem at 5:57 PM on November 14, 2008
posted by fuse theorem at 5:57 PM on November 14, 2008
This thread is closed to new comments.
Then you could hit the Advanced Filter dialog and use the "Unique Records Only" checkbox to winnow out the dupes.
posted by bcwinters at 8:35 AM on November 14, 2008