Help sorting different kinds of duplicates in Excel?
March 1, 2023 12:44 PM Subscribe
I have an Excel file that is a subset of a larger one -- it is a merged file of two mailing lists; this file contains just records that exist in duplicate. Trouble is there are 3 kinds of duplicates, and I need to isolate and remove just one subset of the duplicates. This is about 2900 records, so I'm looking for suggestions on how to automate this or do it in a bulk way.
A few weeks ago I posted looking for help earlier on in this process, and thanks to MeFi suggestions I was able to get myself to this point. I have an existing up-to-date Mailchimp list (MC) of about 13K names, and a separate set of names from an old Outlook contact list (OL), about 5K names. With no unique identifier to go on, I joined the first name and last name fields to get a "Full Name" field, and used conditional formatting to get rid of all the exact duplicates on a merged OL+MC list. (Records in the merged list are marked as to which list they came from.)
So now I'm down to dealing with a subset of names on that merged list -- the same name listed multiple times but with different email addresses. I'm not concerned if the name is listed twice on the MC list (they're currently receiving mailings at both addresses and either they're not actually the same person in real life, or they can already unsubscribe if they want to). I'm also not worried about names listed twice on the OL list. I want those names/addresses to get the reconfirmation email that this list will ultimately get, as part of the process to merge *everyone* into the MC list eventually.
The problematic sub-subset is the list of names that appears on the MC list with one email address, and on the OL list with a different address. I need to identify and delete the OL version of those pairs before I import the OL list into mailchimp, because I do not want the (unreliable, low confidence, very possibly outdated) OL email address to overwrite the (up-to-date and presumably working) MC email address.
This subset still accounts for about 3K records, however, so I'm very much hoping to find a way to let the spreadsheet do the work here. How can I separate out that last set of duplicates so I can delete just those names?
A few weeks ago I posted looking for help earlier on in this process, and thanks to MeFi suggestions I was able to get myself to this point. I have an existing up-to-date Mailchimp list (MC) of about 13K names, and a separate set of names from an old Outlook contact list (OL), about 5K names. With no unique identifier to go on, I joined the first name and last name fields to get a "Full Name" field, and used conditional formatting to get rid of all the exact duplicates on a merged OL+MC list. (Records in the merged list are marked as to which list they came from.)
So now I'm down to dealing with a subset of names on that merged list -- the same name listed multiple times but with different email addresses. I'm not concerned if the name is listed twice on the MC list (they're currently receiving mailings at both addresses and either they're not actually the same person in real life, or they can already unsubscribe if they want to). I'm also not worried about names listed twice on the OL list. I want those names/addresses to get the reconfirmation email that this list will ultimately get, as part of the process to merge *everyone* into the MC list eventually.
The problematic sub-subset is the list of names that appears on the MC list with one email address, and on the OL list with a different address. I need to identify and delete the OL version of those pairs before I import the OL list into mailchimp, because I do not want the (unreliable, low confidence, very possibly outdated) OL email address to overwrite the (up-to-date and presumably working) MC email address.
This subset still accounts for about 3K records, however, so I'm very much hoping to find a way to let the spreadsheet do the work here. How can I separate out that last set of duplicates so I can delete just those names?
Best answer: There are probably a few ways to do this, but I think a quick vlookup would work just fine.
First separate your MC list and OL list into two different tabs (this is not strictly required, but it will make it a little easier). In the OL tab, add a new column: "Full name present in MC list?". In that column, write a vlookup formula that searches the MC list for matching names. The variables you choose will be as follows:
- Lookup value = the OL Full Name cell in that row
- Range of cells where the vlookup function is searching for a match = the entire column listing names in the MC tab
- Column number that the vlookup will return = 1 (aka, the name column)
- Exact vs. approximate match = FALSE
That will yield a formula that looks something like this. Note that I used $ symbols to make sure the MC range is an absolute reference. Drag this formula into the rest of the cells in this column.
=VLOOKUP(A2,'MC List'!$A$2:$A$2500,1,FALSE)
For all the cells where there is no match for the name in the MC list, the result will be "#N/A". For the cells where there is a match in the MC list, the result will be the name. You can now sort this list A-Z and delete all the rows where a match was found. Reassemble the data from the two tabs for a complete, cleaned list.
Let me know if this doesn't make sense -- happy to explain further.
posted by ourobouros at 1:34 PM on March 1, 2023 [3 favorites]
First separate your MC list and OL list into two different tabs (this is not strictly required, but it will make it a little easier). In the OL tab, add a new column: "Full name present in MC list?". In that column, write a vlookup formula that searches the MC list for matching names. The variables you choose will be as follows:
- Lookup value = the OL Full Name cell in that row
- Range of cells where the vlookup function is searching for a match = the entire column listing names in the MC tab
- Column number that the vlookup will return = 1 (aka, the name column)
- Exact vs. approximate match = FALSE
That will yield a formula that looks something like this. Note that I used $ symbols to make sure the MC range is an absolute reference. Drag this formula into the rest of the cells in this column.
=VLOOKUP(A2,'MC List'!$A$2:$A$2500,1,FALSE)
For all the cells where there is no match for the name in the MC list, the result will be "#N/A". For the cells where there is a match in the MC list, the result will be the name. You can now sort this list A-Z and delete all the rows where a match was found. Reassemble the data from the two tabs for a complete, cleaned list.
Let me know if this doesn't make sense -- happy to explain further.
posted by ourobouros at 1:34 PM on March 1, 2023 [3 favorites]
I would organize your lists to be side-by-side instead of merged, with the columns as such:
Column A: OL Full Name
Column B: OL Email
Column C: (blank for now, see below)
Column D: MC Full Name
Column E: MC Email
In Column C will be the results of an XLOOKUP to see if any OL addresses have a match in the MC list. So for row 1, the formula will look something like this: =XLOOKUP(A1, D:D, E:E, "No match"). (This is all off-the-cuff and from memory, so forgive any errors).
On edit, ourobouro beat me to it. Good luck with either solution.
posted by mezzanayne at 1:40 PM on March 1, 2023 [2 favorites]
Column A: OL Full Name
Column B: OL Email
Column C: (blank for now, see below)
Column D: MC Full Name
Column E: MC Email
In Column C will be the results of an XLOOKUP to see if any OL addresses have a match in the MC list. So for row 1, the formula will look something like this: =XLOOKUP(A1, D:D, E:E, "No match"). (This is all off-the-cuff and from memory, so forgive any errors).
On edit, ourobouro beat me to it. Good luck with either solution.
posted by mezzanayne at 1:40 PM on March 1, 2023 [2 favorites]
Response by poster: ourobouro (and mezzanayne) -- I've never actually used VLOOKUP (or XLOOKUP) before, but this does sound like it will do exactly what I need. I will give it a go tomorrow and circle back if I get confused along the way. :)
posted by leticia at 2:05 PM on March 1, 2023 [1 favorite]
posted by leticia at 2:05 PM on March 1, 2023 [1 favorite]
A tangential suggestion...it's probably a good idea to back up your data and export the current version of your contacts. You may also want to add may want to put a "current" tag on all contacts currently in the system.
That being said, it looks like mailchimp only considers duplicate entries at the email address field. Therefore it seems to me that you can upload (the deduped) OC list, and if you don't check the "update existing contact" box, you won't overwrite the current contact. Therefore mailchimp's logic is
1) does this email address exist?
_a) if no, then add to contacts.
_b) if yes, then either
__i. update existing contact (if boxed checked)
__ii. don't update existing contact (if box left unchecked)
posted by oceano at 2:13 PM on March 1, 2023
That being said, it looks like mailchimp only considers duplicate entries at the email address field. Therefore it seems to me that you can upload (the deduped) OC list, and if you don't check the "update existing contact" box, you won't overwrite the current contact. Therefore mailchimp's logic is
1) does this email address exist?
_a) if no, then add to contacts.
_b) if yes, then either
__i. update existing contact (if boxed checked)
__ii. don't update existing contact (if box left unchecked)
posted by oceano at 2:13 PM on March 1, 2023
Email all of them and remove the bounces. I don't know if MC dings you for bounces, which they very well may, but that's the quick and dirty.
posted by rhizome at 3:59 PM on March 1, 2023 [1 favorite]
posted by rhizome at 3:59 PM on March 1, 2023 [1 favorite]
I’m confused. You say you don’t mind if the MC list has the same name but two different email addresses because they’re either receiving at two addresses or they’re not the same person.
But then you are concerned about having an entry in MC and an entry in OL who have the same name but different email addresses, because you don’t want OL to overwrite MC.
Why are you not treating them as if they could be two distinct people and just add those distinct OL e-mail addresses to MC?
If you’re concerned that the OL list having exactly the same name will overwrite the entry in MC just change the OL name by adding a middle initial O, or a period at the end or change the space in the middle to something else for example.
posted by cali59 at 5:54 PM on March 1, 2023 [2 favorites]
But then you are concerned about having an entry in MC and an entry in OL who have the same name but different email addresses, because you don’t want OL to overwrite MC.
Why are you not treating them as if they could be two distinct people and just add those distinct OL e-mail addresses to MC?
If you’re concerned that the OL list having exactly the same name will overwrite the entry in MC just change the OL name by adding a middle initial O, or a period at the end or change the space in the middle to something else for example.
posted by cali59 at 5:54 PM on March 1, 2023 [2 favorites]
I have heard but not confirmed myself that ChatGPT is good at Excel formulas, so you could try that as well (though the humans here also seem very competent and happy to help you!)
posted by lookoutbelow at 6:35 PM on March 1, 2023
posted by lookoutbelow at 6:35 PM on March 1, 2023
Best answer: ouroboros's plan will delete every OL row that has a matching name in the MC list.
If you want to only flag for deletion the OL rows where there's a matching name in the MC list *and* a nonmatching email address, you'd want to change that a little bit. All this is easier using the vlookup wizard in the formulas tab/tool.
(1) You will need to make sure that the email address column in the MC list is to the right of the fullname column.
(2) Pick column in OL list to put the results and start wizard with insert function and then selecting vlookup
(3) Lookup_value -- select fullname in first data row in OL list
(4) Table_array -- in MC list, select from first data row in fullname column to last data row in email address column
(5) Col_index_num -- count columns in MC list from fullname to email address starting at 1 (ie, the fullname column itself is 1, not zero)
(6) Range_lookup -- false
This should output the following into your selected column:
-- if there's no matching name, N/A
-- if there is a matching name, the email address
So now you can just compare the MC address to the OL address and flag the rows where they don't match
posted by GCU Sweet and Full of Grace at 4:35 AM on March 2, 2023 [1 favorite]
If you want to only flag for deletion the OL rows where there's a matching name in the MC list *and* a nonmatching email address, you'd want to change that a little bit. All this is easier using the vlookup wizard in the formulas tab/tool.
(1) You will need to make sure that the email address column in the MC list is to the right of the fullname column.
(2) Pick column in OL list to put the results and start wizard with insert function and then selecting vlookup
(3) Lookup_value -- select fullname in first data row in OL list
(4) Table_array -- in MC list, select from first data row in fullname column to last data row in email address column
(5) Col_index_num -- count columns in MC list from fullname to email address starting at 1 (ie, the fullname column itself is 1, not zero)
(6) Range_lookup -- false
This should output the following into your selected column:
-- if there's no matching name, N/A
-- if there is a matching name, the email address
So now you can just compare the MC address to the OL address and flag the rows where they don't match
posted by GCU Sweet and Full of Grace at 4:35 AM on March 2, 2023 [1 favorite]
I'm surprised at myself for recommending this, as I dislike the current AI hype train, but this might be a case to see what you get from ChatGPT. Relevant video: 10X Your Excel Skills with ChatGPT
I only rarely use Excel for complicated queries like yours, so every time it takes me a lot of googling or asking for help, so next time I'll probably give this a try too.
posted by snarfois at 6:28 AM on March 2, 2023 [1 favorite]
I only rarely use Excel for complicated queries like yours, so every time it takes me a lot of googling or asking for help, so next time I'll probably give this a try too.
posted by snarfois at 6:28 AM on March 2, 2023 [1 favorite]
You can wrap a vlookup in an iferror so you don't have to look at those n/a (or even have it look in a second vlookup if you're trying to compare against multiple tables.
For example =iferror(vlookup(arg1,arg2,arg3,False),"") would return a blank if it didn't find it.
posted by Apoch at 11:02 AM on March 2, 2023 [1 favorite]
For example =iferror(vlookup(arg1,arg2,arg3,False),"") would return a blank if it didn't find it.
posted by Apoch at 11:02 AM on March 2, 2023 [1 favorite]
Response by poster: Thank you for the additional suggestions.
GCU Sweet and Full of Grace -- It took me a minute to grasp it, but I see what you're getting at. As it happened, I had already cleaned the list of any entries where both the FullName and Email matched, so by default in this case if it identified a matching name, the email would be a mismatch.
At first, when I tried your solution, I got #N/A results all down the column. So I tried ourobouros' solution and it worked, then I went back and retried yours and realized I had misinterpreted the table array instructions, and included the entire rows that had the relevant data, not just the columns. With that fix it worked.
Also absolutely key from ouroborous' solution (which I learned the hard way) was to include the "$"s in the formula -- otherwise when you fill the formula down the column, the cell references shift with each repetition of the formula.
But the upshot is now I've got a properly sorted list that is showing the repeated name in one column, and the divergent email address in the next column.
Huge help here, folks!! (Am also intrigued by the idea of putting ChatGPT to work on Excel formulas -- thanks for the suggestions, lookoutbelow and snarfois! Will definitely try that next time I'm stumped -- but still may reality-check it here. :) )
posted by leticia at 12:40 PM on March 2, 2023 [1 favorite]
GCU Sweet and Full of Grace -- It took me a minute to grasp it, but I see what you're getting at. As it happened, I had already cleaned the list of any entries where both the FullName and Email matched, so by default in this case if it identified a matching name, the email would be a mismatch.
At first, when I tried your solution, I got #N/A results all down the column. So I tried ourobouros' solution and it worked, then I went back and retried yours and realized I had misinterpreted the table array instructions, and included the entire rows that had the relevant data, not just the columns. With that fix it worked.
Also absolutely key from ouroborous' solution (which I learned the hard way) was to include the "$"s in the formula -- otherwise when you fill the formula down the column, the cell references shift with each repetition of the formula.
But the upshot is now I've got a properly sorted list that is showing the repeated name in one column, and the divergent email address in the next column.
Huge help here, folks!! (Am also intrigued by the idea of putting ChatGPT to work on Excel formulas -- thanks for the suggestions, lookoutbelow and snarfois! Will definitely try that next time I'm stumped -- but still may reality-check it here. :) )
posted by leticia at 12:40 PM on March 2, 2023 [1 favorite]
« Older rental edition textbook - to ebay or not to ebay? | Best All-Inclusive Resort in Cuba for an Adult and... Newer »
This thread is closed to new comments.
Sort by Name and add some conditional formatting to highlight matches between consecutive identical names.
You might find splitting the names back into two columns and sorting by lastname then firstname is easier.
posted by protorp at 1:22 PM on March 1, 2023