Join 3,512 readers in helping fund MetaFilter (Hide)


Merging Duplicate rows in Excel or Access
June 19, 2014 3:12 PM   Subscribe

I have a table with duplicate values in one column. Some of the other columns have a value, some don't. I want to have one row for each record, with all of the colums values in one row (NOT CONCATENATED). I can use Excel or Access for this.

So, let's say I took several surveys of the same people. In survey 1, John said he liked pizza. In survey 2, he said he liked apples, and so forth.

My table is structured as such:

Name | Survey1 | Survey2

John | Pizza | (null)
John | (null) | Apples
Pat | Bananas | (null)
Pat | (null) | Cake

------------------------------

I want to end up with

Name | Survey1 | Survey2

John | Pizza | Apples
Pat | Bananas | Cake

-----------------------------

I've searched for this and I find a bunch of ways to concatenate or sum the fields but that is not what I want, they must remain separate columns. The output must be in Excel but I'm familiar enough with Access that if that's a better tool I will use it.
posted by desjardins to Computers & Internet (18 answers total) 2 users marked this as a favorite
 
So not elegant, but you can just abuse the sort and vlookup function to get this done.

Turn on the filters for convenient sorting. Sort by the first column, the last column, and then each column through the second column.

It should create results like:

Name|Surv1|Surv2|Surv3
John|Pizza|Null|Null
Pat| Bananas | Null | Null
John | Null | Apples| Null
Pat | Null| Cake | Null
John | Null | Null | Peanuts
Pat | Null | Null | Cheese

In C2, I would type vlookup=(a2, a$4:d$5, 3, FALSE). In D2, I would type vlookup=(a2, a$6:d$7, 4, FALSE). Drag c2 and d2 down to all names in the first survey.

Copy all, paste values, delete rows 4-7.
posted by politikitty at 3:38 PM on June 19 [1 favorite]


Is it possible that this might happen? So that there might be more than one result in B or C?

John | (null) | cake
John | orange | pie
John | apple | cheese
posted by Lyn Never at 3:42 PM on June 19


If and only if there are exactly two rows for each Name like this and one of the two surveys is always NULL, then my quick and dirty solution would be to
  1. Make a copy of the data file. Just in case.
  2. Import it into an Access table.
  3. Run the following SQL statement [SQLFiddle Example] against the table and save the result as a new Excel file
    SELECT * FROM (SELECT Name, Survey1 FROM Table1 WHERE Survey1 IS NOT NULL ) S1 LEFT JOIN (SELECT Name, Survey2 FROM Table1 WHERE Survey2 IS NOT NULL ) S2 ON S2.Name = S1.Name ;
  4. Apologies for the formatting. The <code> tag strips leading spaces.

posted by ob1quixote at 3:50 PM on June 19


Thanks you guys, I'm sorry I didn't add more detail in the first place.

- I can't really use a manual solution because there are about 20,000 records.
- Lyn Never: Yes, there may be values in both B and C (or neither)
- ob1quixote: sadly, there may be 100 Johns and 1 Bob and 4 Kathys.
posted by desjardins at 3:55 PM on June 19


If there may be 100 Johns, how would you want the solution to handle conflicts like:

John | Cheese | Pizza| Null
John | Null | Caramel | Toffee
posted by politikitty at 3:58 PM on June 19


desjardins: “ob1quixote: sadly, there may be 100 Johns and 1 Bob and 4 Kathys”
Is that because there are that many surveys? How many columns are we talking about?

Looking at this example [SQLFiddle], what would your desired result set look like?
posted by ob1quixote at 4:12 PM on June 19


I guess I should not have used "survey" as an example because that implies some value in every row. So... sometimes John (the same person) didn't answer any surveys. There might be a bunch of "John" values in the Name column but nothing in Surveys 1 through n. Also, there would never be any conflicts within the same column (i.e. John could not answer Apples AND Pretzels in Survey 1).
posted by desjardins at 4:26 PM on June 19 [1 favorite]


Six "survey" columns, plus the name column.
posted by desjardins at 4:26 PM on June 19 [1 favorite]


If there are multiple Johns, how do you decide which of the John survey1 answers goes with which survey2 answer? I'm going to assume that what you want is as few nulls as you want, so I'd go with the following solution, using one helper column to the left of your table of Names, and 2 columns to the right for your new values, plus one helper to say if there's new values.

Name Count, Name, Survey1, Survey2, Survey1New, Survey2New, Hasdata

I've converted this into a table (i.e. pressing Ctrl+L), so I'm using column names rather than cell references, but the formulae in the columns are:
Name Count:
=COUNTIF(Table1[[#Headers],[Name]]:[@Name],[@Name])

Survey1New (note this is an array formula: press Ctrl+Shift+Enter to enter it:
=IFERROR(INDEX([Survey1],
        SMALL(IF(([Name]=[@Name])*
              ([Survey1]<>""),
              ROW([Name])-ROW(Table1[[#Headers],[Survey1New]])),[@Count]))
        ,"")

and the same for Survey2New, but with the 1s replaced with 2s
Finally, HasData is the straightforward:=IF(AND([@Survey1New]="",[@Survey2New]=""),"","Hasdata")

Or if you don't want to type all that in and want to experiment, I've uploaded it here.
posted by ambrosen at 4:27 PM on June 19 [1 favorite]


If there aren't any conflicts within columns, you could use Group By and Max in Access.
posted by yarrow at 4:28 PM on June 19


I think the solution I've posted should give the right answers, reading your followups.
posted by ambrosen at 4:28 PM on June 19


desjardins: “Six "survey" columns, plus the name column.”
Okay, how about something like this [Yet Another Fiddle]?
SELECT t.Name,
       s1.Survey1,
       s2.Survey2,
       s3.Survey3,
       s4.Survey4,
       s5.Survey5,
       s6.Survey6
  FROM (SELECT DISTINCT Name
          FROM Table1
       ) T
  LEFT JOIN (SELECT Name, Survey1 
               FROM Table1 
              WHERE Survey1 IS NOT NULL 
            ) S1 ON S1.Name = T.name
  LEFT JOIN (SELECT Name, Survey2 
               FROM Table1 
              WHERE Survey2 IS NOT NULL ) S2 ON S2.Name = T.Name
  LEFT JOIN (SELECT Name, Survey3 
               FROM Table1 
              WHERE Survey3 IS NOT NULL ) S3 ON S3.Name = T.Name
  LEFT JOIN (SELECT Name, Survey4 
               FROM Table1 
              WHERE Survey4 IS NOT NULL ) S4 ON S4.Name = T.Name
  LEFT JOIN (SELECT Name, Survey5 
               FROM Table1 
              WHERE Survey5 IS NOT NULL ) S5 ON S5.Name = T.Name
  LEFT JOIN (SELECT Name, Survey6 
               FROM Table1 
              WHERE Survey6 IS NOT NULL ) S6 ON S6.Name = T.Name
;
Again, not really elegant, but quick and dirty gets the job done sometimes. Especially for one-offs.
posted by ob1quixote at 4:47 PM on June 19


Use a Pivot Table in a Report Layout-Tabular form.

Then put the Name in the Row Label, then each Survey Header under that, they'll display side-by-side in the table. You'll want to right click and deselect the Subtotal. This way each survey will have just the response info in that cell.

You can do some REALLY slick stuff if you download PowerPivot. Check out You Tube for tutorials. This shit changed my life!
posted by Ruthless Bunny at 4:50 PM on June 19 [2 favorites]


I think ambrosen has it, but I'm doing some more tests. Is there a way to reduce the processing power of these formulas on 20,000 rows? That first "count" column takes forever, and it seems to keep recalculating.
posted by desjardins at 4:57 PM on June 19


There is a "find duplicates query" wizard (sic) in Access, which you might try
posted by thelonius at 5:00 PM on June 19


I can't think of any way to speed it up, desjardins, except for the count, which would be quicker if you sorted the table by name, and then you could use (assuming the count is column A and the name column B)
=IF(B2=B1,A1+1, 1)
to do a cumulative count. Obviously it all goes pear-shaped if it's not ordered by name. I think that might be the slow column because the range it's calculating on changes every time, but I doubt the survey rows are exactly super responsive. There's just Aziz lot of comparisons to do.
posted by ambrosen at 5:30 PM on June 19


Given the constraint of 1 survey answer per question per name, ob1quixote's solution seems nicer.
posted by ambrosen at 5:34 PM on June 19


I just did something very similiar.
In Access, I created a separate query filtered on each survey result (survey 1, survey2)
then I created a query that grouped the master list by the unique identifier. In your case that is the name.

So we have a unique list of names, and separate lists for each survey.

I then created a query with the grouped query linked via open join to the name on each survey query.

Column 1 - Name from grouped list. Column 2 - survey1 answer. Column 3 survey 2 answer, etc.

it worked like charm for me.
posted by exparrot at 8:00 AM on June 20


« Older There must be a simple way thr...   |  Or is it easy enough to learn ... Newer »

You are not logged in, either login or create an account to post comments