Why don't they teach us how to prep the numbers as well as how to crunch them?
March 26, 2012 12:08 PM   Subscribe

ExcelFilter: How to combine the data from three spreadsheets into one spreadsheet? (But wait, there's more!)

- Data is in three different spreadsheets, reflecting three different instances of data collection
- Each new spreadsheet should append its column contents (maintaining its original column headings) to a “master” spreadsheet
- There is a column that we can use as a unique ID that could function as a primary key (I think?)
- Not all unique IDs are guaranteed to show up in all three spreadsheets (subjects may have missed one or two data collection days).

This is for a relatively small set (N=~150) but I’ll need to do the same thing with a much, much larger dataset in the coming month (N=15k+), so a brute force copy-paste won’t be feasible. I’m sure there’s got to be a smarter way of doing this. I have no database experience, but I can get my hands on Access and teach myself if you can point me to what kinds of topics I should be looking at.

I’ve tried asking around locally without success (still waiting to hear back from the research help folks) and my Google-Fu is failing. Thank you!
posted by smirkette to Computers & Internet (10 answers total) 4 users marked this as a favorite
Can you explain why you don't think copy & paste won't work?

I routinely copy tens of thousands of rows of data in Excel with no problem.

Depending on your willingness to climb a steep learning curve, Access can help you, but it may take you more time to accomplish than simply copying and pasting in Excel.
posted by dfriedman at 12:20 PM on March 26, 2012 [1 favorite]

Is the unique ID the same for different rounds of data collection? If so, and you don't have a field for round number, you'll have to add one. If you move to a database, the ID+Round could be the primary key.

It should be pretty straight-forward to import the spreadsheets into one table in Access.
posted by amarynth at 12:25 PM on March 26, 2012

If you can create a column of the unique IDs in your master spreadsheet (either through copy-pasting those, or otherwise generating the complete set) , and if the unique ID is the first column in your data spreadsheets (or they can be rearranged that way), then you could use vlookup formulas in the master spreadsheet to pull in the data from the data spreadsheets.

Something like:

=vlookup(unique id cell, data range from file n, col m, FALSE)

The false ensures you'll only pull in data if there's an exact match on your unique id, and return an error in that cell otherwise. If you don't want to see the error value in invalid cells you could wrap the vlookup function in an conditional and an iserror() function to display a custom error message or nothing.
posted by Reverend John at 12:29 PM on March 26, 2012

VLOOKUP is a great formula to use and I the linked page is a good intro to it. It lets you mimic some things you'd have to do in Access without all the overhead.
posted by soelo at 12:29 PM on March 26, 2012

dfriedman, my hope is that there's a way for Excel to look at the existing IDs, append the columns from the new spreadsheet for a pre-existing ID OR create a new row (with imported column values) for IDs not already in the spreadsheet. (Sorry if I'm not explaining this well!)

Reverend John & soelo, I'll start investigating VLOOKUP--looks very promising. Thank you! Keep the great suggestions coming!
posted by smirkette at 12:32 PM on March 26, 2012

Oh, sorry, ignore me above -- I had a reading comprehension fail and thought you were appending rows, not columns.

But, yes, VLOOKUP is a wonderful function, and when you use it, other people will think you're a wizard.
posted by amarynth at 12:35 PM on March 26, 2012

You will need to make sure you have all of the possible IDs in one sheet first. You can use =COUNTIF to compare one column of a sheet to another. =COUNTIF(A:A,B3) will tell you how many times the value in B3 is in column A, so if it is zero, you'll want to add B3 to column A. Do all of that before you start with the Vlookups.
posted by soelo at 12:36 PM on March 26, 2012

Another note to throw in about VLOOKUP....that lookup is only good for the first return instance of a value. If you UID column has two records that have the same ID, only the first record is noted.

If in fact that is the case (duplicate IDs in the UID column) you would do best to combine all the sheets in a raw fashion, create a new temporary UID each record in the new combined set, then run a process to ferret out the duplicates. At that point you could decide what UID scheme you want to work with and assign new IDs to those records in question.

Once you have the UID cleaned, the temp UID column can be deleted.
posted by lampshade at 12:43 PM on March 26, 2012

Vlookup should do this just fine. If possible, bring all the sheets into one workbook. That way, when you open it up, it won't be looking through closed files for data (it "re-calculates" when you re-open the file).

If your data in the source sheets won't be updating, copy the cells it has pulled in and paste special over it them with "values". Turns it into a static sheet rather than having it look up the same data over and over.
posted by dripdripdrop at 12:45 PM on March 26, 2012

As a note on VLOOKUP, it is an unindexed, so if you are brute forcing two sets of 500K rows, make sure your data is sorted, duplicates are removed (or summed), and then do a [copy -> paste special -> paste as values] once the data is in place, otherwise every time the sheet recalculates it will recalculate the entire vlookup.

If you do this a lot, I would honestly reccommend learning to do this in Access or SQL (or SAS, etc...): There is a reason an Excel sheet is called a workbook, and Access works with tables in a database. Workbooks are not Databases, and workboooks are not designed to efficiently process this stuff.

If your dataset is small, and you need to aggregate data and you are working in Excel, also look at the SUMIF, SUMIFS, COUNTIF, COUNTIFS functions.
posted by Nanukthedog at 1:40 PM on March 26, 2012

« Older Best laid plan gets wrinkled   |   Ping Pong League Newer »
This thread is closed to new comments.