Extract Data from Excel Files, Biologists/Data Scientists Please Help
May 2, 2019 1:13 PM   Subscribe

I'm trying to analyze data from a project which was started long ago. The data from each survey date is in separate excel files. The good news is that each file has the same format. The bad news is that in each file there are multiple tables stacked vertically, with the information about what site each table pertains to above the table. There are also multiple sheets in each file.

What is the best way to automate the process of populating a new table created from all of these files, with all of the data arranged vertically? Seems like it shouldn't be that hard to write a script telling the computer to take a subset of rows from each file and make new rows with added columns for the date, site, etc. extracted from the header of each table. I know I could probably do this with excel macros, but I'm wondering if there isn't a better, more transparent solution. I have minimal programming experience, but I am willing to bite the bullet if doing this in R or Python is really the way to go.

Basically my goal is to get this done in a way that will teach me useful "industry standard" skills, but also not make things too much more complicated than necessary. Thanks!
posted by 12%juicepulp to Computers & Internet (6 answers total) 5 users marked this as a favorite
If you're willing to jump into R, the tidyverse is well suited to this kind of task! R for Data Science should provide you an approach to this project -- for instance, the section on tidy data includes examples of some of the transformations you may have to perform.

Following the tidyverse approach in this case will give you an excellent framework for importing data, tidying it, going through the transform -> visualize -> model steps to get it ready for analysis, and performing the analysis all in one ecosystem.

(Re: using this as a learning opportunity -- I have made the switch from Python-first to R-first because I appreciate the clarity of thinking and process in tidy approaches, but there's probably a good case to be made for learning Python to solve this too! So I don't want to discount that route, or doing it in Excel directly. But yes, I think whichever approach you take will be full of learning. Enjoy! Take chances, get messy, make mistakes!)
posted by elephantsvanish at 1:24 PM on May 2, 2019 [3 favorites]

Datawrangling is really so dependent on the exact flavor of messy data you've got, but I echo the suggestion for R and the tidyverse. The Python equivalent for working specifically with tabular data like what you have in your Excel files is pandas. Both are in very active development, but to my mind pandas is behind where the tidyverse is right now. Plus, Excel users tend to take to the Rstudio IDE (a standalone program for viewing code/datatables/plots/etc) easier than the various options for Python.
posted by deludingmyself at 1:28 PM on May 2, 2019 [3 favorites]

I agree with R: there's a bunch of ways to do this in principle but the relevant "industry standard" for modern Biology is R.

Python has a lot of overlap in functionality and perhaps more general functionality. For that matter you could probably be directly operating on .csv with regex and sed/awk/grep, etc. Personal choice and preference does matter, and I've seen lots of tool-chains. In some sense what seems the most fun and intuitive at a glance is best.

TLDR: If you're working with data in research biology, you should probably know one data-handling tool, and it should probably be R, because that's what has the marketshare and people will generally be expecting.
posted by SaltySalticid at 1:32 PM on May 2, 2019 [1 favorite]

Do you have access to Tableau? Or even a free trial of Tableau? If so, it will do this in about 10 seconds. If you have Tableau and want help executing, I can help.
posted by crapples at 2:12 PM on May 2, 2019 [1 favorite]

I'm going to go against the grain a bit: Python and then R. It strikes me that the bulk of your problem here is getting the Excel workbooks into a usable format and it's going to be a heck of a lot easier to do that from Python than it is from R. Working with data is a lot nicer in R (because of the tidyverse, like everyone has mentioned), but I don't find R easy to program (i.e. perform arbitrary tasks) in and it sounds like you're going to need to parse the Excel files first to break up the different tables into separate CSVs or whatever (and extract meaningful information from the headers!). openpyxl is likely the package you want, but I definitely don't envy you this task. Once you have things in a usable format, move to R.
posted by hoyland at 5:49 PM on May 2, 2019 [2 favorites]

I get why folks are suggesting R but I do this kind of thing right in Excel. I find Visual Basic for Applications (VBA) easy to learn and understand. If you have Excel, it’s right there to do this natively. I learned VBA from excel-easy.com.

MeMail me and I can share an Excel file with a script to open each file in a directory and grab data from different tabs and fields to reformat into a single compiled table. It won’t match your data, but if the formats are consistent in your files, it should give you a framework to build from.
posted by putzface_dickman at 1:32 PM on May 3, 2019

« Older A poison so sweet   |   Pay up to see this apartment Newer »
This thread is closed to new comments.