Because "I just know" is not a sufficient organizing strategy
December 8, 2010 2:14 PM   Subscribe

What's the best way to organize data when you're making lots of changes to it?

This question is so basic it's embarrassing, but I've never learned a good system to do this.

When I've got a large dataset in a non-coding program (say Excel or SPSS as opposed to R), and I'm making a bunch of small changes, how do I organize and keep track of what I've done? When I say small changes, I mean like: correcting data input errors, transforming variables from categorical to numeric, creating new variables from others, merging with other data, organizing in different ways, etc.

I already keep an original copy of the data. That I know to do! I also try to keep original variables.

It always sounds so straightforward, but then stuff happens in the middle of your work, like, someone else wants the data but with different stuff taken out or put in, you need to calculate a different way, you made a mistake, etc.

Thing is, I'm not a naturally organized person. I tend to have stuff in piles and just "know" where things are, just remember what things I did, etc. But I know this is not a good system for something like this!

Anonymous because my work questions are making me identifiable.
posted by anonymous to Science & Nature (7 answers total) 9 users marked this as a favorite
 
I use version control for non-programming stuff as well. It's not as efficient as for source files, because the diffs won't mean much for an excel table, but to organize different versions, it's perfect. Look into git, bazaar or whatever you deem interesting and suitable.
posted by Triton at 2:17 PM on December 8, 2010


This is probably a horrible practice, but my go-to solution is to just add another column for "updated/not-updated," and put ones and zeroes in it as i go through my spreadsheet... does that work for your issue?
posted by Pickman's Next Top Model at 2:26 PM on December 8, 2010


I find a formatting aspect I'm not already using in the document, and format the data I've completed. Most often, I use bold, and bold all the fields I've checked, but if bold already means something in my spreadsheet, I'll use something else -- setting the background color of rows, perhaps.

I don't do it every time I fix a cell since that would be tedious, I'll fix a bunch and then set them all at once, but I try to set it every five minutes or so, so that I don't accidentally wander away without doing it and then forget.

When I'm done with the whole job, I can just select all and de-bold everything, or reset the background to white.
posted by jacquilynne at 2:28 PM on December 8, 2010


Track Changes in Excel will work if you want fine granularity. Also, you can keep a worksheet that is just a functional log of changes made to the raw data. Combined with keeping your docs in some sort of change control system (SVN for example) it should allow you to retrace your steps. If the program doesn't support any kind of logging or change tracking internally you can also do screenshot snapping as a memory aid (e.g. Timesnapper) which is also useful for retroactive billable hours management.
posted by benzenedream at 2:30 PM on December 8, 2010 [1 favorite]


I'd store it in CSV and use SVN or GIT for version control.
posted by signal at 2:41 PM on December 8, 2010


Version control, as stated above. You can also, at least in SPSS, practice some of the same practices you use in R. Never recode observations without creating a new variable/vector. You can write code in SPSS just like you can in R (well, you know what I mean) and have your do-file (or whatever they're called in SPSS) as a record of what you did. As for Excel, you can use track changes, or you can keep a research notebook/textfile that indicates the changes you've made.
posted by proj at 2:47 PM on December 8, 2010


Don't use a non-programming interface! SPSS has a built in scripting language; google "SPSS code" or "SPSS script" for samples; it should also be discussed in your SPSS documentation. In Excel you can get the same place with VBA.

When working on ANY real data, instead of directly altering the data create a program that alters the data and includes comments as to why you're doing what you did. Even if you use CSV + version control it will get very dicey later. It is immeasurably better to be able to supply to somebody else "Here is the original data, and here is an annotated script that 'cleans' it." It also means that you can go back and modify your cleaning / transforming decisions pretty effortlessly to see how things would have been different.
posted by a robot made out of meat at 5:56 AM on December 9, 2010 [1 favorite]


« Older How to force page breaks when printing email?   |   Help me get started audio programming Newer »
This thread is closed to new comments.