Can you help me (a designer) learn to wrangle data?
November 9, 2012 6:38 AM   Subscribe

Data sets, Aggs and Smushes, oh my! Can you help me (a designer) learn to wrangle data?

I have a UX design background and often work with engineers to develop interfaces for handling content sourced from databases. It would greatly help my work as well as interest me, if I could learn some basic skills for working with data. To get started I plan on working with some public data sets available as practice, but where should I start?

I want to be able to:
• Better understand what a database actually is; terms I want to investigate are "database" vs "data set", "aggregation", what is technically happening during "data smushing", etc.
• Learn how to query
• Learn how to pull parts of these data sets both into open source visualization software and into interfaces I create

I am looking for:
• technology/software recommendations
• a few interesting public data sources that are simple enough to start working with
• learning resources that assume no previous engineering knowledge (other than basic front end web dev)
posted by halseyaa to Technology (5 answers total) 8 users marked this as a favorite
 
You might want to look into the data analysis competitions and posted solutions at Kaggle for data sources and some guidance.
I was also recently introduced to Data.gov, which offers public data sets.
posted by hot soup at 7:21 AM on November 9, 2012


The amount of data you plan on working with matters a great deal. Tools and techniques that work fine for small amounts of data (e.g. megabytes or gigabytes) usually fall short when you need to scale up to larger amounts (e.g. terabytes or petabytes).

Although I have experience with large-scale data processing (I work for a company that specializes in this), I'll give you a few technology/software recommendations for small scale stuff, because that's the best starting point.

I have a bias towards open source tools, but that's because they work, are easily available to everyone, and are also what people are really using in the field.

I will strongly recommend running Linux (e.g. CentOS), though Mac OS X offers much of what you need. If you're on Windows, run Linux in a virtual machine. The software I mention is readily available for Linux and can be installed with just one command.
  • General-purpose programming language: Python
  • Programming language if you need to do statistical analysis: R
  • Database: MySQL (which uses the common SQL language for querying data)
Also, you can do a lot with the Linux (or UNIX) command line tools if you learn to use them. For example, the command below will extract the third column from a tab-delimited file and display a summary of the unique values found:
   $ cut -f3 myfile.txt | sort | uniq -c
I'm a programmer, but doing things like what I've shown above is usually my first step in working with data. There's an awesome book called UNIX Power Tools that gives hundreds of examples like this, along with good explanations.

A few others resources that might be helpful are: Data Wrangling Handbook and Learn Python the Hard Way.
posted by tomwheeler at 8:24 AM on November 9, 2012 [1 favorite]


I'm assuming you don't necessarily want to learn programming, just to get a handle on the concepts and how they are used.

In that case, you might just want to read up on relational databases. Sometimes schools like Stanford offer free online courses on some introductory topics like this.
posted by deathpanels at 9:43 AM on November 9, 2012


Response by poster: Thanks for all of the responses so far. Many of the links are pointing me in some helpful and interesting directions, but I feel like I may be an additional step removed from being able to utilize them. I guess a follow up to my question would be, "what is your process for taking a data set and turning it into something useful?"

My best attempt at mapping out my process based on my limited understanding would be something like:

1. Identify the question/mashup I want to address (e.g., plotting tweets containing a certain keyword on a map - I know tools exist to do this directly but I'm just using it as an example)

2. Download the data into a csv or similar (may require API usage?)

3. Use an extraction tool to get the columns I want (geo) filtered by my keyword

4. Plot those rows using an open source tool or processing to generate my map

Is that even close to the way anyone here would approach this? Am I missing any critical steps? (I'm sure I am as I'm totally taking a guess.)
posted by halseyaa at 12:54 PM on November 9, 2012


Best answer: What you've described is pretty much an overview of what you'd need to do.

Unfortunately, it gets a bit harder from there for a few reasons:
  • Data is dirty. It will contain unexpected values and errors that you need to fix. If you're a designer, you might be familiar with HTML. How many people do you know that follow the spec 100% correctly? You'll find similar problems with any format you work with, given a sufficient quantity of data.
  • Formats are more complex than they seem. Twitter's API returns data in JSON format (which looks something like the CSS used to style Web pages). Even if you can get data in CSV, it's deceptively complicated. Sometimes they have header rows, sometimes not. Sometimes values without spaces are quoted, sometimes they aren't. Missing values are handled any number of ways.
I don't want to discourage you at all, but I think you'll eventually learn a bit of programming (whether you intended to or not) to get clean data in the format you need.

For simple examples, you could use a spreadsheet like Excel or OpenOffice to do almost all of this. It can import CSV, let you manipulate the values easily, and then plot them out in various ways. If you know Javascript (or are interested in learning it), you can do a lot of cool visualization with d3.js.
posted by tomwheeler at 9:55 PM on November 9, 2012


« Older Collage making software for PC?   |   Where can I find conversation-starters? Newer »
This thread is closed to new comments.