How do I use data from data.gov?
July 8, 2009 11:50 AM   Subscribe

I'm a journalist with not much database programming experience. However I have been fascinated by the government and other agencies making their data available for developers to play with, and I think a newspaper could do wonders with this data. How do I get in on the fun?

Transit agencies are making their times available the gov't just released records at data.gov.

What languages should I learn to manipulate these datasets to make some good applications and visualizations? Example I have in mind are the Guardians treatment of PM expenses and EveryBlock.

Should I learn Django, Ruby, Pyton? All? None?
posted by Blandanomics to Technology (9 answers total) 10 users marked this as a favorite
 
Excel should be fine for now. If you are member of Investigative Reporters and Editors, their Web site should have a lot of advise.

They sponsor computer-assisted reporting workshops.
posted by jgirl at 12:46 PM on July 8, 2009


Understand first, that computer programming (Django... which is a library for Python, Python, Ruby) is a little different then "database programming". Databases have their own language, which to an extent is standardized known as SQL. You can use this language to talk to most databases if you're given access.

However, I seriously doubt the government is going to open up this interface. To repeat jgirl, Excel will be fine for the purposes of displaying data and trivially manipulating it. More complicated merges across data sets could be tricky. The sheer size of the data might not fit in Excel (up to ~65k records).

All of this leads to a real database solution. I would recommend, in this case, with your probable experience in MS Office, to use Access. As a non programmer, where these skills will not be applicable at any other time, I can't really say learning a language will be useful or quick.

I'd consider even hiring a summer intern for peanuts. This will be fairly basic for most beginner programmers.
posted by teabag at 1:07 PM on July 8, 2009


If you were hell bent on learning db "programming", MySQL is free and available on most cheap web hosts. You could even install it on your own PC, using WAMP. You could export the data, load it into MySQL, and use phpMyAdmin to futz around and learn the basics of SQL.

If you really want to learn a programming language, most web scripting ones have nice APIs for accessing MySQL databases. PHP is MySQL's cousin in that regard, but it's by no means your only choice.

Or Access would work as well, if you're living in a MS world. Both databases have import functions, and it looks like data.gov will give you their data in CSV.

That being said however, Excel is probably sufficient for whatever you're trying to see in the data, if you can constrain the problem within its limits.
posted by cgg at 1:33 PM on July 8, 2009 [1 favorite]


Start with Excel. You could move on to R once you're doing really advanced stuff, but honestly, there is a lot you could see just with Excel.
posted by grouse at 1:50 PM on July 8, 2009


Keep in mind there is a couple of different disciplines here:
* finding interesting patterns/analyzing the data
* automating the retrieval, joining and subseting of data
* providing a good UI or visualization of the data

As many other have said, Excel and other stats packages will help you get started and then you can learn a good language like ruby or python along with a database to do the next item. Rails or Django can help with the third piece but this is also the domain GIS system that can draw pretty color coded maps and what not.

Start with finding interesting things and then work with others to learn how to automate and display the data in a way that helps people understand what you found.
posted by mmascolino at 2:20 PM on July 8, 2009


If you're a journalist, Django is at least marketed at you. Ruby appears to to be about building Applications that happen to run on The Web. What Django does is twofold: allows you to build whole applications, and avoid directly writing database queries (SQL).

If you choose Django, I'd ignore the advice about MySQL, and go straight for PostgreSQL. It's compatible, and generally more scalable. I see questions in your history demonstrating experience with Ubuntu, so I'll point out that Postgres is easily installed via package management and well supported.
posted by pwnguin at 3:05 PM on July 8, 2009


Are you trying to win the AppsForAmerica2 challenge?
http://www.sunlightlabs.com/contests/appsforamerica2/

For tool selection, you might follow their lead.

Are you familiar with Hans Rosling's stuff? (20 min TED talk):
http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html

some good examplesa at NYT Data Visualization lab:
http://open.blogs.nytimes.com/2008/10/27/the-new-york-times-data-visualization-lab/
posted by at at 5:26 PM on July 8, 2009


The sheer size of the data might not fit in Excel (up to ~65k records).

Just noting that this is only true in Excel 2003 or earlier. Excel 2007 supports about a million rows. That said, huge spreadsheets can cause 2007 to start acting very wobbly.

And if you don't have 2007, you can (believe it or not) download a complete, free 60-day trial version directly from MSFT with minimal hassle.
posted by Conrad Cornelius o'Donald o'Dell at 10:19 PM on July 8, 2009


You might want to try starting with Processing or NodeBox. Both are oriented towards beginning programmers, have great examples and docs and are capable of producing very pretty graphics.
posted by mr.ersatz at 12:01 AM on July 9, 2009 [1 favorite]


« Older Should he never set foot in NJ again?   |   Help my daughter choose a Vancouver school or... Newer »
This thread is closed to new comments.