Data-getting programming for noobs
February 5, 2013 4:10 PM   Subscribe

Can you help me get an idea of what this personal programming project of designing a "data-getting" program (details inside) would involve, given I would be starting from noob level? Alternatives?

I have a pipe dream about writing a program that would create automatically updating Excel files and graphs from free and publically available economic data on the internet for my own use. Specifically, from the US Bureau of Labour Statistics Website but also from the United Kingdom Office of National Statistics website. I would not want to download everything- just certain time series that I'm interested in. The thing is, I am not a real programmer. The only “programming” I have done is writing scripts for my own use in the command based analysis packages, Stata, a little R and Matlab. I have a vague notion that the DIY part might teach me something interesting and useful, but then again it might just put me in over my head.

There are a few things I would like to know:
(i) Is there a commonly understood generic name for this kind of automatic data-downloading and updating program (for specific data series) I have in mind? (ii) Suppose I did set myself this as a sort of personal project. Presumably I would have a lot to learn. What would be on my “syllabus” - in terms of learning the relevant programming and/or other information about accessing the structure in which the data is stored and the automatic updating parts? What could I read and where can I find the relevant information? I would love it if my graphs would update themselves when the statistical offices update and revise their data. I wouldn’t object to using a sort of “template” code but I would want to understand what it was all doing. (iii) And suppose I put in, say, an hour a day. Given my initial noob level of knowledge, any idea of the timescale for such a project? (iv) Suppose I was to abandon the self-study part of my pipe dream. Are there other means of acquiring an “automatic data-getter”, and at what cost? (Other than the mouse /keyboard intensive copy and paste method I use already). (v) Do you think the self-study route for such a narrow primary goal + vaguely defined knowledge satisfaction is definitely foolish or worth a go?
(If it helps my laptop runs on Windows 7.)
posted by mister_kaupungister to Computers & Internet (12 answers total) 34 users marked this as a favorite
Does the website allow you to get data? (i.e. is there a way for you to get the data programmatically other than downloading the website?) If not, it's called "scraping" and you're trying to write a "scraper".

It will be much easier for you to write a program that will output to a comma-separated format (CSV) than an actual .xls excel file. But then you won't be able to have multiple tabs and/or pretty graphs. But you can set those up in an excel file and just import the CSV for each tab that you want. The graphs should update themselves after importing.

You want to look at a simple scripting language (I prefer Python). There are packages that will (a) help you download the website and (b) help you parse it. Knowing regex is useful.

Best of luck! And feel free to MeMail me if that's what you want and you want more details of what I'm talking about.
posted by ethidda at 4:17 PM on February 5, 2013 [1 favorite]

Terms often used include "spidering" or "scraping". Googling for "scraping web sites" brings up all sorts of "how to get started".

My tool of choice is Perl, but I've no truck against any of the other awesome languages and, as important, tools used with those languages, to do this.

And you might want to look at d3js as a "what to do once you get the data", because several of their data visualization demos blew my mind.
posted by straw at 4:33 PM on February 5, 2013 [1 favorite]

I use python for scraping all the time. Basically, there's a couple of things you need to do:

1. Find the URL of the page to scrape.
2. Download that page.
3. Work out what information in that page to save
5. Save that information in some file format
6. Determine any links in that page to pages that you want to scrape too.
7. Goto 2.

1) and 2) are easy in languages like Python. The "urllib2" module is pretty good.

3) can be done in a few different ways.

Regular expressions (regex) run very fast but can break easily (if a page's format changes) and don't do "real" parsing. They're also kind of a pain to learn. They're basically a string like "Hello, [A-Za-z]" which will match on strings like "Hello, World" but not "Hello, 1world!". You can use these for quick and dirty data scraping.

Alternatively, you throw the website through a module like BeautifulSoup (recommended!). It's slooow but it will turn a page (a bunch of HTML) into a nicely navigable tree, so you can ask for things like "Every row in every table with the CSS class "bigTable"" and it will give you that. For scraping a bunch of webpages I've found it to be very slow (seconds per page; scraping a few hundred pages takes way too long!) but it's very solid and less of a pain than regex.

For 4) I almost always use the csv module; it does magic csv writing which Excel can read easily.
posted by BungaDunga at 4:42 PM on February 5, 2013 [4 favorites]

An extremely similar desire to scrape quantitative data from a government website is what lead me from being a complete novice to an "advanced beginner" level developer - I have actually built useful projects and have moved on to web app development. I used Python packages Beautiful Soup (as BungaDunga suggested) and mechanize which is similar to urllib2. So, based on my experience of pretty much doing what you're worried about in (v) - I say give it a go!

One way to kick things off is to understand what the webpage is doing using the "developer tools" feature in Chrome or Firebug in Firefox. Navigate to the website and use these tools to see how it is displayed. Also, definitely check if the sites have an API (Application Programming Interface). An API will frequently allow you to download data in a standardized output format with only a few lines of code.

The two beginners tutorials I used - Code Academy and Learn Python the Hard Way.
posted by nowoutside at 5:05 PM on February 5, 2013 [3 favorites]

Both the Department of Labor and Office for National Statistics have APIs for some data. That's easier than scraping.

If you decide to use Python, check out Screen Scraping with BeautifulSoup and lxml (lxml is much faster than BeautifulSoup but not as robust) and Requests (much friendlier than urllib2).
posted by djb at 5:16 PM on February 5, 2013 [1 favorite]

I think you might be making this out to be scarier than it is. As previous posters mention, there are lots of data parsing / visualization packages out there that do the hard work for you. I hesitate to give timelines without knowing more about your background, or what kind of graphs you want to produce, but if you're investing an hour a day, and you have enough of a math / engineering background that you've used R and Matlab, I'd be surprised if it took you more than a month. And many of the professional programmers I know still claim, well into their careers, that they're not "real programmers" either, because they never formally studied computer science...and yet they're shipping code every day. If you are motivated to learn it, you almost certainly can.

As far as whether the self-study route is a worthwhile investment of your time, of course it depends on your priorities. I can say that if you deal with data frequently, having data viz. tools at your disposal is super handy, and I personally find it fun.

As far as platforms, let me second d3.js, if you're willing to forgo Excel (d3 is a javascript package for creating charts on a web page): although it's known for its pretty visualizations, d3 also has lots of helpful functions for loading files from a URL. The advantage of a web page is that you'd presumably write it to load its data on launch (unless the data's huge), which means the page would always be fresh (and visible to anyone with an Internet connection).

There's a gentle d3 introduction here. You might try reading the first page or two, see if it seems accessible.
posted by molybdenum at 6:26 PM on February 5, 2013

If you're really tied to Excel, I might recommend taking a look at the PowerPivot extension to it. Combined with a scraper that can dump data into CSV, you can use the PowerPivot plugin to connect the CSV file to a dynamic Excel spreadsheet (just create a connection to a text file, or perhaps to a CGI script that provides the data as CSV).

While I think that there are many more beautiful ways to visualize data than Excel provides, sometimes getting it in there and playing with it can yield some really great results.
posted by montag2k at 8:48 PM on February 5, 2013

I'll back up the suggestion to learn Python for this, if you really want to DIY. It's a nicely designed language that makes it easy to "plug in" modules to do what you want. It's a reasonably popular general-purpose language too, so any skills you pick up would be applicable beyond the field of data analysis.

Poking around the Bureau of Labor Statistics website for just a moment, it seems you won't need to do any website "scraping" from them at least, since they seem to offer all the data as tab-separated files. Parsing them would be a piece of cake for a Python script (once you figure out how).

I'm not so sure about your goal to make the end result be Excel files. This sounds like an unnecessary complication to me, although once you generate your derivative data it would be a trivial matter to export it as CSV or similar, which could be imported into Excel.
posted by neckro23 at 10:09 PM on February 5, 2013

I think this is a great home project. Since you already have and use Excel, I recommend you try using its built in features first, then where you hit limitations, look for improved solutions.

> automatically updating Excel files and graphs from free and publically available economic data on the internet for my own use.

Excel has a built in feature, import from web, that does this. That gets a little dense but it's quite simple in practice.

> then again it might just put me in over my head

Learning curves come with all things, but this is a very good way to ease into programming and data manipulation. You start with the built-in macro recorder. This records your clicks and data entries, but under the hood is (not very efficient) code that you generated. You can access that code with a built-in programming environment, visual basic for applications, and get a feel for the syntax and usage of the language. VBA itself is considered easy to learn (training wheels) but is a subset of, and thus very similar in syntax and usage to Visual Basic, a 'real' programming language. Microsoft has a VBA tutorial to get you started.

Use that base knowledge as a starting point for your project goals. Programmaticly updating your pivot tables. Automate file downloads. Whatever. Excel is well supported as a development platform and has lots of examples you can examine and modify for your project goals.
posted by anti social order at 10:19 PM on February 5, 2013 [2 favorites]

Two things I know about Excel:
1. It can import tab-separated files, so you may not need to convert anythiing to CSV (depending on what data you need and what is stored in those tab files).
2. Excel can have a macro that runs immediately when you open the file. I worked at a place that used this method for delivering Excel reports to end users. They gave the user a link to an Excel file on a server, and when you open it, the macro would import the data file, then format all the columns to be the right width, date vs. dollar vs. text, bold and underline the header rows, put totals on various columns, etc. It even counted the records and when it got past X rows it created a new tab(s) for the remaining data and formatted those tabs too. But originally the Excel file was empty except for this auto-run macro. This enabled the IT team to write one macro for each type of report, then just import the next monthly file and format it exactly the same way every month.

Also, if you find yourself wanting to google for more information about scraping, you should be aware that LOTS of people misspell it with 2 p's. But there are a lot of links in the answers above, so you probably don't need to start from scratch in google.
posted by CathyG at 10:25 AM on February 6, 2013

Response by poster: So: I could start by looking into Python for the scraping or (where available) the API-based approach, importing data to d3js if that's easier, or perhaps focus on working with VBA at the Excel end or with other Excel capabilities.

Very helpful, thanks Mefites! When my schedule allows I'll be giving this a try.
posted by mister_kaupungister at 4:29 PM on February 6, 2013

possibly useful blog entry on web scraping.
posted by rmd1023 at 12:14 PM on February 10, 2013

« Older Fiction: Fatal encounters between therapist and...   |   Hotels with jacuzzi driving distance from Long... Newer »
This thread is closed to new comments.