Python for screen-scraping?
May 14, 2012 1:33 PM   Subscribe

I want to code a really basic screen-scraper to make a tedious work task slightly less tedious. I'll probably be doing it in Python. Some assistance required!

This is the NYS Corporation database search site. You type in a name, it gives you a few matches, you pick one and that leads to this, for example.

Here's what I need a program to do:
  • Query the site
  • Allow me to pick the correct entry
  • Scrape the info in a bunch of those fields
  • Covert the formatting (title case instead of caps, "St." instead of "Street," etc.)
  • Spit the data into pre-determined places in a boilerplate text document.
  • Save each instance of this search as an individual file.
  • OR
  • Email the text to a specific address
Right now I'm doing this by copying and pasting into a Word file and manually emailing each text block. I'm pretty fast, but I know my time is better spent figuring out a way to get the computer to do it. Plus, it'll impress the hell out of my boss, as he has to do this from time to time and finds it equally annoying.

My friend suggested I do this in Python using mechanize and BeautifulSoup. Fair enough. Anyone have any opinions or counter-opinions on that? Getting this done painlessly is top priority.

Speaking of which, if I am to use Python, is there a guide out there that will hold my hand through getting Python actually running on a Windows Vista box? There's all sorts of different versions, different packages, different implementations and just a world of stuff I have no interest in picking apart. I just want an idiot-proof guide that lands me in front of a reasonably smart IDE with keyword highlighting, bracket checking, indenting and so on.

Background: I have a few years of programming classes (a good ten years) behind me. I coded something similar to what I need now for my high school Visual Basic project, so I'm pretty sure I'm capable of it now. I also mess around with Arduino at home, so it would be pretty sweet to have practice for a hobby at work.
posted by griphus to Computers & Internet (9 answers total) 13 users marked this as a favorite
 
Response by poster: (If I wasn't clear: I'm not using this as a launching point for a career in development. I just want to invest some time into this now and get it back in time shaved off this boring-ass task.)
posted by griphus at 1:36 PM on May 14, 2012


Best answer: The standard Python for Windows is fine. You may want to stick with the 2.x.x versions as there is much more support.

There is no setup other than just installing it. Notepad++ is a good editor.
posted by wongcorgi at 1:44 PM on May 14, 2012


Agreed that Python for Windows is all you need.

I learned to program for the sole purpose of scraping data from websites (and to impress colleagues). I've used both Perl and Python for this task, and now I use Python exclusively; mechanize and BeautifulSoup. I used to use Notepad++ but I use Sublime Text 2 now; I think it will meet all the needs you described in this post. It's a gorgeous and elegant text editor.

If you know your regular expressions (which I'm sure you do) most of what you described should be pretty simple. As far as emailing goes, maybe you'd use google mail API? (this module might be relevant)

Another way of going about this (if your only interest is saving time and resources) would be to set up a project on vworker (or a similar site). It might take an experienced coder just a few hours to set up the sort of program you described.
posted by mammary16 at 2:04 PM on May 14, 2012


Response by poster: Time and resources is definitely the top priority, but I'm also set on doing it myself. Thanks for the answers so far!
posted by griphus at 2:09 PM on May 14, 2012


Best answer: My friend suggested I do this in Python using mechanize and BeautifulSoup. Fair enough. Anyone have any opinions or counter-opinions on that?

I agree with your friend, that is exactly what I do when I write stuff like this. I probably wouldn't need to use Mechanize for this particular project because it doesn't sound like you really need that level of browser emulation (i.e. you're not dealing with logging in and cookies and whatnot).

* Query the site
* Allow me to pick the correct entry
* Scrape the info in a bunch of those fields
* Covert the formatting (title case instead of caps, "St." instead of "Street," etc.)
* Spit the data into pre-determined places in a boilerplate text document.
* Save each instance of this search as an individual file.
* OR Email the text to a specific address


All of that is doable. To query the site, figure out what POST data the site is using, and use urllib to do a normal HTTP request (or use Mechanize instead). Then take the results of that and load it into BeautifulSoup, which you'll use to search for the relevant fields you want to scrape (it helps if they have nice unique id names and whatnot but in the worst case you can just look for the third row of the second table on the page and whatnot). Everything else is pretty much standard Python that you can get done with the built-in libraries. StackOverflow is a good resource for figuring out how to do stuff in a new language that you're not used to, if you want to do something common like send an email or do title casing then most likely someone has already asked there and

There's all sorts of different versions, different packages, different implementations and just a world of stuff I have no interest in picking apart

3.x is the newest version but it's not very widely adopted yet, so you should probably go with 2.7. Download the 2.7 installer and install it, then download similar installers for third party libraries like Mechanize and BeatifulSoup.

I just want an idiot-proof guide that lands me in front of a reasonably smart IDE with keyword highlighting, bracket checking, indenting and so on.

There are not a lot of great Python IDEs as far as I know, partially because it's not really a IDE-friendly language in terms of making it easy to do simple analysis of the code (such as showing types of variables or auto-finding declarations). You can get simple syntax highlighting and indenting with any decent text editor though, I personally use Notepad2 on Windows. The IDLE IDE that is included in the Windows installer for Python is okay.
posted by burnmp3s at 2:10 PM on May 14, 2012 [2 favorites]


I recommend trying out selenium. It has a python library/hook and I have used python to perform a similar task that you described.
posted by a womble is an active kind of sloth at 2:48 PM on May 14, 2012 [1 favorite]


Best answer: There are not a lot of great Python IDEs as far as I know, partially because it's not really a IDE-friendly language in terms of making it easy to do simple analysis of the code (such as showing types of variables or auto-finding declarations).

Python's dynamic nature makes this impossible, since a variable can have different types at different times. Eclipse, with a Python extension, does a good job anyway, but has the drawbacks of being very slow to start up and shut down, being a bit complicated to set up, and being written in Java. I've been using pyscripter lately myself, it's not bad.

You want to be using Python 2.7.4 unless you are using a package that explicitly requires an earlier version. You'll probably be doing at least a little command line work, so fair warning there. I don't know anything about mechanize and Beautiful Soup, but I've been doing a fair bit of Python programming lately. Feel free to IM if you need help.
posted by JHarris at 3:39 PM on May 14, 2012


Best answer: I mocked up a prototype in Python. This is written assuming Python 2.x and BeautifulSoup 3.x. You may have to adjust a few things if you're using Py3k and BS4. The output looks like this.

The template is included inline, and output goes to stdout. You can change those both to be files without too much hassle. The fields in the template are derived from the table header elements (<TH>) with spaces removed. The algorithm used here is "for each TH, find the next nearest TD and associate the two as key/value." That works for the main fields that have only one value per table row, but it doesn't work for multi-column rows like the "Name History".
posted by Rhomboid at 5:19 PM on May 14, 2012 [5 favorites]


Response by poster: Okay, awesome, I have Python working w/ BeautifulSoup and using pyscripter which is exactly what I wanted. And thanks for the code, Rhomboid! I've been spending this morning picking it apart.
posted by griphus at 9:13 AM on May 15, 2012


« Older total recall   |   Where else can I get my human relations fix? Newer »
This thread is closed to new comments.