What's the best way to harvest information from a website?
January 15, 2006 3:28 PM   Subscribe

What is the best way to harvest, automatically, information from a website?

I'm an attorney. As part of my law practice, I spend many hours a week looking up court cases on a court website so that I can contact by mail people who may need my services. The process is time-consuming and tedious as hell. I know there must be a way to write a program that would automate the process.

My current process works like this. I go to the court website, select (using a pull-down menu) the type of search I want to do ("docket search"), enter a date in a text box in a different frame of the page, and get a results page with the names of all individuals who have court cases in a certain court on the date I entered, along with a docket number assigned to each individual. The list of individuals produced by this search is presented in an HTML table.

I copy that list, drop it into Excel (so that each person, with his or her corresponding docket number, occupies a row in the spreadsheet), and then I have to search for each case, one by one, using a different kind of search ("case search by docket number") to get the full information listing for that particular person, which is presented in an HTML table. I then copy-and-paste the information I need (the person's home address) for each individual onto that person's line in my Excel spreadsheet. The end result is that I print out a bunch of mailing labels in MS Word, using the mail merge function.

It seems that it would have been so easy for the programmer of this court system to allow you to just click on the docket number on the results list produced in the original search, pulling up all the data for each person, and cutting out the need for a separate search using each person's docket number, but it does not work that way.

The ideal application that I am seeking, would have a simple interface that would allow me to enter the date of the docket, the court division, and then press a button, automatically harvesting the information for each person who has business on that date's docket, and outputting the information (docket number, name, home address, etc.) in plain-text format that could be dropped into Excel. Even better would be a program where you could enter a date range, and it would harvest all the information for all dates, and all court divisions, in that range.

Is it possible to do this? What programming language or technology would be most appropriate for this purpose?
posted by anonymous to Computers & Internet (14 answers total)
 
Exactly the kinds of task you describe is often done using a scripting language with a good website automation library. It would be easy to do in Perl or Python.
posted by RichardP at 3:40 PM on January 15, 2006


curl/wget + perl would be great for this. You could output the info in a .csv to be used with mail merge. Depending on how the labels are done, you might even be able to do it with LaTeX. Contact info in the profile.
posted by devilsbrigade at 3:46 PM on January 15, 2006


oops, I just noticed that the end of my comment was lost...

When using Perl to do that kind of task I often write my own parser for added flexibility and robustness, but it is very easy to whip up something quick-and-dirty using the Perl CPAN module WWW::Mechanize.
posted by RichardP at 3:46 PM on January 15, 2006


Excel has some built-in ability to automatically download and parse web pages. I'm not much of an Excel jockey myself, but I know this feature is used in some spreadsheets at my office. I think the feature is called "Get External Data."
posted by mbrubeck at 4:02 PM on January 15, 2006


What you are looking to do is often called screen scraping or spidering. I'd recommend the book Spidering Hacks which provides lots of code, and tips -- including the importance of spidering etiquette (e.g., don't overburdon the server with tons of requests without a delay between them).

I read the book a couple of years ago, and applied the findings to some code I was writing in Visual Basic -- which you really shouldn't use for this if you can use PERL, Python, PHP, or Ruby, which are much more well suited for the task.
posted by i love cheese at 4:21 PM on January 15, 2006


As you don't have access to their database directly, you are limited to reading the HTML (called "screen scraping" or "web scraping") and converting that into a useful format.

Typically it's done by downloading the page (wget), cleaning the page so that other software can more easily use it (htmltidy), and then extracting and converting the parts you want (xslt). You can automate this conversion process in almost any language.

That's what you can do now with just webpages. Alternatively you could ask them to open up their database, providing daily dumps of their data to a server. Getting the source this way is preferable and much easier to use.
posted by holloway at 4:49 PM on January 15, 2006


For what it's worth, you're right -- it would be nice if the courts and all the other government agencies started publishing their information in easily consumable formats. It's entirely possible that they already have this capability, so you might want to drop a quick email to the administrator of the site which you're going to spider to find out what they've got under the hood.

You never know unless you ask...
posted by ph00dz at 4:50 PM on January 15, 2006


Definitely use Perl and WWW::Mechanize. It rocks for this kind of thing. Having got to the requisite page, there are also Perl modules which specialise in the analysis of HTML tables, like HTML::TableContentParser and HTML::TableExtractor.

By the way:

I go to the court website, select (using a pull-down menu) the type of search I want to do ("docket search"), enter a date in a text box in a different frame of the page, and get a results page

this part can possibly be automated easily for you if your results page has a URL like domain.com/script.cgi?search=docket&date=1/1/05 -- you could have bookmarks set up or a javascript bookmarklet which automatically went to todays date for instance.
posted by AmbroseChapel at 5:02 PM on January 15, 2006


I've done this in perl (scraping eMusic.com) and in Firefox using javascript (scraping a "best of the web" community web site); each program had several hundred satisfied users.

In perl, as RichardP notes, WWW::Mechanize provided an easy start. Doing it in Firefox is easier (but requires, obviously, Firefox).

Now, I don't, usually, do my own legal work: I hire a lawyer because he can do it better and faster than I can. (This is just a trivial application of Ricardo's Law.) I'd suggest that it's ultimately cheaper for you to not distract yourself from making money as a lawyer, by hiring a coder to do this for you.
posted by orthogonality at 5:08 PM on January 15, 2006


If you want to go with Python (probably easier to learn than perl), go with BeautifulSoup or Mechanize (yes, after the perl module). But only do this if you think programming is a fun hobby anyway, if not, I agree with orghogonality: just hire somebody to code this for you.
posted by davar at 4:10 AM on January 16, 2006


Hey Ortho -- what is it that you like about doing it in javascript? The easy dom access?
posted by ph00dz at 7:53 AM on January 16, 2006


ph00dz writes "Hey Ortho -- what is it that you like about doing it in javascript? The easy dom access?"

The fact that the HTML's already been parsed, and can be treated as a tree rather than as text, yeah. (Yes, their are Perl classes that'll parse; no I didn't use them in the Perl app).


jayder writes "If I were to hire a programmer to do this (I sure can't do it myself), what would be a fair price?"

It depends on how flexible you wanted this. if it only needs to work on one site, it's cheaper. Less graphical interface is cheaper. If it's work for hire, it's more expensive; if the coder retains copyright it's cheaper.

You can get it done offshore probably for $100, but with some question about maintainability and some (possibly major) problems communicating requirements. Me, I'd want a couple to several hundred bucks, depending on how extensive your exact requirements were.

Figure it's like you doing a will for a client: there's a fixed cost to you to do anything at all, even if it's pretty pro-forma.
posted by orthogonality at 10:43 AM on January 16, 2006


If I were to hire a programmer to do this (I sure can't do it myself), what would be a fair price?

I think orthogonality is in the correct ballpark with regards to price. In the past I've charged several hundred dollars for small projects similar to the one you describe.
posted by RichardP at 3:50 PM on January 16, 2006


jayder, if you haven't got someone to do this yet, email me? I've done a couple of projects like this before and would be interested in the work.
posted by AmbroseChapel at 6:35 PM on January 16, 2006


« Older How to be a landlord?   |   A friend gave me a poem about jazz. I left it... Newer »
This thread is closed to new comments.