What's the best way to harvest information from a website?
January 15, 2006 3:28 PM
Subscribe
What is the best way to harvest, automatically, information from a website?
I'm an attorney. As part of my law practice, I spend many hours a week looking up court cases on a court website so that I can contact by mail people who may need my services. The process is time-consuming and tedious as hell. I know there must be a way to write a program that would automate the process.
My current process works like this. I go to the court website, select (using a pull-down menu) the type of search I want to do ("docket search"), enter a date in a text box in a different frame of the page, and get a results page with the names of all individuals who have court cases in a certain court on the date I entered, along with a docket number assigned to each individual. The list of individuals produced by this search is presented in an HTML table.
I copy that list, drop it into Excel (so that each person, with his or her corresponding docket number, occupies a row in the spreadsheet), and then I have to search for each case, one by one, using a different kind of search ("case search by docket number") to get the full information listing for that particular person, which is presented in an HTML table. I then copy-and-paste the information I need (the person's home address) for each individual onto that person's line in my Excel spreadsheet. The end result is that I print out a bunch of mailing labels in MS Word, using the mail merge function.
It seems that it would have been so easy for the programmer of this court system to allow you to just click on the docket number on the results list produced in the original search, pulling up all the data for each person, and cutting out the need for a separate search using each person's docket number, but it does not work that way.
The ideal application that I am seeking, would have a simple interface that would allow me to enter the date of the docket, the court division, and then press a button, automatically harvesting the information for each person who has business on that date's docket, and outputting the information (docket number, name, home address, etc.) in plain-text format that could be dropped into Excel. Even better would be a program where you could enter a date range, and it would harvest all the information for all dates, and all court divisions, in that range.
Is it possible to do this? What programming language or technology would be most appropriate for this purpose?
posted by jayder to computers & internet (15 comments total)
posted by RichardP at 3:40 PM on January 15, 2006