Databasing static data
July 24, 2006 9:27 AM   Subscribe

How can I harvest data from static HTML pages?

I've taken on a project for a messageboard administrator who has asked me to create a database containing the contents of the individual posts of the messageboard. Unfortunately each post is a static HTML page generated by a Perl cgi script. The URL of the post determines its posted date and hierarchy. For example, a top-level post submitted on 24 Jul 2006 would look something like this:

(domain/date/threadID/postID)

domain.com/2006-Jul-24/86398/86398.html

The threadID and postID (in this case, the '86398' and '86398.html') will always contain the same number if the post is a top-level post (the first in a thread). A reply to the above-listed example would have a URL similar to this:

domain.com/2006-Jul-24/86398/86403.html

Notice that the threadID (86398) is the same as the top-level post but the postID (86403) is just the next consecutive postID for the messageboard. In this example 4 posts have been submitted elsewhere on the messageboard in between the original parent post (postID 86398) and this example reply (postID 86403).

Additional tiers use a similar threadID/postID convention but I do not currently know how the system determines the hierarchy level of the post. Right now I am going to treat all non-parent posts the same until I can get more information from the admin.

A real-live example of this messageboard engine can be found here. Feel free to post if you need to; it's just a test board. No login required.

How can I glean the post elements (subject, submitter, and message) from these HTML pages? I do have FTP access to all HTML pages and scripts. The website runs on an Apache server. Any leads would be helpful as to how to even begin.
posted by mezzanayne to Computers & Internet (13 answers total) 1 user marked this as a favorite
 
The googleable term you're looking for is screen scraping.
posted by jellicle at 9:41 AM on July 24, 2006


Sounds like you want to do some "Screen Scraping" to some html pages, assuming that the html is indeed the only source you can get your data from.

Google seems to yield a few promising looking resources.


On preview, damn you jellicle! *shakes fist*
posted by utsutsu at 9:44 AM on July 24, 2006


Another term you need to be familiar with is Regular Expressions. You'll use these Regex's with a scripting language like Perl, Python or Ruby and you should be able to get those html files into a database structure.
posted by mmascolino at 9:44 AM on July 24, 2006


Do you have access to the underlying message-board database? (You say the message-board admin wants this, so can't he get at that?) If so, that's by far your easiest approach.

If you have no option but to screen scrape, I'd recommend developing the screen-scraper in Firefox's javascript shell and in greasenmonkey, then hacking GM to allow it to save files to disk. This well work best if the HTML uses divs and spans appropriately to demarcate semantic elements (subject, submitter, etc.) on the page. There are greasemonkey scripts a-plenty (some for metafilter, even) that will give you the flavor of this.

If you don't like javascript or DOM programming, there's a perl screen-scraper package (WWW::Mechanize) that I've successfully used to do similar grabbing of hierarchical information (in that case, four levels and four pages deep) from a webpage. WWW::Mechanize will also you to interactively access the page and page elements, then output that as a perl script, which you can then modify.

But again, your best and easiest solution is to get the board's underlying database, which already has all the information you want, organized as you want it.
posted by orthogonality at 9:44 AM on July 24, 2006


Seconding what others have said about Regular Expressions.

I just wrote a screen scrape the other day using PHP and Regular Expressions. I can send you to figure out whether or not a message is first tier, may be to use regular expressions or while loops to figure out how deepply nested in
    tags each post is. Hopefully that's somewhat helpful.

posted by creeront at 9:51 AM on July 24, 2006


Response by poster: Thank you all, for the 'screen scrape' Google-term. That sounds like what I need to get started.

Orthogonality, there isn't currently an underlying database for the messageboard. The static HTML page is created upon the submission of a post but the information isn't stored anywhere else outside of the page.

Creeront, your screen scrape/RegEx example would be most appreciated. I've updated my AskMe profile to include my email address at the bottom.

Thanks again.
posted by mezzanayne at 10:03 AM on July 24, 2006


But again, your best and easiest solution is to get the board's underlying database, which already has all the information you want, organized as you want it.

It looks like this bulletin board system doesn't have an underlying database-- the messages exist only in the static html files.

Since you have access to the original files, you can read the files directly and skip any URL accessing steps of the screen scraping process.

Luckily, the format of the files seems pretty simple. The post contents can be found between the startMessage and endMessage comments. The hierarchy is defined by the hidden "parentPost" field. Subject, poster handle, and date are in other hidden fields.
posted by justkevin at 10:04 AM on July 24, 2006


Another possible solution is to write a short REBOL program -- read the Web page using http, parse it, and write it to a file which can be imported. REBOL/View is free for personal, commercial, and educational use.
posted by davcoo at 12:06 PM on July 24, 2006


Is it completely outside the potential scope of the project to replace this forum system with one that's already *based* on a DBMS?

I know that isn't specified in the OP, but my experience with paying clients has taught me always to ask.
posted by baylink at 1:38 PM on July 24, 2006


Response by poster: Baylink, I would love to eventually replace the messageboard engine with something database-driven. The data that can be screen-scraped will be used to archive existing (static) data as well as provide the data foundation for a rebuilt site.

The admin of the messageboard is a friend of mine so no money is exchanging hands. I've taken on the project as a personal learning experience for myself.
posted by mezzanayne at 2:12 PM on July 24, 2006


I'd send you my perl screen scraping code, but for the best of reasons (to prevent users from removing a forced pause and hammering the site) I obfuscated it, and now can't find the original un-obfuscated copy. Sorry.
posted by orthogonality at 2:45 PM on July 24, 2006


I do this a fair bit for money. Sometimes very good money. But my business skills suck so I don't market myself well/widely.

Core tools that I use:

perl

WWW::Mechanize and to get a basic script up and running fast (very fast) WWW::Mechanize::Shell. Generate a basic script and tweak for multiple pages from there.

if there are horrible things going on with the way the web page is structured (some kinds of asp pages are particularly bad for this) I fall back to LWP::UserAgent and do things much more manually

After that I use regular expressions and a database orm (I use DBIx::Class these days because I can basically dump a hash of fields into the db really easily like in about 3 lines of code).

The key thing you need to know about how to do the regex is

($dbdata{column_name}) = $data =~ /< [key bit of html]>(.*?)< [another key bit of html or whatever>/ms; # the ms bit at the end may vary.

here's a real example:

my ($name, $sector) = $data =~ / (.*?)\  ltext">(.*?)< \/span>/sm;

Also lynx -dump -stdin -nolist can be a great time saver as well.

To run this from perl code here's an example:

my $cmd = "lynx -dump -stdin -nolist < eof\n$data\neof;br> my $raw_info = `$cmd`; # note those are backticks, which assign the output of $cmd to the string $raw_info

If you want more info, or some tech support, you can contact me via my profile email, but I'm afraid that I'd have to bill you for it (sorry). However my rates are very reasonable.

posted by singingfish at 5:22 PM on July 24, 2006


@mezz: got it.

When you *do* get to that point past where you are now, check out (obligatory plug) WebGUI, which has all the forum stuff built, as well as a crapload of other stuff.
posted by baylink at 7:38 PM on July 24, 2006


« Older American History Books   |   What's the use? Newer »
This thread is closed to new comments.