Databasing static data
July 24, 2006 9:27 AM
Subscribe
How can I harvest data from static HTML pages?
I've taken on a project for a messageboard administrator who has asked me to create a database containing the contents of the individual posts of the messageboard. Unfortunately each post is a static HTML page generated by a Perl cgi script. The URL of the post determines its posted date and hierarchy. For example, a top-level post submitted on 24 Jul 2006 would look something like this:
(
domain/date/threadID/postID)
domain.com/2006-Jul-24/86398/86398.html
The threadID and postID (in this case, the '86398' and '86398.html') will always contain the same number if the post is a top-level post (the first in a thread). A reply to the above-listed example would have a URL similar to this:
domain.com/2006-Jul-24/86398/86403.html
Notice that the threadID (86398) is the same as the top-level post but the postID (86403) is just the next consecutive postID for the messageboard. In this example 4 posts have been submitted elsewhere on the messageboard in between the original parent post (postID 86398) and this example reply (postID 86403).
Additional tiers use a similar threadID/postID convention but I do not currently know how the system determines the hierarchy level of the post. Right now I am going to treat all non-parent posts the same until I can get more information from the admin.
A real-live example of this messageboard engine can be found
here. Feel free to post if you need to; it's just a test board. No login required.
How can I glean the post elements (subject, submitter, and message) from these HTML pages? I do have FTP access to all HTML pages and scripts. The website runs on an Apache server. Any leads would be helpful as to how to even begin.
posted by mezzanayne to computers & internet (13 comments total)
1 user marked this as a favorite
posted by jellicle at 9:41 AM on July 24, 2006