How to extract data from HTML
April 22, 2009 3:17 PM   Subscribe

How to extract data automatically from a group of HTML pages to an Excel file or to a database?

I have a large group of old HTML files (around 800) that are more or less organised the same way. I need to extract certain items (the title, the h2 and h3 headers etc.) to store them in a database. The long term goal is to replace the static pages by dynamic ones but first I need to know what's in there without looking at each file, and I'll have to correct, reassign and rewrite some of the content anyway, so I need to have everything in a easy to browse format. I can write a VBA script (for Excel or Access) or a PHP script (for Mysql) but I was wondering if there were a simple, free tool for this (for Windows). If the tool could take the tag type (h2) and the file directory and spit out a CSV file with "filename, tag content" that would be enough for me.
posted by elgilito to Computers & Internet (5 answers total) 6 users marked this as a favorite
 
I don't believe there are any tools this specific, or at least will batch process all your files and export to the format that you want.

Since you already know PHP, look into the regular expressions with the preg_match/preg_match_all functions.
posted by wongcorgi at 3:33 PM on April 22, 2009


Best answer: The Python library Beautiful Soup is made for this kind of thing.

My Python skills are poor, but I'm learnin', and I managed to find it, get it installed, write a script and churn through 4000 HTML pages in a single afternoon. I wanted tab-delimited output, which is pretty close to what you want.
posted by rokusan at 5:22 PM on April 22, 2009


Yeah, Beautiful Soup is made for this.
posted by signal at 9:19 PM on April 22, 2009


Response by poster: Thanks for the tip about Beautiful Soup. I had trouble figuring out how to install it but I'm running a script now and it's working well!
(I've used regular expressions in PHP, but I'm doing this very occasionally so every time I have to spend time figure them out again).
posted by elgilito at 4:47 AM on April 23, 2009


You can also use CURL + SED/AWK shell scripts, or CURL with Perl, or Curl with Python.

That Beautiful Soup thing is interesting, never seen it before. Gonna have to try it out.
posted by teabag at 5:38 AM on April 23, 2009


« Older Can you recommend a dentist in or around Astoria...   |   Small drops of pure anise? Newer »
This thread is closed to new comments.