How to extract data from HTML
April 22, 2009 3:17 PM Subscribe
How to extract data automatically from a group of HTML pages to an Excel file or to a database?
I have a large group of old HTML files (around 800) that are more or less organised the same way. I need to extract certain items (the title, the h2 and h3 headers etc.) to store them in a database. The long term goal is to replace the static pages by dynamic ones but first I need to know what's in there without looking at each file, and I'll have to correct, reassign and rewrite some of the content anyway, so I need to have everything in a easy to browse format. I can write a VBA script (for Excel or Access) or a PHP script (for Mysql) but I was wondering if there were a simple, free tool for this (for Windows). If the tool could take the tag type (h2) and the file directory and spit out a CSV file with "filename, tag content" that would be enough for me.
I have a large group of old HTML files (around 800) that are more or less organised the same way. I need to extract certain items (the title, the h2 and h3 headers etc.) to store them in a database. The long term goal is to replace the static pages by dynamic ones but first I need to know what's in there without looking at each file, and I'll have to correct, reassign and rewrite some of the content anyway, so I need to have everything in a easy to browse format. I can write a VBA script (for Excel or Access) or a PHP script (for Mysql) but I was wondering if there were a simple, free tool for this (for Windows). If the tool could take the tag type (h2) and the file directory and spit out a CSV file with "filename, tag content" that would be enough for me.
Best answer: The Python library Beautiful Soup is made for this kind of thing.
My Python skills are poor, but I'm learnin', and I managed to find it, get it installed, write a script and churn through 4000 HTML pages in a single afternoon. I wanted tab-delimited output, which is pretty close to what you want.
posted by rokusan at 5:22 PM on April 22, 2009
My Python skills are poor, but I'm learnin', and I managed to find it, get it installed, write a script and churn through 4000 HTML pages in a single afternoon. I wanted tab-delimited output, which is pretty close to what you want.
posted by rokusan at 5:22 PM on April 22, 2009
Response by poster: Thanks for the tip about Beautiful Soup. I had trouble figuring out how to install it but I'm running a script now and it's working well!
(I've used regular expressions in PHP, but I'm doing this very occasionally so every time I have to spend time figure them out again).
posted by elgilito at 4:47 AM on April 23, 2009
(I've used regular expressions in PHP, but I'm doing this very occasionally so every time I have to spend time figure them out again).
posted by elgilito at 4:47 AM on April 23, 2009
You can also use CURL + SED/AWK shell scripts, or CURL with Perl, or Curl with Python.
That Beautiful Soup thing is interesting, never seen it before. Gonna have to try it out.
posted by teabag at 5:38 AM on April 23, 2009
That Beautiful Soup thing is interesting, never seen it before. Gonna have to try it out.
posted by teabag at 5:38 AM on April 23, 2009
This thread is closed to new comments.
Since you already know PHP, look into the regular expressions with the preg_match/preg_match_all functions.
posted by wongcorgi at 3:33 PM on April 22, 2009