How to extract data from HTML
April 22, 2009 3:17 PM Subscribe
How to extract data automatically from a group of HTML pages to an Excel file or to a database?
I have a large group of old HTML files (around 800) that are more or less organised the same way. I need to extract certain items (the title, the h2 and h3 headers etc.) to store them in a database. The long term goal is to replace the static pages by dynamic ones but first I need to know what's in there without looking at each file, and I'll have to correct, reassign and rewrite some of the content anyway, so I need to have everything in a easy to browse format. I can write a VBA script (for Excel or Access) or a PHP script (for Mysql) but I was wondering if there were a simple, free tool for this (for Windows). If the tool could take the tag type (h2) and the file directory and spit out a CSV file with "filename, tag content" that would be enough for me.
posted by elgilito to computers & internet (5 answers total) 6 users marked this as a favorite
Since you already know PHP, look into the regular expressions with the preg_match/preg_match_all functions.
posted by wongcorgi at 3:33 PM on April 22, 2009