<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: How to extract data from HTML</title>
	<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML/</link>
	<description>Comments on Ask MetaFilter post How to extract data from HTML</description>
	<pubDate>Wed, 22 Apr 2009 15:33:41 -0800</pubDate>
	<lastBuildDate>Wed, 22 Apr 2009 15:33:41 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: How to extract data from HTML</title>
		<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML</link>	
		<description>How to extract data automatically from a group of HTML pages to an Excel file or to a database? &lt;br /&gt;&lt;br /&gt; I have a large group of old HTML files (around 800) that are more or less organised the same way. I need to extract certain items (the title, the h2 and h3 headers etc.) to store them in a database. The long term goal is to replace the static pages by dynamic ones but first I need to know what&apos;s in there without looking at each file, and I&apos;ll have to correct, reassign and rewrite some of the content anyway, so I need to have everything in a easy to browse format. I can write a VBA script (for Excel or Access) or a PHP script (for Mysql) but I was wondering if there were a simple, free tool for this (for Windows). If the tool could take the tag type (h2) and the file directory and spit out a CSV file with &quot;filename, tag content&quot; that would be enough for me.</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2009:site.120241</guid>
		<pubDate>Wed, 22 Apr 2009 15:17:35 -0800</pubDate>
		<dc:creator>elgilito</dc:creator>
		
			<category>dataextraction</category>
		
			<category>html</category>
		
			<category>excel</category>
		
			<category>access</category>
		
			<category>database</category>
		
			<category>resolved</category>
		
	</item> <item>
		<title>By: wongcorgi</title>
		<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML#1720766</link>	
		<description>I don&apos;t believe there are any tools this specific, or at least will batch process all your files and export to the format that you want.&lt;br&gt;
&lt;br&gt;
Since you already know PHP, look into the regular expressions with the &lt;a href=&quot;http://www.php.net/manual/en/function.preg-match.php&quot;&gt;preg_match&lt;/a&gt;/&lt;a href=&quot;http://www.php.net/manual/en/function.preg-match-all.php&quot;&gt;preg_match_all&lt;/a&gt; functions.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2009:site.120241-1720766</guid>
		<pubDate>Wed, 22 Apr 2009 15:33:41 -0800</pubDate>
		<dc:creator>wongcorgi</dc:creator>
	</item><item>
		<title>By: rokusan</title>
		<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML#1720905</link>	
		<description>The Python library &lt;a href=&quot;http://www.crummy.com/software/BeautifulSoup/&quot;&gt;Beautiful Soup&lt;/a&gt; is made for this kind of thing.&lt;br&gt;
&lt;br&gt;
My Python skills are poor, but I&apos;m learnin&apos;, and I managed to find it, get it installed, write a script and churn through 4000 HTML pages in a single afternoon. I wanted tab-delimited output, which is pretty close to what you want.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2009:site.120241-1720905</guid>
		<pubDate>Wed, 22 Apr 2009 17:22:11 -0800</pubDate>
		<dc:creator>rokusan</dc:creator>
	</item><item>
		<title>By: signal</title>
		<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML#1721136</link>	
		<description>Yeah, Beautiful Soup is made for this.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2009:site.120241-1721136</guid>
		<pubDate>Wed, 22 Apr 2009 21:19:49 -0800</pubDate>
		<dc:creator>signal</dc:creator>
	</item><item>
		<title>By: elgilito</title>
		<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML#1721329</link>	
		<description>Thanks for the tip about Beautiful Soup. I had trouble figuring out how to install it but I&apos;m running a script now and it&apos;s working well!&lt;br&gt;
(I&apos;ve used regular expressions in PHP, but I&apos;m doing this very occasionally so every time I have to spend time figure them out again).</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2009:site.120241-1721329</guid>
		<pubDate>Thu, 23 Apr 2009 04:47:08 -0800</pubDate>
		<dc:creator>elgilito</dc:creator>
	</item><item>
		<title>By: teabag</title>
		<link>http://ask.metafilter.com/120241/How-to-extract-data-from-HTML#1721354</link>	
		<description>You can also use CURL + SED/AWK shell scripts, or CURL with Perl, or Curl with Python. &lt;br&gt;
&lt;br&gt;
That Beautiful Soup thing is interesting, never seen it before. Gonna have to try it out.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2009:site.120241-1721354</guid>
		<pubDate>Thu, 23 Apr 2009 05:38:33 -0800</pubDate>
		<dc:creator>teabag</dc:creator>
	</item>
	</channel>
</rss>
