<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: How can I grab the text (not code) off of a bunch of .htm files?</title>
	<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files/</link>
	<description>Comments on Ask MetaFilter post How can I grab the text (not code) off of a bunch of .htm files?</description>
	<pubDate>Tue, 13 Feb 2007 18:37:07 -0800</pubDate>
	<lastBuildDate>Tue, 13 Feb 2007 18:37:07 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: How can I grab the text (not code) off of a bunch of .htm files?</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files</link>	
		<description>How can I automatically grab the text (not code) off of a bunch of .htm files? &lt;br /&gt;&lt;br /&gt; I have a bunch of .htm files which are based on the same template, and I am looking for a way to grab all the text from these pages and collect it in a text file for a voice actor to read. I could copy each page&apos;s text through a browser but I thought there had to be an easier way, as I need to grab the text from over 100 pages. Any advice appreciated!</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2007:site.56962</guid>
		<pubDate>Tue, 13 Feb 2007 18:25:36 -0800</pubDate>
		<dc:creator>pantufla</dc:creator>
		
			<category>html</category>
		
			<category>text</category>
		
			<category>script</category>
		
			<category>web</category>
		
			<category>batch</category>
		
	</item> <item>
		<title>By: saraswati</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856282</link>	
		<description>&lt;a href=&quot;http://www.4guysfromrolla.com/webtech/042501-1.shtml&quot;&gt;Here&lt;/a&gt; is a tutorial on stripping HTML tags using regular expressions. It includes VBScript examples. If you google around you&apos;ll find some simple VBScript code that will load in the files and all you need to do from there is enclose it in a loop to do all of the files.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856282</guid>
		<pubDate>Tue, 13 Feb 2007 18:37:07 -0800</pubDate>
		<dc:creator>saraswati</dc:creator>
	</item><item>
		<title>By: pantufla</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856284</link>	
		<description>Thanks qvtght. Is there any perl script I could use that I wouldn&apos;t need much knowledge of perl to run?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856284</guid>
		<pubDate>Tue, 13 Feb 2007 18:37:32 -0800</pubDate>
		<dc:creator>pantufla</dc:creator>
	</item><item>
		<title>By: ReiToei</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856285</link>	
		<description>&lt;em&gt;&quot;Perl.&quot;&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
... real helpful.&lt;br&gt;
&lt;br&gt;
http://www.webscrape.com/</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856285</guid>
		<pubDate>Tue, 13 Feb 2007 18:38:20 -0800</pubDate>
		<dc:creator>ReiToei</dc:creator>
	</item><item>
		<title>By: qvtqht</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856303</link>	
		<description>pantufla: Apologies about the terse response.&lt;br&gt;
&lt;br&gt;
An easy way a non-programmer could accomplish this is by using a multi-document text editor such as &lt;a href=&quot;http://www.editplus.com/&quot;&gt;EditPlus&lt;/a&gt;. Open all of the HTML files, and then do a global search-and-replace for the common elements you want to replace. You can use regular expressions to strip out all HTML tags by searching for &quot;\&amp;lt;.+\&amp;gt;&quot; (without the quotes).&lt;br&gt;
&lt;br&gt;
Other options include:&lt;br&gt;
&lt;br&gt;
&lt;a href=&quot;http://www.velocityscape.com/&quot;&gt;http://www.velocityscape.com/&lt;/a&gt;&lt;br&gt;
&lt;a href=&quot;http://www.iopus.com/imacros/web-testing.htm&quot;&gt;http://www.iopus.com/imacros/web-testing.htm&lt;/a&gt;&lt;br&gt;
&lt;a href=&quot;http://www.theeasybee.com/&quot;&gt;http://www.theeasybee.com/&lt;/a&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856303</guid>
		<pubDate>Tue, 13 Feb 2007 18:52:01 -0800</pubDate>
		<dc:creator>qvtqht</dc:creator>
	</item><item>
		<title>By: xiojason</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856320</link>	
		<description>&lt;code&gt;lynx -dump &lt;i&gt;http://ask.metafilter.com&lt;/i&gt;&lt;/code&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856320</guid>
		<pubDate>Tue, 13 Feb 2007 19:08:32 -0800</pubDate>
		<dc:creator>xiojason</dc:creator>
	</item><item>
		<title>By: xiojason</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856322</link>	
		<description>or, if you made a textfile with the URL of each page on a separate line and have access to a bash shell:&lt;br&gt;
&lt;br&gt;
&lt;code&gt;cat &lt;i&gt;urls.txt&lt;/i&gt; | while read url; do lynx -dump &quot;$url&quot; &amp;gt;&amp;gt; &lt;i&gt;pages.txt&lt;/i&gt;; done&lt;/code&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856322</guid>
		<pubDate>Tue, 13 Feb 2007 19:12:11 -0800</pubDate>
		<dc:creator>xiojason</dc:creator>
	</item><item>
		<title>By: 31d1</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856330</link>	
		<description>Yeah, something like:&lt;br&gt;
&lt;code&gt;lynx -dump -nolist -width=NUMBER [url]&lt;/code&gt; &lt;br&gt;
&lt;br&gt;
The &lt;code&gt;-nolist&lt;/code&gt; flag disables the list of links it prints at the bottom, and &lt;code&gt;-width=NUMBER&lt;/code&gt; lets you wrap on something other than 80 characters.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856330</guid>
		<pubDate>Tue, 13 Feb 2007 19:16:45 -0800</pubDate>
		<dc:creator>31d1</dc:creator>
	</item><item>
		<title>By: Rhomboid</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856366</link>	
		<description>Yeah, you don&apos;t want to try to remove tags yourself, you want to actually render the page.  Use lynx.&lt;br&gt;
&lt;br&gt;
&lt;small&gt;Pet peeve: &quot;cat file |&quot; is always extraneous and unnecessary.  You can replace this with redirection and save having to actually invoke /bin/cat: while read url ; do ... ; done &amp;lt;urls.txt&quot;&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856366</guid>
		<pubDate>Tue, 13 Feb 2007 19:55:01 -0800</pubDate>
		<dc:creator>Rhomboid</dc:creator>
	</item><item>
		<title>By: xiojason</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856381</link>	
		<description>&lt;small&gt;Interesting. Is there some significant downside to using cat? I&apos;ve used cat and avoided the redirection in an effort to increase readability, linking the read to its input in an easy-to-see left-to-right manner. Otherwise the data source can be so far away from the reader that it seems out of place.&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856381</guid>
		<pubDate>Tue, 13 Feb 2007 20:12:35 -0800</pubDate>
		<dc:creator>xiojason</dc:creator>
	</item><item>
		<title>By: xiojason</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856394</link>	
		<description>&lt;small&gt;Ah, I see from some further reading that, were I using zsh, it would be truly unnecessary, since zsh supports input redirection before the while. Another reason I should probably start messing about with zsh. Sorry for the continued derail.&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856394</guid>
		<pubDate>Tue, 13 Feb 2007 20:26:46 -0800</pubDate>
		<dc:creator>xiojason</dc:creator>
	</item><item>
		<title>By: lunchbox</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856444</link>	
		<description>I&apos;ve done very similar things before, and html2text has always worked for me. It&apos;s available as a Linux package; I&apos;m sure there are versions for Windows and Mac too. If you have Python you can use the &lt;a href=&quot;http://www.aaronsw.com/2002/html2text/&quot;&gt;Python version&lt;/a&gt;.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856444</guid>
		<pubDate>Tue, 13 Feb 2007 21:24:17 -0800</pubDate>
		<dc:creator>lunchbox</dc:creator>
	</item><item>
		<title>By: Rhomboid</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856475</link>	
		<description>&lt;small&gt;xiojason - You&apos;re right, the actual wastefulness of invoking cat is very minimal and I suppose it does improve readability if you are used to seeing that idiom.  But it&apos;s just one of those pet peeves that grate me a small bit every time.  I regret derailing the thread for such trivial *nix minutiae now.&lt;/small&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856475</guid>
		<pubDate>Tue, 13 Feb 2007 21:58:54 -0800</pubDate>
		<dc:creator>Rhomboid</dc:creator>
	</item><item>
		<title>By: metaswell</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856487</link>	
		<description>HTMSTRIP: Processes and removes embedded HTML commands from Web pages downloaded from the Web.&lt;br&gt;
&lt;br&gt;
http://www.erols.com/waynesof/bruce.htm&lt;br&gt;
&lt;br&gt;
Its a commandline tool, and I use it routinely when I want to save just the text portions of study material from the web, which is quite often!&lt;br&gt;
&lt;br&gt;
VERY flexible, and wildcards are allowed. It will go through an entire directory/folder really fast, and the results are good.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856487</guid>
		<pubDate>Tue, 13 Feb 2007 22:21:24 -0800</pubDate>
		<dc:creator>metaswell</dc:creator>
	</item><item>
		<title>By: ejoey</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856494</link>	
		<description>If you are on a Mac you can just print to a PDF, or do any of the good unix-y tips above as well.  If you are on a Windows machine, you can use &lt;a href=&quot;http://sourceforge.net/projects/pdfcreator/&quot;&gt;PDF Creator&lt;/a&gt; to make a PDF as well.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856494</guid>
		<pubDate>Tue, 13 Feb 2007 22:28:44 -0800</pubDate>
		<dc:creator>ejoey</dc:creator>
	</item><item>
		<title>By: raildr</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856495</link>	
		<description>just load it in a browser, as Rhomboid says, select all, copy and walah!</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856495</guid>
		<pubDate>Tue, 13 Feb 2007 22:28:49 -0800</pubDate>
		<dc:creator>raildr</dc:creator>
	</item><item>
		<title>By: pantufla</title>
		<link>http://ask.metafilter.com/56962/How-can-I-grab-the-text-not-code-off-of-a-bunch-of-htm-files#856556</link>	
		<description>thanks for all the great info!</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.56962-856556</guid>
		<pubDate>Tue, 13 Feb 2007 23:49:59 -0800</pubDate>
		<dc:creator>pantufla</dc:creator>
	</item>
	</channel>
</rss>
