How can I grab the text (not code) off of a bunch of .htm files?
February 13, 2007 6:25 PM Subscribe

How can I automatically grab the text (not code) off of a bunch of .htm files?

I have a bunch of .htm files which are based on the same template, and I am looking for a way to grab all the text from these pages and collect it in a text file for a voice actor to read. I could copy each page's text through a browser but I thought there had to be an easier way, as I need to grab the text from over 100 pages. Any advice appreciated!

posted by pantufla to Computers & Internet (16 answers total) 3 users marked this as a favorite

Here is a tutorial on stripping HTML tags using regular expressions. It includes VBScript examples. If you google around you'll find some simple VBScript code that will load in the files and all you need to do from there is enclose it in a loop to do all of the files.
posted by saraswati at 6:37 PM on February 13, 2007 [1 favorite]

Thanks qvtght. Is there any perl script I could use that I wouldn't need much knowledge of perl to run?
posted by pantufla at 6:37 PM on February 13, 2007

"Perl."

... real helpful.

http://www.webscrape.com/
posted by ReiToei at 6:38 PM on February 13, 2007

pantufla: Apologies about the terse response.

An easy way a non-programmer could accomplish this is by using a multi-document text editor such as EditPlus. Open all of the HTML files, and then do a global search-and-replace for the common elements you want to replace. You can use regular expressions to strip out all HTML tags by searching for "\<.+\>" (without the quotes).

Other options include:

http://www.velocityscape.com/
http://www.iopus.com/imacros/web-testing.htm
http://www.theeasybee.com/
posted by qvtqht at 6:52 PM on February 13, 2007

lynx -dump http://ask.metafilter.com
posted by xiojason at 7:08 PM on February 13, 2007

or, if you made a textfile with the URL of each page on a separate line and have access to a bash shell:

cat urls.txt | while read url; do lynx -dump "$url" >> pages.txt; done
posted by xiojason at 7:12 PM on February 13, 2007

Yeah, something like:
lynx -dump -nolist -width=NUMBER [url]

The -nolist flag disables the list of links it prints at the bottom, and -width=NUMBER lets you wrap on something other than 80 characters.
posted by 31d1 at 7:16 PM on February 13, 2007

Yeah, you don't want to try to remove tags yourself, you want to actually render the page. Use lynx.

Pet peeve: "cat file |" is always extraneous and unnecessary. You can replace this with redirection and save having to actually invoke /bin/cat: while read url ; do ... ; done <urls.txt"
posted by Rhomboid at 7:55 PM on February 13, 2007

Interesting. Is there some significant downside to using cat? I've used cat and avoided the redirection in an effort to increase readability, linking the read to its input in an easy-to-see left-to-right manner. Otherwise the data source can be so far away from the reader that it seems out of place.
posted by xiojason at 8:12 PM on February 13, 2007

Ah, I see from some further reading that, were I using zsh, it would be truly unnecessary, since zsh supports input redirection before the while. Another reason I should probably start messing about with zsh. Sorry for the continued derail.
posted by xiojason at 8:26 PM on February 13, 2007

I've done very similar things before, and html2text has always worked for me. It's available as a Linux package; I'm sure there are versions for Windows and Mac too. If you have Python you can use the Python version.
posted by lunchbox at 9:24 PM on February 13, 2007

xiojason - You're right, the actual wastefulness of invoking cat is very minimal and I suppose it does improve readability if you are used to seeing that idiom. But it's just one of those pet peeves that grate me a small bit every time. I regret derailing the thread for such trivial *nix minutiae now.
posted by Rhomboid at 9:58 PM on February 13, 2007

HTMSTRIP: Processes and removes embedded HTML commands from Web pages downloaded from the Web.

http://www.erols.com/waynesof/bruce.htm

Its a commandline tool, and I use it routinely when I want to save just the text portions of study material from the web, which is quite often!

VERY flexible, and wildcards are allowed. It will go through an entire directory/folder really fast, and the results are good.
posted by metaswell at 10:21 PM on February 13, 2007

If you are on a Mac you can just print to a PDF, or do any of the good unix-y tips above as well. If you are on a Windows machine, you can use PDF Creator to make a PDF as well.
posted by ejoey at 10:28 PM on February 13, 2007

just load it in a browser, as Rhomboid says, select all, copy and walah!
posted by raildr at 10:28 PM on February 13, 2007

thanks for all the great info!
posted by pantufla at 11:49 PM on February 13, 2007

« Older Getting around in Ohio without a car. | How can I drive a bus? Newer »

This thread is closed to new comments.

Ask MetaFilter

How can I grab the text (not code) off of a bunch of .htm files?
February 13, 2007 6:25 PM Subscribe

Tags

Share

How can I grab the text (not code) off of a bunch of .htm files? February 13, 2007 6:25 PM Subscribe

Tags

Share

How can I grab the text (not code) off of a bunch of .htm files?
February 13, 2007 6:25 PM Subscribe