How can I grab the text (not code) off of a bunch of .htm files?
February 13, 2007 6:25 PM Subscribe
How can I automatically grab the text (not code) off of a bunch of .htm files?
I have a bunch of .htm files which are based on the same template, and I am looking for a way to grab all the text from these pages and collect it in a text file for a voice actor to read. I could copy each page's text through a browser but I thought there had to be an easier way, as I need to grab the text from over 100 pages. Any advice appreciated!
I have a bunch of .htm files which are based on the same template, and I am looking for a way to grab all the text from these pages and collect it in a text file for a voice actor to read. I could copy each page's text through a browser but I thought there had to be an easier way, as I need to grab the text from over 100 pages. Any advice appreciated!
Response by poster: Thanks qvtght. Is there any perl script I could use that I wouldn't need much knowledge of perl to run?
posted by pantufla at 6:37 PM on February 13, 2007
posted by pantufla at 6:37 PM on February 13, 2007
"Perl."
... real helpful.
http://www.webscrape.com/
posted by ReiToei at 6:38 PM on February 13, 2007
... real helpful.
http://www.webscrape.com/
posted by ReiToei at 6:38 PM on February 13, 2007
pantufla: Apologies about the terse response.
An easy way a non-programmer could accomplish this is by using a multi-document text editor such as EditPlus. Open all of the HTML files, and then do a global search-and-replace for the common elements you want to replace. You can use regular expressions to strip out all HTML tags by searching for "\<.+\>" (without the quotes).
Other options include:
http://www.velocityscape.com/
http://www.iopus.com/imacros/web-testing.htm
http://www.theeasybee.com/
posted by qvtqht at 6:52 PM on February 13, 2007
An easy way a non-programmer could accomplish this is by using a multi-document text editor such as EditPlus. Open all of the HTML files, and then do a global search-and-replace for the common elements you want to replace. You can use regular expressions to strip out all HTML tags by searching for "\<.+\>" (without the quotes).
Other options include:
http://www.velocityscape.com/
http://www.iopus.com/imacros/web-testing.htm
http://www.theeasybee.com/
posted by qvtqht at 6:52 PM on February 13, 2007
or, if you made a textfile with the URL of each page on a separate line and have access to a bash shell:
posted by xiojason at 7:12 PM on February 13, 2007
cat urls.txt | while read url; do lynx -dump "$url" >> pages.txt; done
posted by xiojason at 7:12 PM on February 13, 2007
Yeah, something like:
The
posted by 31d1 at 7:16 PM on February 13, 2007
lynx -dump -nolist -width=NUMBER [url]
The
-nolist
flag disables the list of links it prints at the bottom, and -width=NUMBER
lets you wrap on something other than 80 characters.posted by 31d1 at 7:16 PM on February 13, 2007
Yeah, you don't want to try to remove tags yourself, you want to actually render the page. Use lynx.
Pet peeve: "cat file |" is always extraneous and unnecessary. You can replace this with redirection and save having to actually invoke /bin/cat: while read url ; do ... ; done <urls.txt"
posted by Rhomboid at 7:55 PM on February 13, 2007
Pet peeve: "cat file |" is always extraneous and unnecessary. You can replace this with redirection and save having to actually invoke /bin/cat: while read url ; do ... ; done <urls.txt"
posted by Rhomboid at 7:55 PM on February 13, 2007
Interesting. Is there some significant downside to using cat? I've used cat and avoided the redirection in an effort to increase readability, linking the read to its input in an easy-to-see left-to-right manner. Otherwise the data source can be so far away from the reader that it seems out of place.
posted by xiojason at 8:12 PM on February 13, 2007
posted by xiojason at 8:12 PM on February 13, 2007
Ah, I see from some further reading that, were I using zsh, it would be truly unnecessary, since zsh supports input redirection before the while. Another reason I should probably start messing about with zsh. Sorry for the continued derail.
posted by xiojason at 8:26 PM on February 13, 2007
posted by xiojason at 8:26 PM on February 13, 2007
I've done very similar things before, and html2text has always worked for me. It's available as a Linux package; I'm sure there are versions for Windows and Mac too. If you have Python you can use the Python version.
posted by lunchbox at 9:24 PM on February 13, 2007
posted by lunchbox at 9:24 PM on February 13, 2007
xiojason - You're right, the actual wastefulness of invoking cat is very minimal and I suppose it does improve readability if you are used to seeing that idiom. But it's just one of those pet peeves that grate me a small bit every time. I regret derailing the thread for such trivial *nix minutiae now.
posted by Rhomboid at 9:58 PM on February 13, 2007
posted by Rhomboid at 9:58 PM on February 13, 2007
HTMSTRIP: Processes and removes embedded HTML commands from Web pages downloaded from the Web.
http://www.erols.com/waynesof/bruce.htm
Its a commandline tool, and I use it routinely when I want to save just the text portions of study material from the web, which is quite often!
VERY flexible, and wildcards are allowed. It will go through an entire directory/folder really fast, and the results are good.
posted by metaswell at 10:21 PM on February 13, 2007
http://www.erols.com/waynesof/bruce.htm
Its a commandline tool, and I use it routinely when I want to save just the text portions of study material from the web, which is quite often!
VERY flexible, and wildcards are allowed. It will go through an entire directory/folder really fast, and the results are good.
posted by metaswell at 10:21 PM on February 13, 2007
If you are on a Mac you can just print to a PDF, or do any of the good unix-y tips above as well. If you are on a Windows machine, you can use PDF Creator to make a PDF as well.
posted by ejoey at 10:28 PM on February 13, 2007
posted by ejoey at 10:28 PM on February 13, 2007
just load it in a browser, as Rhomboid says, select all, copy and walah!
posted by raildr at 10:28 PM on February 13, 2007
posted by raildr at 10:28 PM on February 13, 2007
Response by poster: thanks for all the great info!
posted by pantufla at 11:49 PM on February 13, 2007
posted by pantufla at 11:49 PM on February 13, 2007
This thread is closed to new comments.
posted by saraswati at 6:37 PM on February 13, 2007 [1 favorite]