Tips as to UNIX Shell Script to Programmatically Save a Webpage as a Text File
January 30, 2009 4:53 PM   Subscribe

I'm writing a shell script to take a webpage and convert it into a text file. I'd appreciate tips as to how to store URLs, save the file with the URL's title as its name, and also just general tips as to how to improve the script and/or achieve the process better. Specific questions inside.

I'd like to write a shell script which converts a webpage into a text file. After a lot of tinkering with various note-taking applications, Firefox extensions, and so on, I've found that the best tool for me is just plain good old-fashioned text files. However, I'd love it if I could automate the process a little bit more, and so I'm writing a shell script to get a webpage into a text-form equivalent.

Right now, I've got:

links -dump -width 512 "$1" | cut -c 4- > /tmp/temp.file
lynx -listonly -dump "$1" | sed '1,3d' | cut -c 7- >> /tmp/temp.file
edit -b /tmp/temp.file


In this example, $1 is a Web address.

What this does is:
  1. Uses links to save the text of the page. I use this instead of lynx because the "-width 512" lets it handle it without inappropriate line breaks, and links seems to let handle punctuation spacing better than lynx. (The "cut" removes the extra lefthand margin.)
  2. Uses lynx to generate the list of links that are on that page, removing the "References" header, margin, and numbering. Links doesn't seem to have any way of recording the URLs when generating a text copy. It appends that to the work in progress.
  3. Sends this to TextWrangler to open up in the background.
I'm seeking the community's advice on this on three points:
  1. The way I've got it working now is okay, but, ideally, I'd like to handle URLs in the way that Mefi's print stylesheet handles it — the URL appearing right after the link text. So in a webpage converted into a text file, instead of it being "Google", it'd be "Google [http://www.google.com]". I'm aware that lynx lets you do footnotes ("[1]Google" and later "1. http://www.google.com"), but lynx's handling of line breaks and spacing isn't great.
  2. I'd then ideally like to have this script save the results automatically to a text file on my Desktop with the URL's TITLE attribute as the name of the file.
  3. I'm wondering if, given the format, any odd punctuation in the URL could screw up the process.
  4. Also, I imagine this might be an enjoyable script for others — and if so, any other modifications to the script that would improve the overall process and/or end goal — and/or any utilities that do this process better than what I'm hacking up — would be appreciated.
posted by WCityMike to Computers & Internet (17 answers total) 3 users marked this as a favorite
 
3. Yes. Use Perl.
posted by devbrain at 5:38 PM on January 30, 2009


Response by poster: That first line is now:

links -dump -width 512 "$1" | cut -c 4- | sed 's/^[ \t]*//;s/[ \t]*$//' > /tmp/temp.file

Does a shift left to get rid of any initial or tailing whitespace.
posted by WCityMike at 5:42 PM on January 30, 2009


Response by poster: I don't know Perl. I just splice junk I find on the Web together.
posted by WCityMike at 5:43 PM on January 30, 2009


Sounds like you're writing a screen-scraper.

My suggestion would be to use Beautiful Soup and write something in Python. Existing Python libraries can easily handle 1 and 2 above.
posted by needled at 5:49 PM on January 30, 2009


Response by poster: I'm afraid I don't know Python either.
posted by WCityMike at 5:51 PM on January 30, 2009


I'm a complete spaz in programming and I wrote a scraper in python that creates email newsletters of a page. Mostly with your 'splice junk from the web' strategy. So, ya know, if I can do it you can too. And unlike a shell script, you can transport it between different OSes. For an amateur coder, the sort of text manipulation you're doing with pipes (#1 + #2) would be easier with some string or regex functions, and stored variables. Separating them out would also improve the script's readability and expandability (#4).

If you might consider learning python, I could email you my scraper. I commented it up pretty well, so it might help you gauge what a beginner can do. I realize it's an annoying suggestion to "learn this other thing and redo what you've done" but man does that shell stuff look hacky.
posted by cowbellemoo at 7:44 PM on January 30, 2009


Shell scripting is great for doing some things quickly, but as you're finding out, once you reach a certain level of complexity, it gets quite messy.

Pick any modern scripting language (perl, python, ruby, etc) and it'll have a great libraries for handling this sort of thing. I'lll go ahead and recommend using Ruby and hpricot. The example on the first page of the hpricot wiki gives you an idea of how powerful and easy this can be. There are great tutorials both on that wiki and elsewhere on the web that will have you up and running in no time.
posted by chrisamiller at 9:32 PM on January 30, 2009


I can't argue with chrisamiller's recommendation of hpricot (but I'll assume that any reasonable language and XPath library would make this easy.)

This will fetch a URL, insert links' URLs into the text following the link (doesn't fully qualify relative URLs, though) and print it out.

require 'open-uri'
require 'hpricot'

doc = open("http://metafilter.com/") { |f| Hpricot(f) }
title = (doc/"head/title")[0].inner_html
puts title
(doc/"a").each do |a|
  a.after(' ' + a.get_attribute('href') + ' ') if a.has_attribute?('href')
end
puts doc


Then run that through html2text and you're close to done. (That could be done within the ruby script, of course, and you could write the output to the string stored in title, above, but you'd want to sanitize it to make sure it's a reasonable filename.)

Ah, how can you keep them programming in Java after they've seen Ruby?
posted by Zed at 10:44 PM on January 30, 2009 [1 favorite]


Go look at THE ASCIINATOR (aka html2txt) - example.

Links lists & numbers links with the option "-html-numbered-links 1" passed to it, at least on Ubuntu.
posted by Pronoiac at 10:46 PM on January 30, 2009


Note that Pronoiac and I are talking about unrelated html2text programs. The one he brought up is much closer to doing what you want out of the box than anyone else's suggestions.
posted by Zed at 11:04 PM on January 30, 2009


What Zed said. I should have mentioned that, especially after realizing there are multiple variants of links around - elinks & links2, & the original, at least.
posted by Pronoiac at 11:23 PM on January 30, 2009


Haven't seen links -dump before.

Anyway, a couple options. For getting the urls to show up in the text, see if theres a way to use your own CSS in links. If so you can specify a stylesheet that uses 'a:after {
content:" (" attr(href) ") ";' to add the url text after each url (although I'm not sure if links does :after or content - you'll have to look into that).

I'd also suggest using curl first. curl downloads a url. Instead of the screenshot you'll get the html. You can use that to match the text between title tags to extract the title. If links doesn't do custom stylesheets, you could insert your stylesheet with sed. Hell, you could strip out <a href= and </a> to get links to show as text. Anyway, download the file with curl, get the title, strip links/insert style sheet, dump it into /tmp/$title.txt, and then call links -dump /tmp/$title.txt > $HOME/Desktop/$title.txt

Or something to that effect. Sorry I can't be more precise, I'm drunk.
posted by valadil at 11:40 PM on January 30, 2009


Response by poster: I appreciate the advice thus far. But the ASCIINator doesn't appear to be doing anything that lynx -dump doesn't do ... it doesn't take care of line breaks like links -dump -width 512 does.

As for learning a language in order to program this ... I just don't want to invest that much time into it.
posted by WCityMike at 12:15 AM on January 31, 2009


Response by poster: I actually got part of the way on #2 via an AppleScript, but "class pTit" is broken in Firefox 3 (a regression from FF2, where it worked).
posted by WCityMike at 12:39 AM on January 31, 2009


My version of lynx, at least, has a -width flag, which might allow you to use only one invocation instead of two. I don't know of a very good way to get the URLs in-line, though.
posted by hattifattener at 12:41 AM on January 31, 2009


Are you judging by the example, or by looking at the output? If you're looking at the output, you should configure it first - change LINKS_EACH_PARAGRAPH = 0 to 1 & BODY_WIDTH = 78 to 0.
posted by Pronoiac at 12:55 AM on January 31, 2009


Response by poster: Pronoiac: "Are you judging by the example, or by looking at the output? If you're looking at the output, you should configure it first - change LINKS_EACH_PARAGRAPH = 0 to 1 & BODY_WIDTH = 78 to 0."

Thanks; I hadn't thought to look at the code for configuration. I think I may very well use the ASCIINator then. Thanks.
posted by WCityMike at 12:37 PM on February 1, 2009


« Older Please identify the possible English origins of...   |   This "metronomic" shaker sound is annoying me.... Newer »
This thread is closed to new comments.