Tips as to UNIX Shell Script to Programmatically Save a Webpage as a Text File
January 30, 2009 4:53 PM
Subscribe
I'm writing a shell script to take a webpage and convert it into a text file. I'd appreciate tips as to how to store URLs, save the file with the URL's title as its name, and also just general tips as to how to improve the script and/or achieve the process better. Specific questions inside.
I'd like to write a shell script which converts a webpage into a text file. After a lot of tinkering with various note-taking applications, Firefox extensions, and so on, I've found that the best tool for me is just plain good old-fashioned text files. However, I'd love it if I could automate the process a little bit more, and so I'm writing a shell script to get a webpage into a text-form equivalent.
Right now, I've got:
links -dump -width 512 "$1" | cut -c 4- > /tmp/temp.file
lynx -listonly -dump "$1" | sed '1,3d' | cut -c 7- >> /tmp/temp.file
edit -b /tmp/temp.file
In this example,
$1 is a Web address.
What this does is:
- Uses links to save the text of the page. I use this instead of lynx because the "-width 512" lets it handle it without inappropriate line breaks, and links seems to let handle punctuation spacing better than lynx. (The "cut" removes the extra lefthand margin.)
- Uses lynx to generate the list of links that are on that page, removing the "References" header, margin, and numbering. Links doesn't seem to have any way of recording the URLs when generating a text copy. It appends that to the work in progress.
- Sends this to TextWrangler to open up in the background.
I'm seeking the community's advice on this on three points:
- The way I've got it working now is okay, but, ideally, I'd like to handle URLs in the way that Mefi's print stylesheet handles it — the URL appearing right after the link text. So in a webpage converted into a text file, instead of it being "Google", it'd be "Google [http://www.google.com]". I'm aware that lynx lets you do footnotes ("[1]Google" and later "1. http://www.google.com"), but lynx's handling of line breaks and spacing isn't great.
- I'd then ideally like to have this script save the results automatically to a text file on my Desktop with the URL's TITLE attribute as the name of the file.
- I'm wondering if, given the format, any odd punctuation in the URL could screw up the process.
- Also, I imagine this might be an enjoyable script for others — and if so, any other modifications to the script that would improve the overall process and/or end goal — and/or any utilities that do this process better than what I'm hacking up — would be appreciated.
posted by WCityMike to computers & internet (17 comments total)
3 users marked this as a favorite
posted by devbrain at 5:38 PM on January 30