Tags:


Automatically downloading text off web pages
January 8, 2007 8:16 PM   RSS feed for this thread Subscribe

There's an ancient website that's in the Wayback Machine, and I need to download a few dozen pages off of it before it disappears forever - I need the material for research purposes and it just doesn't exist anywhere else. I could just print everything out, but could really use the electronic versions so I could index them. If I had the time (which I will make if I can't find a way to do this programatically), I could just cut and paste into a text file. But I'd really prefer to not have to resort to that for 60+ pages worth of stuff. There used to be utilities that would do this (back in Ye Golden Age of Ye Internets) but I have no idea what to use today. I don't need the actual page, just the text on the pages. And again, to make it clear, this is not to steal or bootleg or appropriate, but to use data that will be lost when these pages disappear forever. Thanks for any reasonable suggestions.
posted by micawber to computers & internet (16 comments total) 1 user marked this as a favorite
Both wget and cURL are handy utilities that do this sort of operation; they come in many flavours with shiny GUI wrappers.
posted by docgonzo at 8:20 PM on January 8, 2007


HTTrack (if yer on Windoze) might be the easiest solution. (I've not used it; I'm on OS X and use wget or cURL.)
posted by docgonzo at 8:23 PM on January 8, 2007


I think you can do this with DevonThink if you're on a Mac.
posted by dobbs at 8:28 PM on January 8, 2007


I don't know if you want to do this with 60+ pages but I think you can archive each web page with Furl..
posted by rikhei at 8:29 PM on January 8, 2007


I use wget for this stuff.
posted by caddis at 8:29 PM on January 8, 2007


Adobe Acrobat Pro will create PDFs from webpages. It does a pretty good job but it's not cheap. If you have any designer friends who owe you a favor this might be a good way for them to work it off.
posted by lekvar at 9:05 PM on January 8, 2007


wget works great and is totally free. To flesh it out a little bit, you'd want "--mirror" which turns on timestamping and recursive retrieval; and "--no-parent" which limits it to retrieving everything at the given URL and below but nothing above that (by default it won't retrieve anything on different hosts either.) If you want a browsable copy (i.e. you can just point your browser at the files on your HD) then add "--page-requisites" to retrieve images and stylesheets and "--convert-links" to rewrite links so they don't refer to them as the original http:// URL but instead just relative URLs. This will create a mirror structure in the current directory, starting with a directory of the hostname. If you want just the directory structure without the hostname toplevel directory, add "--no-host-directories".

So, using the short form of parameters, "wget -m -np -p -k -nH http://your/URL/here/".

If you want something more polished that isn't a command line program there are tons and tons of shareware/commercial apps that do this. One I've used before that was pretty good was Teleport Pro.
posted by Rhomboid at 10:42 PM on January 8, 2007


try snissa. or try google notes.
posted by londongeezer at 11:48 PM on January 8, 2007


If you're using Firefox I suggest installing the ScrapBook add-on and grabbing the page with this. You then have a browsable copy of the page(s) that you can access anytime through Firefox that you can print, etc.
posted by Glow Bucket at 6:09 AM on January 9, 2007


Seconding Scrapbook. I still use wget for various things, but Scrapbook keeps nice archives right in my browser (and I have many sites spidered in Scrapbook from the Wayback Machine that don't otherwise exist anymore.)
posted by Cat Pie Hurts at 6:34 AM on January 9, 2007


If it's in the Wayback machine then why would it disappear? I thought that was the whole point of the wbm.
posted by zeoslap at 8:40 AM on January 9, 2007


Stuff can still disappear from the wayback machine. I remember reading an article several years ago about the following phenominon:
  1. an author places a page online at example.com
  2. the Wayback machine spiders the page, placing a copy online in its archives
  3. some time later, the original author fails to renew the registration on example.com (or decides not to) and it expires
  4. an unrelated third party registers example.com again and puts a completely new and different site online, but this time with a robots.txt that indicates that all or some of the site is not to be indexed
  5. the Wayback machine again spiders the site on the new domain, sees the new robots.txt, and applies its exclusions to the original content that was archived on the domain under the prior owner, effectively removing the original site completely from the net
The problem is that archive.org needs some way of allowing authors to specify that they do not want their content archived, and that is currently robots.txt. But since domain ownership can change, there is no real way to indicate that "the robots.txt here now does/does not apply to content that was online X years ago." If they only considered the state of robots.txt that was in place at the time of spidering then there would be no way for a webmaster to retroactively remove their own site from the archive, which I think is an important ability, without which they would take a lot more shit.

Anyway, I don't know how prevalent this scenario is, or whether it was happening accidently or maliciously, but it is one example of how something can disappear from the Wayback machine.
posted by Rhomboid at 9:33 AM on January 9, 2007


Internet explorer will do this for you if you add the urls as offline content and then synchroize it.
posted by Four Flavors at 10:31 AM on January 9, 2007


How about (in IE):
File | Print | check the Print to File box
Options | check the Print All Linked Documents box

Domain owners can get their material removed from the Wayback Machine as well as request that their sites not be scanned, so there's no guarantee that the WB Machine will have what you want.

If the material is political, you might try Memory Hole.
posted by KRS at 11:04 AM on January 9, 2007


Catching up on an AskMe backlog...

The free software Warrick will reconstruct a website (creating a local version on your computer) from the Wayback Machine archives.
posted by blag at 9:43 AM on January 11, 2007


(thanks Rhomboid, never knew that)
posted by zeoslap at 12:53 PM on January 12, 2007


« Older How many times should I zero t...   |   Looking for yummy restaurants ... Newer »
This thread is closed to new comments.