How can I download and archive a blog or website in plain text?
August 11, 2014 9:15 AM   Subscribe

There are some blogs and websites I would like to read in plain text or even HTML, I just don't want to click through hundreds of pages, and I definitely don't want to download them one by one.

Some of them also disappear forever after a while, and I am not exactly going to Evernote a hundred screenshots. I can't seem to find any tools that straightforwardly do what I want. I've found a couple that will convert the first few pages of a website into a Kindle-friendly format, but no more than that. This seems like it'd be straightforward, but I can't find anything. Paid software or services are fine.
posted by ziggly to Computers & Internet (4 answers total) 9 users marked this as a favorite
 
I believe that you can use lynx to do this. -dump dumps the formatted output from a web page to a file, specifically check out the -crawl and -traversal options, the man page for -traversal says:
traverse all http links derived from startfile. When used with -crawl, each link that begins with the same string as startfile is output to a file, intended for indexing. See CRAWL.announce for more information.
Note that a lot of web pages these days are just loading skeleton HTML and then loading the rest with JavaScript/AJAX-ish calls, and lynx doesn't do JavaScript.
posted by straw at 9:23 AM on August 11, 2014 [1 favorite]


Archive It (built by the same people who maintain the Wayback Machine, I believe) does exactly what you described:

Archive-It enables you to capture, manage and search collections of digital content without any technical expertise or hosting facilities. Visit Archive-It to build and browse the collections.
posted by rada at 9:33 AM on August 11, 2014 [2 favorites]


This page could be helpful: Go from knowing nothing to scraping Web pages.
posted by soelo at 9:49 AM on August 11, 2014 [2 favorites]


For low-traffic blogs that I know I want to read and publish an RSS feed, I create an IFTTT recipe in the Instapaper channel to save everything there. I believe you can also do keyword matching with that recipe to constrain what gets saved.
posted by These Premises Are Alarmed at 10:36 AM on August 11, 2014 [1 favorite]


« Older What is a good standard set of IT Infrastructure...   |   How to help someone who is on a downward spiral? Newer »
This thread is closed to new comments.