How can I download and archive a blog or website in plain text?
August 11, 2014 9:15 AM Subscribe
There are some blogs and websites I would like to read in plain text or even HTML, I just don't want to click through hundreds of pages, and I definitely don't want to download them one by one.
Some of them also disappear forever after a while, and I am not exactly going to Evernote a hundred screenshots. I can't seem to find any tools that straightforwardly do what I want. I've found a couple that will convert the first few pages of a website into a Kindle-friendly format, but no more than that. This seems like it'd be straightforward, but I can't find anything. Paid software or services are fine.
Some of them also disappear forever after a while, and I am not exactly going to Evernote a hundred screenshots. I can't seem to find any tools that straightforwardly do what I want. I've found a couple that will convert the first few pages of a website into a Kindle-friendly format, but no more than that. This seems like it'd be straightforward, but I can't find anything. Paid software or services are fine.
Archive It (built by the same people who maintain the Wayback Machine, I believe) does exactly what you described:
Archive-It enables you to capture, manage and search collections of digital content without any technical expertise or hosting facilities. Visit Archive-It to build and browse the collections.
posted by rada at 9:33 AM on August 11, 2014 [2 favorites]
Archive-It enables you to capture, manage and search collections of digital content without any technical expertise or hosting facilities. Visit Archive-It to build and browse the collections.
posted by rada at 9:33 AM on August 11, 2014 [2 favorites]
This page could be helpful: Go from knowing nothing to scraping Web pages.
posted by soelo at 9:49 AM on August 11, 2014 [2 favorites]
posted by soelo at 9:49 AM on August 11, 2014 [2 favorites]
For low-traffic blogs that I know I want to read and publish an RSS feed, I create an IFTTT recipe in the Instapaper channel to save everything there. I believe you can also do keyword matching with that recipe to constrain what gets saved.
posted by These Premises Are Alarmed at 10:36 AM on August 11, 2014 [1 favorite]
posted by These Premises Are Alarmed at 10:36 AM on August 11, 2014 [1 favorite]
« Older What is a good standard set of IT Infrastructure... | How to help someone who is on a downward spiral? Newer »
This thread is closed to new comments.
-dump
dumps the formatted output from a web page to a file, specifically check out the-crawl
and-traversal
options, the man page for-traversal
says: Note that a lot of web pages these days are just loading skeleton HTML and then loading the rest with JavaScript/AJAX-ish calls, and lynx doesn't do JavaScript.posted by straw at 9:23 AM on August 11, 2014 [1 favorite]