Recovering the contents of a stolen Blogger blog
July 20, 2011 3:10 PM   Subscribe

A Blogger blog was hacked. The perpetrator declared he'll delete all content pretty soon. It's six years of work I want to save.

The blog doesn't belong to me, I was a contributor. I understand that Google won't be helping us recovering the ownership (it's on their TOS that it's the admin's responsability).

What I want to do is save as much content as possible and help the original author relaunch the blog. Unfortunately, the blog is being modified and some images aren't appearing anymore. It's a race against time.

I've already salvaged the Atom feed and some of the images it references. The rest of the archive looks inaccesible.

I'm a proficient programmer and I can do webscrapers. I need the hive mind to help me with pointers on where/how to recover the contents.
posted by jgwong to Computers & Internet (19 answers total) 8 users marked this as a favorite
 
[removed link, put it in your profile if you want people to see it, thanks]
posted by jessamyn at 3:12 PM on July 20, 2011


Start by checking Google cache. The wayback machine might have some useful stuff.
posted by Foci for Analysis at 3:16 PM on July 20, 2011


The Wayback Machine maybe?
posted by L'Estrange Fruit at 3:18 PM on July 20, 2011


It won't have the pictures, but if you subscribe to the blog in Google Reader, it may have archived posts that are no longer available in the Atom feed. A friend of mine had something very similar happen to his blog, and he was able to reconstitute a lot of it from Reader's cache.
posted by Zozo at 3:21 PM on July 20, 2011 [1 favorite]


Foci for Analysis, L'Estrange Fruit: For the Google Cache and the Wayback Machine I'm setting up Warrick (http://warrick.cs.odu.edu/warrick.html) to recover as much as possible. Thanks!

Zozo: Thanks for the Google Reader tip, that's a clever one.
posted by jgwong at 3:27 PM on July 20, 2011


Don't assume Blogger won't help. Those terms about what they're legally liable for don't mean they won't give you any service. Give it a shot.
posted by John Cohen at 3:36 PM on July 20, 2011


Another approach using some url hacking and DownThemAll.
posted by Foci for Analysis at 3:41 PM on July 20, 2011


Here's an application that does backups of blogger blogs:
The Blogger Backup utility is intended to be a simple utility to backup to local disk your Blogger posts.

Using the GData C# Library, the utility will walk backward in time, from your latest post to your last, saving each post to a local Atom/XML file.

If you want to, mefi mail me the blog url and I'll look at what I can save at my end. Please write any special instructions wrt specific data you want to save.
posted by Foci for Analysis at 3:45 PM on July 20, 2011 [1 favorite]


Scrapbook is a Firefox extension that will save entire sites and/or folders.
posted by SuperSquirrel at 3:47 PM on July 20, 2011


Backupify is a service that will automatically back up your blogger content at regular intervals (as well as Twitter, Flickr, Google Docs, Gmail, etc.). It's nice to have if you have data you care about in any of those services, for exactly this sort of reason.
posted by mbrubeck at 4:02 PM on July 20, 2011


Can you grab all the pages with wget?
posted by COD at 4:08 PM on July 20, 2011 [1 favorite]


Thanks everyone for your quick responses.

I've tried the URL hacking to get the latest 1,000 posts but the perpetrator has deleted most of the content already. Six years of posts gone just like that. So sad. Recovering from the blog itself is discarded.

I'll have to go the Google Cache/Wayback Machine way now. Warrick didn't work too well (it only recovered 5 pages and currently doesn't work with the Wayback Machine). Any help on automating this will be appreciated.
posted by jgwong at 4:12 PM on July 20, 2011


httrack can copy websites.
posted by Zed at 4:12 PM on July 20, 2011


Nthing asking for help even if Google isn't legally required to do so. They may have a backup, and depending on how the site was hacked they might want to help.

Admittedly, it's a long shot, but what have you go to lose?
posted by amtho at 4:52 PM on July 20, 2011


amtho (and previously John Cohen): Yes, I've asked the original owner to talk to Google. Thanks!
posted by jgwong at 4:56 PM on July 20, 2011


"What I want to do is save as much content as possible and help the original author relaunch the blog."

WinHTTrack will recursively grab pages. It'll also grab from a text file of URLs, recursively if desired, so if the page URLs are either known or predictable you can generate a text file with all the pages for either blogger or Google Cache. Ad it;ll follow links to a specified depth off of the base URL to grab images that may be hosted on another site.
posted by Mitheral at 6:17 PM on July 20, 2011


nthing that wget is your best automated option.
posted by turkeyphant at 7:19 PM on July 20, 2011


jgwong: Warrick didn't work too well (it only recovered 5 pages and currently doesn't work with the Wayback Machine).

Whoa, really? In my experience, the author was pretty responsive to hearing, say, that retrievals from the Yahoo cache weren't working. ... It might be about the new Wayback Machine interface.

Could you mail me your site so I can take a look?
posted by Pronoiac at 4:58 PM on July 22, 2011


Hi everyone, sorry for not answering before.

Google answered and the blog was recovered! My hopes weren't high on this, but they did the right thing, which is awesome. Thanks everyone for your suggestions and help offers.
posted by jgwong at 12:43 PM on July 30, 2011


« Older Help me become my organization...   |  What has your experience been ... Newer »
This thread is closed to new comments.