Archive.org partially archived forums, best way to find all the page
February 13, 2016 12:51 AM   Subscribe

A old bulletin board system disappeared, and some of the information was captured in archive.org. Most posts were not saved, but every once in a while I will get lucky and select a day and a thread that is actually archived. Is there any way I can figure out which ones are archived, without having to manually click all the threads on the forums across multiple dates?

The site in question is syngnathid.org from archive.org. When I click on a topic, much of it is just straight up missing. However, every once in a while one does work.

I've found a few like this, but usually it takes picking different dates and different threads. It was a big enough forum that I don't see being able to select all the dates in archive.org and then all the posts manually. Is there some way I could automate this? Some other function of archive.org that might give me as complete a record of this site? Any other variations to get me as much as possible with the least amount of manual work?

Another working example. The forum, one of the working threads. Another.

I'm keen to do a similar thing on on another forum. That has a small number of threads that "work".

If someone could suggest where I can start looking to find all the threads that are archived, without manually searching and clicking each thread.

Thanks for the help and suggestions.
posted by [insert clever name here] to Computers & Internet (6 answers total) 3 users marked this as a favorite
 
Our local Linux user group recovered an old community wiki by gently scraping the different versions, resolving the internal links programmatically, then using best judgment to pick out the most complete version. This was quite a bit of work, and we didn't get everything, but the hard part was the editorial decisions, not the downloading.
posted by scruss at 5:30 AM on February 13, 2016


The wayback machine supports wildcard searches with an asterisk, for example. It seems like they don't paginate their results though, so loading all 20,000 odd pages it has archived kinda freaked out my browser.

They also have an API which would allow you to automate the retrieval.
posted by mustardayonnaise at 7:31 AM on February 13, 2016 [1 favorite]


The Wayback Machine has APIs. Three different ones actually, see the links towards the bottom. If I understand the CDX API right you can use it to query "show me all the pages you've archived for this hostname", as well as other filters. I've never used these myself, but if you can program it might be a place to start. It would take some work. If you can't program, it might be worth looking if someone has made an app / website that uses these APIs to do what you need.
posted by Nelson at 7:33 AM on February 13, 2016


Unfortunately, it's primarily a manual process (you could be utilizing scripts, but then the scripts are still doing it "manually"). Wayback lives in an odd twilight world of backup and copy of the web, so these situations are always difficult when you want a specific thing. Sometimes there are highly focused crawls, but that's not the case here.
posted by jscott at 9:57 AM on February 13, 2016


Response by poster: Is there anyplace outside of archive.org that I can look for backups of sites, or is archive.org it? I assume the latter, but don't know.

I know I won't be able to create a script to do this. If I want to create a freelance project, any suggestions of what I should be asking for? Should I be asking for it to be converted to a new database type, as a crawler converter does? How do I "make sure" as much as there is captured?

I've got some development experience, I just don't have enough to do this myself, and need some ideas of what I should be asking for and how I should describe and manage a project like this.
posted by [insert clever name here] at 6:38 PM on February 13, 2016


Google has cached some versions of some pages. I'm not sure how easy it would be to put them together with archive.org versions.

Oh! While Googling "Google Cache" I found this: http://cachedview.com/
posted by getawaysticks at 6:34 AM on February 14, 2016


« Older Down with the King!   |   Graduating with Bad Grades Newer »
This thread is closed to new comments.