Reconstituting a wiki database from html?
August 6, 2008 8:47 PM   Subscribe

I'd like to reconstitute a years-defunct wiki I used to collaborate on. I've contacted the principals, & our searches for the database backup have come up empty so far. Without having the original database, the simplest path appears to be taking the html & transforming it into, say, an sql dump. So - how do I do that? Are there any MediaWiki, database, or Perl trails to follow?
posted by Pronoiac to Computers & Internet (10 answers total)
 
Not necessarily simple, but relevant.

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
posted by H. Roark at 9:46 PM on August 6, 2008


Not only would you have to generate SQL code but you'll have to write scripts to parse the HTML into wiki markup and re-establish the intrawiki links.

In my experience doing this sort of thing for converting flat HTML sites for content management system projects, unless there are thousands of pages the best course is to just bite the bullet and do it all manually. It's possible to write scripts to parse HTML the way you're thinking of but it takes lots of time, effort, and development cycles. And the end result doesn't work perfectly so you usually have to do lots of hand massaging of the resulting content anyways.

I think an advantage you'll have over doing the equivalent process with a commercial CMS is that you have all of the bots, scripts, and tools that are available for speeding up manual editing in Mediawiki. I like the Eclipse plug-in, myself.
posted by XMLicious at 10:11 PM on August 6, 2008


Hey pro, are you talking about the Quicksilver Wiki? I can get you the db dump.
posted by zippy at 11:42 PM on August 6, 2008


Response by poster: zippy: Uh, yup.

I'd gotten the "edit" pages, so I wouldn't have to do the html to wiki translation.
posted by Pronoiac at 7:41 AM on August 7, 2008


Response by poster: Ha! Whoops, hit "post" early. I'm not caffeinated yet, so I'm sort of "surprisingly lifelike."

Maybe I should have said "I attempted contacting the principals," because I used the email addresses I had - old addresses.
posted by Pronoiac at 7:57 AM on August 7, 2008


OK, it's on its way to you through the æther.
posted by zippy at 11:01 PM on August 8, 2008


Response by poster: Update: This is still an open question; zippy found & sent a blank copy, & he's still looking for a more substantial copy.
posted by Pronoiac at 2:28 PM on September 4, 2008


If you have each of the edit pages as a separate file, and if each file has a name that can be parsed into the correct page name, maybe you could import it all into a blank installation of MediaWiki via some simple but clever wget command lines, similar to what rjt proposed in this recent post.
posted by XMLicious at 10:22 PM on September 4, 2008


Response by poster: Wow. That's a new, utterly strange line of attack. It would lose lots of metadata, but it would be presentable, & it would get far enough along to start the spam blacklist, the next obstacle.
posted by Pronoiac at 12:07 PM on September 5, 2008


Yeah, you wouldn't have any history, et cetera, but at least you'd have the site up and going and editable, and you would have upgraded to the latest version of MediaWiki in the bargain.
posted by XMLicious at 6:47 PM on September 5, 2008


« Older I need a digital camera under $200   |   Multiple projects in a single git repository. Newer »
This thread is closed to new comments.