Comments on: Reconstituting a wiki database from html?

Question: Reconstituting a wiki database from html?

Pronoiac — Wed, 06 Aug 2008 20:47:07 -0800

I'd like to reconstitute a years-defunct wiki I used to collaborate on. I've contacted the principals, & our searches for the database backup have come up empty so far. Without having the original database, the simplest path appears to be taking the html & transforming it into, say, an sql dump. So - how do I do that? Are there any MediaWiki, database, or Perl trails to follow?

By: H. Roark

H. Roark — Wed, 06 Aug 2008 21:46:17 -0800

Not necessarily simple, but relevant.

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

By: XMLicious

XMLicious — Wed, 06 Aug 2008 22:11:56 -0800

Not only would you have to generate SQL code but you'll have to write scripts to parse the HTML into wiki markup and re-establish the intrawiki links.

In my experience doing this sort of thing for converting flat HTML sites for content management system projects, unless there are thousands of pages the best course is to just bite the bullet and do it all manually. It's possible to write scripts to parse HTML the way you're thinking of but it takes lots of time, effort, and development cycles. And the end result doesn't work perfectly so you usually have to do lots of hand massaging of the resulting content anyways.

I think an advantage you'll have over doing the equivalent process with a commercial CMS is that you have all of the bots, scripts, and tools that are available for speeding up manual editing in Mediawiki. I like the Eclipse plug-in, myself.

By: zippy

zippy — Wed, 06 Aug 2008 23:42:36 -0800

Hey pro, are you talking about the Quicksilver Wiki? I can get you the db dump.

By: Pronoiac

Pronoiac — Thu, 07 Aug 2008 07:41:39 -0800

zippy: Uh, yup.

I'd gotten the "edit" pages, so I wouldn't have to do the html to wiki translation.

By: Pronoiac

Pronoiac — Thu, 07 Aug 2008 07:57:21 -0800

Ha! Whoops, hit "post" early. I'm not caffeinated yet, so I'm sort of "surprisingly lifelike."

Maybe I should have said "I attempted contacting the principals," because I used the email addresses I had - old addresses.

By: zippy

zippy — Fri, 08 Aug 2008 23:01:49 -0800

OK, it's on its way to you through the æther.

By: Pronoiac

Pronoiac — Thu, 04 Sep 2008 14:28:01 -0800

Update: This is still an open question; zippy found & sent a blank copy, & he's still looking for a more substantial copy.

By: XMLicious

XMLicious — Thu, 04 Sep 2008 22:22:29 -0800

If you have each of the edit pages as a separate file, and if each file has a name that can be parsed into the correct page name, maybe you could import it all into a blank installation of MediaWiki via some simple but clever wget command lines, similar to what rjt proposed in this recent post.

By: Pronoiac

Pronoiac — Fri, 05 Sep 2008 12:07:45 -0800

Wow. That's a new, utterly strange line of attack. It would lose lots of metadata, but it would be presentable, & it would get far enough along to start the spam blacklist, the next obstacle.

By: XMLicious

XMLicious — Fri, 05 Sep 2008 18:47:21 -0800

Yeah, you wouldn't have any history, et cetera, but at least you'd have the site up and going and editable, and you would have upgraded to the latest version of MediaWiki in the bargain.