Reconstituting a wiki database from html?
August 6, 2008 8:47 PM Subscribe
I'd like to reconstitute a years-defunct wiki I used to collaborate on. I've contacted the principals, & our searches for the database backup have come up empty so far. Without having the original database, the simplest path appears to be taking the html & transforming it into, say, an sql dump. So - how do I do that? Are there any MediaWiki, database, or Perl trails to follow?
Not only would you have to generate SQL code but you'll have to write scripts to parse the HTML into wiki markup and re-establish the intrawiki links.
In my experience doing this sort of thing for converting flat HTML sites for content management system projects, unless there are thousands of pages the best course is to just bite the bullet and do it all manually. It's possible to write scripts to parse HTML the way you're thinking of but it takes lots of time, effort, and development cycles. And the end result doesn't work perfectly so you usually have to do lots of hand massaging of the resulting content anyways.
I think an advantage you'll have over doing the equivalent process with a commercial CMS is that you have all of the bots, scripts, and tools that are available for speeding up manual editing in Mediawiki. I like the Eclipse plug-in, myself.
posted by XMLicious at 10:11 PM on August 6, 2008
In my experience doing this sort of thing for converting flat HTML sites for content management system projects, unless there are thousands of pages the best course is to just bite the bullet and do it all manually. It's possible to write scripts to parse HTML the way you're thinking of but it takes lots of time, effort, and development cycles. And the end result doesn't work perfectly so you usually have to do lots of hand massaging of the resulting content anyways.
I think an advantage you'll have over doing the equivalent process with a commercial CMS is that you have all of the bots, scripts, and tools that are available for speeding up manual editing in Mediawiki. I like the Eclipse plug-in, myself.
posted by XMLicious at 10:11 PM on August 6, 2008
Hey pro, are you talking about the Quicksilver Wiki? I can get you the db dump.
posted by zippy at 11:42 PM on August 6, 2008
posted by zippy at 11:42 PM on August 6, 2008
Response by poster: zippy: Uh, yup.
I'd gotten the "edit" pages, so I wouldn't have to do the html to wiki translation.
posted by Pronoiac at 7:41 AM on August 7, 2008
I'd gotten the "edit" pages, so I wouldn't have to do the html to wiki translation.
posted by Pronoiac at 7:41 AM on August 7, 2008
Response by poster: Ha! Whoops, hit "post" early. I'm not caffeinated yet, so I'm sort of "surprisingly lifelike."
Maybe I should have said "I attempted contacting the principals," because I used the email addresses I had - old addresses.
posted by Pronoiac at 7:57 AM on August 7, 2008
Maybe I should have said "I attempted contacting the principals," because I used the email addresses I had - old addresses.
posted by Pronoiac at 7:57 AM on August 7, 2008
Response by poster: Update: This is still an open question; zippy found & sent a blank copy, & he's still looking for a more substantial copy.
posted by Pronoiac at 2:28 PM on September 4, 2008
posted by Pronoiac at 2:28 PM on September 4, 2008
If you have each of the edit pages as a separate file, and if each file has a name that can be parsed into the correct page name, maybe you could import it all into a blank installation of MediaWiki via some simple but clever
posted by XMLicious at 10:22 PM on September 4, 2008
wget
command lines, similar to what rjt proposed in this recent post.posted by XMLicious at 10:22 PM on September 4, 2008
Response by poster: Wow. That's a new, utterly strange line of attack. It would lose lots of metadata, but it would be presentable, & it would get far enough along to start the spam blacklist, the next obstacle.
posted by Pronoiac at 12:07 PM on September 5, 2008
posted by Pronoiac at 12:07 PM on September 5, 2008
Yeah, you wouldn't have any history, et cetera, but at least you'd have the site up and going and editable, and you would have upgraded to the latest version of MediaWiki in the bargain.
posted by XMLicious at 6:47 PM on September 5, 2008
posted by XMLicious at 6:47 PM on September 5, 2008
This thread is closed to new comments.
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
posted by H. Roark at 9:46 PM on August 6, 2008