Join 3,524 readers in helping fund MetaFilter (Hide)


Creating an ebook from a wiki
May 28, 2011 3:31 PM   Subscribe

How would one go about exporting a whole MediaWiki installation as an ebook?

What I have:
  • an XML dump of the Wikia wiki I want to read (~18Mb file, 5000+ articles, CC-BY-SA licensed)
  • a local MediaWiki 1.16.5 installation with the above dump imported
  • access to all the Unixy tools you'd expect

What I want:
  • a way to read the wiki offline on my Kindle, presumably as a proper .mobi eBook; EPUB or anything else easily convertable to Mobipocket (i.e. not PDF) is fine too
  • any internal links should work as expected - this will be the main means of navigating the ebook along with search
  • inline images would be fantastic, but a) they aren't critical to this particular wiki, b) they aren't included in the dump and would need to be fetched separately, and c) they would further increase the size of an already hefty file
  • as little MediaWiki-specific content as possible (such as Edit links and page meta info), but this is also a secondary priority
  • ideally, the process should be simple, repeatable on different wikis and require as little human interaction as possible

What I tried:
  • straight up converting the XML dump to an HTML file
    • the main problem here is that MediaWiki markup is a pain to parse and none of the libraries I looked at provided satisfactory results:
      • wikicloth (Ruby) is pretty fast and seems to provide all the hooks I need, but the git HEAD hasn't been updated in months and is failing several test cases; in particular, list tags are not properly closed which causes all kinds of horrible nesting issues
      • marker (Ruby) provides better output, but is super slow and has a less flexible API which means I'd need to do additional post-processing on the HTML to get links and stuff to work
      • the language doesn't matter much, but seems that the situation with Perl and Python libraries isn't much better

    • another issue is with templates, which would require a lot of extra work


  • using one of the available MediaWiki extensions/tools to export the content of my local install in a more convenient format:
    • ePubExport chokes and times out when provided with the full list of pages, even with the corresponding time/memory limits raised
    • mw-render seems to only operate on single pages


Right now I'm leaning toward using the dumpHTML extension (if it's still compatible with the latest MediaWiki - reports vary) to get 5000+ static HTML files and then writing a script that would extract the content subsection of each page, rewriting all headers and internal links to use anchors. Glue the output together and run it through pandoc or similar.

Is there a better way?
posted by dmit to Computers & Internet (1 answer total) 1 user marked this as a favorite
 
Could you email the folks with your favorite potential resource and ask them how updates/fixes are coming along?

The first one I checked, wikicloth, appears to have a Google-able author.

If someone emailed me asking for an update to something I had made, I'd probably be pleased that they cared enough to ask for it, like this: "hey! this thing that feels obscure to me wasn't useless after all!"
posted by aniola at 3:33 AM on June 1, 2011


« Older How can I make £6000 in a fair...   |  Argh, someone linked to a web ... Newer »
This thread is closed to new comments.