Wikipedia online/offline diff tool
December 12, 2024 12:11 AM Subscribe
Is there an easy way to compare a downloaded copy of wikipedia with the online version?
Suppose one had downloaded a copy of Wikipedia (ie, via Kiwix) but wanted to periodically compare changes vs the online version. I assume of course this is only practical on a page-by-page basis as diffing the entire site would take many hours and be an unreasonable load on their servers.
The challenge is, Kiwix stores the archive in a single file. Is there an established way to accomplish this, perhaps using a different downloaded format/viewing app?
And if not, would someone skilled at the craft consider making one?
I anticipate a takeover and massive scrubbing of Wikipedia in the next four years and would like to monitor the changes.
Suppose one had downloaded a copy of Wikipedia (ie, via Kiwix) but wanted to periodically compare changes vs the online version. I assume of course this is only practical on a page-by-page basis as diffing the entire site would take many hours and be an unreasonable load on their servers.
The challenge is, Kiwix stores the archive in a single file. Is there an established way to accomplish this, perhaps using a different downloaded format/viewing app?
And if not, would someone skilled at the craft consider making one?
I anticipate a takeover and massive scrubbing of Wikipedia in the next four years and would like to monitor the changes.
Best answer: This is a broad question. As long as Wikipedia changelog pages on articles remains active and usable you can just look at any article's changes there (https://en.wikipedia.org/wiki/Help:Page_history). However in spirit of your question assuming you can't trust that anymore you can do a different approach.
Download and keep snapshots of wikipedia (https://en.wikipedia.org/wiki/Wikipedia:Database_download). As long as you can get an up to date archive file you could do the compare offline with two unzipped archives you saved from different dates.
Once you have lots of loose files in folders you can use something like winmerge (free on windows) or BBEdit or Beyond Compare (on Mac). The compare will take a long time and will have a TON of data in there. Probably you will want to target what you are trying to monitor.
If you are really into this I would recommend trying to learn tools that will make this easier and becoming more technical overtime (or finding a partner in the project who is more technical) since have some decent programming or IT skills will make this a lot easier. E.g. If wikipedia stops offering downloads of the archive knowing how to make a page checker, etc.
https://en.wikipedia.org/wiki/Reliability_of_Wikipedia is in an interesting read.
If what you worry about were to start happening there are a number of alarm bells that would start ringing. In USA EFF and Internet Archive are two folks that would be good orgs to keep up on.
posted by alicebob at 5:10 AM on December 12
Download and keep snapshots of wikipedia (https://en.wikipedia.org/wiki/Wikipedia:Database_download). As long as you can get an up to date archive file you could do the compare offline with two unzipped archives you saved from different dates.
Once you have lots of loose files in folders you can use something like winmerge (free on windows) or BBEdit or Beyond Compare (on Mac). The compare will take a long time and will have a TON of data in there. Probably you will want to target what you are trying to monitor.
If you are really into this I would recommend trying to learn tools that will make this easier and becoming more technical overtime (or finding a partner in the project who is more technical) since have some decent programming or IT skills will make this a lot easier. E.g. If wikipedia stops offering downloads of the archive knowing how to make a page checker, etc.
https://en.wikipedia.org/wiki/Reliability_of_Wikipedia is in an interesting read.
If what you worry about were to start happening there are a number of alarm bells that would start ringing. In USA EFF and Internet Archive are two folks that would be good orgs to keep up on.
posted by alicebob at 5:10 AM on December 12
Best answer: As HearHere mentions, you can download all of Wikipedia for various definitions of all pretty easily, and that's where kiwix gets its information from.
As it says, "pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is over 19 GB compressed (expands to over 86 GB when decompressed)."
On local, reasonably fast storage, diffing two versions of that will be not just practical but a lot faster than you'd think, so the place to start is "what question are you trying to answer specifically". If you want to track what articles change over time that seems pretty easy.
If you are interested in checking the integrity of Kiwix, as in verifying that what Kiwix provides was actually the content of a specific wikipedia page, that looks a bit more difficult. It looks like the Kiwix project grabs a Wikipedia download and recompresses it into a "ZIM" file, which I think can be decompressed with widely available zstd tools, though you'll probably need to mine the whole multi-terabyte-sized history of Wikipedia if you want to convincingly demonstrate some sort of malfeasance there.
Kiwix itself is on Github, and isn't a "wikipedia viewer" really, it's a ZIM file viewer and that project offers up a lot of other web content repackaged in that format for offline viewing.
posted by mhoye at 5:39 AM on December 12
As it says, "pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is over 19 GB compressed (expands to over 86 GB when decompressed)."
On local, reasonably fast storage, diffing two versions of that will be not just practical but a lot faster than you'd think, so the place to start is "what question are you trying to answer specifically". If you want to track what articles change over time that seems pretty easy.
If you are interested in checking the integrity of Kiwix, as in verifying that what Kiwix provides was actually the content of a specific wikipedia page, that looks a bit more difficult. It looks like the Kiwix project grabs a Wikipedia download and recompresses it into a "ZIM" file, which I think can be decompressed with widely available zstd tools, though you'll probably need to mine the whole multi-terabyte-sized history of Wikipedia if you want to convincingly demonstrate some sort of malfeasance there.
Kiwix itself is on Github, and isn't a "wikipedia viewer" really, it's a ZIM file viewer and that project offers up a lot of other web content repackaged in that format for offline viewing.
posted by mhoye at 5:39 AM on December 12
You are not logged in, either login or create an account to post comments
posted by HearHere at 1:51 AM on December 12