Version control
December 15, 2006 2:41 PM   Subscribe

How does a wiki store revisions?

Say I get on a wiki and post a painstakingly researched 4,000-word article on the Dreyfus Affair. Then, ten minutes later, some snob who calls himself FrancoPhile67 comes along and changes a single word (my plainly brilliant use of "avaricious" to a plebian "covetous"). Clearly, he deserves to be pummeled, but that's not my question.

What I want to know is, how does the wiki archive his revision? I posit that there must be two main approaches:

1) Store each revision in its entirety. In this case, the wiki claims ignorance of the content and preserves the whole of FrancoPhile67's 4,000 words, a full 3,999 of which are replicated in the wiki's record of my original opus. The database is thus, in our crude exmaple, 8,000 words long.

2) Store only the changes that have been made. This more challenging method requires the wiki to parse the texts, find where FrancoPhile67 and I differ--where don't we?--and record only those disagreements, à la diff.

Method #1 saves processing power, while method #2 saves space--considerable amounts of it, considering the number of minor revisions made to many wiki's pages.

While I recognize that there are many wiki platforms, which method is more common? If method #1 is used, does that make wiki databases unfathomably huge? If method #2 is employed, what specific algorithms work best? This topic interests me greatly.
posted by deadfather to Computers & Internet (7 answers total) 2 users marked this as a favorite
 
To answer your question outright: it depends.

The most common method (and easiest) is to store each revision in its entirety. Storing text in a database requires very little overhead and most databases are damn good at it.

There are a few wiki packages that use a backend like Subversion to store changes to wiki pages. However, I don't believe it's very common. Both in that it adds "yet another requirement" to installing it, and also because, well, it's not the easiest way to do it.

I only know this because I myself have written wiki software before and studied quite a few open source packages. If you have anymore questions please do feel free to startup an e-mail exchange. :-)
posted by jdgdotnet at 2:58 PM on December 15, 2006


The technology to store diffs is well developed. If I were to implement a LARGE wiki (lots of data), I'd be sure to use the diff method. CVS and Subversion as the backend will give you the feature practically for free.
posted by cschneid at 3:01 PM on December 15, 2006


I dare not speak for all wiki's in general but Wiki's do evolve out of the Unix culture were diff was available. I'd assume that the original (and successors) that use file based stores go ahead and just use the diff program that is there.

For wiki's implemented in higher level or more structured environments or where diff might not be natively around, I would posit that they use diff libraries written for that language. Ancedotally, I've seen the diff approach far more often that the store whole revision approach.

As to algorthims, I'm sure there are more than a couple, but I bet most implementations crib a lot from each other since most wiki's are open source.

As for the big daddy, MediaWiki which powers Wikipedia, it appears to use the Perl library Algorithm::Diff. If you were curious, I would start there.
posted by mmascolino at 3:11 PM on December 15, 2006


I wrote my own wiki (hasn't everyone?), and it uses a combination approach. It stores diffs, but if a diff is above a certain size threshold (say, 50% of the total page length) it stores the entire page. And rather than comparing revisions to the previous revision, I compare them to the last full-page revision. That way, reconstructing any particular revision takes at most one patch. My method is O(1) -- constant time -- whereas CVS, for example, is O(n), n being the number of revisions since the initial one.

Anyways, that's just the way I did it. It was mostly for fun, so I even implemented the diffing procedure myself, based on some academic paper, rather than calling diff, which would have been 10x easier.
posted by Khalad at 3:47 PM on December 15, 2006 [1 favorite]


I also wrote a wiki engine in C# (tinkering) and used a diff approach, but: The current version of the page was stored in its entirety (to display quickly with little CPU overhead), and the diffs "to get there"were "behind" it in another table. So the DB, in your example, would be storing 8001 words (the original full text, the current full text, and the diff history). However, if I came along and replaced the plebian "covetous" with the downright boring "greedy", the database would only increase by one word.
posted by Merdryn at 6:25 PM on December 15, 2006


In classic revision control systems that don't simply store every revision, there are two main choices: The RCS approach of storing one version and a series of deltas (in the case of RCS, a list of instructions to remove, add, or replace lines to transform one version to another).

Second, the SCCS approach, which is called "weave". I am not as familiar with SCCS, but the "weave" seems to involve interspersing the lines of all the revisions along with instructions on which lines to take for a particular revision.

In either of these systems, it may be necessary to do work proportional to the total length of all revisions of the file in order to obtain a particular old revision. (with RCS, the least work is done for the newest version of the file; for SCCS, I think that about the same amount of work is done for any version).

In recent systems, such as the "GIT" version control system by the guy who created Linux, each revision is stored on the filesystem, though possibly in a compressed format. GIT has some way of efficiently storing multiple revisions of multiple fiies in one "pack", but again I don't know the details. Storing each revision of each file on the disk is not a bad choice when disks are over 100 gigabytes, and a year's work on the project might only require an additional gigabyte of storage.

One difference between modern (git, svn) and classic (rcs, cvs, sccs) systems is that there is a greater emphasis on storing revisions of "the tree" (all the files in a project) than storing revisions of single files. To use the wiki example, you might want a wiki to include a page and all its images as a single snapshot; you might want to include a number of pages; or you might want to include the entire wiki. The revision control system would allow you to find the answers to questions like: was this page renamed from another page (title change)? was it a copy? What was the history of the page, including under its old title?

One difference between the revision control systems on wikis and the ones used by programmers is the concept of "branches". For instance, a software project may have a branch for its "stable version" and a branch for its "development version". In this way, work can proceed on new features, while making it easy to put a few bugfixes in the stable version, and keep track of the whole thing. (if your revision control system is fancy, it will even help you "cherry pick" changes from one branch to put in another) I'm not aware of a wiki that uses branches to store different but equal revisions of the same article.
posted by jepler at 8:01 PM on December 15, 2006


If your database is sensibly compressed, you wouldn't expect this to be an issue. Even the simplest compression algorithms stop any text from being duplicated.
posted by reklaw at 3:22 AM on December 16, 2006


« Older Kid-Friendly Corner Fittings for Sharp-Cornered...   |   Artifacts: The TV, or the Broadcast? Newer »
This thread is closed to new comments.