Bazaar is too Bizarre
March 22, 2010 5:27 AM   Subscribe

How can I viscerally understand distributed revision control?

I've been using bazaar for about a year now. It's pretty great, but I really only have my toe stuck in.

A coworker who has been doing more distributed development than I keeps suggesting usages/features/methodologies that make no sense to me because I don't understand distributed revision control.

I understand what you might call the "mission statement" of it: You work locally on your own thing and merges happen magically. But without understanding at least some of that magic, I'm just not grokking how anything more advanced than ci/diff/log work.

I need some kind of visualization1 of the process. Better yet, a mechanical explanation because that's how I understand almost all processes: as tiny machines. Or maybe what I need is some laboriously-worked-out examples showing the before and after? I'm so confused I'm not even sure what I don't understand.

1I already know about bzr viz and it hasn't helped. If anything, bzr viz leaves me more confused than before I used it. "Wait...what does it mean that the red dot connects to the blue line?"
posted by DU to Computers & Internet (12 answers total) 8 users marked this as a favorite
 
Have you read The Git Parable?
posted by aparrish at 5:53 AM on March 22, 2010


Joel Spolsky's Mercurial tutorial at hginit.com is really good, despite the fact that Joel Spolsky is really annoying.
posted by Combustible Edison Lighthouse at 6:15 AM on March 22, 2010 [3 favorites]


Merges don't happen magically.

Stop thinking of things as a sequence of revisions, and instead think of them as a set of patches. The system's job is to order and apply patches. I add a patch to mine; I can give you my patches; i can take and apply your patches; we collect patches over months, and eventually call some set of patches "a release".

You can do the same thing e-mailing diffs around and fastidiously keeping records of what came from whom and when and why and in what order you run "patch" to make a tree.
posted by cmiller at 7:20 AM on March 22, 2010


Oh, and come hang out on Freenode #bzr to ask questions. I'm CardinalFang.
posted by cmiller at 7:21 AM on March 22, 2010


Distributed version control systems, at heart, are conceptually identical to centralized version control systems. Each developer has a bunch of working directories where everything is in a state of flux, and then there is The Repository that keeps track of old copies of things so that anybody can get at stuff the way it used to be.

The main thing that distinguishes a distributed VCS from a centralized one is that the distributed VCS doesn't impose a requirement of continuous access to the central repository. This is achieved by caching, on each developer's local machine, those portions of The Repository that have been (a) explictly copied from it or (b) committed locally by the developer.

To make this work, the VCS adopts a global convention for naming things. It does this to guarantee that you never end up with the same name referring to two different things just because those things happen to be cached on different developer machines. That global naming mechanism is where the magic hides. Dig into that, for whichever DVCS you're using, and you will be enlightened.

Once you're convinced that the DVCS is reliably able to name each object it knows about in a way that means identical names guarantee identical content, you'll readily see that it doesn't actually matter where the objects so named are actually stored or how many times they're duplicated, as long as there's at least one copy available to anybody who needs it. The only difference between having the item you need stored on your own box or the central server or a co-worker's box is how many hoops you need to jump through for access.

So in fact you could, if you wanted to, run an entire project development in the complete absence of a central machine that holds a fully up-to-date repository. Given access to all the partial-repo caches on all the developer boxes, you could mechanically construct that centralized repository simply by grabbing the first instance of each uniquely-named object you find as you walk the developer-box caches.

See? The global, central repo still exists, but the pieces that make it up no longer need to live on the same box to assure consistency. Consistency relies instead on the robustness of an object naming convention.

Now, the developers themselves typically do not need to pay much attention to the VCS's internal names for things. A developer can make a change to foobar.c, and check it in, and make another change, and check it in, and revert a change and make a different one, and check it in, and it still looks like foobar.c. But as far as the DVCS itself is concerned, every one of those check-ins has created a new object for it to track, and it will generate a new and globally unique name to do that with. And that's why when I'm checking in changes to foobar.c on my box, and you're independently checking in changes to foobar.c on your box, we're not breaking the integrity of the (possibly ghostly) global repo. Rather, your sequence of changes and mine are handled exactly as a centralized VCS would if we had explicitly created two different repository branches to work in.

That is: a DVCS does not attempt the I-did-it you-did-it I-did-it you-did-it sequence of merges that would result from you and me working in parallel on foobar.c in the same branch of a CVCS. Instead, it automatically makes (and tracks) a repository branch for each of us. We can of course re-unify those branches any time we like, provided we each have (possibly indirect) access to each other's repo caches, and the file content merges involved in doing so are no more or less automated than they would be with a CVCS.

Because a DVCS does this kind of branch creation inherently and automatically, while a CVCS only creates branches on explicit command, DVCS projects tend to have more branches. That means DVCS users typically do fewer file content merges, with more changes per merge, than CVCS users. There's nothing inherent in a DVCS that makes that process any easier than it would be with a CVCS; in practice, though, the DVCS file-merging tools tend to be very good.

Does that help?
posted by flabdablet at 7:58 AM on March 22, 2010


Joel Spolsky wrote a good introduction to Mercurial called HgInit. It's got pretty pictures.
posted by blue_beetle at 8:22 AM on March 22, 2010


Response by poster: The naming thing does help a little, although I still don't get how the merges happen or why the result is the log I see. Revision numbers changing out from under me seems like a bad feature...
posted by DU at 8:32 AM on March 22, 2010


Merges are just application of patches and a record that those patches are applied after the others.

Revision IDs are the constant name of a change in Bazaar. Only use revno for second-to-second reference that you can type. To use web analogy, revids are the permalinks, and revno is a session-dependent URL.

Also, you can alias all your commands to use revids, and more. From my ~/.bazaar/bazaar.conf :

[ALIASES]

commit = commit --show-diff

tags = tags --show-ids

log = log --show-ids --show-diff
posted by cmiller at 9:10 AM on March 22, 2010


I still don't get how the merges happen

In talking about VCS there are two different contexts for using the word "merge", and these often cause confusion. There's merging multiple versions of a file to produce a new version that incorporates changes made from different starting points, and there's merging a branch which unifies forked code bases. Merging a branch will involve merging a bunch of files, and may also involve dealing with conflicts over particular files having been deleted or renamed or moved to a different subdirectory.

First-generation centralized version control systems put an exclusive lock on any file that a developer checked out of the repository, so that no other developer could make changes as well until that file had been checked back in. That's a pretty annoying limit to work under, but version control's overall benefits meant that lots of people got used to working under it and built it deep into their conceptual models of how a VCS ought to work.

One of the reasons that CVS became so popular so quickly is that it got rid of that limitation. So if you want to mess with foobar.c when I've already checked it out, you don't need to wait for me to check my changes back in. Instead, you and I can both check out foobar.c and modify it independently, and then you could check your modified version back in; CVS would then stop me from checking mine back in until I'd merged your changes into it as well.

For most source files in most languages, constructing a merged version that incorporates both your changes and my changes can be totally automated provided that you and I have been working on different parts of the file. Only when our changes are found to overlap does human discretion need to get involved.

In practice this works really well, but on first exposure it sounds insanely unsafe, and a lot of project managers looked at it and thought NO! I am not going to allow that kind of free-love hippie chaos within my project! and either configured CVS to require traditional check-out locking, or required their devs to use of something comfortingly commercial instead.

As a result, there are lots of experienced and competent developers who still view the whole idea of merging with some degree of suspicion. Is that you, or are you comfortable with the idea of merging but just don't quite get how and when it happens when you're working with bzr?

Revision numbers changing out from under me seems like a bad feature...

Then just ignore them and use the IDs instead, at least until you've internalized the idea that IDs are all that matters to the (possibly ghostly) global repository.

Also realize that when you bzr pull from another dev's repository, you're not merging that dev's changes. All you're doing is increasing the number of the (distributed) global repository's branches that are cached on your own box.

Actual file merging activity doesn't happen until you bzr merge two branches (which could both be yours, or could be one of yours and one of somebody else's) to form a single new one. Because all that merge activity is happening on your box, it doesn't affect anybody else unless and until they merge your new branch with one of theirs, to make a single new branch for them on their box.

If everybody just keeps doing this kind of thing, what a DVCS gives you instead of the "trunk" development line you'd get with a CVCS looks more like a braid. It doesn't split into a huge mess of forked code precisely because devs do merge branches with each other on a fairly regular basis.

Depending on your project's release model, it might or might not make sense to set up an "authoritative" box somewhere whose job is to consolidate the latest working versions of everything, and arrange for its repo cache to be as close as possible to containing the entire global repo at all times. That could be a senior developer's workstation, or it could be an automated build/test/release server.
posted by flabdablet at 5:13 PM on March 22, 2010




Response by poster: Is that you, or are you comfortable with the idea of merging but just don't quite get how and when it happens when you're working with bzr?

No, that's not me at all. If anything, I find the "locked out until the other guy checks in" model extremely weird and limiting. I used CVS for years. (Which is not a demand to explain bzr in terms of CVS--there were a lot of things I didn't get about CVS either.)

it might or might not make sense to set up an "authoritative" box somewhere

Yeah, we have this.

OK, here's the thing. My problem is a lack of global understanding and so I've been asking for Grand Explanations. But the solution may be to work on cases and try to build up global understanding.

I edit foobar.c. I check it in as revno 2. Meanwhile you also edited foobar.c and checked it in as revno 2 on your branch. Now you want to get my changes, so you merge from me. ....I was going to ask for an explanation of some of the things I'd see, but actually I'm not even sure what they are. Let me go create a project to test this out with.

Oh jebus, I understand this less the more I try it. calmblueoceancalmblueocean

You merge from. The diff between our branches is applied to your branch. You check that in with a comment that says "merged from DU". Now your bzr log shows 3 revisions. The initial one, the one where you edited foobar.c and the one where you merged from me.

In my real project, the bzr log for your revision 3 would contain the comment "merged from DU" plus the comments on all the checkins I'd made since you last merged from me. That's a big part of what I don't get. But in my test just now, all it says is "merged from DU". WTF.

Another thing I don't get is why we have to keep merging back and forth so much. You have to merge from me to get my stuff. Then I have to merge from you to get your stuff PLUS the fact that you merged from me. Then you have to pull back from me to get the fact that I merged from you. Unless you made a change in the meantime, which means you'd actually have to MERGE back from me, which means I'd have to merge THAT back from you.

Meanwhile the log is looking crazier and crazier and is filled with messages about merging.
posted by DU at 4:58 AM on March 23, 2010


Response by poster: bzr log -n0 gives me the merged comments. I also spent over an hour trying scenarios and explaining them with a knowledgeable coworker. I'm beginning to get a vague glimmering of the gut feeling I want of how branches really work.
posted by DU at 10:52 AM on March 23, 2010


« Older I'm interested in the way that athletic...   |   Spoilt brats need advice Newer »
This thread is closed to new comments.