by Mark Stosberg
David Roundy explains how CVS, Subversion and Arch were all considerations in the design of darcs, a source control alternative growing in popularity. Having just passed its "1.0" release milestone, David provides insight into the history and future of the project, as well as explaining the unique "theory of patches" that forms darc's foundation.
Stosberg: What were the options for source control like when you decided to start on darcs? What motivated you then, and what motivates you now, considering how related projects have developed?
David Roundy: It all started with looking for a revision control system for aBridge, an online bridge game I was working on at the time. I was using SourceForge and for a while, I used the CVS repository they provide, but wasn't very happy with it, so I started looking for alternatives.
At the time, Subversion and arch were the only free software alternatives. There was Aegis, but it was even more centralized than CVS, requiring all developers to have login accounts on the same machine. Since I liked the idea of a distributed system, I went with arch.
I wasn't really satisfied with arch either, though, and for a while started getting involved with a discussion of creating a new patch format for arch. This got me thinking and discussing, and having neat ideas, which ended up becoming the theory behind darcs. I realized that there was no way my ideas would ever get into arch, and since I found them interesting, decided I'd play with them a bit, and thus darcs was born. I didn't really think it would become popular, but as long as I was enjoying myself, I figured it wasn't a waste of time, although it did essentially kill off aBridge development.
I should say, one of the motivating factors was the theory of patches, but there was also the feeling that things could be a lot simpler. Physics is all about symmetry, and most of the problems I saw in arch, both in its complexity, and in its behavior, are due to artificially broken symmetries. Developing a more "elegant" system was certainly in my mind.
My motivation in working on darcs has changed quite a bit as it's become more popular. I never would have guessed how much time I'd end up spending on email, but I've found that helping people use darcs is very satisfying. Darcs feels a lot more like work now than it used to, and at times it starts feeling old, but then I get an email from someone who's just discovered darcs, and it's all worthwhile.
Stosberg: The "theory of patches" that darcs is based on really seems to be unique. Could you summarize briefly the key elements that sets this design apart from Arch and other alternatives?
David Roundy: At its most basic level, the theory of patches is about the commutation, or reordering, of changes in such a way that their meaning doesn't change. The rules of commutation tell us when, for example, one patch requires another, since dependent patches cannot be commuted. Once the commutation primitives have been worked out, one can do all sorts of interesting (and useful) operations, such as merging. And such operations can be shown to be independent of order, i.e. it doesn't matter whether you merge patch A or patch B first, you'll get the same result.
Arch really is the only other revision control system I am aware of which is really change-oriented. Other systems may be change/set/ oriented, but fundamentally they keep track of versions, rather than changes.
Arch is based on the idea of "inexact application" of patches, which is basically what patch does, only extended to also handle file renames, etc. Because of this inexact patching, two branches which have the same set of changesets may not be identical, if the changesets were applied in a different order. Another result of the design of arch based on inexact patching is that you can, for example, cherrypick a change that modifies a file without first getting the change that *creates* the file, which probably isn't what you want.
Stosberg: There seems to be a rhythm to your contributions to darcs. Your patches and communication seem to flow almost exclusively during a one or two hour period early in the morning. Is it true you have a structured schedule to work on darcs? What are the benefits of this arrangement?
David Roundy: I'm a morning person, so I prefer working in the morning. Basically, I work on darcs from the time I wake up, which varies seasonally according to the time the sun rises, until I have to eat breakfast, and get ready for work. I try very hard not to work on darcs while I'm at work, since working on darcs in the office really could really eat into the time spent on research, and in the evening I try not to touch a computer at all.
When I first started working on darcs, I would also work in the evening at times, but after injuring my wrist with too much computer work, I developed the habit of *not* working in the evenings, which really is much more sustainable. I also used to work on darcs all day on saturdays, but now I try to stop by noon, and usually spend the rest of the day off the computer.
Stosberg: Darcs is written in a Haskell, a functional language that is relatively unknown compared to C or Perl. Why Haskell? What impact do you think the language choice has had the project? Any unexpected twists?
David Roundy: It is a little-known fact that the first implementation of darcs was actually in C++. However, after working on it for a while, I had an essentially solid mass of bugs, which was very hard to track down. In all honesty, the biggest problem was that when I started, I didn't really understand the theory of patches, so the algorithms themselves were all buggy. However, after spending six months or so writing a code that was impossible to debug, I didn't feel like rewriting it in the same way.
At the time, I had been looking into Haskell, since functional languages are quite interesting. I had written a toy code or two, just to see if I would be annoyed by the syntax, and thought it would be worth trying Haskell for the next
attempt at darcs. I had recently had a traumatic experience with python, and really wanted to learn a strictly typed language that was a bit more friendly than C or C++.
Haskell is just a great language in which to program. It is purely functional, and lazy, both of which allow you to do really cool tricks. For example, by using lazy IO I can cleanly separate the file and directory reading, from the patch-applying (which is pure functional code), from the file or directory writing. Haskell also is a really good match for implementing the primitive patch operations, with its pattern-matching syntax and higher order functions.
The main downside of using Haskell has definitely been the issues of speed and memory cost. I've done a lot more optimization tricks than I ever expected, which is an interesting challenge in Haskell. In some ways, the laziness of the language is helpful, as it allows somewhat "sloppy" programming to still be quite efficient, but it also makes it hard to track down why memory isn't being garbage collected.
On the plus side, the pure functional nature of Haskell often makes it easier to experiment with new ways of representing data that may be more optimal. For example, darcs has a configure option --enable-antimemoize, which makes certain data structures garbage-collectible, and if they do get thrown away, they get recomputed when needed. In addition, data structures created from such evanescent data structures can have the same disposable property. This would require a lot more restructuring in a non-functional language.
One danger that is often mentioned is that of a lack of contributors due to the rarity of Haskell programmers. I've been surprised by the number and quality of contributors darcs has had. There seem to be quite a few people out there just looking for somewhere to use Haskell! :) And in fact, there have also been developers who learned Haskell expressly for the purpose of contributing to darcs. It's such a pleasant language to work with that I think it's more of a draw to developers than a put-off.
Stosberg: Looking into the future of darcs, what issues do you see as critical to address soon? What others are longer term goals you see for the project?
David Roundy: Right now there are a couple of important issues that need to be addressed. One is general scalability issues with large repositories. On large repositories, darcs is a serious memory hog, and for certain operations runs significantly slower than it needs to. To a certain extent, this is straightforward to address. On the other hand, there is a limit as to how far we can go, since by design darcs needs to be able to hold patches in memory, so if your patch is 500MB, you need at minimum 500MB of virtual memory.
The other major issue is that of merge conflicts. While the current algorithms are "correct", there are scenarios where the performance is just unacceptable. Basically, this happens when you have a long series of changes that conflict. My main post-darcs-1.0 goal is to address this issue. I have some ideas how to address it, but with getting darcs 1.0 out the door, and then dealing with the rush of bug reports resulting from that release, I just haven't had time. If things don't slow down, I may need to take a sabbatical from the darcs-users mailing list so I'll have time to work on this.
Other short to medium term changes will be interface improvements, and better behavior on the windows platform. These are things that are mostly being addressed by other contributors, so I don't see myself spending much time on them, but of course they're still important. In this category is the issue of improving support for different character sets, more flexible handling of line endings in windows.
In the longer term, I'd like to see some interesting new primitive patch types, which should allow better merge behavior by better expressing the intention of the user. The simple line-based patching patching works pretty well for source code, but for other sorts of files, you'd rather use a different scheme. For example, there is a PhD student in Germany who has been working on diffing algorithms for XML documents, who has been working on implementing his algorithms in darcs. This was originally one of the things about darcs that interested me most, but for about a year I've been distracted by working to stabilize darcs and improve its interface.
Mark Stosberg is the Principal Developer at Summersault. He helps maintain several Perl modules including Data::FormValidator, CGI::Application and WWW::Mechanize. Lately he's become an avid recumbent commuter.