|
|
| <prev next> |
Choosing A Webhost: |
The new Codeville merge algorithm: msg#00000version-control.revctrl
I've been discussing the Codeville merge algorithm with a few people on #revctrl recently, most notably Nathaniel Smith from Monotone and Ross, who does most of the real work on Codeville. We have a new version of Codeville merge mostly planned out now. I'm going to explain how Codeville merge currently works, what the problems are, the reasoning behind the new merge algorithm, and how the new merge algorithm works. Codeville merges by doing a two-way merge between files, then using history information to decide which side of each differing section of code wins. Basically it annotates both sides, and checks to see if the revisions those lines came from have already been applied to the other side, and if they have then the side which already has the revisions wins. If neither side wins, it's a conflict and gets escalated to the user, using cvs-style conflict annotations in the file. Usually this works quite well, but it can screw up pretty badly. Since it's just doing a two-way merge lines can erroneously line up with other lines with a completely different lineage. It also can lead to ambiguous clean merges - situations in which *both* sides should win. It turns out these do happen, and they've been causing some problems. Codeville currently has a fix for ambiguous clean merges which is 'good enough for engineering', but we'd like something better. The other case related to ambiguous clean merges is coincidental matches, where two similar-looking lines with different lineages are matched up to each other. That one also causes problems. Our goals for the new merge algorithm are - (1) get rid of ambiguous clean merges (2) get rid of coincidental matches (3) add the property that a series of clean merges will always have the same result regardless of what order the files were merged in (4) make merge generally more mathematical, so that things can be proven about it There are upcoming feataures which we hope are easier to build from the new algorithm, but the immediate goal here is to clean up what we've got, because that leads to some well-defined criteria and we have a general feeling that a well-behaver merge algorithm will make everything easier. One thing which is absolutely NOT a goal is to have a pathway to supporting reordering lines within a file. We've been going over lots of wacky edge cases, and have generally come to the conclusion that the frequency with which reordering lines within a file breaks things more than obliterates the benefits it gets by avoiding unnecessary conflicts, even from a strictly behavioral standpoint, ignoring implementation difficulties. A VCS which knows more about what its tracking than just that it's a series of lines might be able to do moves in a reasonable way, but that's a research topic we aren't even gonna touch for the foreseeable future. Now, for the new merge algorithm. We will give each line an identifier, and only allow it to be matched up with itself, regardless of how much the file has shifted around and how similar it may look to other lines with a different lineage. This approach is inspired by the very nice line alignment which Darcs has, in fact we've been referring to the new merge algorithm as the 'precise-alignment merge'. It may be that as long as you only ever hit clean merges the new Codeville merge is isomorphic to Darcs merge. A full list of lines which have been deleted will be kept in each revision, and when two revisions are merged together lines which have been deleted from either side must be deleted in the result. Perhaps in the future there will be a concept of un-deleting a line, but determining those implicitly is very prone to error, so for now once a line is dead, it stays dead. Two lines will never be allowed to appear in different orders in different revisions (that would nearly guarantee an ambiguous clean merge). This leads to an interesting case. If there's an ancestral version which is ab (we tend to use strings where each character represents a line when giving examples) and one person makes aXb and another makes aYb, is aXYb possible? Obviously we can't have both aXYb and aYXb allowed, but one or the other doesn't cause problems. In ther interests of overall simplicity, we've decided to make a full ordering of all lines. The full ordering, as I'll explain in a second, makes merging a lot easier, and that's the motivation for doing it. Allowing aXYb is bonus. Before going on about the wonderful new merge technique which a full ordering enables, I'll give an example of a full ordering algorithm which works, since it isn't all that obvious that there is one. The criteria for a full ordering are (1) if two lines appear in any file at the same time, they must appear in that order in the full ordering, and (2) For each line, give it an ordering identifier, which is a tuple used to do 'tiebreak' between lines. To compare two ordering identifiers, first the first element is compared, and if those are the same then the second one is compared, and if those are the same the third ones are compared. The tuple elements are, first, the longest number of steps between the new revision and the history root, second the new revision identifier, and third the line number within that revision. When new text is inserted, it's put between the lines before and after it in the full ordering. When comparing merging two versions together, we combine their full orderings first by separating out lines based on what flanks them, so that if line X appears on both sides then a section which appears before X doesn't have to be compared to a section after X. For sections which are flanked the same way on either side we first determine their tiebreak ordering, and then mix. Mixing is done by successively taking the smallest number which appears anywhere in either section, then pulling everything up until that line to the beginning, and iterating on the result. For example if we're mixing 513 and 42, then first the 51 gets pulled out, then the 42, then the 3, so the full ordering is 51423. We aren't especially fond of this full ordering technique, but it does work. We'd really like one which 'clumps' better - that is, if you have two different branches, and merge them together, all lines from one side should come after all lines from the other in the full ordering. Finding a better ordering technique is an interesting problem, but we have one which is perfictly serviceable for now. Once we have a full ordering, merge becomes surprisingly easy. Simply combine line states, and pull out all lines from the list. Line states are: not born, living (inserted), and dead (deleted). If a line is dead on either side, it's dead in the result, and if it's living on either side but dead on neither it's alive, it's alive in the result. If it's not born on both sides, it's not born in the result. To determine if there's a conflict, we look over each section between lines which are alive on both sides, and see if there's indicators that both sides must win in that section. Winners are determined by the following table. The letters correspond to unborn, living, and dead, '<' means the left must win, and '>' means the right must win. uu ul > ud lu < ll ld > du dl < dd Note that if someone adds a diagnostic line and then deletes it then that doesn't cause a merge conflict later, which is kind of a neat feature. Since we have a full ordering, we might as well store all lines in that ordering. And we've now reinvented the weave format (I literally went through this process, not knowing about weaves until the whole merge algorithm was done). Weaves make it really quick to retrieve any revision of history - just go through a graph of history which gives newly deleted and inserted lines only, make the list of what's still left, then run through the list of all lines ever and pull out the ones in this revision. Quite simple, speedy, and space-efficient. The other thing we need an algorithm for is resolution. Given the old versions of a file and the new version, determine what lines map where. We aren't sure of the best way to do this yet, but probably the best approach is to find lines which only occur exactly once (comparing by line content, not identifier) in both the child and the ancestors, then do a longest common substring on those, then extend the selected lines out as continuing matches in either direction. I just recently rewrote the current two-way match code to be based on this approach, and it resulted in a big improvement in both speed and behavior. -Bram
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Next by Date: | Re: The new Codeville merge algorithm, Bram Cohen |
|---|---|
| Next by Thread: | Re: The new Codeville merge algorithm, Bram Cohen |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |