logo       
Google Custom Search
    AddThis Social Bookmark Button
-->

Re: metarepositories and disk space: msg#00319

Subject: Re: metarepositories and disk space
On Sat August 21 2004 06:27 am, David Roundy wrote:
> It seems that there are roughly three reasons people would like to do this,
> and I think that these three reasons can be best addressed separately.

While this is a nice summary of the reasons that have been discussed so far, 
it seems to me that a larger point, which I haven't expressed very well if at 
all, is being missed.  Let me take another shot at it.

In general, projects have multiple lines of development, and multiple 
configurations within those lines.  Much development is fairly linear in 
nature, but some isn't: branches are sometimes made, and changes are 
sometimes copied between branches.

The darcs model supports a single line of development per repo, with a tagging 
feature available to record specific configurations along the line.  &&&

Thus there are several kinds of information that are of interest:

() What configuration a given repo is in.  It is suggested that the repo name 
feature suffices to track this.

() What patches form the difference between two configurations.  Darcs might 
be able to compute this for two configurations in the same line, but can't do 
it between branches, since it has no representation for multiple branches.

() What configurations contain a given patch.  Again, darcs can't compute this 
across branches.

() What public configurations (including branches) have been created (by 
anyone) for a given project.  The suggestion is that if all users keep their 
public configurations in a single directory, an external tool could 
synchronize the sets of configurations.  As I think someone pointed out, 
there are space and download time issues here.  If one user creates a new 
repo for a public branch on their own machine, they can get the benefit of 
the hardlinks.  But if one user has a repo A, which another user gets to make 
A', and then the first user does a `darcs get' on A to create B, and the 
second user then does a get on B to make B', there won't be any sharing 
between A' and B'.  This is one reason I think this functionality should be 
integrated into darcs rather than outboard.  Still, an external solution is 
arguably workable as long as the numbers of public branches remain small.

We've mostly been discussing the third point, but I think the first two are at 
least as important.  It's common, for example, when a product is about to be 
released, to make a branch for the release.  Subsequent minor releases are 
created by fixing bugs in the release branch; the development branch 
typically undergoes more disruptive revisions in preparation for a future 
release.  The release branch needs to be isolated from those, but the bug 
fixes need to be copied (usually) from the release branch to the development 
branch.  I gather it's easier in darcs to do this than in most SCM systems -- 
just push the patches from the release repo to the development repo.  But it 
still would be nice to be able to ask darcs whether this had been done for a 
given patch.

> Reason 1: disk space

This is certainly one of the reasons.  As you point out, bandwidth can be an 
issue too.

> Reason 2: easily synchronizing or transfering multiple projects/branches
>
> As I said, I think this is best done with an external tool.  Such a tool
> could be bundled with darcs, but I like the fact that darcs itself deals
> with just a single branch, which simplifies both its interface and its
> code.

I'm not sure it simplifies the interface.  See below.

> Reason 3: keeping track of branches in a sane manner
>
> I'm not clear as to how things would be different in this respect if there
> was some link between various branches (a metarepository, or whatever).  It
> seems like all that would change would be the namespace that is being used,
> and the only major effect would be that you'd lose flexibility.  Whatever
> you gained in ability to deal with two branches on your computer (in a
> single metarepository) you wouldn't gain when dealing with one branch on
> your computer and one branch on someone else's computer--i.e. you'd be
> losing the distributed nature of darcs.

I don't agree.  All you would have to do is to synchronize the two repos, and 
then each user would be able to perform all operations that either could 
perform.

In fact I think it enhances the distributed nature of darcs, because it lets 
me know about all the public configurations you have created (including 
experimental ones), and after I synchronize with your repo, I can then enter 
any of those configurations efficiently without reconnecting.

Let me elaborate on this a bit.  In the system I envision, the most common 
operation between two repos would probably be complete synchronization, in 
which all public configuration definitions and the patches they contain are 
transferred between the two repos.  ("Public" means "those whose developer 
has decided they're ready to be published" -- the distinction simply provides 
a way to keep work private while it's in progress.)  This operation per se 
doesn't alter the working copy of either repo.  After it's done, the user of 
each repo can browse the new configurations and decide which, if any, they 
wish to move their working copy to -- or they can create some new 
configuration, for instance, one which combines some of the patches they just 
received with one of their private configurations.

By default, as a matter of convenience, one would probably want darcs to put 
the local repo into the latest configuration in some line of development; 
this reflects the way darcs works now and the most common usage scenario.  
But this isn't necessary, and one can easily do something different.

To me this actually simplifies the darcs user interface, because no decisions 
have to be made at repo synchronization time.  I don't have to fiddle with 
patches whose names match a particular regexp, or anything like that -- I 
just get everything.  I don't have to worry much about the bandwidth or disk 
space involved, because no information is transmitted more than once or 
copied more than once locally (I can choose to have multiple local copies for 
my own reasons, of course, but darcs won't impose this on me).  I don't even 
have to decide in advance which configurations I might be interested in -- 
the configuration definitions and associated patch sets are likely to be 
small enough (since programmers can create patches only at a certain rate) 
that I don't have to worry about them.  And once I have everything, I can 
(but don't have to) look at all of it in as much detail as I wish before 
deciding what to use.

Is this picture getting any clearer?

-- Scott


<Prev in Thread] Current Thread [Next in Thread>