Asger Kunuk Ottar Alstrup wrote:
Yes, that is of course perceivable, but that would defeat the advantage
of a truly distributed system, where a user does not have to be online
to work. I'd prefer another approach. Maybe we will set up separate
databases for different purposes - we can divide the work into sealed
compartments.
yes, I'd recommend the approach -- if possible -- of creating multiple
"collections" of files. I'll briefly describe what I have in mind with
the hashtrees:
each database will have a set of "collections", each of which has a
hashtree associated with it. collections can share members -- they're
working on the same underlying store of blocks and keys and whatnot --
but the hashtree picks out a subset and names it. each collection will
also map to a set of peers which the database syncs that collection with
normally (eg. when you just type "monotone sync <collection>"). one of
the states in a hashtree is called "tombstone", which marks a block you
do not have and do not want, but hashes to the same value as if you had
the block. this makes it possible to expire blocks from your copy of a
collection. of course if you un-expire the block (set from tombstone ->
empty) you will fetch it on the next sync.
in answer to the previous question about distribution costs: obviously
you have to send, at least once, every block you expect to exist at the
other end of a connection. the hashtree is an auxiliary structure used
to discover which blocks are missing from either end of a connection. as
of 2 nights ago, I have a prototype working which synchronizes block
collections in a pair of sqlite databases, over a TCP connection. it
shows pretty good promise; the interactive protocol isn't *quite*
pipelined enough, plus the encodings could use some tuning; it's at
least twice as bulky as it needs to be. even now it finds and sends the
50 missing random 512-byte blocks amongst a collection of 10,000 such,
with only 16k written and 130k received (including the 34.7k worth of
base64-encoded data blocks). it will add a little overhead in the case
of small collections, but the cost curve is a very flat logarithm of the
collection size.
I'm not certain a 256-ary tree is the best fan-out though; I only chose
it because it's easy to prototype in ASCII. a 16-ary tree is equally
easy, so maybe I'll run through that too.. it is a subtle issue: larger
nodes mean fewer round trips and less protocol chatter, but also more
retransmission of hash values for unchanged portions of the tree and a
bit more asymmetry in the transmit/receive load (as in this instance).
I'll see if it's sufficiently easy to make the whole protocol and
hashtree calculator depend on a couple template constants, and try to
write it that way so we can wiggle it around and find a sweet spot.
OK. I tried with VS.NET, but did not have time to complete it. It seems
there are a couple of places where the code uses some non-standard C++
features not supported by VS.NET. Also, it uses a bunch of Unix-only
#includes.
But AFAIK it's nothing that can not be handled with a few days of work -
it should also be possible to use VS6.
I think you might find some of the fancy-pants stuff spirit does with
templates will make VS6 upset. I think that compiler series really only
approximates the ISO standard around version 7 (around the ".net" edition)
-graydon
|
|