|
Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...
|
Re: Support for binary files, scalability and Windows port: msg#00055
|
Subject: |
Re: Support for binary files, scalability and Windows port |
Nathaniel Smith wrote:
On Sat, Jan 17, 2004 at 01:11:14AM -0500, graydon hoare wrote:
for a heavily edited file, it'll slowly get worse, but maybe you could
have a "defragment" routine which builds some fresh blocks (especially
if a bunch of blocks appear with refcount=1; might as well toss them)
How does this interact with hash tree synchronization? I.e., what
happens if I do
$ monotone sync
$ monotone db defrag
$ monotone sync
? Will this cause problems, since efficient hash tree synchronization
seems to depend on both sides using the same blocks?
no, shouldn't cause any problem. I'm forseeing doing a sync on about 5
different collections: blocks, files, manifests, certs, and keys. files
would be addressed by SHA1 of the file, blocks by SHA1 of the block,
etc. defragmenting a file would be a concern of the storage manager, but
wouldn't change the file's SHA1. it might, in some cases, add a new
block full of coalesced constants. but that's a *good* thing, because
then they can be reused by other files storage representations :)
as background (another email asked "how this stuff works"): the idea
with a hashtree (a.k.a. merkle tree) is that you and I both store our
entire set of "objects" (say, blocks or files or whatever) in an N-ary
tree where each leaf is positioned in the tree at some unique spot (in
our case the tree represents the entire space of SHA1, each branch
peeling off some number of bits, and the leaf's position at the bottom
of the tree is determined by content hash). each interior node's slots
are the hashes of the subnodes beneath it. the hash algorithm used for
interior nodes isn't necessarily related to the one pointing out to
leaves, but we might as well use SHA1 again.
anyways, when you and I want to sync, I send you my root node, you can
see all the slots in it which are non-equal to *your* values for those
slots: those are 1/N subtrees which have "a difference" in them,
somewhere. so for each of those which is different, you send me *your*
subtree node, and then I look at them and pick out the 1/(N*2) subtrees
which contain the differences. we bounce back and forth at worst
log_N(X) times where X is the size of the space we're spanning (I'm
thinking a 256-ary tree over SHA1: 20 levels, so at worst 10 round
trips) and exchange at worst O(D*K) bytes overhead locating the K
different objects. then we just transmit the objects missing from each
party's collection, having isolated them. there are some cheap hacks
used to make the average case costs collapse to the load of the tree
rather than its depth (so, more likely say 3 or 4 round trips), but even
in the worst case it's very efficient, and it's easy to pipeline.
note that this is a very general algorithm for synchronizing two
collections of arbitrary things. the "tree" here has nothing to do with
a filesystem tree or the xdelta/suffix-tree storage representation for a
file. those are orthogonal.
merkle trees over sets with not much in common are basically useless;
they add overhead (the tree structure, round trips, and hashing) and you
will wind up exchanging a full list of objects anyways. might as well
just send a full list. where they are useful is when two parties have
*nearly identical* large collections of objects, with just a few
differences since the last synchronization. then you can narrow it down
to the set of differences in much less traffic than a complete listing.
-graydon
|
| |