Subject: Re: Re: Shrinking ext3 directories

On Jun 20, 2002 18:31 +0200, Daniel Phillips wrote:
> Here's some very skeletal pseudocode for the coalesce algorithm we've
> been talking about. It should expand into about 200 lines without the
> recursive index coalesce, which we can be lazy about and implement
> later. I didn't handle the corner cases where the upper or lower leaves
> don't exist. I also didn't say anything about what to do when they're
> not in the same index block - we might as well treat that the same as
> the 'doesn't exist' case for now.

One things of note - it is currently impossible to have a directory with
an i_size larger than 2GB, because i_size_high is really i_dir_acl
overloaded. Even though we don't need i_dir_acl anymore with EA's, all
of the older kernels would break if they mounted such a filesystem.
Whether they would be broken anyways trying to read a 2GB+ directory
is another question entirely.

In any case, my point was that we actually can mask (32 - blocksize_bits + 1)
bits off the top of the block number for our use. We _could_ then
encode the exact fullness of each directory block into this space, and
still have one bit to spare. I don't think we need this much accuracy,
since we are only lazily updating this field anyways, so this would give
us some bits for future usage if needed. Given that the minimum
dir_entry size is 12 bytes, we can shave at least 3 bits off the hint
with no real loss, but I think we could just count multiples of 128
bytes (so we have blocksize_bits - 7 bits of hint), leaving us 8 bits
for future use.

I'm not set on this idea, I just thought I would bring it up.

> All the coalesce action is in delete:
> <we just removed an entry>
> fullness = <added up usage below entry before + add up usage above>
> fullness_shift = blocksize_bits - 8;
> threshold = <depends on blocksize>;

Actually, if fullness_shift depends on blocksize_bits (i.e. we don't use
the idea I mentioned above), then the threshold will be a constant, as
the fullness of a block will be a fraction (1/256th) of the size of the
block. We would probably want to set the threshold to leave some
fraction of the block empty, so we don't have to split it again right
away if a new entry is added (e.g. mv foo foo.bak; touch foo).

With my idea above, the threshold would be a constant number of bytes
(multiple of 128 bytes) to remain free after coalesce - 256 or so would
bytes would be enough for about a dozen normal-sized or 1 maximal-length
dir_entry to be added.

> while (1) {

I'm not convinced that this would need to loop here. If we are checking
for coalesce on each delete, two adjacent blocks that sum to < threshold
would have already been coalesced. By doing a coalesce we can only be
increasing the fullness of a block, so it wouldn't improve our chance to
coalesce over the last time an entry was deleted from either of the
other blocks.

> upper = (dx_get_fullness(upper_entry()) << fullness_shift);
> lower = (dx_get_fullness(lower_entry()) << fullness_shirt);
> if (max(upper, lower) + fullness > threshold)
> break;
> merge_entry = upper < lower? lower_entry: upper_entry;
> if (fullness + compute_fullness(merge_entry) < blocksize) {

Why would this be "< blocksize" instead of "< threshold"? Wouldn't we
want to keep some free space for new entries here? If we are only ever
deleting entries, we will still discard all of the leaf blocks when they
and their neighbours shrink a bit more.

> <from bottom to top, compact entries of destination>

Agree with this. It keeps telldir/seekdir sane if you return the hash
value of the last dirent as the "offset" in the directory.

> <recursive coalesce of index blocks goes here>

I'm not sure this would even be worth the code. The number of index
blocks will be tiny, and if we do leaf compaction the "spread" of each
index block will remain relatively constant over prolonged usage.

Cheers, Andreas
Andreas Dilger

Sponsored by:
ThinkGeek at