Hello,
I've been trying to improve unlink speed under ext3, because we are
getting huge slowdowns at the end of scientific compute jobs on Lustre,
when a thousand nodes each delete a bunch of files 10-50MB in size.
I noticed the blocks_for_truncate() fix from 2.5 has just made it into
2.4.21-pre5 (I was on the verge of sending the identical patch just now).
This won't fix our unlink speed problem though, because these inodes
would need the maximum transaction size for truncates anyways. At most
we would do a bit of extra reservation at the end of the truncate.
For the unlink speed issue I'm creating a thread which runs in the
background that gets handed inodes to delete/truncate (after the name
has been removed, and the inode added to the orphan list) and inodes
are deleted asynchronously. This lets the applications proceed without
waiting for 500GB of data to be unlinked/journaled, yet is safe from any
crash issues because the inode has alreay been linked into the orphan list
(it would make for a long recovery on crash, but that is no different
than before). Most of the scientific apps are bursty writers, so deferring
the unlinks to a background thread is an overall win.
What I would _really_ like to do is reduce the amount of data that
truncate needs to write, instead of (or in addition to) only pushing
the truncate latency into the background. The issue with truncate is
that it can possibly need huge transaction sizes, and we need to be
able to restart the transaction in the middle if we can't get a single
transaction large enough to do the entire truncate.
However, it is also possible to do the minimum amount of updates required
to keep a consistent filesystem in a single transaction and write out a
lot less data if we _can_ get a single transaction large enough to hold
the data, and we know how large that transaction needs to be. Basically,
we would need to count the number of bitmaps that were being modified,
the group desciptor blocks, add an inode table block, and the superblock,
and get a transaction of that size and we can do a consistent truncate.
We don't need to zero all of the {d,t}indirect blocks that we are dropping
entirely, and that is by far the largest amount of data to be written
(1/1024 of the size of the file, vs ~1/32768 for the bitmaps).
Depending on the journal size, there is obviously an upper limit to
the size of a file (and/or number of allocated blocks) for which this
is practical (after which we would need to go to the slow method),
but I think it can make a huge difference in performance. For Lustre,
we are of course willing to have huge journals in order to encompass
the truncate of large files in a single operation if it means we need
to journal 1/32 as much data to disk. The fringe side benefit would be
that we can leave the allocated block lists the inode (like ext2 does),
and we could get undelete for ext3 working again.
My only worry is the number of headaches I'll get looking at
extN_free_branches() and worrying about how to get partial truncate right
(which can be optimized similarly, with the addition of at most a single
indirect, dindirect, and tindirect block to the transaction size). I
think the performance hit of reading all of the block allocation lists
to count bitmaps/descriptors will be significantly less than the cost of
writing them all out again. We will need to read all of these allocation
lists anyways, so it is mostly just a pre-read for the real truncate.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
Thread at a glance:
Previous Message by Date:
click to view message preview
Re: [PATCH] concurrent block allocation for ext2 against 2.5.64
> First of all, thanks for this work, Alex. It's been a long time in coming.
>
> One thing I would wonder about is whether we should be implementing this in
> ext2, or in ext3 only. One of the decisions we made in the past is that we
> shouldn't necessarily implement everything in ext2 (especially features that
> complicated the code, and are only useful on high-end systems).
>
> There was a desire to keep ext2 small and simple, and ext3 would get the
> fancy high-end features that make sense if you have a large filesystem
> that you would likely be using in conjunction with ext3 anyways.
Errrm ... if you want to start advocating that sort of thing, I suggest
you make ext3 usable on high end systems first. At the moment, that makes
no sense whatsoever. Ext3 still doesn't scale to big systems.
M.
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
Next Message by Date:
click to view message preview
RE: Problems with using -1 as the (directory) EOF co okie
I apologize if you get this mail twice. I am having some problem
send this email out.
> -----Original Message-----
> From: Theodore Ts'o [mailto:tytso@xxxxxxx]
> Sent: Monday, March 10, 2003 10:57 PM
> To: Trond Myklebust
> Cc: Jeremy Fitzhardinge; Ext2 devel
> Subject: Re: [Ext2-devel] Problems with using -1 as the
> (directory) EOF
> cookie
>
>
> The problem is that if we don't do this, we're going to have to
> artificially restrict our hash space so that 0x7ffffff is no longer a
> valid hash, and use that as the EOF marker. I need to do some more
> thinking about it, but I *think* I can do this without needing to
> define new hash types, by simply folding hashes that used to be
> 0x7ffffff into the hash value 0x7fffffe.
>
> There is a slight backwards compatibility problems after we make this
> change, but it only applies if there are sufficient hash collisions in
> the 31-bit major hash number space such that the hash collisions fill
> an entire directory block. That's rare enough, and the number of
> people who have deployed htree are small enough (and it's unlikely
> that people who have deployed a 2.5/2.6 kernel would be likely install
> an older kernel, which is where the comptaibility problem would
> arise), that we might be able to get away with it.
Is it possible to make the hash 0x7ffffff to 0x7fffffe folding hack
local to the readdir related functions? If that is OK, we don't need
to change the disk layout. The rest of the htree code doesn't care
about this limit at all.
That will need to change ext3_htree_next_block(), when start_hash
is 0x7fffffe, we need to look for 0x7ffffff as if it is a continuation
hash. And we fold the hash there.
Any reason we can't do it that way?
Chris
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
Previous Message by Thread:
click to view message preview
Re: Re: [Bug 417] New: htree much slower than regular ext3
Hi,
On Thu, 2003-02-27 at 21:00, Andreas Dilger wrote:
> I've got a patch which should help here, although it was originally written
> to speed up the "create" case instead of the "lookup" case. In the lookup
> case, it will do a pre-read of a number of inode table blocks, since the cost
> of doing a 64kB read and doing a 4kB read is basically the same - the cost
> of the seek.
No it's not --- you're evicting 16 times as much other
potentially-useful data from the cache for each lookup. You'll improve
the "du" or "ls -l" case by prefetching, but you may well slow down the
overall system performance when you're just doing random accesses (eg.
managing large spools.)
It would be interesting to think about how we can spot the cases where
the prefetch is likely to be beneficial, for example by observing
"stat"s coming in in strict hash order.
--Stephen
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en
Next Message by Thread:
click to view message preview
Re: Speeding up truncate
Hi,
On Fri, 2003-03-14 at 18:40, Andreas Dilger wrote:
> I've been trying to improve unlink speed under ext3, because we are
> getting huge slowdowns at the end of scientific compute jobs on Lustre,
> when a thousand nodes each delete a bunch of files 10-50MB in size.
...
> What I would _really_ like to do is reduce the amount of data that
> truncate needs to write, instead of (or in addition to) only pushing
> the truncate latency into the background. The issue with truncate is
> that it can possibly need huge transaction sizes, and we need to be
> able to restart the transaction in the middle if we can't get a single
> transaction large enough to do the entire truncate.
Indeed. One thing on my todo list is to go through the buffer-forget
code in the truncate paths and see if we can safely be any more
aggressive about unpinning buffers from a transaction while we're
truncating. Even when we can't do that, we should be able to be sure
that writeback doesn't happen, although the buffers may make it into the
journal.
The _real_ win is extent maps, though. That eliminates all the extra
indirect tree IOs for truncate, including the initial reads, which you
can't get away from if you're just doing buffer-forget optimisations
> My only worry is the number of headaches I'll get looking at
> extN_free_branches() and worrying about how to get partial truncate right
> (which can be optimized similarly, with the addition of at most a single
> indirect, dindirect, and tindirect block to the transaction size).
Yep, partial truncates are really nasty in general. That's one of the
reasons I'd like to see if we can simply optimise the buffer-forget
logic: by the time we get there, we know that the rest of the code has
already dealt with partial truncates and/or allocation collisions.
Cheers,
Stephen
-------------------------------------------------------
This SF.net email is sponsored by:Crypto Challenge is now open!
Get cracking and register here for some mind boggling fun and
the chance of winning an Apple iPod:
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en