|
|
Subject: Re: Questions answered by Neil Brown - msg#00236
List: linux.raid
On Wednesday February 26, ptb@xxxxxxxxxx wrote:
>
> What is also puzzling me is that despite the horrible potential for
> what might happen from doing the original users end_io early, I
> can't see any consequences in actual tests!
>
> I am writing stuff to a file on a raid1 mirror mount, then dismantling
> the mount, the device, and rmmoding the driver. Then I put it all back
> again. On remount the data in the file is all perfect. Yet it was built
> with plenty of async writes! Surely the buffers in some of those writes
> should have been pointing nowhere and been full of rubbish?
>
I suspect that mostly you are writing from a cache, and the data will
probably stay around.
To be able to demonstrate a problem you probably need very high memory
pressure so things don't stay in cache long, lots of metadata updates,
and probably some for of journalling filesystem like I mentioned
previously. Have very long latencies for the delayed write would also
make the problem more likely.
Even if you cannot demonstrate a problem, I'm sure one would be
noticed sooner or later if you released this sort of code into the
wild and people used it.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Was this page helpful?
Thread at a glance:
Previous Message by Date:
click to view message preview
Re: Questions answered by Neil Brown
On Wednesday February 26, Paul.Clements@xxxxxxxxxxxx wrote:
> "Peter T. Breuer" wrote:
> >
> > "A month of sundays ago Neil Brown wrote:"
> > > On Monday February 24, Paul.Clements@xxxxxxxxxxxx wrote:
>
> > So it might be enough to chain all the mirror bh's through
> > bh->b_this_page.
This sort of thing is very much a layering violation and should not be
done.
It is similar in some ways to the design decisions in the 2.2 raid
patches which meant they were un-safe for swap or ext3 (at least while
rebuilding).
The current interface does *not* guarantee that the data will be
around after b_end_io is called, and does not provide any way to ask
for the data to be left around.
You might be able to fake it in some cases, but not all.
A particular example that springs to mind is descriptor block used
when writing out an ext3 journal entry. Once the block is written,
the descriptor block is of no interest to the kernel and will very
likely be re-used. You cannot stop this.
You *have* to copy the data.
>
> > I believe that currently this field is just set to "1" in
> > raid1_make_request().
>
> Yeah, I sure wish I knew who did that and why. I wonder if someone had a
> clever plan to use that field at some point, but never got around to it.
> Setting that field to something besides a real address sure does seem
> odd...and I can't see that it's ever used anywhere.
The bh which gets b_this_page set to 1 is a bh that is internal to
raid1. It is allocated by raid1 and never used by any filesystem or
buffer cache or anything. Thus raid1 can do whatever it pleases with
this field. No-one else should ever look at it.
I suspect it was set to one so that, if some coding error meant that
the buffer cache saw this buffer, then it would oops pretty quickly.
NeilBrown
>
> --
> Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Next Message by Date:
click to view message preview
Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
Some observations on the idea of an intent-logging bitmap.
1/ You don't want or need very fine granularity. The value of this is
to speed up resync time. Currently it is limited by drive
bandwidth. If you have lots of little updates due to fine
granularity, you will be limited by seek time.
One possibly reasonable approach would be to pick a granulatity
such that it takes about as long to read or write a chunk and it
would to seek to the next one. This probably means a few hundred
kilobytes. i.e. one bit in the bitmap for every hundred K.
This would require a 125K bitmap for a 100Gig drive.
Another possibility would be a fixed size bitmap - say 4K or 8K.
An 8K bitmap used to map a 100Gig drive would be 1.5Meg per bit.
This may seem biggish, but if your bitmap were sparse, resync would
still be much much faster, and if it were dense, having a finer
grain in the bitmap isn't going to speed things up much.
2/ You cannot allocate the bitmap on demand.
Demand happens where you are writing data out, and when writing
data out due to high memory pressure, kmalloc *will* fail.
Relying on kmalloc in the write path is BAD. That is why we have
mempools which pre-allocate.
For the bitmap, you simply need to pre-allocate everything.
3/ Internally, you need to store a counter for each 'chunk' (need a
better word, this is different from the raid chunksize, this is the
amount of space that each bit refers to).
The counter is needed so you know when the bit can be cleared.
This too must be pre-allocated and so further limits the size of
your bitmap.
16 bit counters would use less ram and would allow 33553920 bytes
(65535 sectors) per 'chunk' which, with an 8K bitmap, puts an upper
limit of 2 terabytes per device, which I think is adequate. (that's
per physical device, not per raid array).
Or you could just use 32 bit counters.
4/ I would use device plugging to help reduce the number of times you
have to write the intent bitmap.
When a write comes in, you set the bit in the bitmap, queue the
write on a list of 'plugged' requests, and mark the device as
'plugged'. The device will eventually be unplugged, at which point
you write out the bitmap, then release all the requests to the
lower devices.
You could optimise this a bit, and not bother plugging the device
if it wasn't already plugged, and the request only affected bits
that were already set.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Previous Message by Thread:
click to view message preview
Re: raid1 bitmap code [Was: Re: Questions answered by Neil Brown]
"Paul Clements wrote:"
> That sounds good. I'll have to merge up to 2.6...ugh...now I really need
> to put this all in CVS...I'm making some progress with the mmap stuff
> and I've got the code done for duplicating the buffers for the async
> writes to the backup devices.
> I'll need a pool of memory for that too, since I can't just fail in the
> middle of I/O...
Well, you can. Failing to complete will leave the map marked and
we could fault the device offline too. That will stop any more memory
being wasted because nothing will go to it! We could then make periodic
attempts to bring it online again using the "hotrepair" trick and it'll
resync itself in the background.
Just an idea. Maybe not a good one, but an idea.
Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Next Message by Thread:
click to view message preview
Re: Questions answered by Neil Brown
On Wednesday February 26, Paul.Clements@xxxxxxxxxxxx wrote:
> "Peter T. Breuer" wrote:
> >
> > "A month of sundays ago Neil Brown wrote:"
> > > On Monday February 24, Paul.Clements@xxxxxxxxxxxx wrote:
>
> > So it might be enough to chain all the mirror bh's through
> > bh->b_this_page.
This sort of thing is very much a layering violation and should not be
done.
It is similar in some ways to the design decisions in the 2.2 raid
patches which meant they were un-safe for swap or ext3 (at least while
rebuilding).
The current interface does *not* guarantee that the data will be
around after b_end_io is called, and does not provide any way to ask
for the data to be left around.
You might be able to fake it in some cases, but not all.
A particular example that springs to mind is descriptor block used
when writing out an ext3 journal entry. Once the block is written,
the descriptor block is of no interest to the kernel and will very
likely be re-used. You cannot stop this.
You *have* to copy the data.
>
> > I believe that currently this field is just set to "1" in
> > raid1_make_request().
>
> Yeah, I sure wish I knew who did that and why. I wonder if someone had a
> clever plan to use that field at some point, but never got around to it.
> Setting that field to something besides a real address sure does seem
> odd...and I can't see that it's ever used anywhere.
The bh which gets b_this_page set to 1 is a bh that is internal to
raid1. It is allocated by raid1 and never used by any filesystem or
buffer cache or anything. Thus raid1 can do whatever it pleases with
this field. No-one else should ever look at it.
I suspect it was set to one so that, if some coding error meant that
the buffer cache saw this buffer, then it would oops pretty quickly.
NeilBrown
>
> --
> Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
|
|