|
|
Subject: Re: sbp2: sbp2util_node_write_no_wait failed - msg#00012
List: linux.kernel.firewire.user
On Saturday 05 November 2005 01:36, Michael Brade wrote:
> > > It doesn't seem justified to lock up for 30 seconds
> > > since a new label could be available much earlier. But that's just my
> > > guess.
> >
> > These pauses aren't spent locked-up in sbp2. It is the period that the
> > SCSI subsystem waits for completion of a task.
>
> I know... do you think I would put something at risk if I'd lower the
> timeout to, say, 5 seconds? 30 seconds is *really* annoying.
Well, I just did it now, I put
#define SD_TIMEOUT (7 * HZ)
in drivers/scsi/sd.c and I can tell you, finally it's fun again to work with
this hd :-) I'll see if it does any bad...
Cheers,
--
Michael Brade; KDE Developer, Student of Computer Science
|-mail: echo brade !#|tr -d "c oh"|s\e\d 's/e/\@/2;s/$/.org/;s/bra/k/2'
°--web: http://www.kde.org/people/michaelb.html
KDE 3: The Next Generation in Desktop Experience
pgp2OuXkxLUex.pgp
Description: PGP signature
Was this page helpful?
Thread at a glance:
Previous Message by Date:
click to view message preview
Re: sbp2: sbp2util_node_write_no_wait failed
On Friday 04 November 2005 23:13, Stefan Richter wrote:
> Michael Brade wrote:
> > Heh, good news, debugging finished! :-)
>
> Let's say, _nearly_ finished.
Ok, fair enough, however, you didn't tell me yet what to do next...
> There are 64 tlabels, therefore what you have seen is that sbp2 added
> ORBs and rung the doorbell 64 times while the target did not finish
> these 64 small transactions to the doorbell register with a response.
Yep, but what I don't quite understand yet is when exactly it happens. It
seems that I can copy one huge file almost without errors, maybe one or two,
rarely more. But when I write a lot of small files and try to do some reading
inbetween the error happens every 10 seconds or worse. With the odd exception
to the rule.
> > Any idea what to fix?
>
> I am unsure.
>
> Idea 1:
>
> Maybe we could move the initiator's part of the protocol (in particular,
> writes to command block agent register and writes to doorbell register)
> out of atomic context, e.g. into an additional kthread or into a
> workqueue. That would let sbp2 wait for availability of a tlabel.
That sounds about good to me ;-)
> Moreover, as I said in an earlier reply, there are many reports about
> mysterious "aborting sbp2 command" mishaps but only few reports about
> "sbp2util_node_write_no_wait failed" along with command abortions.
Hm, I have no idea about how many "aborting sbp2 command" reports there are
but I found quite some reports on Google with the sbp2util_node_write_no_wait
failed, none of them with a good answer though.
> Therefore, IMO, we should implement changes of such kind only after we
> understood the more common problem better and have an idea how such
> changes may affect it, ideally cure it.
Ok, is there anything I do to help there? I mean, I have a system where the
error happens with 100% certainty every time, so as a test bed it's quite
perfect, eh?
> Idea 2:
>
> [...] It won't improve
> performance relative to what you are seeing now, but it would lower the
> risk of data loss.
Well, as far as I can see I haven't lost any data yet. Do you reckon that
could happen or is even likely to happen? Cause then I'd do some backups
pretty soon.
> > It doesn't seem justified to lock up for 30 seconds
> > since a new label could be available much earlier. But that's just my
> > guess.
>
> These pauses aren't spent locked-up in sbp2. It is the period that the
> SCSI subsystem waits for completion of a task.
I know... do you think I would put something at risk if I'd lower the timeout
to, say, 5 seconds? 30 seconds is *really* annoying.
Cheers,
--
Michael Brade; KDE Developer, Student of Computer Science
|-mail: echo brade !#|tr -d "c oh"|s\e\d 's/e/\@/2;s/$/.org/;s/bra/k/2'
°--web: http://www.kde.org/people/michaelb.html
KDE 3: The Next Generation in Desktop Experience
pgpF61WlYhXWU.pgp
Description: PGP signature
Next Message by Date:
click to view message preview
Re: sbp2: sbp2util_node_write_no_wait failed
Michael Brade wrote:
On Saturday 05 November 2005 01:36, Michael Brade wrote:
It doesn't seem justified to lock up for 30 seconds
since a new label could be available much earlier. But that's just my
guess.
These pauses aren't spent locked-up in sbp2. It is the period that the
SCSI subsystem waits for completion of a task.
I know... do you think I would put something at risk if I'd lower the
timeout to, say, 5 seconds? 30 seconds is *really* annoying.
Well, I just did it now, I put
#define SD_TIMEOUT (7 * HZ)
in drivers/scsi/sd.c and I can tell you, finally it's fun again to work with
this hd :-) I'll see if it does any bad...
Perhaps you need to define this time-out specifically for normal I/O
(that which bothers you most; I think that would be sd_probe, perhaps
sd_prepare_flush too) and keep the standard time-out for the rest, like
spin-up, read capacity, cache sync on device removal, and so on.
Note to other readers: Don't do this at home. A shorter SCSI timeout is
only a hack, not a fix for sbp2's problems. It will just cause the
command abortions to happen at higher frequency.
--
Stefan Richter
-=====-=-=-= =-== --=-=
http://arcgraph.de/sr/
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
Previous Message by Thread:
click to view message preview
Re: sbp2: sbp2util_node_write_no_wait failed
On Friday 04 November 2005 23:13, Stefan Richter wrote:
> Michael Brade wrote:
> > Heh, good news, debugging finished! :-)
>
> Let's say, _nearly_ finished.
Ok, fair enough, however, you didn't tell me yet what to do next...
> There are 64 tlabels, therefore what you have seen is that sbp2 added
> ORBs and rung the doorbell 64 times while the target did not finish
> these 64 small transactions to the doorbell register with a response.
Yep, but what I don't quite understand yet is when exactly it happens. It
seems that I can copy one huge file almost without errors, maybe one or two,
rarely more. But when I write a lot of small files and try to do some reading
inbetween the error happens every 10 seconds or worse. With the odd exception
to the rule.
> > Any idea what to fix?
>
> I am unsure.
>
> Idea 1:
>
> Maybe we could move the initiator's part of the protocol (in particular,
> writes to command block agent register and writes to doorbell register)
> out of atomic context, e.g. into an additional kthread or into a
> workqueue. That would let sbp2 wait for availability of a tlabel.
That sounds about good to me ;-)
> Moreover, as I said in an earlier reply, there are many reports about
> mysterious "aborting sbp2 command" mishaps but only few reports about
> "sbp2util_node_write_no_wait failed" along with command abortions.
Hm, I have no idea about how many "aborting sbp2 command" reports there are
but I found quite some reports on Google with the sbp2util_node_write_no_wait
failed, none of them with a good answer though.
> Therefore, IMO, we should implement changes of such kind only after we
> understood the more common problem better and have an idea how such
> changes may affect it, ideally cure it.
Ok, is there anything I do to help there? I mean, I have a system where the
error happens with 100% certainty every time, so as a test bed it's quite
perfect, eh?
> Idea 2:
>
> [...] It won't improve
> performance relative to what you are seeing now, but it would lower the
> risk of data loss.
Well, as far as I can see I haven't lost any data yet. Do you reckon that
could happen or is even likely to happen? Cause then I'd do some backups
pretty soon.
> > It doesn't seem justified to lock up for 30 seconds
> > since a new label could be available much earlier. But that's just my
> > guess.
>
> These pauses aren't spent locked-up in sbp2. It is the period that the
> SCSI subsystem waits for completion of a task.
I know... do you think I would put something at risk if I'd lower the timeout
to, say, 5 seconds? 30 seconds is *really* annoying.
Cheers,
--
Michael Brade; KDE Developer, Student of Computer Science
|-mail: echo brade !#|tr -d "c oh"|s\e\d 's/e/\@/2;s/$/.org/;s/bra/k/2'
°--web: http://www.kde.org/people/michaelb.html
KDE 3: The Next Generation in Desktop Experience
pgpF61WlYhXWU.pgp
Description: PGP signature
Next Message by Thread:
click to view message preview
Re: sbp2: sbp2util_node_write_no_wait failed
Michael Brade wrote:
On Saturday 05 November 2005 01:36, Michael Brade wrote:
It doesn't seem justified to lock up for 30 seconds
since a new label could be available much earlier. But that's just my
guess.
These pauses aren't spent locked-up in sbp2. It is the period that the
SCSI subsystem waits for completion of a task.
I know... do you think I would put something at risk if I'd lower the
timeout to, say, 5 seconds? 30 seconds is *really* annoying.
Well, I just did it now, I put
#define SD_TIMEOUT (7 * HZ)
in drivers/scsi/sd.c and I can tell you, finally it's fun again to work with
this hd :-) I'll see if it does any bad...
Perhaps you need to define this time-out specifically for normal I/O
(that which bothers you most; I think that would be sd_probe, perhaps
sd_prepare_flush too) and keep the standard time-out for the rest, like
spin-up, read capacity, cache sync on device removal, and so on.
Note to other readers: Don't do this at home. A shorter SCSI timeout is
only a hack, not a fix for sbp2's problems. It will just cause the
command abortions to happen at higher frequency.
--
Stefan Richter
-=====-=-=-= =-== --=-=
http://arcgraph.de/sr/
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
|
|