Subject: RE: DRBD8:Panics in lc_find due to null lc after ScsiMEDIUM_ERROR I.O error. - msg#00051
List: linux.kernel.drbd.devel
Phil,
Thank you very much.That was fast!!..This should be quick to verify!!
Just have to run it
On my machine with the broken disk. Sweet!! I'll let you know.
EM--
-----Original Message-----
From: drbd-dev-bounces-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx
[
mailto:drbd-dev-bounces-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx]
On Behalf Of Philipp Reisner
Sent: Wednesday, January 31, 2007 12:26 PM
To: drbd-dev-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8:Panics in lc_find due to null lc after
ScsiMEDIUM_ERROR I.O error.
Am Mittwoch, 31. Januar 2007 14:58 schrieb Montrose, Ernest:
Just for the records, this triggered this commit:
http://lists.linbit.com/pipermail/drbd-cvs/2007-January/001458.html
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria
http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev-cunTk1MwBs8qoQakbn7OcQ@xxxxxxxxxxxxxxxx
http://lists.linbit.com/mailman/listinfo/drbd-dev
Was this page helpful?
Thread at a glance:
Previous Message by Date:
click to view message preview
Re: any chance for a drbdmeta backport to 0.7?
Am Mittwoch, 31. Januar 2007 14:35 schrieb Philipp Hug:
> Hello everyone,
>
> I got a bug report/feature request to implement a feature to delete
> existing metadata on a disk.
> This feature seems to be implemented in 0.8. Any chance this will be
> backported?
>
> See here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=409161
>
Actually the drbdmeta, that is included in drbd-8.0.0 can deal
with drbd-0.7's metadata as well. It can also delete (actually
re-initialize) it.
I do not feel like including it into the 0.7 tarball...
This is left as exercise for someone else.
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
Previous Message by Thread:
click to view message preview
DRBD8:Panics in lc_find due to null lc after Scsi MEDIUM_ERROR I.O error.
This one is
automatic. During a read an IO errors occured. The specific error is
ATA stat/err 0x51/40 translated into SCSI error 0x3/11/04 or Medium error/Read
error/auto-reallocation failed. This error appears to be a ATA
Uncorectable ECC error...So though the read may have
completed,
the data is not
validated. This is just for completeness in helping to understand this.
The main issue is that drbd ends up with a Null lc. Here
is
the stack
trace. Also see Simon Graham very good analysis after:
Jan 30 11:05:46 bo kernel: ------------[ cut here
]------------Jan 30 11:05:46 bo kernel: kernel BUG at
/sandbox/emontros/devel/trunk_drbd8/platform/drbd/src/drbd/lru_cache.c:120!Jan
30 11:05:46 bo kernel: invalid opcode: 0000 [#1]Jan 30 11:05:46 bo kernel:
SMPJan 30 11:05:46 bo kernel: Modules linked in: drbd cn bridge ipv6
ipmi_devintf ipmi_si ipmi_msghandler binfmt_misc dm_mirror video thermal
processor fan container button battery ac hw_random i2c_i801 i2c_core shpchp
pci_hotplug e1000 piix ide_cd cdrom raid1 dm_mod ide_disk ata_piix libata sd_mod
scsi_modJan 30 11:05:46 bo kernel: CPU: 0Jan 30 11:05:46 bo
kernel: EIP: 0061:[<ee3ad1c4>] Tainted: GF
VLIJan 30 11:05:46 bo kernel: EFLAGS: 00010046 (2.6.16.29-xen
#1)Jan 30 11:05:46 bo kernel: EIP is at lc_find+0x44/0x50 [drbd]Jan 30
11:05:46 bo kernel: eax: 00000000 ebx: 00000000 ecx: ec8e13b0
edx: 00000058Jan 30 11:05:46 bo kernel: esi: 00000058 edi:
ec8e13b0 ebp: c59b9f08 esp: c59b9f00Jan 30 11:05:46 bo kernel:
ds: 007b es: 007b ss: 0069Jan 30 11:05:46 bo kernel: Process
drbd1_worker (pid: 6253, threadinfo=c59b8000 task=c586e570)Jan 30 11:05:46
bo kernel: Stack: <0>00000058 ec8e1000 c59b9f44 ee3ac5bd c59b9f44 00000000
c586e570 c0137100Jan 30 11:05:46 bo kernel:
c59b9f20 00000058 00000000 00000000 002c0000 00000000 eb121e74 ec8e1000Jan
30 11:05:46 bo kernel: ec8e1000 c59b9f74 ee39cd16
c59b9f5c c59b9f74 00000005 ed6e6820 ec8e102cJan 30 11:05:46 bo kernel: Call
Trace:Jan 30 11:05:46 bo kernel: [<c0105431>]
show_stack_log_lvl+0xa1/0xe0Jan 30 11:05:46 bo kernel:
[<c0105621>] show_registers+0x181/0x200Jan 30 11:05:46 bo
kernel: [<c0105840>] die+0x100/0x1a0Jan 30 11:05:46 bo
kernel: [<c0105961>] do_trap+0x81/0xc0Jan 30 11:05:46 bo
kernel: [<c0105c45>] do_invalid_op+0xa5/0xb0Jan 30 11:05:46 bo
kernel: [<c0105097>] error_code+0x2b/0x30Jan 30 11:05:46 bo
kernel: [<ee3ac5bd>] drbd_rs_complete_io+0x5d/0x130 [drbd]Jan 30
11:05:46 bo kernel: [<ee39cd16>] w_e_end_rsdata_req+0x26/0x390
[drbd]Jan 30 11:05:46 bo kernel: [<ee39dcae>]
drbd_worker+0x2de/0x4b5 [drbd]Jan 30 11:05:46 bo kernel:
[<ee3b010c>] drbd_thread_setup+0x8c/0x100 [drbd]Jan 30 11:05:46 bo
kernel: [<c0102ec5>] kernel_thread_helper+0x5/0x10Jan 30
11:05:46 bo kernel: Code: c3 ff ff ff 8b 44 83 4c eb 0d 8b 10 0f 18 02 90 39 70
14 74 08 89 d0 85 c0 75 ef 31 c0 5b 5e 5d c3 0f 0b 79 00 30 f2 3b ee eb d0
<0f> 0b 78 00 30 f2 3b eeeb bf 89 f6 55 31 d2 89 e5 53 39 00 74Jan
30 11:05:46 bo kernel: <0>Fatal exception: panic in 5
seconds
Here is Simon
original analysis that may help track this:
looks like another instance of the same bug we
fixed in the data path – you can’t look at mdev->resync or mdev->act_log
without first getting a local reference on the mdev… the way they fixed it
previously was to make sure this is done before calling drbd_al_complete_io in
all cases – just need to do the same with drbd_rs_complete_io as well I
think.There are three places where I think this is not
done:1. got_NegRSDReply in
drbd_receiver.c2. w_make_resync_request in
drbd_worker.c3. w_e_end_rsdata_req - this is the one
we actually crashed on here.Taking each one in turn:1.
got_NegRSDReply -- possibly add inc_local_if_state()/dec_local()? Make sure you
still call dec_rs_pending() though -- only drbd_rs_complete_io() and
drbd_rs_failed_io() should be protected with the
inc_local_if_state(Failed).2. w_make_resync_request -- this one I think we
need to defer to Linbit; this is a complex routine and it's possible
it wont ever be called when the local disk has been detached (the
detach should stop the resync!) - However; I am concerned there is a
race condition where we could be in this routine when the disk gets
detached and we have no local ref on the mdev.3. w_e_end_rsdata_req -- this
is run as a worker item at the end of processing a resync data request
-- the bio completion routine is drbd_endio_read_sec and this is
decrementing the local count on the mdev before the work item runs --
I think the right fix here is to move the dec_local() from this
routine into w_e_end_rsdata_req after dec_unacked() is called (note
that there are TWO exits paths from this routine that do this - fix
both!)Note that we should report this asap to Linbit -- there may be
some other places where mdev->resync is accessed without the proper
protection...
_______________________________________________
drbd-dev mailing list
drbd-dev-cunTk1MwBs8qoQakbn7OcQ@xxxxxxxxxxxxxxxx
http://lists.linbit.com/mailman/listinfo/drbd-dev