osdir.com
mailing list archive

Subject: RE: DRBD8:Panics in lc_find due to null lc after ScsiMEDIUM_ERROR I.O error. - msg#00051

List: linux.kernel.drbd.devel

Date: Prev Index Thread: Prev Index
Phil,
Thank you very much.That was fast!!..This should be quick to verify!!
Just have to run it
On my machine with the broken disk. Sweet!! I'll let you know.

EM--

-----Original Message-----
From: drbd-dev-bounces-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx
[mailto:drbd-dev-bounces-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx]
On Behalf Of Philipp Reisner
Sent: Wednesday, January 31, 2007 12:26 PM
To: drbd-dev-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8:Panics in lc_find due to null lc after
ScsiMEDIUM_ERROR I.O error.

Am Mittwoch, 31. Januar 2007 14:58 schrieb Montrose, Ernest:

Just for the records, this triggered this commit:
http://lists.linbit.com/pipermail/drbd-cvs/2007-January/001458.html

-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev-cunTk1MwBs8qoQakbn7OcQ@xxxxxxxxxxxxxxxx
http://lists.linbit.com/mailman/listinfo/drbd-dev


Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: any chance for a drbdmeta backport to 0.7?

Am Mittwoch, 31. Januar 2007 14:35 schrieb Philipp Hug: > Hello everyone, > > I got a bug report/feature request to implement a feature to delete > existing metadata on a disk. > This feature seems to be implemented in 0.8. Any chance this will be > backported? > > See here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=409161 > Actually the drbdmeta, that is included in drbd-8.0.0 can deal with drbd-0.7's metadata as well. It can also delete (actually re-initialize) it. I do not feel like including it into the 0.7 tarball... This is left as exercise for someone else. -Phil -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :

Previous Message by Thread: click to view message preview

DRBD8:Panics in lc_find due to null lc after Scsi MEDIUM_ERROR I.O error.

This one is automatic.  During a read an IO errors occured.  The specific error is ATA stat/err 0x51/40 translated into SCSI error 0x3/11/04 or Medium error/Read error/auto-reallocation failed.  This error appears to be a ATA Uncorectable ECC error...So though the read may have completed, the data is not validated.  This is just for completeness in helping to understand this. The main issue is that drbd ends up with a Null lc.  Here is the stack trace.  Also see Simon Graham very good analysis after:   Jan 30 11:05:46 bo kernel: ------------[ cut here ]------------Jan 30 11:05:46 bo kernel: kernel BUG at /sandbox/emontros/devel/trunk_drbd8/platform/drbd/src/drbd/lru_cache.c:120!Jan 30 11:05:46 bo kernel: invalid opcode: 0000 [#1]Jan 30 11:05:46 bo kernel: SMPJan 30 11:05:46 bo kernel: Modules linked in: drbd cn bridge ipv6 ipmi_devintf ipmi_si ipmi_msghandler binfmt_misc dm_mirror video thermal processor fan container button battery ac hw_random i2c_i801 i2c_core shpchp pci_hotplug e1000 piix ide_cd cdrom raid1 dm_mod ide_disk ata_piix libata sd_mod scsi_modJan 30 11:05:46 bo kernel: CPU:    0Jan 30 11:05:46 bo kernel: EIP:    0061:[<ee3ad1c4>]    Tainted: GF    VLIJan 30 11:05:46 bo kernel: EFLAGS: 00010046  (2.6.16.29-xen #1)Jan 30 11:05:46 bo kernel: EIP is at lc_find+0x44/0x50 [drbd]Jan 30 11:05:46 bo kernel: eax: 00000000  ebx: 00000000  ecx: ec8e13b0  edx: 00000058Jan 30 11:05:46 bo kernel: esi: 00000058  edi: ec8e13b0  ebp: c59b9f08  esp: c59b9f00Jan 30 11:05:46 bo kernel: ds: 007b  es: 007b  ss: 0069Jan 30 11:05:46 bo kernel: Process drbd1_worker (pid: 6253, threadinfo=c59b8000 task=c586e570)Jan 30 11:05:46 bo kernel: Stack: <0>00000058 ec8e1000 c59b9f44 ee3ac5bd c59b9f44 00000000 c586e570 c0137100Jan 30 11:05:46 bo kernel:        c59b9f20 00000058 00000000 00000000 002c0000 00000000 eb121e74 ec8e1000Jan 30 11:05:46 bo kernel:        ec8e1000 c59b9f74 ee39cd16 c59b9f5c c59b9f74 00000005 ed6e6820 ec8e102cJan 30 11:05:46 bo kernel: Call Trace:Jan 30 11:05:46 bo kernel:  [<c0105431>] show_stack_log_lvl+0xa1/0xe0Jan 30 11:05:46 bo kernel:  [<c0105621>] show_registers+0x181/0x200Jan 30 11:05:46 bo kernel:  [<c0105840>] die+0x100/0x1a0Jan 30 11:05:46 bo kernel:  [<c0105961>] do_trap+0x81/0xc0Jan 30 11:05:46 bo kernel:  [<c0105c45>] do_invalid_op+0xa5/0xb0Jan 30 11:05:46 bo kernel:  [<c0105097>] error_code+0x2b/0x30Jan 30 11:05:46 bo kernel:  [<ee3ac5bd>] drbd_rs_complete_io+0x5d/0x130 [drbd]Jan 30 11:05:46 bo kernel:  [<ee39cd16>] w_e_end_rsdata_req+0x26/0x390 [drbd]Jan 30 11:05:46 bo kernel:  [<ee39dcae>] drbd_worker+0x2de/0x4b5 [drbd]Jan 30 11:05:46 bo kernel:  [<ee3b010c>] drbd_thread_setup+0x8c/0x100 [drbd]Jan 30 11:05:46 bo kernel:  [<c0102ec5>] kernel_thread_helper+0x5/0x10Jan 30 11:05:46 bo kernel: Code: c3 ff ff ff 8b 44 83 4c eb 0d 8b 10 0f 18 02 90 39 70 14 74 08 89 d0 85 c0 75 ef 31 c0 5b 5e 5d c3 0f 0b 79 00 30 f2 3b ee eb d0 <0f> 0b 78 00 30 f2 3b eeeb bf 89 f6 55 31 d2 89 e5 53 39 00 74Jan 30 11:05:46 bo kernel:  <0>Fatal exception: panic in 5 seconds Here is Simon original analysis that may help track this: looks like another instance of the same bug we fixed in the data path – you can’t look at mdev->resync or mdev->act_log without first getting a local reference on the mdev… the way they fixed it previously was to make sure this is done before calling drbd_al_complete_io in all cases – just need to do the same with drbd_rs_complete_io as well I think.There are three places where I think this is not done:1.    got_NegRSDReply in drbd_receiver.c2.    w_make_resync_request in drbd_worker.c3.    w_e_end_rsdata_req - this is the one we actually crashed on here.Taking each one in turn:1. got_NegRSDReply -- possibly add inc_local_if_state()/dec_local()? Make sure you still call  dec_rs_pending() though -- only drbd_rs_complete_io() and drbd_rs_failed_io() should  be protected with the inc_local_if_state(Failed).2. w_make_resync_request -- this one I think we need to defer to Linbit; this is  a complex routine and it's possible it wont ever be called when the local disk  has been detached (the detach should stop the resync!) - However; I am concerned  there is a race condition where we could be in this routine when the disk gets  detached and we have no local ref on the mdev.3. w_e_end_rsdata_req -- this is run as a worker item at the end of processing  a resync data request -- the bio completion routine is drbd_endio_read_sec  and this is decrementing the local count on the mdev before the work item  runs -- I think the right fix here is to move the dec_local() from this  routine into w_e_end_rsdata_req after dec_unacked() is called (note that  there are TWO exits paths from this routine that do this - fix both!)Note that we should report this asap to Linbit -- there may be some other places where mdev->resync is accessed without the proper protection...   _______________________________________________ drbd-dev mailing list drbd-dev-cunTk1MwBs8qoQakbn7OcQ@xxxxxxxxxxxxxxxx http://lists.linbit.com/mailman/listinfo/drbd-dev
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by