osdir.com
mailing list archive

Subject: RE: DRBD8: Split-brain false positive on Primary/primary potential patch - msg#00010

List: linux.kernel.drbd.devel

Date: Prev Next Index Thread: Prev Next Index
Not sure I agree that the current behavior is protecting users from themselves
-- it only causes the split-brain if you lose the n/w and during 'normal'
operation and there is nothing that protects against mounting a 1-node fs on
both nodes of a primary-primary DRBD cluster.

Running primary-secondary doesn't work if you are in a situation where it is
not possible to switch primaryness when failing over; a good example of that is
if you want to run a Xen virtual machine on top of a DRBD partition and support
live migration of the VM (the problem is that Xen doesn't provide the means to
execute a script to change primaryness at the required point in the migration).
Of course you could argue that this is a Xen bug _but_ pragmatically, the
proposed patch to delay updating the UUID until an actual write occurs
preserves (I believe) correctness in DRBD and works without introducing new
features into Xen.

Recovering from split-brain automatically is of course something that is
incredibly valuable but I think it can be treated orthogonally to the proposed
fix.

Simon

-----Original Message-----
From: Philipp Reisner
[mailto:philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx]
Sent: Thursday, November 16, 2006 4:10 AM
To: drbd-dev-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx
Cc: Montrose, Ernest; Graham, Simon
Subject: Re: [Drbd-dev] DRBD8: Split-brain false positive on Primary/primary
potential patch

Am Dienstag, 7. November 2006 00:47 schrieb Montrose, Ernest:
> When running Primary/Primary if the Heartbeat connection goes down when
> we recover we always split brain. Simon had an idea which I have
> implemented. He is on vacation so this may not reflect his exact idea.
>
> Essentially with this change, we do not create a new current UUID on the
> node unless I/O is seen. This prevent Split-Brain mitigation when both
> nodes are primary but only one node is originating I/O and never the
> other. He is only stand-by in that case.
>
> Take a look and let me know.

Hi Ernest,

I understand your reasoning, I see the patch, which I guess does
what you expect of it.

I do not want to do it that way for the following reasons:

* It is only applicable in case you are using a 1-node filesystem
on a primary-primary DRBD cluster.

* I do not want users to do this. Because with this setup it is
easily possible to mount the FS on both nodes concurrently.
I want to protect the from themselfs ;)

* Users using a 1-node filesystem should use DRBD withe
primary and secondary role.

* I rather want to fix DRBD's split brain recovery methods to deal
with a cluster crash of a primary-primary cluster (actually this
is item 41 in the ROADMAP file)

I have a few hours time today, I will work on this today...

-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :


Was this page helpful?
Yes No
Thread at a glance:

Previous Message by Date: click to view message preview

Re: larger than 4tb volume status

Am Donnerstag, 16. November 2006 12:18 schrieb Sylvain Coutant: > > I hope you do not use a proportional font to read your email: > > Nice. Thanks. > I took this as cause to write a real DRBD-Fact sheet. http://www.linbit.com/fileadmin/linbit/DRBD-Faktblatt.pdf http://www.linbit.com/fileadmin/linbit/DRBD-Factsheet.pdf linked in from the DRBD+ feature page. -Phil -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :

Next Message by Date: click to view message preview

RE: DRBD8: Fencing and outdate-peer handler getting called multiple times

Phil, My outdate-peer handler simply says "echo /deve/drbd0: Running handler for outdate-peer >>/tmp/drbdio.log" And that log file is populated and created so I can only assume that did indicate success. In this case success being an exit status of 0. Thanks, EM-- -----Original Message----- From: Philipp Reisner [mailto:philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx] Sent: Thursday, November 16, 2006 3:54 AM To: drbd-dev-63ez5xqkn6DQT0dZR+AlfA@xxxxxxxxxxxxxxxx Cc: Montrose, Ernest Subject: Re: [Drbd-dev] DRBD8: Fencing and outdate-peer handler getting called multiple times Am Dienstag, 7. November 2006 00:16 schrieb Montrose, Ernest: > Hi all, > I have submitted this issue before, sorry for resubmit. Essentially, on > the primary node if I do an ifdown on the heartbeat interface and I have > fencing enable to say "resource-only" then on the primary node the > outdate-peer script gets called twice. Once for state Disconnecting, > and the other for state Networkfailure. Maybe the return code of the outdate-peer handler did not indicated success. > > I also notice that if on a node that is primary, I issue a "drbdadm > secondary r0" the outdate-peer script gets called again from > drbd_set_role() this time. Maybe the return code of the outdate-peer handler did not indicated success. > What is the exact policy for the outdate-peer script? This is the section of the ROADMAP file, that describes how it *should* work: 7 Handle split brain situations; Support IO fencing; New commands: drbdadm outdate r0 When the device is configured this works via an ioctl() call. In the other case it modifies the meta data directly by calling drbdmeta. remove option: on-disconnect New meta-data flag: "Outdated" introduce: disk { fencing [ dont-care | resource-only | resource-and-stonith ]; } handlers { outdate-peer "some script"; } If the disk state of the peer is unknown, drbd calls this handler (yes a call to userspace from kernel space). The handler's returncodes are: 3 -> peer is inconsistent 4 -> peer is outdated (this handler outdated it) [ resource fencing ] 5 -> peer was down / unreachable 6 -> peer is primary 7 -> peer got stonithed [ node fencing ] Let us assume that we have two boxes (N1 and N2) and that these two boxes are connected by two networks (net and cnet [ clinets'-net ]). Net is used by DRBD, while heartbeat uses both, net and cnet I know that you are talking about fencing by STONITH, but DRBD is not limited to that. Here comes my understanding of how resource fencing should works with DRBDv8 : N1 net N2 P/S --- S/P everything up and running. P/? - - S/? network breaks ; N1 freezes IO P/? - - S/? N1 fences N2: In the STONITH case: turn off N2. In the resource fencing case: N1 asks N2 to fence itself from the storage via cnet. HB calls "drbdadm outdate r0" on N2. N2 replies to N1 that fencing is done via cnet. The outdate-peer script on N1 returns sucess to DRBD. P/D - - S/? N1 thaws IO N2 got the the "Outdated" flag set in its meta-data, by the outdate command. The fencing is set to resource-only enables this behaviour. In the resource-only case the outdate-peer handler should have a return value of 3, 4, 5 or 6, but should not return 7. In case "fencing" is set to "resource-and-stonith", all IO operations get immediately frozen (even all currently outstanding IO operations will not finish) upon loss of connection. Then the "outdate-peer" handler is started. In this configuration the outdate peer handler might return any of the documented return values. When the outdate-peer handler returns IO is resumed. Notes: * Why do we need to freeze IO in the "resource-and-stonith" case: Stonith protects you when all communication pathes fail. In that case both (isolated) nodes try to stonith each other. If the current primary would continue to allow IO it could accept transactions, but could get stonithed by the currently secondary node. -> Therefore others could see commited transactions that would be gone after the successfull stonith operation. * The outedate peer handler also gets called if an unconnected secondary wants to become primary. In other words it only may become primary when it knows that the peer is outdated/inconsistent. * We need to store the fact that the peer is outdated/inconsistent in the meta-data. To allow an stand allone primary to be rebooted. Does this answer your question ? -Phil -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :

Previous Message by Thread: click to view message preview

Re: DRBD8: Split-brain false positive on Primary/primary potential patch

Am Dienstag, 7. November 2006 00:47 schrieb Montrose, Ernest: > When running Primary/Primary if the Heartbeat connection goes down when > we recover we always split brain. Simon had an idea which I have > implemented. He is on vacation so this may not reflect his exact idea. > > Essentially with this change, we do not create a new current UUID on the > node unless I/O is seen. This prevent Split-Brain mitigation when both > nodes are primary but only one node is originating I/O and never the > other. He is only stand-by in that case. > > Take a look and let me know. > Hi Ernset and Simon, I found an good examply why I do not like this approach: N1/P --- N2/P/M both primary, FS mounted on N2 and is completely idle. N1/P - - N2/P/M network breaks (still unchanged UUIDs on both sides) N1/P/M - - N2/P/M users mounts FS on N1 (and modifies data, new UUID N1) N1/P - - N2/P/M users umounts FS on N1. N1/P ->- N2/P/M Network gets repaired. Sync from N1 to N2. With the patch you sent, we would get a resync from N1 to N2, instantly corrupting all the cached information that the FS on N2 might have from the data! I understand you test scenario therefore I introduced this solution to your problem: Implemented a new after-slit-brain-0pri policy: "discard-zero-changes" Auto sync from the node that modified blocks during the split brain situation, but only if the target not did not touched a single block. If both nodes touched their data, this policy falls back to disconnect. And a new after-sb-1pri & 2pri policy "violently-as0p" Alsways take the decission of the "after-sb-0pri" algorithm. Even if that causes case an erratic change of the primarie's view of the data. This is only usefull if you use an 1node FS (i.e. not OCFS2 or GFS) with the allow-two-primaries flag, _AND_ you really know what you are doing. This is DANGEROUS and MAY CRASH YOUR MACHINE if you have a FS mounted on the primary node. Now you need to configure it like this: after-sb-0pri discard-zero-changes; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; And you can do the tests with the behaviour you expect, but other users are free to select an other behaviour. -Phil

Next Message by Thread: click to view message preview

Re: DRBD8: Split-brain false positive on Primary/primary potential patch

/ 2006-11-16 07:52:14 -0500 \ Graham, Simon: > Not sure I agree that the current behavior is protecting users from > themselves -- it only causes the split-brain if you lose the n/w and > during 'normal' operation and there is nothing that protects against > mounting a 1-node fs on both nodes of a primary-primary DRBD cluster. > > Running primary-secondary doesn't work if you are in a situation where > it is not possible to switch primaryness when failing over; a good > example of that is if you want to run a Xen virtual machine on top of > a DRBD partition and support live migration of the VM (the problem is > that Xen doesn't provide the means to execute a script to change > primaryness at the required point in the migration). Of course you > could argue that this is a Xen bug _but_ pragmatically, the proposed > patch to delay updating the UUID until an actual write occurs > preserves (I believe) correctness in DRBD and works without > introducing new features into Xen. > > Recovering from split-brain automatically is of course something that > is incredibly valuable but I think it can be treated orthogonally to > the proposed fix. I agree here. But see below why I still think Philipp is "right", too :) But I think the provided patch (doing it only in al_begin_io) is wrong. actually it needs to be done as soon as the bitmap is touched, so it needs be done in "set_out_of_sync", which may be called in the cleanup code after connection loss, too, and will be, typically, on the actually active node. when there is a journalled file system mounted, even if it had been idle, there are periodic updates for the journal/superblock, so it would be deferred only a few seconds on the actually active node. on the "Primary" but inactive node, it would indeed defer this uuid update, thus preventing the "split brain"... one alternative would be to update the uuids where it is done now, but only if we have been opened RW (we have that information anyways somewhere), and do it again (unless already done) as soon as we are opened rw. that would be correct, I think, and easy. Why we could leave the code as is, anyways: we can leave it like it is right now, because the "after split brain recovery" strategy "discard least changes" would do the same thing: your assumtion was that the "inactive" node does no changes. zero is less than anything else... -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com :
Sign up for updates to this mailing list. email:
Loading Comments...
Home | News | Patents | Sitemap | FAQ | advertise

Advertising by