I'm looking for information on how OTP behaves when the network between nodes fails, and reconnects (nodes stay up all the time).
** Question 1 **
In particular the behavior of "global", the "distributed application controller" and Ulf's "locker" (contrib page) is what I'd like to understand better in network partition/reconnect scenarios.
I've found references to work of Thomas Arts et al [1,2] and Ulf Wiger [3] and snippets here and there, but it would be most helpful to me if an OTP wizard could illuminate this topic comprehensively.
For "global" one has to expect "name conflict" errors when the network comes back together. By extension I guess the same applies to the application controller (via it's use of global). Not sure about Ulf's locker. Using Ulf's release handling tutorial example, I can generate a naming conflict and observe what happens (start n1 then n2 (owner), suspend erl process that runs n2, dist fails over to n1, then resume erl that runs n2, ping n1 -> naming conflict, kills dist_server on n2, supervisor restarts n2 which takes over from n1 - takeover handshake not logged - does it happen?).
=INFO REPORT==== 29-Apr-2003::12:59:39 ===
global: Name conflict terminating {dist_server,<1930.59.0>}
** Question 2 ** is there any risk of loosing messages that were buffered by the dist_server instance just before it got killed? I'm worried that while the global:register etc call are atomic across nodes [docs and 2], a potential client (client of dist_server I mean here) is not part of the atomic conflict resolution/re-registering process.
I noticed the "relay" function in Ulf's release handling tutorial [3], but am not sure it kicks in when global detects the naming conflict upon reconnect - I guess not, correct?
** Question 3 ** - somewhat related to the above:
Is there any library support for "majority voting" and/or "lease management" in OTP that I've not discovered yet? In particular I'm interested in rejecting a global:register/2 if the process calling the function is not in a node majority-set.
Thanks,
- Reto
References:
Thomas Arts et al [1,2], Ulf Wiger [3]
[1] http://www.ericsson.com/cslab/~thomas/publ2.shtml (resource locker case study)
[2] http://www.erlang.org/ml-archive/erlang-questions/200107/msg00031.html (christian paper)
[3] (OTP release handling tutorial by Ulf) - was on the newsgroup, cannot find ref right now
______________________
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
C.A.R. Hoare
1980 Turing Award Lecture
|
Try Searching:
servers, voip, java, networking, microsoft ...
|
|
|
|