logo       

Re: Network partition and OTP: msg#00419

Subject: Re: Network partition and OTP
In the AXD301 product this problem is handled by setting
the kernel flag 'dist_auto_connect' to 'once'.

Why You may ask?

Because the most ugliest (perverted?) thing that can happen is
if one uses the automatic node reconnect feature and have flipping
communication (down->up->down and so on). This can really screw up the
distributed applications (global, dist_ac).

Before we changed to dist_auto_connect'=='once' we could see some
systems (at customer site) that were totally screwed up in the dist_ac.
We could only pray that this disastrous situation would escalate to
node restart so everything would clear up.

What happens then if connection can only be setup once?
Well, we have implemented a simple resolve protocol that is activated
between the two nodes that looses connection. (UPD ports always ready
to receive messages, one on each Erlang node).
Both involved nodes makes a decision on which node is more important
and selects the least important node. Minor handshaking and one of the
the nodes is restarted (the least prior. node).
When it comes up again it will reconnect.

This solution have worked quite well and has been enhanced as we found
more ugly cases. We even try to discover which of the two nodes is 
the "guilty" party. For example, if one node looses connection to more
than one node it "must" be guilty. Such case can happen for instance
if there is some huge garbage collect that takes up all execution. In that
case only the "guilty" node is restarted and the other involved nodes
are unharmed.

I feel that the automatic node reconnect feature might be nice
for small systems with very few applications. But it will still
be lot of work to handle the reconnect case correctly. I'm not sure but 
I think that very few have thought about handling this error case.
Haven't seen anything in the OTP documentation about this but then
I seldom read all documentation that carefully..

Asko Husso                              E-mail:etxhua@xxxxxxxxxxxxxxx
Ericsson AB                             Phone: +46 8 7192324
Varuvägen 9B                               
S-126 25 Stockholm-Älvsjö, Sweden





<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
audio.irate.dev...    yellowdog.gener...    ietf.ips/2002-0...    xfree86.fonts/2...    busybox/2003-07...    emacs.jdee/2004...    linux.mandrake....    hardware.microc...    user-groups.lin...    science.analysi...    version-control...    db.filemaker.de...    cluster.openmos...    mail.eyebrowse....    text.xml.xerces...    kde.devel.kwrit...    finance.moneyda...    gcc.regression/...    network.routing...    os.freebsd.deve...    recreation.radi...    qnx.openqnx.dev...    python.xml/2002...   
Home | blog view | USPTO Patent Archive | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe