logo       

Re: UNICAST problem when a node is disconnected/reconnected to network (iss: msg#00000

java.javagroups.general

Subject: Re: UNICAST problem when a node is disconnected/reconnected to network (issue for 2.4)



David Forget wrote:
> Hi Bela,
> We found the following issue while testing 2.2.9 JGroup release (also
> reproduce in 2.4).
>
> Stack Used:
>
> UDP(bind_addr=10.4.3.110;enable_diagnostics=false;mcast_addr=228.8.8.8;mcast_port=52976;port_range=100;ip_mcast=true;ucast_recv_buf_size=1048576;ucast_send_buf_size=65535;mcast_recv_buf_size=1048576;mcast_send_buf_size=65535;loopback=false)
>
>
> PING(timeout=2000;num_initial_members=10;num_ping_requests=1)
> MERGEFAST():FD(timeout=1500;max_tries=2;shun=true)
> VERIFY_SUSPECT(timeout=3000):pbcast.NAKACK():UNICAST()
> pbcast.STABLE(stability_delay=30000;desired_avg_gossip=60000;max_bytes=10000)
>
> FRAG2(frag_size=8192):VIEW_SYNC(avg_send_interval=60000;down_thread=false;up_thread=false)
>
>
> pbcast.GMS(view_ack_collection_timeout=2000;join_timeout=2000;join_retry_timeout=1000;shun=true;print_local_addr=true)\
>
>
> pbcast.STATE_TRANSFER()
>
> Problem:
> In a group of ~26 nodes {A, B …Z} we are disconnecting (remove network
> cable) from one of the node for about 7 seconds, when we are network
> reconnecting some nodes are not able to handle message from this node
> (UNICAST).
> Issue:
> When we are disconnecting network cable the disconnected node ex: {M}
> will quickly identify {N} as not replying to FD::HEARTBEAT and a call
> to removeConnection in UNICAST this is be done in {M} for {N}. But if
> {N} did not detect that {M} was disconnected. When {N} receive UNICAST
> message from {M} is did not accept UNICAST message because the seqno
> is smaller that what it was expecting.

I tested this with both UDP and TCP a few days ago, and it worked every
single time ! I pulled the plug on the boxes, and also on the switch.
Can you do a few things and re-test ?

* Set loopback=true
* Use MERGE2 instead of MERGEFAST (largely untested)
* Set stability_delay in STABLE to a small number (e.g. 1000)
* Set shun=false in FD and GMS. See
http://wiki.jboss.org/wiki/Wiki.jsp?page=Shunning for a discussion
* Why do you have port_range in UDP ? Since you don't define a start
port, this is unnecessary
* Optionally include both FD_SOCK and FD in your config


Did you test with the latest JGroups ?
http://jira.jboss.com/jira/browse/JGRP-244 and
http://jira.jboss.com/jira/browse/JGRP-217 fixed 2 bugs with unicast
connections, where members crashed and then were restarted


> Feature Request:
> Is their any stack available in JGroup to help us detect an isolation
> (lost all members) quickly ? FD with aggressive timeout is still
> pretty slow for cluster with 52 nodes has we are running in production
> it took 51*8 seconds = ~7 minutes to detect a complete isolation that
> is too long.

What's an isolation ? Every node becomes a singleton cluster ? Or is it
about detection of a crashed node ? In the latter case, FD_SOCK would
certainly help.


--
Bela Ban
Lead JGroups / Manager JBoss Clustering Group
JBoss - a division of Red Hat

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise