|
| <prev next> |
Re: UNICAST problem when a node is disconnected/reconnected to network (iss: msg#00000java.javagroups.general
David Forget wrote: > Hi Bela, > We found the following issue while testing 2.2.9 JGroup release (also > reproduce in 2.4). > > Stack Used: > > UDP(bind_addr=10.4.3.110;enable_diagnostics=false;mcast_addr=228.8.8.8;mcast_port=52976;port_range=100;ip_mcast=true;ucast_recv_buf_size=1048576;ucast_send_buf_size=65535;mcast_recv_buf_size=1048576;mcast_send_buf_size=65535;loopback=false) > > > PING(timeout=2000;num_initial_members=10;num_ping_requests=1) > MERGEFAST():FD(timeout=1500;max_tries=2;shun=true) > VERIFY_SUSPECT(timeout=3000):pbcast.NAKACK():UNICAST() > pbcast.STABLE(stability_delay=30000;desired_avg_gossip=60000;max_bytes=10000) > > FRAG2(frag_size=8192):VIEW_SYNC(avg_send_interval=60000;down_thread=false;up_thread=false) > > > pbcast.GMS(view_ack_collection_timeout=2000;join_timeout=2000;join_retry_timeout=1000;shun=true;print_local_addr=true)\ > > > pbcast.STATE_TRANSFER() > > Problem: > In a group of ~26 nodes {A, B …Z} we are disconnecting (remove network > cable) from one of the node for about 7 seconds, when we are network > reconnecting some nodes are not able to handle message from this node > (UNICAST). > Issue: > When we are disconnecting network cable the disconnected node ex: {M} > will quickly identify {N} as not replying to FD::HEARTBEAT and a call > to removeConnection in UNICAST this is be done in {M} for {N}. But if > {N} did not detect that {M} was disconnected. When {N} receive UNICAST > message from {M} is did not accept UNICAST message because the seqno > is smaller that what it was expecting. I tested this with both UDP and TCP a few days ago, and it worked every single time ! I pulled the plug on the boxes, and also on the switch. Can you do a few things and re-test ? * Set loopback=true * Use MERGE2 instead of MERGEFAST (largely untested) * Set stability_delay in STABLE to a small number (e.g. 1000) * Set shun=false in FD and GMS. See http://wiki.jboss.org/wiki/Wiki.jsp?page=Shunning for a discussion * Why do you have port_range in UDP ? Since you don't define a start port, this is unnecessary * Optionally include both FD_SOCK and FD in your config Did you test with the latest JGroups ? http://jira.jboss.com/jira/browse/JGRP-244 and http://jira.jboss.com/jira/browse/JGRP-217 fixed 2 bugs with unicast connections, where members crashed and then were restarted > Feature Request: > Is their any stack available in JGroup to help us detect an isolation > (lost all members) quickly ? FD with aggressive timeout is still > pretty slow for cluster with 52 nodes has we are running in production > it took 51*8 seconds = ~7 minutes to detect a complete isolation that > is too long. What's an isolation ? Every node becomes a singleton cluster ? Or is it about detection of a crashed node ? In the latter case, FD_SOCK would certainly help. -- Bela Ban Lead JGroups / Manager JBoss Clustering Group JBoss - a division of Red Hat ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Next by Date: | Uses of system property substitution in JGroups, Brian Stansberry |
|---|---|
| Next by Thread: | Uses of system property substitution in JGroups, Brian Stansberry |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |