osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Network problems during repair make it hang on "Wait for validation to complete"


Hello!

Using Cassandra 2.2.11, I observe behaviour, that is very similar to https://issues.apache.org/jira/browse/CASSANDRA-12860

Steps to reproduce:
1. Set up a cluster: ccm create five -v 2.2.11 && ccm populate -n 5 --vnodes && ccm start
2. Import some keyspace into it (approx 50 Mb of data)
3. Start repair on one node: ccm node2 nodetool repair KEYSPACE
4. While repair is still running, disconnect node3: sudo iptables -I INPUT -p tcp -d 127.0.0.3 -j DROP
5. This repair hangs.
6. Restore network connectivity
7. Repair is still hanging.
8. Following repairs will also hang.

In tpstats I see tasks that make no progress:

$ for i in {1..5}; do echo node$i; ccm node$i nodetool tpstats | grep "Repair#"; done
node1
Repair#1                          1      2255              1         0                 0
node2
Repair#1                          1      2335             26         0                 0
node3
node4
Repair#3                          1       147           2175         0                 0
node5
Repair#1                          1      2335             17         0                 0

In jconsole I see that Repair threads are blocked here:
Name: Repair#1:1
State: WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@73c5ab7e
Total blocked: 0  Total waited: 242

Stack trace: 
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1371)
org.apache.cassandra.repair.RepairJob.run(RepairJob.java:167)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

According to the source code, they are waiting for validations to complete:
# ./apache-cassandra-2.2.8-src/src/java/org/apache/cassandra/repair/RepairJob.java
 74     public void run()
 75     {
...
166         // Wait for validation to complete
167         Futures.getUnchecked(validations);

https://issues.apache.org/jira/browse/CASSANDRA-11824 says that problem was fixed in 2.2.7, but I use 2.2.11.

Restart of all Cassandra nodes that have hanging tasks (one-by-one) allows these tasks to disappear from tpstats. After that repairs work well (until next network problem).

I also suppose that long GC times on one node (as well as network issues) during repair may also lead to the same problem.

Is it a known issue?

--
Best Regards,
Dmitry Simonov