osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Taskmanager times out continuously for registration with Jobmanager


Hi,


We are facing an issue in standalone HA mode in Flink 1.4.0 where Taskmanager restarts and is not able to register with the Jobmanager. It times out awaiting AcknowledgeRegistration/AlreadyRegistered message from Jobmanager Actor and keeps sending RegisterTaskManager message. The logs at Jobmanager don’t show anything about registration failure/request. It doesn’t print log.debug(s"RegisterTaskManager: $msg") (from JobManager.scala) either. The network connection between taskmanager and jobmanager seems fine; tcpdump shows message sent to jobmanager and TCP ACK received from jobmanager. Note that the communication is happening between docker containers.


Following are the logs from Taskmanager:



{"timeMillis":1539189572438,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@192.168.83.51:6123/user/jobmanager (attempt 1400, timeout: 30000 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5}

{"timeMillis":1539189580229,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got ping response for sessionid: 0x10000260ea5002d after 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5}

{"timeMillis":1539189600247,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got ping response for sessionid: 0x10000260ea5002d after 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5}

{"timeMillis":1539189602458,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@192.168.83.51:6123/user/jobmanager (attempt 1401, timeout: 30000 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5}

{"timeMillis":1539189620251,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got ping response for sessionid: 0x10000260ea5002d after 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5}

{"timeMillis":1539189632478,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@192.168.83.51:6123/user/jobmanager (attempt 1402, timeout: 30000 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5}