osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Taskmanager times out continuously for registration with Jobmanager


Hi Abdul,

have you tried whether this problem also occurs with newer Flink versions (1.5.4 or 1.6.1)?

Cheers,
Till

On Thu, Oct 11, 2018 at 9:24 AM Dawid Wysakowicz <dwysakowicz@xxxxxxxxxx> wrote:

Hi Abdul,

I've added Till and Gary to cc, who might be able to help you.

Best,

Dawid


On 11/10/18 03:05, Abdul Qadeer wrote:

Hi,


We are facing an issue in standalone HA mode in Flink 1.4.0 where Taskmanager restarts and is not able to register with the Jobmanager. It times out awaiting AcknowledgeRegistration/AlreadyRegistered message from Jobmanager Actor and keeps sending RegisterTaskManager message. The logs at Jobmanager don’t show anything about registration failure/request. It doesn’t print log.debug(s"RegisterTaskManager: $msg") (from JobManager.scala) either. The network connection between taskmanager and jobmanager seems fine; tcpdump shows message sent to jobmanager and TCP ACK received from jobmanager. Note that the communication is happening between docker containers.


Following are the logs from Taskmanager:



{"timeMillis":1539189572438,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@192.168.83.51:6123/user/jobmanager (attempt 1400, timeout: 30000 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5}

{"timeMillis":1539189580229,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got ping response for sessionid: 0x10000260ea5002d after 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5}

{"timeMillis":1539189600247,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got ping response for sessionid: 0x10000260ea5002d after 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5}

{"timeMillis":1539189602458,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@192.168.83.51:6123/user/jobmanager (attempt 1401, timeout: 30000 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5}

{"timeMillis":1539189620251,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got ping response for sessionid: 0x10000260ea5002d after 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5}

{"timeMillis":1539189632478,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@192.168.83.51:6123/user/jobmanager (attempt 1402, timeout: 30000 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5}