osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.5.1


I also have jobs failing on a daily basis with the error "Heartbeat of TaskManager with id <id> timed out". I'm using Flink 1.5.2.

Could anyone suggest how to debug possible causes?

I already set these in flink-conf.yaml, but I'm still getting failures:
heartbeat.interval: 10000
heartbeat.timeout: 100000

Thanks.

On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vishal.santoshi@xxxxxxxxx> wrote:
According to the UI it seems that "
org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
" was the cause of a pipe restart.

As to the TM it is an artifact of the new job allocation regime which will exhaust all slots on a TM rather then distributing them equitably.  TMs selectively are under more stress then in a pure RR distribution I think. We may have to lower the slots on each TM to define a good upper bound. You are correct 50s is a a pretty generous value.

On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@xxxxxxxxxxxxxxxxx> wrote:
Hi,

The first exception should be only logged on info level. It's expected to see
this exception when a TaskManager unregisters from the ResourceManager.

Heartbeats can be configured via heartbeat.interval and hearbeat.timeout [1].
The default timeout is 50s, which should be a generous value. It is probably a
good idea to find out why the heartbeats cannot be answered by the TM.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager


On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <vishal.santoshi@xxxxxxxxx> wrote:
2 issues we are seeing on 1.5.1 on a streaming pipe line 

org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.

and

java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.

Not sure about the first but how do we increase the heartbeat interval of a TM

Thanks much 

Vishal