According to the UI it seems that "org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed." was the cause of a pipe restart.As to the TM it is an artifact of the new job allocation regime which will exhaust all slots on a TM rather then distributing them equitably. TMs selectively are under more stress then in a pure RR distribution I think. We may have to lower the slots on each TM to define a good upper bound. You are correct 50s is a a pretty generous value.On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@xxxxxxxxxxxxxxxxx> wrote:Hi,
The first exception should be only logged on info level. It's expected to see
this exception when a TaskManager unregisters from the ResourceManager.
Heartbeats can be configured via heartbeat.interval and hearbeat.timeout .
The default timeout is 50s, which should be a generous value. It is probably a
good idea to find out why the heartbeats cannot be answered by the TM.
 https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-managerOn Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <vishal.santoshi@xxxxxxxxx> wrote:2 issues we are seeing on 1.5.1 on a streaming pipe lineorg.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.andjava.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out.Not sure about the first but how do we increase the heartbeat interval of a TMThanks muchVishal