Thanks Gary..What could be blocking the RPC threads? Slow checkpointing?In production we're still using a self-built Flink package 1.5-SNAPSHOT, flink commit 8395508b0401353ed07375e22882e7
581d46ac0e, and the jobs are stable.Now with 1.5.2 the same jobs are failing due to heartbeat timeouts every day. What changed between commit 8395508b0401353ed07375e22882e7 581d46ac0e & release 1.5.2?Also, I just tried to run a slightly heavier job. It eventually had some heartbeat timeouts, and then this:2018-08-15 01:49:58,156 INFO org.apache.flink.runtime.execu tiongraph.ExecutionGraph - Source: Kafka (topic1, topic2) -> Filter -> AppIdFilter([topic1, topic2]) -> XFilter -> EventMapFilter(AppFilters) (4/8) (da6e2ba425fb91316dd05e72e6518 b24) switched from RUNNING to FAILED.org.apache.flink.util.FlinkExc eption: The assigned slot container_1534167926397_0001_0 1_000002_1 was removed.After that the job tried to restart according to Flink restart strategy but that kept failing with this error:2018-08-15 02:00:22,000 INFO org.apache.flink.runtime.execu tiongraph.ExecutionGraph - Job X (19bd504d2480ccb2b44d84fb1ef8a f68) switched from state RUNNING to FAILING.org.apache.flink.runtime.jobma nager.scheduler.NoResourceAvai lableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 36, slots allocated: 12This was repeated until all restart attempts had been used (we've set it to 50), and then the job finally failed.I would like to know also how to prevent Flink from going into such bad state. At least it should exit immediately instead of retrying in such a situation. And why was "slot container removed"?On Tue, Aug 14, 2018 at 11:24 PM Gary Yao <gary@xxxxxxxxxxxxxxxxx> wrote:Hi Juho,
It seems in your case the JobMaster did not receive a heartbeat from the
TaskManager in time . Heartbeat requests and answers are sent over the RPC
framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.) are
dispatched by a single thread. Therefore, the reasons for heartbeats timeouts
1. The RPC threads of the TM or JM are blocked. In this case heartbeat requests or answers cannot be dispatched.
2. The scheduled task for sending the heartbeat requests  died.
3. The network is flaky.
If you are confident that the network is not the culprit, I would suggest to
set the logging level to DEBUG, and look for periodic log messages (JM and TM
logs) that are related to heartbeating. If the periodic log messages are
overdue, it is a hint that the main thread of the RPC endpoint is blocked
k/blob/release-1.5.2/flink- runtime/src/main/java/org/ apache/flink/runtime/jobmaster /JobMaster.java#L1611
k/blob/913b0413882939c30da4ad4 df0cabc84dfe69ea0/flink- runtime/src/main/java/org/ apache/flink/runtime/ heartbeat/HeartbeatManagerSend erImpl.java#L64On Mon, Aug 13, 2018 at 9:52 AM, Juho Autio <juho.autio@xxxxxxxxx> wrote:I also have jobs failing on a daily basis with the error "Heartbeat of TaskManager with id <id> timed out". I'm using Flink 1.5.2.Could anyone suggest how to debug possible causes?I already set these in flink-conf.yaml, but I'm still getting failures:heartbeat.interval: 10000heartbeat.timeout: 100000Thanks.On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi <vishal.santoshi@xxxxxxxxx> wrote:According to the UI it seems that "org.apache.flink.util.FlinkExc" was the cause of a pipe restart. eption: The assigned slot 208af709ef7be2d2dfc028ba3bbf46 00_10 was removed.As to the TM it is an artifact of the new job allocation regime which will exhaust all slots on a TM rather then distributing them equitably. TMs selectively are under more stress then in a pure RR distribution I think. We may have to lower the slots on each TM to define a good upper bound. You are correct 50s is a a pretty generous value.On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@xxxxxxxxxxxxxxxxx> wrote:Hi,
The first exception should be only logged on info level. It's expected to see
this exception when a TaskManager unregisters from the ResourceManager.
Heartbeats can be configured via heartbeat.interval and hearbeat.timeout .
The default timeout is 50s, which should be a generous value. It is probably a
good idea to find out why the heartbeats cannot be answered by the TM.
/flink/flink-docs-release-1.5/ ops/config.html#heartbeat- managerOn Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <vishal.santoshi@xxxxxxxxx> wrote:2 issues we are seeing on 1.5.1 on a streaming pipe lineorg.apache.flink.util.FlinkExc eption: The assigned slot 208af709ef7be2d2dfc028ba3bbf46 00_10 was removed.andjava.util.concurrent.TimeoutEx ception: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf46 00 timed out.Not sure about the first but how do we increase the heartbeat interval of a TMThanks muchVishal