It seems in your case the JobMaster did not receive a heartbeat from the
TaskManager in time . Heartbeat requests and answers are sent over the RPC
framework, and RPCs of one component (e.g., TaskManager, JobMaster, etc.) are
dispatched by a single thread. Therefore, the reasons for heartbeats timeouts
1. The RPC threads of the TM or JM are blocked. In this case heartbeat requests or answers cannot be dispatched.
2. The scheduled task for sending the heartbeat requests  died.
3. The network is flaky.
If you are confident that the network is not the culprit, I would suggest to
set the logging level to DEBUG, and look for periodic log messages (JM and TM
logs) that are related to heartbeating. If the periodic log messages are
overdue, it is a hint that the main thread of the RPC endpoint is blocked