osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: runtime.resourcemanager


Hello Piotrek,

thank you for your answer. I installed a Flink on a local cluster and used the GUI in order to monitor the task managers. It seems the program does not start at all. The whole time just the job manager is struggling... For very very toy examples, after a long time (during this time I see the job manager logs as I mentioned before),  the job is started and can be executed in 2 seconds. 

Best,

Alieh


On 12/07/2018 10:43 AM, Piotr Nowojski wrote:
Hi,

Please investigate logs/standard output/error from the task manager that has failed (the logs that you showed are from job manager). Probably there is some obvious error/exception explaining why has it failed. Most common reasons:
- out of memory
- long GC pause
- seg fault or other error from some native library
- task manager killed via for example SIGKILL

Piotrek

On 6 Dec 2018, at 17:34, Alieh <saeedi@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Hello all,

I have an algorithm x () which contains several joins and usage of 3 times of gelly ConnectedComponents. The problem is that if I call x() inside a script more than three times, I receive the messages listed below in the log and the program is somehow stopped. It happens even if I run it with a toy example of a graph with less that 10 vertices. Do you have any clue what is the problem?

Cheers,

Alieh


129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request.
129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request.
129150 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat request from e80ec35f3d0a04a68000ecbdc555f98b.
129150 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle timeout.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.