osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: runtime.resourcemanager


Hey,

Is that whole Task Manager log? Have you checked memory issues both on Task Managers and the Job Manager? Like out of memory/long GC pauses as I suggested in the first email? 

After you rule memory issues, you could capture couple of thread dumps (`kill -3 JVM_PID` or `jstack JVM_PID`) and check if any thread is stuck in your code.

Another potential issue, are you sure that you have a healthy network between nodes? No packet losts, low ping etc?

Piotrek

On 10 Dec 2018, at 17:44, Alieh <saeedi@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:












Hello,

this is the task manage log but it does not change after I run the program.  I think the Flink planner has problem with my program. It can not even start the job.

Best,

Alieh


018-12-10 12:20:20,386 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - --------------------------------------------------------------------------------
2018-12-10 12:20:20,387 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Starting TaskManager (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 @ 13:31:13 UTC)
2018-12-10 12:20:20,387 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  OS current user: alieh
2018-12-10 12:20:20,609 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-10 12:20:20,768 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Current Hadoop/Kerberos user: alieh
2018-12-10 12:20:20,769 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b12
2018-12-10 12:20:20,769 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Maximum heap size: 922 MiBytes
2018-12-10 12:20:20,769 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  JAVA_HOME: /usr/lib/jvm/java-8-oracle
2018-12-10 12:20:20,774 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Hadoop version: 2.4.1
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  JVM Options:
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -XX:+UseG1GC
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Xms922M
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Xmx922M
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -XX:MaxDirectMemorySize=8388607T
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Dlog.file=/home/alieh/flink-1.6.0/log/flink-alieh-taskexecutor-0-alieh-P67A-D3-B3.log
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Dlog4j.configuration=file:/home/alieh/flink-1.6.0/conf/log4j.properties
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     -Dlogback.configurationFile=file:/home/alieh/flink-1.6.0/conf/logback.xml
2018-12-10 12:20:20,775 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Program Arguments:
2018-12-10 12:20:20,776 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     --configDir
2018-12-10 12:20:20,776 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -     /home/alieh/flink-1.6.0/conf
2018-12-10 12:20:20,776 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       -  Classpath: /home/alieh/flink-1.6.0/lib/flink-python_2.11-1.6.0.jar:/home/alieh/flink-1.6.0/lib/flink-shaded-hadoop2-uber-1.6.0.jar:/home/alieh/flink-1.6.0/lib/log4j-1.2.17.jar:/home/alieh/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar:/home/alieh/flink-1.6.0/lib/flink-dist_2.11-1.6.0.jar:::
2018-12-10 12:20:20,776 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - --------------------------------------------------------------------------------
2018-12-10 12:20:20,777 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-12-10 12:20:20,785 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Maximum number of open file descriptors is 1048576.
2018-12-10 12:20:20,803 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, localhost
2018-12-10 12:20:20,803 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2018-12-10 12:20:20,803 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 1024m
2018-12-10 12:20:20,803 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.heap.size, 1024m
2018-12-10 12:20:20,803 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-12-10 12:20:20,803 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: parallelism.default, 1
2018-12-10 12:20:20,804 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: rest.port, 8081
2018-12-10 12:20:20,912 INFO  org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user set to alieh (auth:SIMPLE)
2018-12-10 12:20:21,131 WARN  org.apache.flink.configuration.Configuration                  - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-12-10 12:20:21,135 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to select the network interface and address to use by connecting to the leading JobManager.
2018-12-10 12:20:21,136 INFO  org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2018-12-10 12:20:21,145 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Retrieved new target address localhost/127.0.0.1:6123.
2018-12-10 12:20:21,204 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - TaskManager will use hostname/address 'alieh-P67A-D3-B3' (127.0.1.1) for communication.
2018-12-10 12:20:21,208 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Starting AkkaRpcService at alieh-p67a-d3-b3:0.
2018-12-10 12:20:21,805 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-12-10 12:20:21,898 INFO  akka.remote.Remoting                                          - Starting remoting
2018-12-10 12:20:22,091 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@alieh-p67a-d3-b3:44267]
2018-12-10 12:20:22,117 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-12-10 12:20:22,124 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Created BLOB cache storage directory /tmp/blobStore-32ec7a05-737e-4b46-b716-3a0831683c47
2018-12-10 12:20:22,127 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory /tmp/blobStore-4b33c843-b7d3-45dc-814f-850e8c6be21a
2018-12-10 12:20:22,136 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: alieh-P67A-D3-B3/127.0.1.1, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2018-12-10 12:20:22,166 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerServices     - Temporary file directory '/tmp': total 450 GB, usable 91 GB (20.22% usable)
2018-12-10 12:20:22,211 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 102 MB for network buffer pool (number of memory segments: 3278, bytes per segment: 32768).
2018-12-10 12:20:22,256 INFO  org.apache.flink.runtime.query.QueryableStateUtils            - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2018-12-10 12:20:22,256 INFO  org.apache.flink.runtime.query.QueryableStateUtils            - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2018-12-10 12:20:22,257 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment        - Starting the network environment and its components.
2018-12-10 12:20:22,289 INFO  org.apache.flink.runtime.io.network.netty.NettyClient         - Successful initialization (took 31 ms).
2018-12-10 12:20:22,325 INFO  org.apache.flink.runtime.io.network.netty.NettyServer         - Successful initialization (took 35 ms). Listening on SocketAddress /127.0.1.1:46127.
2018-12-10 12:20:22,326 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerServices     - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2018-12-10 12:20:22,329 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-4f10dc60-3805-4c50-85a1-497c99dfb20c for spill files.
2018-12-10 12:20:22,387 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have a max timeout of 10000 ms
2018-12-10 12:20:22,394 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2018-12-10 12:20:22,406 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2018-12-10 12:20:22,407 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@localhost:6123/user/resourcemanager(00000000000000000000000000000000).
2018-12-10 12:20:22,409 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-058052c5-36cc-432f-88eb-8acf7dc5f1f1
2018-12-10 12:20:22,743 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved ResourceManager address, beginning registration
2018-12-10 12:20:22,743 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-10 12:20:22,814 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful registration at resource manager akka.tcp://flink@localhost:6123/user/resourcemanager under registration id ba9dd638db7ebccde63a3e0df420a990.

On 12/10/2018 12:14 PM, Piotr Nowojski wrote:
Hi,

Have you checked task managers logs?

Piotrek

On 8 Dec 2018, at 12:23, Alieh <saeedi@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Hello Piotrek,

thank you for your answer. I installed a Flink on a local cluster and used the GUI in order to monitor the task managers. It seems the program does not start at all. The whole time just the job manager is struggling... For very very toy examples, after a long time (during this time I see the job manager logs as I mentioned before),  the job is started and can be executed in 2 seconds. 

Best,

Alieh


On 12/07/2018 10:43 AM, Piotr Nowojski wrote:
Hi,

Please investigate logs/standard output/error from the task manager that has failed (the logs that you showed are from job manager). Probably there is some obvious error/exception explaining why has it failed. Most common reasons:
- out of memory
- long GC pause
- seg fault or other error from some native library
- task manager killed via for example SIGKILL

Piotrek

On 6 Dec 2018, at 17:34, Alieh <saeedi@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Hello all,

I have an algorithm x () which contains several joins and usage of 3 times of gelly ConnectedComponents. The problem is that if I call x() inside a script more than three times, I receive the messages listed below in the log and the program is somehow stopped. It happens even if I run it with a toy example of a graph with less that 10 vertices. Do you have any clue what is the problem?

Cheers,

Alieh


129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request.
129149 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Trigger heartbeat request.
129150 [flink-akka.actor.default-dispatcher-20] DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor  - Received heartbeat request from e80ec35f3d0a04a68000ecbdc555f98b.
129150 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received heartbeat from 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Received new slot report from TaskManager 78cdd7a4-0c00-4912-992f-a2990a5d46db.
129151 [flink-akka.actor.default-dispatcher-22] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Received slot report from instance 4c3e3654c11b09fbbf8e993a08a4c2da.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Release TaskExecutor 4c3e3654c11b09fbbf8e993a08a4c2da because it exceeded the idle timeout.
129200 [flink-akka.actor.default-dispatcher-15] DEBUG org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Worker 78cdd7a4-0c00-4912-992f-a2990a5d46db could not be stopped.