osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Flink cluster crashing going from 1.4.0 -> 1.5.3


Hello,

I am trying to get my Beam application (run on newer version of Flink
(1.5.3) but having trouble with that. When I submit application, everything
works fine but after a few mins (as soon as 2 minutes after job start)
cluster just goes bad. Logs are full of timeouts for heartbeats, JobManager
lost leadership, TaskExecutor timed out etc.

At that time, also WebUI is not usable. Looking into job manager, I did
notice that all of "flink-akka.actor.default-dispatcher" threads are busy
or blocked. Most blocks are on metrics:

=======================================
java.lang.Thread.State: BLOCKED (on object monitor)
        at
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84)
        - waiting to lock <0x000000053df75510> (a
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore)
        at
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205)
        at
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown
Source)
        at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        ...
=======================================

I tried to increase memory, as MetricStore seems to hold quite a lot stuff,
but it is not helping. On 1.4.0 job manager was running with 4GB heap, now,
this behaviour also occur with 10G.

Any suggestions?

Best,
Jozef

P.S.: Executed Beam app has problem in setup with 100 parallelism, 100 task
slots, 2100 running task, streaming mode. Smaller job runs without problem