osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3


Hi,

So with Flink 1.5.3 but a smaller parallelism the job works fine?

Best,
Aljoscha

> On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
> 
> Hello,
> 
> I am trying to get my Beam application (run on newer version of Flink
> (1.5.3) but having trouble with that. When I submit application, everything
> works fine but after a few mins (as soon as 2 minutes after job start)
> cluster just goes bad. Logs are full of timeouts for heartbeats, JobManager
> lost leadership, TaskExecutor timed out etc.
> 
> At that time, also WebUI is not usable. Looking into job manager, I did
> notice that all of "flink-akka.actor.default-dispatcher" threads are busy
> or blocked. Most blocks are on metrics:
> 
> =======================================
> java.lang.Thread.State: BLOCKED (on object monitor)
>        at
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84)
>        - waiting to lock <0x000000053df75510> (a
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore)
>        at
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205)
>        at
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown
> Source)
>        at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>        ...
> =======================================
> 
> I tried to increase memory, as MetricStore seems to hold quite a lot stuff,
> but it is not helping. On 1.4.0 job manager was running with 4GB heap, now,
> this behaviour also occur with 10G.
> 
> Any suggestions?
> 
> Best,
> Jozef
> 
> P.S.: Executed Beam app has problem in setup with 100 parallelism, 100 task
> slots, 2100 running task, streaming mode. Smaller job runs without problem