osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3


Yes, on smaller data and therefore smaller resources and parallelism
exactly same job runs fine

On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljoscha@xxxxxxxxxx> wrote:

> Hi,
>
> So with Flink 1.5.3 but a smaller parallelism the job works fine?
>
> Best,
> Aljoscha
>
> > On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
> >
> > Hello,
> >
> > I am trying to get my Beam application (run on newer version of Flink
> > (1.5.3) but having trouble with that. When I submit application,
> everything
> > works fine but after a few mins (as soon as 2 minutes after job start)
> > cluster just goes bad. Logs are full of timeouts for heartbeats,
> JobManager
> > lost leadership, TaskExecutor timed out etc.
> >
> > At that time, also WebUI is not usable. Looking into job manager, I did
> > notice that all of "flink-akka.actor.default-dispatcher" threads are busy
> > or blocked. Most blocks are on metrics:
> >
> > =======================================
> > java.lang.Thread.State: BLOCKED (on object monitor)
> >        at
> >
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84)
> >        - waiting to lock <0x000000053df75510> (a
> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore)
> >        at
> >
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205)
> >        at
> >
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown
> > Source)
> >        at
> >
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> >        ...
> > =======================================
> >
> > I tried to increase memory, as MetricStore seems to hold quite a lot
> stuff,
> > but it is not helping. On 1.4.0 job manager was running with 4GB heap,
> now,
> > this behaviour also occur with 10G.
> >
> > Any suggestions?
> >
> > Best,
> > Jozef
> >
> > P.S.: Executed Beam app has problem in setup with 100 parallelism, 100
> task
> > slots, 2100 running task, streaming mode. Smaller job runs without
> problem
>
>