osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3


parallelism is 100.  I tried clusters with 1 and 2 slots per TM yielding
100 or 50 TMs in cluster.

I did notice that URL  http://jobmanager:port/jobs/job_id/metrics  in 1.5.x
returns huge list of "latency.source_id. ...." IDs. Heap dump shows that
hash map takes 1.6GB for me. I am guessing that is the one dispatcher
threads keep updating. Not sure what are those. In 1.4.0 that URL returns
something else, very short list.

On Thu, Aug 23, 2018 at 6:44 PM Piotr Nowojski <piotr@xxxxxxxxxxxxxxxxx>
wrote:

> Hi,
>
> How many task slots do you have in the cluster and per machine, and what
> parallelism are you using?
>
> Piotrek
>
> > On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
> >
> > Yes, on smaller data and therefore smaller resources and parallelism
> > exactly same job runs fine
> >
> > On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljoscha@xxxxxxxxxx>
> wrote:
> >
> >> Hi,
> >>
> >> So with Flink 1.5.3 but a smaller parallelism the job works fine?
> >>
> >> Best,
> >> Aljoscha
> >>
> >>> On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I am trying to get my Beam application (run on newer version of Flink
> >>> (1.5.3) but having trouble with that. When I submit application,
> >> everything
> >>> works fine but after a few mins (as soon as 2 minutes after job start)
> >>> cluster just goes bad. Logs are full of timeouts for heartbeats,
> >> JobManager
> >>> lost leadership, TaskExecutor timed out etc.
> >>>
> >>> At that time, also WebUI is not usable. Looking into job manager, I did
> >>> notice that all of "flink-akka.actor.default-dispatcher" threads are
> busy
> >>> or blocked. Most blocks are on metrics:
> >>>
> >>> =======================================
> >>> java.lang.Thread.State: BLOCKED (on object monitor)
> >>>       at
> >>>
> >>
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84)
> >>>       - waiting to lock <0x000000053df75510> (a
> >>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore)
> >>>       at
> >>>
> >>
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205)
> >>>       at
> >>>
> >>
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown
> >>> Source)
> >>>       at
> >>>
> >>
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> >>>       ...
> >>> =======================================
> >>>
> >>> I tried to increase memory, as MetricStore seems to hold quite a lot
> >> stuff,
> >>> but it is not helping. On 1.4.0 job manager was running with 4GB heap,
> >> now,
> >>> this behaviour also occur with 10G.
> >>>
> >>> Any suggestions?
> >>>
> >>> Best,
> >>> Jozef
> >>>
> >>> P.S.: Executed Beam app has problem in setup with 100 parallelism, 100
> >> task
> >>> slots, 2100 running task, streaming mode. Smaller job runs without
> >> problem
> >>
> >>
>
>