osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3


Hi,

How many task slots do you have in the cluster and per machine, and what parallelism are you using?

Piotrek

> On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
> 
> Yes, on smaller data and therefore smaller resources and parallelism
> exactly same job runs fine
> 
> On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljoscha@xxxxxxxxxx> wrote:
> 
>> Hi,
>> 
>> So with Flink 1.5.3 but a smaller parallelism the job works fine?
>> 
>> Best,
>> Aljoscha
>> 
>>> On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vilcek@xxxxxxxxx> wrote:
>>> 
>>> Hello,
>>> 
>>> I am trying to get my Beam application (run on newer version of Flink
>>> (1.5.3) but having trouble with that. When I submit application,
>> everything
>>> works fine but after a few mins (as soon as 2 minutes after job start)
>>> cluster just goes bad. Logs are full of timeouts for heartbeats,
>> JobManager
>>> lost leadership, TaskExecutor timed out etc.
>>> 
>>> At that time, also WebUI is not usable. Looking into job manager, I did
>>> notice that all of "flink-akka.actor.default-dispatcher" threads are busy
>>> or blocked. Most blocks are on metrics:
>>> 
>>> =======================================
>>> java.lang.Thread.State: BLOCKED (on object monitor)
>>>       at
>>> 
>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84)
>>>       - waiting to lock <0x000000053df75510> (a
>>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore)
>>>       at
>>> 
>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205)
>>>       at
>>> 
>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown
>>> Source)
>>>       at
>>> 
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>>       ...
>>> =======================================
>>> 
>>> I tried to increase memory, as MetricStore seems to hold quite a lot
>> stuff,
>>> but it is not helping. On 1.4.0 job manager was running with 4GB heap,
>> now,
>>> this behaviour also occur with 10G.
>>> 
>>> Any suggestions?
>>> 
>>> Best,
>>> Jozef
>>> 
>>> P.S.: Executed Beam app has problem in setup with 100 parallelism, 100
>> task
>>> slots, 2100 running task, streaming mode. Smaller job runs without
>> problem
>> 
>>