OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PartitionNotFoundException after deployment


Ufuk: I don’t know why.

+1 for your other suggestions.

Piotrek

> On 4 May 2018, at 14:52, Ufuk Celebi <ufuk@xxxxxxxxxxxxxxxxx> wrote:
> 
> Hey Gyula!
> 
> I'm including Piotr and Nico (cc'd) who have worked on the network
> stack in the last releases.
> 
> Registering the network structures including the intermediate results
> actually happens **before** any state is restored. I'm not sure why
> this reproducibly happens when you restore state. @Nico, Piotr: any
> ideas here?
> 
> In general I think what happens here is the following:
> - a task requests the result of a local upstream producer, but that
> one has not registered its intermediate result yet
> - this should result in a retry of the request with some backoff
> (controlled via the config params you mention
> taskmanager.network.request-backoff.max,
> taskmanager.network.request-backoff.initial)
> 
> As a first step I would set logging to DEBUG and check the TM logs for
> messages like "Retriggering partition request {}:{}."
> 
> You can also check the SingleInputGate code which has the logic for
> retriggering requests.
> 
> – Ufuk
> 
> 
> On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.fora@xxxxxxxxx> wrote:
>> Hi Ufuk,
>> 
>> Do you have any quick idea what could cause this problems in flink 1.4.2?
>> Seems like one operator takes too long to deploy and downstream tasks error
>> out on partition not found. This only seems to happen when the job is
>> restored from state and in fact that operator has some keyed and operator
>> state as well.
>> 
>> Deploying the same job from empty state works well. We tried increasing the
>> taskmanager.network.request-backoff.max that didnt help.
>> 
>> It would be great if you have some pointers where to look further, I havent
>> seen this happening before.
>> 
>> Thank you!
>> Gyula
>> 
>> The errror:
>> org.apache.flink.runtime.io.network.partition.: Partition
>> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found.
>>    at
>> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
>>    at
>> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
>>    at
>> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
>>    at java.util.TimerThread.mainLoop(Timer.java:555)
>>    at java.util.TimerThread.run(Timer.java:505)
> 
> 
> 
> -- 
> Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
> 
> info@xxxxxxxxxxxxxxxxx
> +49-30-43208879
> 
> Registered at Amtsgericht Charlottenburg - HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen