osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PartitionNotFoundException after deployment


Hi Gyula,
as a follow-up, you may be interested in
https://issues.apache.org/jira/browse/FLINK-9413


Nico

On 04/05/18 15:36, Gyula Fóra wrote:
> Looks pretty clear that one operator takes too long to start (even on
> the UI it shows it in the created state for far too long). Any idea what
> might cause this delay? It actually often crashes on Akka ask timeout
> during scheduling the node.
> 
> Gyula
> 
> Piotr Nowojski <piotr@xxxxxxxxxxxxxxxxx
> <mailto:piotr@xxxxxxxxxxxxxxxxx>> ezt írta (időpont: 2018. máj. 4., P,
> 15:33):
> 
>     Ufuk: I don’t know why.
> 
>     +1 for your other suggestions.
> 
>     Piotrek
> 
>     > On 4 May 2018, at 14:52, Ufuk Celebi <ufuk@xxxxxxxxxxxxxxxxx
>     <mailto:ufuk@xxxxxxxxxxxxxxxxx>> wrote:
>     >
>     > Hey Gyula!
>     >
>     > I'm including Piotr and Nico (cc'd) who have worked on the network
>     > stack in the last releases.
>     >
>     > Registering the network structures including the intermediate results
>     > actually happens **before** any state is restored. I'm not sure why
>     > this reproducibly happens when you restore state. @Nico, Piotr: any
>     > ideas here?
>     >
>     > In general I think what happens here is the following:
>     > - a task requests the result of a local upstream producer, but that
>     > one has not registered its intermediate result yet
>     > - this should result in a retry of the request with some backoff
>     > (controlled via the config params you mention
>     > taskmanager.network.request-backoff.max,
>     > taskmanager.network.request-backoff.initial)
>     >
>     > As a first step I would set logging to DEBUG and check the TM logs for
>     > messages like "Retriggering partition request {}:{}."
>     >
>     > You can also check the SingleInputGate code which has the logic for
>     > retriggering requests.
>     >
>     > – Ufuk
>     >
>     >
>     > On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.fora@xxxxxxxxx
>     <mailto:gyula.fora@xxxxxxxxx>> wrote:
>     >> Hi Ufuk,
>     >>
>     >> Do you have any quick idea what could cause this problems in
>     flink 1.4.2?
>     >> Seems like one operator takes too long to deploy and downstream
>     tasks error
>     >> out on partition not found. This only seems to happen when the job is
>     >> restored from state and in fact that operator has some keyed and
>     operator
>     >> state as well.
>     >>
>     >> Deploying the same job from empty state works well. We tried
>     increasing the
>     >> taskmanager.network.request-backoff.max that didnt help.
>     >>
>     >> It would be great if you have some pointers where to look
>     further, I havent
>     >> seen this happening before.
>     >>
>     >> Thank you!
>     >> Gyula
>     >>
>     >> The errror:
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.: Partition
>     >> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd
>     not found.
>     >>    at
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77)
>     >>    at
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115)
>     >>    at
>     >> org.apache.flink.runtime.io
>     <http://org.apache.flink.runtime.io>.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159)
>     >>    at java.util.TimerThread.mainLoop(Timer.java:555)
>     >>    at java.util.TimerThread.run(Timer.java:505)
>     >
>     >
>     >
>     > --
>     > Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin
>     >
>     > info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     > +49-30-43208879 <tel:+49%2030%2043208879>
>     >
>     > Registered at Amtsgericht Charlottenburg - HRB 158244 B
>     > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
> 

-- 
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Attachment: signature.asc
Description: OpenPGP digital signature