OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Batch job stuck in Canceled state in Flink 1.5


Also, please have a look at the other TaskManagers' logs, in particular
the one that is running the operator that was mentioned in the
exception. You should look out for the ID 98f5976716234236dc69fb0e82a0cc34.


Nico


PS: Flink logs files should compress quite nicely if they grow too big :)

On 03/05/18 14:07, Stephan Ewen wrote:
> Google Drive would be great.
> 
> Thanks!
> 
> On Thu, May 3, 2018 at 1:33 PM, Amit Jain <aj2011it@xxxxxxxxx
> <mailto:aj2011it@xxxxxxxxx>> wrote:
> 
>     Hi Stephan,
> 
>     Size of JM log file is 122 MB. Could you provide me other media to
>     post the same? We can use Google Drive if that's fine with you.
> 
>     --
>     Thanks,
>     Amit
> 
>     On Thu, May 3, 2018 at 12:58 PM, Stephan Ewen <sewen@xxxxxxxxxx
>     <mailto:sewen@xxxxxxxxxx>> wrote:
>     > Hi Amit!
>     >
>     > Thanks for sharing this, this looks like a regression with the
>     network stack
>     > changes.
>     >
>     > The log you shared from the TaskManager gives some hint, but that
>     exception
>     > alone should not be a problem. That exception can occur under a
>     race between
>     > deployment of some tasks while the whole job is entering a
>     recovery phase
>     > (maybe we should not print it so prominently to not confuse
>     users). There
>     > must be something else happening on the JobManager. Can you share
>     the JM
>     > logs as well?
>     >
>     > Thanks a lot,
>     > Stephan
>     >
>     >
>     > On Wed, May 2, 2018 at 12:21 PM, Amit Jain <aj2011it@xxxxxxxxx
>     <mailto:aj2011it@xxxxxxxxx>> wrote:
>     >>
>     >> Thanks! Fabian
>     >>
>     >> I will try using the current release-1.5 branch and update this
>     thread.
>     >>
>     >> --
>     >> Thanks,
>     >> Amit
>     >>
>     >> On Wed, May 2, 2018 at 3:42 PM, Fabian Hueske <fhueske@xxxxxxxxx
>     <mailto:fhueske@xxxxxxxxx>> wrote:
>     >> > Hi Amit,
>     >> >
>     >> > We recently fixed a bug in the network stack that affected
>     batch jobs
>     >> > (FLINK-9144).
>     >> > The fix was added after your commit.
>     >> >
>     >> > Do you have a chance to build the current release-1.5 branch
>     and check
>     >> > if
>     >> > the fix also resolves your problem?
>     >> >
>     >> > Otherwise it would be great if you could open a blocker issue
>     for the
>     >> > 1.5
>     >> > release to ensure that this is fixed.
>     >> >
>     >> > Thanks,
>     >> > Fabian
>     >> >
>     >> > 2018-04-29 18:30 GMT+02:00 Amit Jain <aj2011it@xxxxxxxxx
>     <mailto:aj2011it@xxxxxxxxx>>:
>     >> >>
>     >> >> Cluster is running on commit 2af481a
>     >> >>
>     >> >> On Sun, Apr 29, 2018 at 9:59 PM, Amit Jain <aj2011it@xxxxxxxxx
>     <mailto:aj2011it@xxxxxxxxx>> wrote:
>     >> >> > Hi,
>     >> >> >
>     >> >> > We are running numbers of batch jobs in Flink 1.5 cluster
>     and few of
>     >> >> > those
>     >> >> > are getting stuck at random. These jobs having the following
>     failure
>     >> >> > after
>     >> >> > which operator status changes to CANCELED and stuck to same.
>     >> >> >
>     >> >> > Please find complete TM's log at
>     >> >> >
>     https://gist.github.com/imamitjain/066d0e99990ee24f2da1ddc83eba2012
>     <https://gist.github.com/imamitjain/066d0e99990ee24f2da1ddc83eba2012>
>     >> >> >
>     >> >> >
>     >> >> > 2018-04-29 14:57:24,437 INFO
>     >> >> > org.apache.flink.runtime.taskmanager.Task
>     >> >> > - Producer 98f5976716234236dc69fb0e82a0cc34 of partition
>     >> >> > 5b8dd5ac2df14266080a8e049defbe38 disposed. Cancelling execution.
>     >> >> >
>     >> >> >
>     org.apache.flink.runtime.jobmanager.PartitionProducerDisposedException:
>     >> >> > Execution 98f5976716234236dc69fb0e82a0cc34 producing partition
>     >> >> > 5b8dd5ac2df14266080a8e049defbe38 has already been disposed.
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     org.apache.flink.runtime.jobmaster.JobMaster.requestPartitionState(JobMaster.java:610)
>     >> >> > at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     >> >> > at java.lang.reflect.Method.invoke(Method.java:498)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:210)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:69)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
>     >> >> > at
>     >> >> >
>     >> >> >
>     akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
>     >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>     >> >> > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>     >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>     >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>     >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>     >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>     >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>     >> >> > at
>     >> >> >
>     scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>     >> >> > at
>     >> >> >
>     >> >> >
>     scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>     >> >> > at
>     >> >> >
>     >> >> >
>     >> >> >
>     scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>     >> >> >
>     >> >> >
>     >> >> > Thanks
>     >> >> > Amit
>     >> >
>     >> >
>     >
>     >
> 
> 

-- 
Nico Kruber | Software Engineer
data Artisans

Follow us @dataArtisans
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Stresemannstr. 121A,10963 Berlin, Germany
data Artisans, Inc. | 1161 Mission Street, San Francisco, CA-94103, USA
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Attachment: signature.asc
Description: OpenPGP digital signature