osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (FLINK-10850) Job may hang on FAILING state if taskmanager updateTaskExecutionState failed


ouyangzhe created FLINK-10850:
---------------------------------

             Summary: Job may hang on FAILING state if taskmanager updateTaskExecutionState failed
                 Key: FLINK-10850
                 URL: https://issues.apache.org/jira/browse/FLINK-10850
             Project: Flink
          Issue Type: Bug
          Components: JobManager
    Affects Versions: 1.5.5
            Reporter: ouyangzhe
             Fix For: 1.8.0


I encountered a job which is oom but hung on FAILING state. It left 3 slots to release, and the corresponding task state is CANCELING.

I found the following log in the taskmanager, it seems that taskmanager tried to updateTaskExecutionState from CANCELING to CANCELED, but OOMed.
{panel}


2018-11-08 18:01:23,250 INFO  org.apache.flink.runtime.taskmanager.Task                     - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) (46005ba837e
fc4ebf783fc92121e55a8) switched from RUNNING to CANCELING.
2018-11-08 18:01:23,257 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code PartialSolution (BulkIteration (B
ulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8).
2018-11-08 18:01:44,081 INFO  org.apache.flink.runtime.taskmanager.Task                     - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) (46005ba837e
fc4ebf783fc92121e55a8) switched from CANCELING to CANCELED.
2018-11-08 18:01:44,081 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for PartialSolution (BulkIteration (Bulk Iterat
ion)) (97/600) (46005ba837efc4ebf783fc92121e55a8).
2018-11-08 18:02:03,097 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PartialSolution (BulkIteration (Bulk Iteration)) (97/600)' did n
ot react to cancelling signal for 30 seconds, but is stuck in method:
 org.apache.flink.shaded.guava18.com.google.common.collect.Maps$EntryFunction$1.apply(Maps.java:86)
org.apache.flink.shaded.guava18.com.google.common.collect.Iterators$8.transform(Iterators.java:799)
org.apache.flink.shaded.guava18.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
java.util.AbstractCollection.toArray(AbstractCollection.java:141)
org.apache.flink.shaded.guava18.com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258)
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartitionsProducedBy(ResultPartitionManager.java:100)
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:275)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:833)
java.lang.Thread.run(Thread.java:745)

2018-11-08 18:02:05,665 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Discarding the results produced by task execution e9141e20871e530dee90
4ddce11adca0.
2018-11-08 18:02:22,536 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Discarding the results produced by task execution 7fac76a5d76247d803e1
f1c47a6b385f.
2018-11-08 18:03:47,210 WARN  org.apache.flink.runtime.taskmanager.Task                     - Task 'PartialSolution (BulkIteration (Bulk Iteration)) (97/600)' did n
ot react to cancelling signal for 30 seconds, but is stuck in method:
 org.apache.flink.runtime.memory.MemoryManager.releaseAll(MemoryManager.java:497)

org.apache.flink.runtime.taskmanager.Task.run(Task.java:837)
java.lang.Thread.run(Thread.java:745)

2018-11-08 18:03:47,213 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task PartialSolution (B
ulkIteration (Bulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8) [CANCELED]
2018-11-08 18:03:47,215 WARN  org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline  - An exception was thrown by a user handler while handlin
g an exception event ([id: 0x397132f7, /11.10.199.197:33286 => /11.9.137.228:40859] EXCEPTION: java.lang.OutOfMemoryError: GC overhead limit exceeded)
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
        at org.apache.flink.shaded.akka.org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
        at org.apache.flink.shaded.akka.org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
        at org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
        at org.apache.flink.shaded.akka.org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
        at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.extractFrame(FrameDecoder.java:566)
        at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:391)
        at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
        at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.apache.flink.shaded.akka.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.apache.flink.shaded.akka.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{panel}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)