osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey


Hi Alex,

Thanks for your help! I'm quite used to debugging concurrent/distributed problems. But this one is quite tricky, especially with regards to GRPC threads. I try to provide more information in the following.

There are two observations:

1) The problem is specifically related to how the cleanup is performed for the EmbeddedEnvironmentFactory. The environment is shutdown when the SDK Harness exists but the GRPC threads continue to linger for some time and may stall state processing on the next test.

If you do _not_ close DefaultJobBundleFactory, which happens during close() or dispose() in the FlinkExecutableStageFunction or ExecutableStageDoFnOperator respectively, the tests run just fine. I ran 1000 test runs without a single failure.

The EmbeddedEnvironment uses direct channels which are marked experimental in GRPC. We may have to convert them to regular socket communication.

2) Try setting a conditional breakpoint in GrpcStateService which will never break, e.g. "false". Set it here: https://github.com/apache/beam/blob/6da9aa5594f96c0201d497f6dce4797c4984a2fd/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/GrpcStateService.java#L134

The tests will never fail. The SDK harness is always shutdown correctly at the end of the test.

Thanks,
Max

On 26.11.18 19:15, Alex Amato wrote:
Thanks Maximilian, let me know if you need any help. Usually I debug this sort of thing by pausing the IntelliJ debugger to see all the different threads which are waiting on various conditions. If you find any insights from that, please post them here and we can try to figure out the source of the stuckness. Perhaps it may be some concurrency issue leading to deadlock?

On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels <mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>> wrote:

    I couldn't fix it thus far. The issue does not seem to be in the Flink
    Runner but in the way the tests utilizes the EMBEDDED environment to
    run
    multiple portable jobs in a row.

    When it gets stuck it is in RemoteBundle#close and it is independent of
    the test type (batch and streaming have different implementations).

    Will give it another look tomorrow.

    Thanks,
    Max

    On 22.11.18 13:07, Maximilian Michels wrote:
     > Hi Alex,
     >
     > The test seems to have gotten flaky after we merged support for
    portable
     > timers in Flink's batch mode.
     >
     > Looking into this now.
     >
     > Thanks,
     > Max
     >
     > On 21.11.18 23:56, Alex Amato wrote:
     >> Hello, I have noticed
     >> that org.apache.beam.runners.flink.PortableTimersExecutionTest
    is very
     >> flakey, and repro'd this test timeout on the master branch in
    40/50 runs.
     >>
     >> I filed a JIRA issue: BEAM-6111
     >> <https://issues.apache.org/jira/browse/BEAM-6111>. I was just
     >> wondering if anyone knew why this may be occurring, and to check if
     >> anyone else has been experiencing this.
     >>
     >> Thanks,
     >> Alex