[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: org.apache.beam.runners.flink.PortableTimersExecutionTest is very flakey

Thank you for looking into the test. For me the flakiness was solely caused by the non thread-safe GrpcStateService. I have since closed the JIRA issue because I didn't see another failure since the fix.

Your fixes are valid, but they won't fix flakiness (if present) in the current testing pipeline. Why? The results are only ever written by 1 worker because the testing source uses Impulse which generates a signal only received by a single worker. So the shared results list is not a problem for this test.

Let me quickly comment on the changes you mentioned:

1) Yes, if we had a parallel source, the List should be a concurrent or synchronized list.

2) Using a static list should be fine for testing purposes. There are no other tests accessing this map. Tests are not run parallel on a class level. Besides, there is only one test in this class.

3) If you make the results object transient, then it won't be serialized, so it will be 'null' after being deserialized.

Thank you for spending the time to look into the test. Do you think it would make sense to address changes to the test separately of your PR?

I believe the test can be further improved, e.g. to run more parallel. Also, if you see any flakiness after the merged fix, please post the build log in the JIRA issue.


On 05.12.18 03:18, Alex Amato wrote:
Well, here is my hacky solution.
You can see the changes I make to PortableTimersExecutionTest

I don't really understand why the pipeline never starts running when I make the results object transient in PortableTiemrsExecutionTest.

So I instead continue to access a static object, but key it with the test parameter, to prevent tests from interfering with each other.

I am not too sure how to proceed. I don't really want to check in this hacky solution. But I am not too sure of what else to do with solved the problems. Please let me know if you have any suggestions.

On Tue, Dec 4, 2018 at 5:26 PM Alex Amato <ajamato@xxxxxxxxxx <mailto:ajamato@xxxxxxxxxx>> wrote:

    Thanks for letting me know Maximillian,

    Btw, I've been looking a this test the last few days as well. I have found a
    few other concurrency issues. That I hope to send a PR out for.

      * The PortableTimersExecutionTest result variable is using a static
        ArrayList, but can be writen to concurrently (by multiple thread AND
        multiple parameterized test instnace) which causing flakeyness.
      * But just using a ConcurrentLinkedQueue and a non static variable isn't
        sufficient as that will cause a copy of the results object to be copied
        during doFn serialization. So that makes all the assertions fail, since
        nothing get written to the same result object the test is using/
          o So it should be made private transient final. However, after trying
            this I am seeing the test timeout, and I am not sure why. Continuing
            to debug this.

    I think that my PR was increasing flakeyness, which is why I saw more of
    these issues.
    Just wanted to point these out in the meantime, hopefull it helps with
    debugging for you too.

    On Fri, Nov 30, 2018 at 7:49 AM Maximilian Michels <mxm@xxxxxxxxxx
    <mailto:mxm@xxxxxxxxxx>> wrote:

        This turned out to be a tricky bug. Robert and me had a joined debugging
        session and managed to find the culprit.

        PR pending: https://github.com/apache/beam/pull/7171

        On 27.11.18 19:35, Kenneth Knowles wrote:
         > I actually didn't look at this one. I filed a bunch more adjacent flake
         > bugs. I didn't find your bug but I do see that test flaking at the same
         > time as the others. FWIW here is the list of flakes and sickbayed tests:
         > https://issues.apache.org/jira/issues/?filter=12343195
         > Kenn
         > On Tue, Nov 27, 2018 at 10:25 AM Alex Amato <ajamato@xxxxxxxxxx
         > <mailto:ajamato@xxxxxxxxxx <mailto:ajamato@xxxxxxxxxx>>> wrote:
         >     +Ken,
         >     Did you happen to look into this test? I heard that you may have
         >     been looking into this.
         >     On Mon, Nov 26, 2018 at 3:36 PM Maximilian Michels
        <mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>
         >     <mailto:mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>>> wrote:
         >         Hi Alex,
         >         Thanks for your help! I'm quite used to debugging
         >         concurrent/distributed
         >         problems. But this one is quite tricky, especially with regards
         >         to GRPC
         >         threads. I try to provide more information in the following.
         >         There are two observations:
         >         1) The problem is specifically related to how the cleanup is
         >         performed
         >         for the EmbeddedEnvironmentFactory. The environment is shutdown
         >         when the
         >         SDK Harness exists but the GRPC threads continue to linger for
         >         some time
         >         and may stall state processing on the next test.
         >         If you do _not_ close DefaultJobBundleFactory, which happens
         >         close() or dispose() in the FlinkExecutableStageFunction or
         >         ExecutableStageDoFnOperator respectively, the tests run just
         >         fine. I ran
         >         1000 test runs without a single failure.
         >         The EmbeddedEnvironment uses direct channels which are marked
         >         experimental in GRPC. We may have to convert them to regular
         >         communication.
         >         2) Try setting a conditional breakpoint in GrpcStateService
         >         which will
         >         never break, e.g. "false". Set it here:
         >         The tests will never fail. The SDK harness is always shutdown
         >         correctly
         >         at the end of the test.
         >         Thanks,
         >         Max
         >         On 26.11.18 19:15, Alex Amato wrote:
         >          > Thanks Maximilian, let me know if you need any help. Usually
         >         I debug
         >          > this sort of thing by pausing the IntelliJ debugger to see
         >         all the
         >          > different threads which are waiting on various conditions. If
         >         you find
         >          > any insights from that, please post them here and we can try
         >         to figure
         >          > out the source of the stuckness. Perhaps it may be some
         >         concurrency
         >          > issue leading to deadlock?
         >          >
         >          > On Thu, Nov 22, 2018 at 12:57 PM Maximilian Michels
         >         <mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>
        <mailto:mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>>
         >          > <mailto:mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>
        <mailto:mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>>>> wrote:
         >          >
         >          >     I couldn't fix it thus far. The issue does not seem to be
         >         in the Flink
         >          >     Runner but in the way the tests utilizes the EMBEDDED
         >         environment to
         >          >     run
         >          >     multiple portable jobs in a row.
         >          >
         >          >     When it gets stuck it is in RemoteBundle#close and it is
         >         independent of
         >          >     the test type (batch and streaming have different
         >         implementations).
         >          >
         >          >     Will give it another look tomorrow.
         >          >
         >          >     Thanks,
         >          >     Max
         >          >
         >          >     On 22.11.18 13:07, Maximilian Michels wrote:
         >          >      > Hi Alex,
         >          >      >
         >          >      > The test seems to have gotten flaky after we merged
         >         support for
         >          >     portable
         >          >      > timers in Flink's batch mode.
         >          >      >
         >          >      > Looking into this now.
         >          >      >
         >          >      > Thanks,
         >          >      > Max
         >          >      >
         >          >      > On 21.11.18 23:56, Alex Amato wrote:
         >          >      >> Hello, I have noticed
         >          >      >>
         >         that org.apache.beam.runners.flink.PortableTimersExecutionTest
         >          >     is very
         >          >      >> flakey, and repro'd this test timeout on the master
         >         branch in
         >          >     40/50 runs.
         >          >      >>
         >          >      >> I filed a JIRA issue: BEAM-6111
         >          >      >> <https://issues.apache.org/jira/browse/BEAM-6111>. I
         >         was just
         >          >      >> wondering if anyone knew why this may be occurring,
         >         and to check if
         >          >      >> anyone else has been experiencing this.
         >          >      >>
         >          >      >> Thanks,
         >          >      >> Alex
         >          >