OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


We should do it in a separate thread so that people can see it with the
[VOTE] subject.  Some people use that as a filter in their email to know
when to pay attention to things.

Alan.

On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
pjayachandran@xxxxxxxxxxxxxxx> wrote:

> Will there be a separate voting thread? Or the voting on this thread is
> sufficient for lock down?
>
> Thanks
> Prasanth
>
> > On May 14, 2018, at 2:34 PM, Alan Gates <alanfgates@xxxxxxxxx> wrote:
> >
> > ​I see there's support for this, but people are still pouring in commits.
> > I proposed we have a quick vote on this to lock down the commits until we
> > get to green.  That way everyone knows we have drawn the line at a
> specific
> > point.  Any commits after that point would be reverted.  There isn't a
> > category in the bylaws that fits this kind of vote but I suggest lazy
> > majority as the most appropriate one (at least 3 votes, more +1s than
> > -1s).
> >
> > Alan.​
> >
> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
> vihang@xxxxxxxxxxxx>
> > wrote:
> >
> >> I worked on a few quick-fix optimizations in Ptest infrastructure over
> the
> >> weekend which reduced the execution run from ~90 min to ~70 min per
> run. I
> >> had to restart Ptest multiple times. I was resubmitting the patches
> which
> >> were in the queue manually, but I may have missed a few. In case you
> have a
> >> patch which is pending pre-commit and you don't see it in the queue,
> please
> >> submit it manually or let me know if you don't have access to the
> jenkins
> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will do
> >> some maintenance next weekend as well.
> >>
> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
> >> jcamacho@xxxxxxxxxx> wrote:
> >>
> >>> Vineet has already been working on disabling those tests that were
> timing
> >>> out. I am working on disabling those that are generating different q
> >> files
> >>> consistently for last ptests n runs. I am keeping track of all these
> >> tests
> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
> >>>
> >>> -Jesús
> >>>
> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> >>> pjayachandran@xxxxxxxxxxxxxxx> wrote:
> >>>
> >>>    +1 on freezing commits until we get repetitive green tests. We
> should
> >>> probably disable (and remember in a jira to reenable then at later
> point)
> >>> tests that are flaky to get repetitive green test runs.
> >>>
> >>>    Thanks
> >>>    Prasanth
> >>>
> >>>
> >>>
> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
> >> lirui.fudan@xxxxxxxxx
> >>> <mailto:lirui.fudan@xxxxxxxxx>> wrote:
> >>>
> >>>
> >>>    +1 to freezing commits until we stabilize
> >>>
> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
> >>>    wrote:
> >>>
> >>>> In order to understand the end-to-end precommit flow I would like
> >> to
> >>> get
> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> >>> know how
> >>>> can I get that?
> >>>>
> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> >>>> jcamacho@xxxxxxxxxx> wrote:
> >>>>
> >>>>> Bq. For the short term green runs, I think we should @Ignore the
> >>> tests
> >>>>> which
> >>>>> are known to be failing since many runs. They are anyways not
> >> being
> >>>>> addressed as such. If people think they are important to be run
> >> we
> >>> should
> >>>>> fix them and only then re-enable them.
> >>>>>
> >>>>> I think that is a good idea, as we would minimize the time that
> >> we
> >>> halt
> >>>>> development. We can create a JIRA where we list all tests that
> >> were
> >>>>> failing, and we have disabled to get the clean run. From that
> >>> moment, we
> >>>>> will have zero tolerance towards committing with failing tests.
> >>> And we
> >>>> need
> >>>>> to pick up those tests that should not be ignored and bring them
> >>> up again
> >>>>> but passing. If there is no disagreement, I can start working on
> >>> that.
> >>>>>
> >>>>> Once I am done, I can try to help with infra tickets too.
> >>>>>
> >>>>> -Jesús
> >>>>>
> >>>>>
> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >>>>>
> >>>>>    +1. I strongly vote for freezing commits and getting our
> >>> testing
> >>>>> coverage in acceptable state.  We have been struggling to
> >> stabilize
> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
> >>> state
> >>>> would
> >>>>> be unacceptable.
> >>>>>
> >>>>>    Currently there are quite a few test suites which are not
> >> even
> >>>> running
> >>>>> and are being timed out. We have been committing patches (to both
> >>>> branch-3
> >>>>> and master) without test coverage for these tests.
> >>>>>    We should immediately figure out what’s going on before we
> >>> proceed
> >>>>> with commits.
> >>>>>
> >>>>>    For reference following test suites are timing out on
> >> master: (
> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
> >>>>>
> >>>>>
> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
> >> file
> >>>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> >>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
> >>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
> >> file
> >>>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> >>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> >>> out)
> >>>>>
> >>>>>
> >>>>>    Vineet
> >>>>>
> >>>>>
> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> >>>> vihang@xxxxxxxxxxxx
> >>>>>> wrote:
> >>>>>
> >>>>>    +1 There are many problems with the test infrastructure and
> >> in
> >>> my
> >>>>> opinion
> >>>>>    it has not become number one bottleneck for the project. I
> >> was
> >>>> looking
> >>>>> at
> >>>>>    the infrastructure yesterday and I think the current
> >>> infrastructure
> >>>>> (even
> >>>>>    its own set of problems) is still under-utilized. I am
> >>> planning to
> >>>>> increase
> >>>>>    the number of threads to process the parallel test batches to
> >>> start
> >>>>> with.
> >>>>>    It needs a restart on the server side. I can do it now, it
> >>> folks are
> >>>>> okay
> >>>>>    with it. Else I can do it over weekend when the queue is
> >> small.
> >>>>>
> >>>>>    I listed the improvements which I thought would be useful
> >> under
> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> >>>> speaking
> >>>>> I am
> >>>>>    not able to devote as much time as I would like to on it. I
> >>> would
> >>>>>    appreciate if folks who have some more time if they can help
> >>> out.
> >>>>>
> >>>>>    I think to start with https://issues.apache.org/
> >>>> jira/browse/HIVE-19429
> >>>>> will
> >>>>>    help a lot. We need to pack more test runs in parallel and
> >>> containers
> >>>>>    provide good isolation.
> >>>>>
> >>>>>    For the short term green runs, I think we should @Ignore the
> >>> tests
> >>>>> which
> >>>>>    are known to be failing since many runs. They are anyways not
> >>> being
> >>>>>    addressed as such. If people think they are important to be
> >>> run we
> >>>>> should
> >>>>>    fix them and only then re-enable them.
> >>>>>
> >>>>>    Also, I feel we need light-weight test run which we can run
> >>> locally
> >>>>> before
> >>>>>    submitting it for the full-suite. That way minor issues with
> >>> the
> >>>> patch
> >>>>> can
> >>>>>    be handled locally. May be create a profile which runs a
> >>> subset of
> >>>>>    important tests which are consistent. We can apply some label
> >>> that
> >>>>>    pre-checkin-local tests are runs successful and only then we
> >>> submit
> >>>>> for the
> >>>>>    full-suite.
> >>>>>
> >>>>>    More thoughts are welcome. Thanks for starting this
> >>> conversation.
> >>>>>
> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >>>>>    jcamacho@xxxxxxxxxx> wrote:
> >>>>>
> >>>>>    I believe we have reached a state (maybe we did reach it a
> >>> while ago)
> >>>>> that
> >>>>>    is not sustainable anymore, as there are so many tests
> >> failing
> >>> /
> >>>>> timing out
> >>>>>    that it is not possible to verify whether a patch is breaking
> >>> some
> >>>>> critical
> >>>>>    parts of the system or not. It also seems to me that due to
> >> the
> >>>>> timeouts
> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
> >>> longer
> >>>> than
> >>>>>    usual, which in turn creates even longer queue of patches.
> >>>>>
> >>>>>    There is an ongoing effort to improve ptests usability (
> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
> >>> from
> >>>>> that,
> >>>>>    we need to make an effort to stabilize existing tests and
> >>> bring that
> >>>>>    failure count to zero.
> >>>>>
> >>>>>    Hence, I am suggesting *we stop committing any patch before
> >> we
> >>> get a
> >>>>> green
> >>>>>    run*. If someone thinks this proposal is too radical, please
> >>> come up
> >>>>> with
> >>>>>    an alternative, because I do not think it is OK to have the
> >>> ptest
> >>>> runs
> >>>>> in
> >>>>>    their current state. Other projects of certain size (e.g.,
> >>> Hadoop,
> >>>>> Spark)
> >>>>>    are always green, we should be able to do the same.
> >>>>>
> >>>>>    Finally, once we get to zero failures, I suggest we are less
> >>> tolerant
> >>>>> with
> >>>>>    committing without getting a clean ptests run. If there is a
> >>> failure,
> >>>>> we
> >>>>>    need to fix it or revert the patch that caused it, then we
> >>> continue
> >>>>>    developing.
> >>>>>
> >>>>>    Please, let’s all work together as a community to fix this
> >>> issue,
> >>>> that
> >>>>> is
> >>>>>    the only way to get to zero quickly.
> >>>>>
> >>>>>    Thanks,
> >>>>>    Jesús
> >>>>>
> >>>>>    PS. I assume the flaky tests will come into the discussion.
> >>> Let´s see
> >>>>>    first how many of those we have, then we can work to find a
> >>> fix.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>>    --
> >>>    Best regards!
> >>>    Rui Li
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
>
>