OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


In order to understand the end-to-end precommit flow I would like to get
access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
can I get that?

On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
jcamacho@xxxxxxxxxx> wrote:

> Bq. For the short term green runs, I think we should @Ignore the tests
> which
> are known to be failing since many runs. They are anyways not being
> addressed as such. If people think they are important to be run we should
> fix them and only then re-enable them.
>
> I think that is a good idea, as we would minimize the time that we halt
> development. We can create a JIRA where we list all tests that were
> failing, and we have disabled to get the clean run. From that moment, we
> will have zero tolerance towards committing with failing tests. And we need
> to pick up those tests that should not be ignored and bring them up again
> but passing. If there is no disagreement, I can start working on that.
>
> Once I am done, I can try to help with infra tickets too.
>
> -Jesús
>
>
> On 5/11/18, 1:57 PM, "Vineet Garg" <vgarg@xxxxxxxxxxxxxxx> wrote:
>
>     +1. I strongly vote for freezing commits and getting our testing
> coverage in acceptable state.  We have been struggling to stabilize
> branch-3 due to test failures and releasing Hive 3.0 in current state would
> be unacceptable.
>
>     Currently there are quite a few test suites which are not even running
> and are being timed out. We have been committing patches (to both branch-3
> and master) without test coverage for these tests.
>     We should immediately figure out what’s going on before we proceed
> with commits.
>
>     For reference following test suites are timing out on master: (
> https://issues.apache.org/jira/browse/HIVE-19506)
>
>
>     TestDbNotificationListener - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
>
>
>     Vineet
>
>
>     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <vihang@xxxxxxxxxxxx
> <mailto:vihang@xxxxxxxxxxxx>> wrote:
>
>     +1 There are many problems with the test infrastructure and in my
> opinion
>     it has not become number one bottleneck for the project. I was looking
> at
>     the infrastructure yesterday and I think the current infrastructure
> (even
>     its own set of problems) is still under-utilized. I am planning to
> increase
>     the number of threads to process the parallel test batches to start
> with.
>     It needs a restart on the server side. I can do it now, it folks are
> okay
>     with it. Else I can do it over weekend when the queue is small.
>
>     I listed the improvements which I thought would be useful under
>     https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking
> I am
>     not able to devote as much time as I would like to on it. I would
>     appreciate if folks who have some more time if they can help out.
>
>     I think to start with https://issues.apache.org/jira/browse/HIVE-19429
> will
>     help a lot. We need to pack more test runs in parallel and containers
>     provide good isolation.
>
>     For the short term green runs, I think we should @Ignore the tests
> which
>     are known to be failing since many runs. They are anyways not being
>     addressed as such. If people think they are important to be run we
> should
>     fix them and only then re-enable them.
>
>     Also, I feel we need light-weight test run which we can run locally
> before
>     submitting it for the full-suite. That way minor issues with the patch
> can
>     be handled locally. May be create a profile which runs a subset of
>     important tests which are consistent. We can apply some label that
>     pre-checkin-local tests are runs successful and only then we submit
> for the
>     full-suite.
>
>     More thoughts are welcome. Thanks for starting this conversation.
>
>     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>     jcamacho@xxxxxxxxxx<mailto:jcamacho@xxxxxxxxxx>> wrote:
>
>     I believe we have reached a state (maybe we did reach it a while ago)
> that
>     is not sustainable anymore, as there are so many tests failing /
> timing out
>     that it is not possible to verify whether a patch is breaking some
> critical
>     parts of the system or not. It also seems to me that due to the
> timeouts
>     (maybe due to infra, maybe not), ptest runs are taking even longer than
>     usual, which in turn creates even longer queue of patches.
>
>     There is an ongoing effort to improve ptests usability (
>     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
> that,
>     we need to make an effort to stabilize existing tests and bring that
>     failure count to zero.
>
>     Hence, I am suggesting *we stop committing any patch before we get a
> green
>     run*. If someone thinks this proposal is too radical, please come up
> with
>     an alternative, because I do not think it is OK to have the ptest runs
> in
>     their current state. Other projects of certain size (e.g., Hadoop,
> Spark)
>     are always green, we should be able to do the same.
>
>     Finally, once we get to zero failures, I suggest we are less tolerant
> with
>     committing without getting a clean ptests run. If there is a failure,
> we
>     need to fix it or revert the patch that caused it, then we continue
>     developing.
>
>     Please, let’s all work together as a community to fix this issue, that
> is
>     the only way to get to zero quickly.
>
>     Thanks,
>     Jesús
>
>     PS. I assume the flaky tests will come into the discussion. Let´s see
>     first how many of those we have, then we can work to find a fix.
>
>
>
>
>
>
>
>