[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Long, Slow Zuul Queues and Why They Happen

On 9/13/2019 2:03 PM, Clark Boylan wrote:
> We've been fielding a fair bit of questions and suggestions around Zuul's long change (and job) queues over the last week or so. As a result I tried to put a quick FAQ type document [0] on how we schedule jobs, why we schedule that way, and how we can improve the long queues.
> Hoping that gives us all a better understanding of why were are in the current situation and ideas on how we can help to improve things.
> [0]https://docs.openstack.org/infra/manual/testing.html#why-are-jobs-for-changes-queued-for-a-long-time

Thanks for writing this up Clark.

As for the current status of the gate, several nova devs have been 
closely monitoring the gate since we have 3 fairly lengthy series of 
feature changes approved since yesterday and we're trying to shepherd 
those through but we're seeing failures and trying to react to them.

Two issues of note this week:

1. http://status.openstack.org/elastic-recheck/index.html#1843615

I had pushed a fix for that one earlier in the week but there was a bug 
in my fix which Takashi has fixed:


That was promoted to the gate earlier today but failed on...

2. http://status.openstack.org/elastic-recheck/index.html#1813147

We have a couple of patches up for that now which might get promoted 
once we are reasonably sure those are going to pass check (promote to 
gate means skipping check which is risky because if it fails in the gate 
we have to re-queue the gate as the doc above explains).

As far as overall failure classifications we're pretty good there in 


Meaning for the most part we know what's failing, we just need to fix 
the bugs.

One that continues to dog us (and by "us" I mean OpenStack, not just 
nova) is this one:


The QA team's work to split apart the big tempest full jobs into 
service-oriented jobs like tempest-integrated-compute should have helped 
here but we're still seeing there are lots of jobs timing out which 
likely means there are some really slow tests running in too many jobs 
and those require investigation. It could also be devstack setup that is 
taking a long time like Clark identified with OSC usage awhile back:


If you have questions about how elastic-recheck works or how to help 
investigate some of these failures, like with using 
logstash.openstack.org, please reach out to me (mriedem), clarkb and/or 
gmann in #openstack-qa.