osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Horizon][Designate][Tacker][Tobiko][OSH][Ironic] Flaky jobs and retries


On Mon, Dec 16, 2019, at 10:20 AM, Clark Boylan wrote:
> Hello,
> 
> Zuulv3 has a feature where it will rerun jobs that it thinks failed due 
> to infrastructure problems up to some configured limit. For us the 
> limit is 3 job attempts. The circumstances where Zuul thinks it has 
> infrastructure issues are when pre-run playbooks fail (because these 
> are expected to be stable), and when Ansible returns with an exit code 
> signifying network connectivity issues.
> 
> Recently I updated Zuul to report the job attempt in job inventories 
> and am using that to report this data into logstash and elasticsearch.  
> Now I can use queries like: 'zuul_attempts:"3" AND 
> filename:"job-output.txt" AND message:"Job console starting"' to see 
> which jobs are flaky and reattempting. I'm writing this email to report 
> on some of what I've found in the hopes that the problems we have can 
> be fixed and hopefully result in better jobs across the board.
> 

snip

> Tobiko
> 
> tobiko-devstack-faults-centos-7 is failing because this job runs on 
> CentOS 7 using python2 but Nova needs python3.6 now. This fails in 
> pre-run, 
> https://zuul.opendev.org/t/openstack/build/8071780e7ba748169b447f1d42e069fc/log/controller/logs/devstacklog.txt.gz#11799, and forces the job to be rerun. I'm not sure what the best fix is here. Maybe run your jobs on Ubuntu until CentOS 8 is supported by devstack?
> 

Looks like Tobiko was updated to build python3 on CentOS 7 to address this issue. Unfortunately, it looks like glance imports sqlite and the built python did not include sqlite bindings. This results in g-api failing to start and the job still fails with RETRY_LIMIT, https://zuul.opendev.org/t/openstack/build/880ded9fe11c41da9c7622d85746df23/log/controller/logs/screen-g-api.txt.gz#101. Note that this job ran against the proposed fix and was tested pre merge. You don't need to wait for the change to merge to observe these results.

Instead Zuul allows you to push the fix up, confirm it works, then merge the change. The fix for the sqlite issue is in progress here, https://review.opendev.org/#/c/699945/2, but I wanted to point out that Zuul helps us to avoid these problems in the first place with its pre merge testing.

Finally, CentOS 7 does provide python3.6 packages in the main repository now. I'm not sure if that is sufficient for Tobiko's needs but that may help simplify things here.

Clark