[Horizon][Designate][Tacker][Tobiko][OSH][Ironic] Flaky jobs and retries
Zuulv3 has a feature where it will rerun jobs that it thinks failed due to infrastructure problems up to some configured limit. For us the limit is 3 job attempts. The circumstances where Zuul thinks it has infrastructure issues are when pre-run playbooks fail (because these are expected to be stable), and when Ansible returns with an exit code signifying network connectivity issues.
Recently I updated Zuul to report the job attempt in job inventories and am using that to report this data into logstash and elasticsearch. Now I can use queries like: 'zuul_attempts:"3" AND filename:"job-output.txt" AND message:"Job console starting"' to see which jobs are flaky and reattempting. I'm writing this email to report on some of what I've found in the hopes that the problems we have can be fixed and hopefully result in better jobs across the board.
Designate Horizon Dashboard
The nodejs jobs for stable/rocky expect to run against nodejs4 but run on Bionic which has no nodejs 4 packaging. This causes the pre-run to fail as it cannot install the requested package, https://zuul.opendev.org/t/openstack/build/498ebb052e2b4a3393b0939820ee8927/log/job-output.txt#391. Fix in progress here, https://review.opendev.org/#/c/699241/1, to pin to Xenial. Thank you!
Horizon projects and dashboards may want to double check they don't have similar problems with nodejs version and node mismatches.
tacker-functional-devstack-multinode-python3 (and possibly other tacker devstack jobs) attempt to download an openwrt image during devstack runs and the server isn't responding. This fails in pre-run, http://zuul.openstack.org/build/27ff514c77724250968d60469923f613/log/controller/logs/devstacklog.txt.gz#45426, and results in the job being retried. I am not aware of a fix for this, but you'll likely need to find a new image host. Let the infra team know if we can help.
tobiko-devstack-faults-centos-7 is failing because this job runs on CentOS 7 using python2 but Nova needs python3.6 now. This fails in pre-run, https://zuul.opendev.org/t/openstack/build/8071780e7ba748169b447f1d42e069fc/log/controller/logs/devstacklog.txt.gz#11799, and forces the job to be rerun. I'm not sure what the best fix is here. Maybe run your jobs on Ubuntu until CentOS 8 is supported by devstack?
Ansible has decided that become_user is not valid on include_role tasks. https://opendev.org/openstack/openstack-helm-infra/src/branch/master/roles/deploy-docker/tasks/deploy-ansible-docker-support.yaml#L26-L49 seems to be the issue. This causes Airship's deckhand functional docker jobs to fail in pre-run, https://zuul.opendev.org/t/openstack/build/7493038ee1744465b2387b44e067d029/log/job-output.txt#396, and the jobs are retried. It is possible this is fallout from Zuul's default ansible version being bumped to 2.8 from 2.7.
This also seems to affect OSH's kubernetes keystone auth job, https://zuul.opendev.org/t/openstack/build/68d937b3e5e3449cb6fe2e6947bbf0db/log/job-output.txt#397
The Ironic example is a bit different and runs into some fun Ansible behaviors. Ironic's IPA multinode jobs (possibly others too) are filling the root disk of the test nodes on some clouds, http://paste.openstack.org/show/787641/. Ansible uses /tmp to bootstrap its ssh connectivity and when /tmp is on / and / is full that results in Ansible returning a network failure error code. This then causes Zuul to rerun the job. Unfortunately because Ansible thinks networking is broken we have to do special things in a Zuul cleanup playbook to double check disk usage and that info is only available on the Zuul executors. But I've double checked for you and can confirm this is what is happening.
The fix here is to use the ephemeral disk which is mounted on /opt and contains much more disk space. Another option would be to reduce the size of Ironic's fake baremetal images. In any case they are aware of this and we should expect a fix soon.
There are likely more cases I haven't found yet, but I wanted to point out this Zuul behavior to people. The intent is that it handle exceptional cases and not paper over failures that happen often or even 100% of the time. We should do our best to fix these problems when they pop up. Thank you to everyone that has already stepped up to help fix some of these!