[all][qa] Tempest jobs are swapping
---- On Wed, 03 Jul 2019 07:37:34 +0900 Clark Boylan <cboylan at sapwetik.org> wrote ----
> I've been working to bring up a new cloud as part of our nodepool resource set and one of the things we do to sanity check that is run a default tempest full job. The first time I ran tempest it failed because I hadn't configured swap on the test node and we ran out of memory. I added swap, reran things and tempest passed just fine.
> Our base jobs configure swap as a last ditch effort to avoid failing jobs unnecessarily but the ideal is to avoid swap entirely. In the past 8GB of memory has been plenty to run the tempest testsuite so I think something has changed here and I think we should be able to get us running back under 8GB of memory again.
> I bring this up because in recent weeks we've seen different groups attempt to reduce their resource footprint (which is good), but many of the approaches seem to ignore that making our jobs as quick and reliable as possible (eg don't use swap) will have a major impact. This is due to the way gating works where a failure requires we discard all results for subsequent changes in the gate, remove the change that failed, then re enqueue jobs for the changes after the failed change. On top of that the quicker our jobs run the quicker we return resources to the pool.
> How do we debug this? Devstack jobs actually do capture dstat data as well as memory specific information that can be used to identify resource hogs. Taking a recent tempest-full job's dstat log we can see that cinder-backup is using 785MB of memory all on its own  (scroll to the bottom). Devstack also captures memory usage of a larger set of processes in its peakmem_tracker log . This includes RSS specifically which doesn't match up with dstat's number making me think dstat's number may be virtual memory and not resident memory. This peakmem_tracker log identifies other processes which we might look at for improving this situation.
> It would be great if the QA team and various projects could take a look at this to help improve the reliability and throughput of our testing. Thank you.
Thanks, Clark for pointing this. We have faced the memory issue in fast also where some of the swift services were disabled. cinder-backup service is no doubt taking a lot of memory. As matt mentioned the patch of disabling the c-bak service in tempest-full, we need some job which can run c-bak tests on cinder as well on tempest side but not on other projects.
There will be some improvement I except by splitting the integrated tests as per actual dependent services . I need some time to prepare those template and propose if the situation improves.
>  http://logs.openstack.org/81/665281/3/check/tempest-full/cf5e17e/controller/logs/screen-dstat.txt.gz
>  http://logs.openstack.org/81/665281/3/check/tempest-full/cf5e17e/controller/logs/screen-peakmem_tracker.txt.gz