[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

creating instances, haproxy eats CPU, glance eats RAM

Hi everyone,

We have a Queens/Rocky environment with haproxy in front of most services. Recently we've found a problem when creating multiple instances (2 VCPUs, 6GB RAM) from large images. The behaviour is the same whether we use Horizon or Terraform, so I've continued on with Terraform since it's easier to repeat attempts.

WORKS: 340MB image, create 80x instances, 40 at a time
FAILS: 20GB image, create 40x instances, 20 at a time
Changed haproxy config, added "balance roundrobin" to glance, cinder, nova, neutron stanzas (there was no 'balance' config before, not sure what it would have been doing)
WORKS sometimes[1]: 20GB image, create 40x instances, 20 at a time
FAILS: 20GB image, create 80x instances, 40 at a time

The failure condition:
* The active haproxy server has a single core go to 100% usage (the rest are idle)
* One glance server's RAM usage grows rapidly and continuously
* Some instances that are building complete
* Creating new instances fails (BUILD state forever)
* Horizon becomes unresponsive
* Ceph and Cinder don't appear to be overloaded (ceph -s, logs, system state)
* These states do not recover until we take the following actions...

To recover:
* Kill any remaining (Terraform) attempts to launch instances
* Stop haproxy on the active server
* Wait a few seconds
* Start haproxy again

[1] When we create enough to not quite overload it, haproxy server goes to 100% on one core but recovers once the instances are (slooowly) created.

The cause of the problem is not clear (e.g. from haproxy and glance logs, system state), and I'm looking for pointers on where to look or what to try next. Can you help?

Thank you,