[ops][neutron]After an upgrade to openstack queens, Neutron is unable to communicate properly with rabbitmq
On Wed, Jun 5, 2019 at 1:01 PM Jean-Philippe MÃ©thot
<jp.methot at planethoster.info> wrote:
> We had a Pike openstack setup that we updated to Queens earlier this week. Itâ??s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :
> =ERROR REPORT==== 5-Jun-2019::18:50:08 ===
> closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d):
> missed heartbeats from client, timeout: 60s
> The neutron-server logs show this error:
> 2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
> 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>
> The relevant service version numbers are as follow:
> Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.
> Iâ??ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.
We had a very similar issue after upgrading to Neutron Queens. In
fact, all Neutron agents were "down" according to status API and
messages weren't getting through. IIRC, this only happened in regions
which had more load than the others.
We applied a bunch of fixes which I suspect are only a bunch of bandaids.
Here are the changes we made:
* Split neutron-api from neutron-server. Create a whole new controller
running neutron-api with mod_wsgi.
* Increase [database]/max_overflow = 200
* Disable RabbitMQ heartbeat in oslo.messaging:
[oslo_messaging_rabbit]/heartbeat_timeout_threshold = 0
* Increase [agent]/report_interval = 120
* Increase [DEFAULT]/agent_down_time = 600
We also have those sysctl configs due to firewall dropping sessions.
But those have been on the server forever:
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 5
We never figured out why a service that was working before the upgrade
but no longer is.
This is kind of frustrating as it caused us all short of intermittent
issues and stress during our upgrade.
Hope this helps.