osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI


On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz at redhat.com> wrote:

> On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack at nemebean.com> wrote:
> >
> >
> >
> > On 4/22/19 12:53 PM, Alex Schultz wrote:
> > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack at nemebean.com>
> wrote:
> > >>
> > >>
> > >>
> > >> On 4/20/19 1:38 AM, Michele Baldessari wrote:
> > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell at oracle.com
> wrote:
> > >>>>
> > >>>> Today I discovered that this problem appears to be caused by
> eventlet
> > >>>> monkey-patching. I've created a bug for it:
> > >>>>
> > >>>> https://bugs.launchpad.net/nova/+bug/1825584
> > >>>
> > >>> Hi,
> > >>>
> > >>> just for completeness we see this very same issue also with
> > >>> mistral (actually it was the first service where we noticed the
> missed
> > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well,
> > >>> although I have not personally observed it there yet.
> > >>
> > >> Is Mistral also mixing eventlet monkeypatching and WSGI?
> > >>
> > >
> > > Looks like there is monkey patching, however we noticed it with the
> > > engine/executor. So it's likely not just wsgi.  I think I also saw it
> > > in the ironic-conductor, though I'd have to try it out again.  I'll
> > > spin up an undercloud today and see if I can get a more complete list
> > > of affected services. It was pretty easy to reproduce.
> >
> > Okay, I asked because if there's no WSGI/Eventlet combination then this
> > may be different from the Nova issue that prompted this thread. It
> > sounds like that was being caused by a bad interaction between WSGI and
> > some Eventlet timers. If there's no WSGI involved then I wouldn't expect
> > that to happen.
> >
> > I guess we'll see what further investigation turns up, but based on the
> > preliminary information there may be two bugs here.
> >
>
> So I wasn't able to reproduce the ironic issues yet. But it's the
> mistral executor and nova-api which exhibit the issue on the
> undercloud.
>
> mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR
> oslo.messaging._drivers.impl_rabbit [-]
> [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on
> undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104]
> Connection reset by peer. Trying again in 1 seconds.:
> ConnectionResetError: [Errno 104] Connection reset by peer
>
>
> nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR
> oslo.messaging._drivers.impl_rabbit
> [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe
> 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe -
> default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on
> undercloud-0.ctlplane.localdomain:5672 is unreachable: Server
> unexpectedly closed connection. Trying again in 1 seconds.: OSError:
> Server unexpectedly closed connection
>
> The errors being thrown are different perhaps it is two different problems.
>
>
Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi
was due to a change in how we run healthcheck in Stein in TripleO-deployed
environments, so different to what Iain originally experienced it seems...

For the record, up to Rocky, we used to run healthcheck scripts
every 30 seconds, which guarantees that eventlet will wake up and
send an AMQP heartbeat packet if a service had no AMQP traffic in the last
15s. It also guarantees that any incoming AMQP heartbeat packet from
rabbitmq will be processed in at most 30s.

In Stein, our healthchecks are now triggered via systemd timers, and the
current time setting is too high to guarantee that mod_wsgi will always
wake up on time to send/receive AMQP heartbeats to/from rabbitmq
when there's no traffic.

The fix is being tracked in https://bugs.launchpad.net/tripleo/+bug/1826281


> Thanks,
> -Alex
>
>

-- 
Damien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190425/c0660c56/attachment.html>