osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI



On 5/4/19 4:14 PM, Damien Ciabrini wrote:
> 
> 
> On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele at acksyn.org 
> <mailto:michele at acksyn.org>> wrote:
> 
>     On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:
>      >
>      >
>      > On 4/22/19 12:53 PM, Alex Schultz wrote:
>      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
>     <openstack at nemebean.com <mailto:openstack at nemebean.com>> wrote:
>      > > >
>      > > >
>      > > >
>      > > > On 4/20/19 1:38 AM, Michele Baldessari wrote:
>      > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700,
>     iain.macdonnell at oracle.com <mailto:iain.macdonnell at oracle.com> wrote:
>      > > > > >
>      > > > > > Today I discovered that this problem appears to be caused
>     by eventlet
>      > > > > > monkey-patching. I've created a bug for it:
>      > > > > >
>      > > > > > https://bugs.launchpad.net/nova/+bug/1825584
>      > > > >
>      > > > > Hi,
>      > > > >
>      > > > > just for completeness we see this very same issue also with
>      > > > > mistral (actually it was the first service where we noticed
>     the missed
>      > > > > heartbeats). iirc Alex Schultz mentioned seeing it in
>     ironic as well,
>      > > > > although I have not personally observed it there yet.
>      > > >
>      > > > Is Mistral also mixing eventlet monkeypatching and WSGI?
>      > > >
>      > >
>      > > Looks like there is monkey patching, however we noticed it with the
>      > > engine/executor. So it's likely not just wsgi.  I think I also
>     saw it
>      > > in the ironic-conductor, though I'd have to try it out again.  I'll
>      > > spin up an undercloud today and see if I can get a more
>     complete list
>      > > of affected services. It was pretty easy to reproduce.
>      >
>      > Okay, I asked because if there's no WSGI/Eventlet combination
>     then this may
>      > be different from the Nova issue that prompted this thread. It
>     sounds like
>      > that was being caused by a bad interaction between WSGI and some
>     Eventlet
>      > timers. If there's no WSGI involved then I wouldn't expect that
>     to happen.
>      >
>      > I guess we'll see what further investigation turns up, but based
>     on the
>      > preliminary information there may be two bugs here.
> 
>     So just to get some closure on this error that we have seen around
>     mistral executor and tripleo with python3: this was due to the ansible
>     action that called subprocess which has a different implementation in
>     python3 and so the monkeypatching needs to be adapted.
> 
>     Review which fixes it for us is here:
>     https://review.opendev.org/#/c/656901/
> 
>     Damien and I think the nova_api/eventlet/mod_wsgi has a separate
>     root-cause
>     (although we have not spent all too much time on that one yet)
> 
> 
> Right, after further investigation, it appears that the problem we saw
> under mod_wsgi was due to monkey patching, as Iain originally
> reported. It has nothing to do with our work on healthchecks.
> 
> It turns out that running the AMQP heartbeat thread under mod_wsgi
> doesn't work when the threading library is monkey_patched, because the
> thread waits on a data structure [1] that has been monkey patched [2],
> which makes it yield its execution instead of sleeping for 15s.
> 
> Because mod_wsgi stops the execution of its embedded interpreter, the
> AMQP heartbeat thread can't be resumed until there's a message to be
> processed in the mod_wsgi queue, which would resume the python
> interpreter and make eventlet resume the thread.
> 
> Disabling monkey-patching in nova_api makes the scheduling issue go
> away.

This sounds like the right long-term solution, but it seems unlikely to 
be backportable to the existing releases. As I understand it some 
nova-api functionality has an actual dependency on monkey-patching. Is 
there a workaround? Maybe periodically poking the API to wake up the 
wsgi interpreter?

> 
> Note: other services like heat-api do not use monkey patching and
> aren't affected, so this seem to confirm that monkey-patching
> shouldn't happen in nova_api running under mod_wsgi in the first
> place.
> 
> [1] 
> https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L904
> [2] 
> https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils.py#L182