osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI


On Tue, 7 May 2019 15:22:36 -0700, Iain Macdonnell 
<iain.macdonnell at oracle.com> wrote:
> 
> 
> On 5/7/19 2:45 PM, Ben Nemec wrote:
>>
>>
>> On 5/4/19 4:14 PM, Damien Ciabrini wrote:
>>>
>>>
>>> On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele at acksyn.org
>>> <mailto:michele at acksyn.org>> wrote:
>>>
>>>      On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:
>>>       >
>>>       >
>>>       > On 4/22/19 12:53 PM, Alex Schultz wrote:
>>>       > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
>>>      <openstack at nemebean.com <mailto:openstack at nemebean.com>> wrote:
>>>       > > >
>>>       > > >
>>>       > > >
>>>       > > > On 4/20/19 1:38 AM, Michele Baldessari wrote:
>>>       > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700,
>>>      iain.macdonnell at oracle.com <mailto:iain.macdonnell at oracle.com> wrote:
>>>       > > > > >
>>>       > > > > > Today I discovered that this problem appears to be caused
>>>      by eventlet
>>>       > > > > > monkey-patching. I've created a bug for it:
>>>       > > > > >
>>>       > > > > >
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=zgCsi2WthDNaeptBSW02iplSjxg9P_zrnfocp8P06oA&e=
>>>
>>>       > > > >
>>>       > > > > Hi,
>>>       > > > >
>>>       > > > > just for completeness we see this very same issue also with
>>>       > > > > mistral (actually it was the first service where we noticed
>>>      the missed
>>>       > > > > heartbeats). iirc Alex Schultz mentioned seeing it in
>>>      ironic as well,
>>>       > > > > although I have not personally observed it there yet.
>>>       > > >
>>>       > > > Is Mistral also mixing eventlet monkeypatching and WSGI?
>>>       > > >
>>>       > >
>>>       > > Looks like there is monkey patching, however we noticed it
>>> with the
>>>       > > engine/executor. So it's likely not just wsgi.  I think I also
>>>      saw it
>>>       > > in the ironic-conductor, though I'd have to try it out
>>> again.  I'll
>>>       > > spin up an undercloud today and see if I can get a more
>>>      complete list
>>>       > > of affected services. It was pretty easy to reproduce.
>>>       >
>>>       > Okay, I asked because if there's no WSGI/Eventlet combination
>>>      then this may
>>>       > be different from the Nova issue that prompted this thread. It
>>>      sounds like
>>>       > that was being caused by a bad interaction between WSGI and some
>>>      Eventlet
>>>       > timers. If there's no WSGI involved then I wouldn't expect that
>>>      to happen.
>>>       >
>>>       > I guess we'll see what further investigation turns up, but based
>>>      on the
>>>       > preliminary information there may be two bugs here.
>>>
>>>      So just to get some closure on this error that we have seen around
>>>      mistral executor and tripleo with python3: this was due to the
>>> ansible
>>>      action that called subprocess which has a different implementation in
>>>      python3 and so the monkeypatching needs to be adapted.
>>>
>>>      Review which fixes it for us is here:
>>>      
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_c_656901_&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=1o81kC60gB8_5zIgi8WugZaOma_3m7grG4RQ-aVsbSE&e=
>>>
>>>
>>>      Damien and I think the nova_api/eventlet/mod_wsgi has a separate
>>>      root-cause
>>>      (although we have not spent all too much time on that one yet)
>>>
>>>
>>> Right, after further investigation, it appears that the problem we saw
>>> under mod_wsgi was due to monkey patching, as Iain originally
>>> reported. It has nothing to do with our work on healthchecks.
>>>
>>> It turns out that running the AMQP heartbeat thread under mod_wsgi
>>> doesn't work when the threading library is monkey_patched, because the
>>> thread waits on a data structure [1] that has been monkey patched [2],
>>> which makes it yield its execution instead of sleeping for 15s.
>>>
>>> Because mod_wsgi stops the execution of its embedded interpreter, the
>>> AMQP heartbeat thread can't be resumed until there's a message to be
>>> processed in the mod_wsgi queue, which would resume the python
>>> interpreter and make eventlet resume the thread.
>>>
>>> Disabling monkey-patching in nova_api makes the scheduling issue go
>>> away.
>>
>> This sounds like the right long-term solution, but it seems unlikely to
>> be backportable to the existing releases. As I understand it some
>> nova-api functionality has an actual dependency on monkey-patching. Is
>> there a workaround? Maybe periodically poking the API to wake up the
>> wsgi interpreter?
> 
> I've been pondering things like that ... but if I have multiple WSGI
> processes, can I be sure that an API-poke will hit the one(s) that need it?
> 
> This is a road-block for me upgrading to Stein. I really don't want to
> have to go back to running nova-api standalone, but that's increasingly
> looking like the only "safe" option :/

FWIW, I have a patch series that aims to re-eliminate the eventlet 
dependency in nova-api:

https://review.opendev.org/657750 (top patch)

if you might be able to give it a try. If it helps, then maybe we could 
backport to Stein if folks are in support.

-melanie

> 
> 
>>> Note: other services like heat-api do not use monkey patching and
>>> aren't affected, so this seem to confirm that monkey-patching
>>> shouldn't happen in nova_api running under mod_wsgi in the first
>>> place.
>>>
>>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.messaging_blob_master_oslo-5Fmessaging_-5Fdrivers_impl-5Frabbit.py-23L904&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=O5nQh1r8Zmded00yYMXrfxL44xcd9KqFK-VOa0cg6gs&e=
>>>
>>> [2]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.utils_blob_master_oslo-5Futils_eventletutils.py-23L182&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=QRkXCiqv6zcnO2b2p8Uv6cgRuu1R414B9SvILuugN6w&e=
>>>
>>
>