osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI



On 5/7/19 2:45 PM, Ben Nemec wrote:
> 
> 
> On 5/4/19 4:14 PM, Damien Ciabrini wrote:
>>
>>
>> On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele at acksyn.org 
>> <mailto:michele at acksyn.org>> wrote:
>>
>>     On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:
>>      >
>>      >
>>      > On 4/22/19 12:53 PM, Alex Schultz wrote:
>>      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
>>     <openstack at nemebean.com <mailto:openstack at nemebean.com>> wrote:
>>      > > >
>>      > > >
>>      > > >
>>      > > > On 4/20/19 1:38 AM, Michele Baldessari wrote:
>>      > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700,
>>     iain.macdonnell at oracle.com <mailto:iain.macdonnell at oracle.com> wrote:
>>      > > > > >
>>      > > > > > Today I discovered that this problem appears to be caused
>>     by eventlet
>>      > > > > > monkey-patching. I've created a bug for it:
>>      > > > > >
>>      > > > > > 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=zgCsi2WthDNaeptBSW02iplSjxg9P_zrnfocp8P06oA&e= 
>>
>>      > > > >
>>      > > > > Hi,
>>      > > > >
>>      > > > > just for completeness we see this very same issue also with
>>      > > > > mistral (actually it was the first service where we noticed
>>     the missed
>>      > > > > heartbeats). iirc Alex Schultz mentioned seeing it in
>>     ironic as well,
>>      > > > > although I have not personally observed it there yet.
>>      > > >
>>      > > > Is Mistral also mixing eventlet monkeypatching and WSGI?
>>      > > >
>>      > >
>>      > > Looks like there is monkey patching, however we noticed it 
>> with the
>>      > > engine/executor. So it's likely not just wsgi.  I think I also
>>     saw it
>>      > > in the ironic-conductor, though I'd have to try it out 
>> again.  I'll
>>      > > spin up an undercloud today and see if I can get a more
>>     complete list
>>      > > of affected services. It was pretty easy to reproduce.
>>      >
>>      > Okay, I asked because if there's no WSGI/Eventlet combination
>>     then this may
>>      > be different from the Nova issue that prompted this thread. It
>>     sounds like
>>      > that was being caused by a bad interaction between WSGI and some
>>     Eventlet
>>      > timers. If there's no WSGI involved then I wouldn't expect that
>>     to happen.
>>      >
>>      > I guess we'll see what further investigation turns up, but based
>>     on the
>>      > preliminary information there may be two bugs here.
>>
>>     So just to get some closure on this error that we have seen around
>>     mistral executor and tripleo with python3: this was due to the 
>> ansible
>>     action that called subprocess which has a different implementation in
>>     python3 and so the monkeypatching needs to be adapted.
>>
>>     Review which fixes it for us is here:
>>     
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_c_656901_&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=1o81kC60gB8_5zIgi8WugZaOma_3m7grG4RQ-aVsbSE&e= 
>>
>>
>>     Damien and I think the nova_api/eventlet/mod_wsgi has a separate
>>     root-cause
>>     (although we have not spent all too much time on that one yet)
>>
>>
>> Right, after further investigation, it appears that the problem we saw
>> under mod_wsgi was due to monkey patching, as Iain originally
>> reported. It has nothing to do with our work on healthchecks.
>>
>> It turns out that running the AMQP heartbeat thread under mod_wsgi
>> doesn't work when the threading library is monkey_patched, because the
>> thread waits on a data structure [1] that has been monkey patched [2],
>> which makes it yield its execution instead of sleeping for 15s.
>>
>> Because mod_wsgi stops the execution of its embedded interpreter, the
>> AMQP heartbeat thread can't be resumed until there's a message to be
>> processed in the mod_wsgi queue, which would resume the python
>> interpreter and make eventlet resume the thread.
>>
>> Disabling monkey-patching in nova_api makes the scheduling issue go
>> away.
> 
> This sounds like the right long-term solution, but it seems unlikely to 
> be backportable to the existing releases. As I understand it some 
> nova-api functionality has an actual dependency on monkey-patching. Is 
> there a workaround? Maybe periodically poking the API to wake up the 
> wsgi interpreter?

I've been pondering things like that ... but if I have multiple WSGI 
processes, can I be sure that an API-poke will hit the one(s) that need it?

This is a road-block for me upgrading to Stein. I really don't want to 
have to go back to running nova-api standalone, but that's increasingly 
looking like the only "safe" option :/

     ~iain


>> Note: other services like heat-api do not use monkey patching and
>> aren't affected, so this seem to confirm that monkey-patching
>> shouldn't happen in nova_api running under mod_wsgi in the first
>> place.
>>
>> [1] 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.messaging_blob_master_oslo-5Fmessaging_-5Fdrivers_impl-5Frabbit.py-23L904&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=O5nQh1r8Zmded00yYMXrfxL44xcd9KqFK-VOa0cg6gs&e= 
>>
>> [2] 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.utils_blob_master_oslo-5Futils_eventletutils.py-23L182&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=QRkXCiqv6zcnO2b2p8Uv6cgRuu1R414B9SvILuugN6w&e= 
>>
>