osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI



On 4/25/19 11:19 AM, Ken Giusti wrote:
> 
> 
> On Thu, Apr 25, 2019 at 11:19 AM <iain.macdonnell at oracle.com 
> <mailto:iain.macdonnell at oracle.com>> wrote:
> 
> 
> 
>     On 4/25/19 1:05 AM, Damien Ciabrini wrote:
>      >
>      >
>      > On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz at redhat.com
>     <mailto:aschultz at redhat.com>
>      > <mailto:aschultz at redhat.com <mailto:aschultz at redhat.com>>> wrote:
>      >
>      >     On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec
>     <openstack at nemebean.com <mailto:openstack at nemebean.com>
>      >     <mailto:openstack at nemebean.com
>     <mailto:openstack at nemebean.com>>> wrote:
>      >      >
>      >      >
>      >      >
>      >      > On 4/22/19 12:53 PM, Alex Schultz wrote:
>      >      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
>      >     <openstack at nemebean.com <mailto:openstack at nemebean.com>
>     <mailto:openstack at nemebean.com <mailto:openstack at nemebean.com>>> wrote:
>      >      > >>
>      >      > >>
>      >      > >>
>      >      > >> On 4/20/19 1:38 AM, Michele Baldessari wrote:
>      >      > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700,
>      > iain.macdonnell at oracle.com <mailto:iain.macdonnell at oracle.com>
>     <mailto:iain.macdonnell at oracle.com
>     <mailto:iain.macdonnell at oracle.com>> wrote:
>      >      > >>>>
>      >      > >>>> Today I discovered that this problem appears to be
>     caused by
>      >     eventlet
>      >      > >>>> monkey-patching. I've created a bug for it:
>      >      > >>>>
>      >      > >>>> https://bugs.launchpad.net/nova/+bug/1825584
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=yuWVlqvT-TMIF0OQxtByfrZjswKo56fkyrbKrvy9oHs&s=mJYdHr4mPQVjex9HrhekR_IvRsktOkvDNGQBnuB_ehI&e=>
>      >   
>       <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=>
>      >      > >>>
>      >      > >>> Hi,
>      >      > >>>
>      >      > >>> just for completeness we see this very same issue also
>     with
>      >      > >>> mistral (actually it was the first service where we
>     noticed
>      >     the missed
>      >      > >>> heartbeats). iirc Alex Schultz mentioned seeing it in
>     ironic
>      >     as well,
>      >      > >>> although I have not personally observed it there yet.
>      >      > >>
>      >      > >> Is Mistral also mixing eventlet monkeypatching and WSGI?
>      >      > >>
>      >      > >
>      >      > > Looks like there is monkey patching, however we noticed
>     it with the
>      >      > > engine/executor. So it's likely not just wsgi.  I think
>     I also
>      >     saw it
>      >      > > in the ironic-conductor, though I'd have to try it out
>     again.  I'll
>      >      > > spin up an undercloud today and see if I can get a more
>      >     complete list
>      >      > > of affected services. It was pretty easy to reproduce.
>      >      >
>      >      > Okay, I asked because if there's no WSGI/Eventlet combination
>      >     then this
>      >      > may be different from the Nova issue that prompted this
>     thread. It
>      >      > sounds like that was being caused by a bad interaction between
>      >     WSGI and
>      >      > some Eventlet timers. If there's no WSGI involved then I
>     wouldn't
>      >     expect
>      >      > that to happen.
>      >      >
>      >      > I guess we'll see what further investigation turns up, but
>     based
>      >     on the
>      >      > preliminary information there may be two bugs here.
>      >      >
>      >
>      >     So I wasn't able to reproduce the ironic issues yet. But it's the
>      >     mistral executor and nova-api which exhibit the issue on the
>      >     undercloud.
>      >
>      >     mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR
>      >     oslo.messaging._drivers.impl_rabbit [-]
>      >     [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on
>      >     undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno
>     104]
>      >     Connection reset by peer. Trying again in 1 seconds.:
>      >     ConnectionResetError: [Errno 104] Connection reset by peer
>      >
>      >
>      >     nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR
>      >     oslo.messaging._drivers.impl_rabbit
>      >     [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe
>      >     9ac89090d2d24949b9a1e01b1afb14cc
>     7becac88cbae4b3b962ecccaf536effe -
>      >     default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP
>     server on
>      >     undercloud-0.ctlplane.localdomain:5672 is unreachable: Server
>      >     unexpectedly closed connection. Trying again in 1 seconds.:
>     OSError:
>      >     Server unexpectedly closed connection
>      >
>      >     The errors being thrown are different perhaps it is two different
>      >     problems.
>      >
>      >
>      > Correct, I think our original issue with erratic AMQP hearbeats
>     and mod_wsgi
>      > was due to a change in how we run healthcheck in Stein in
>     TripleO-deployed
>      > environments, so different to what Iain originally experienced it
>     seems...
>      >
>      > For the record, up to Rocky, we used to run healthcheck scripts
>      > every 30 seconds, which guarantees that eventlet will wake up and
>      > send an AMQP heartbeat packet if a service had no AMQP traffic in
>     the last
>      > 15s. It also guarantees that any incoming AMQP heartbeat packet from
>      > rabbitmq will be processed in at most 30s.
>      >
>      > In Stein, our healthchecks are now triggered via systemd timers,
>     and the
>      > current time setting is too high to guarantee that mod_wsgi will
>     always
>      > wake up on time to send/receive AMQP heartbeats to/from rabbitmq
>      > when there's no traffic.
> 
>     If I'm reading this right, it sounds like the periodic healthcheck is
>     working around the underlying issue of the heartbeats not happening by
>     themselves (due to eventlet monkey-patching somehow interfering with
>     threading). The whole point of the heartbeats is to continually
>     maintain
>     the connection while there's no real traffic.
> 
>           ~iain
> 
> 
> 
> I would agree with Iain here - it sounds as if something in mod_wsgi (or 
> elsewhere)  is blocking in the os outside of the purview of eventlet.
> 
> FWIW here's the oslo.messaging thread responsible for sending heartbeats 
> from the RPC client:
> 
> https://opendev.org/openstack/oslo.messaging/src/branch/stable/stein/oslo_messaging/_drivers/impl_rabbit.py#L897 
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_oslo.messaging_src_branch_stable_stein_oslo-5Fmessaging_-5Fdrivers_impl-5Frabbit.py-23L897&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=yuWVlqvT-TMIF0OQxtByfrZjswKo56fkyrbKrvy9oHs&s=dIFtanzYyLQXGnr8R9RMZLiIq-HaEn0BokQxsytq7ok&e=>
> 
> Can you verify that the event.wait() call at the bottom of the loop is 
> not waking up as per the passed in _heartbeat_wait_timeout?
> thanks

I added extra debug logging before and after the call to wait(). The 
"before" line gets logged, but the "after" one does not (when 
monkey-patching is in place).

Here's what it looks like with monkey-patching disabled:

http://paste.openstack.org/show/kUMhrSN8aiyleDgQ99EC/

and with it enabled:

http://paste.openstack.org/show/DlwG9iBjLvR7lX6NHlie/

In both cases, I had started the WSGI app, made an API call that 
required use of RPC, then waited for several minutes.

     ~iain