[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[qa][openstackclient] Debugging devstack slowness

On 8/7/19 9:37 AM, Sean Mooney wrote:
> On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:
>> On 8/6/19 11:34 AM, Ben Nemec wrote:
>>> On 8/6/19 10:49 AM, Clark Boylan wrote:
>>>> On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
>>>>> Just a reminder that there is also
>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
>>>>> which was intended to address this same issue.
>>>>> I toyed around with it a bit for TripleO installs back then and it did
>>>>> seem to speed things up, but at the time there was a bug in our client
>>>>> plugin where it was triggering a prompt for input that was problematic
>>>>> with the server running in the background. I never really got back to it
>>>>> once that was fixed. :-/
>>>> I'm not tied to any particular implementation. Mostly I wanted to show
>>>> that we can take this ~5 minute portion of devstack and turn it into a
>>>> 15 second portion of devstack by improving our use of the service APIs
>>>> (and possibly even further if we apply it to all of the api
>>>> interaction). Any idea how difficult it would be to get your client as
>>>> a service stuff running in devstack again?
>>> I wish I could take credit, but this is actually Dan Berrange's work. :-)
>>>> I do not think we should make a one off change like I've done in my
>>>> POC. That will just end up being harder to understand and debug in the
>>>> future since it will be different than all of the other API
>>>> interaction. I like the idea of a manifest or feeding a longer lived
>>>> process api update commands as we can then avoid requesting new tokens
>>>> as well as pkg_resource startup time. Such a system could be used by
>>>> all of devstack as well (avoiding the "this bit is special" problem).
>>>> Is there any interest from the QA team in committing to an approach
>>>> and working to do a conversion? I don't want to commit any more time
>>>> to this myself unless there is strong interest in getting changes
>>>> merged (as I expect it will be a slow process weeding out places where
>>>> we've made bad assumptions particularly around plugins).
>>>> One of the things I found was that using names with osc results in
>>>> name to id lookups as well. We can avoid these entirely if we remember
>>>> name to id mappings instead (which my POC does). Any idea if your osc
>>>> as a service tool does or can do that? Probably have to be more
>>>> careful for scoping things in a tool like that as it may be reused by
>>>> people with name collisions across projects/users/groups/domains.
>>> I don't believe this would handle name to id mapping. It's a very thin
>>> wrapper around the regular client code that just makes it persistent so
>>> we don't pay the startup costs every call. On the plus side that means
>>> it basically works like the vanilla client, on the minus side that means
>>> it may not provide as much improvement as a more targeted solution.
>>> IIRC it's pretty easy to use, so I can try it out again and make sure it
>>> still works and still provides a performance benefit.
>> It still works and it still helps. Using the osc service cut about 3
>> minutes off my 21 minute devstack run. Subjectively I would say that
>> most of the time was being spent cloning and installing services and
>> their deps.
>> I guess the downside is that working around the OSC slowness in CI will
>> reduce developer motivation to fix the problem, which affects all users
>> too. Then again, this has been a problem for years and no one has fixed
>> it, so apparently that isn't a big enough lever to get things moving
>> anyway. :-/
> using osc diretly i dont think the slowness is really perceptable from a human
> stand point but it adds up in a ci run. there are large problems to kill with gate
> slowness then fixing osc will solve be every little helps. i do agree however
> that the gage is not a big enough motivater for people to fix osc slowness as
> we can wait hours in some cases for jobs to start so 3 minutes is not really a consern
> form a latency perspective but if we saved 3 mins on every run that might
> in aggreaget reduce the latency problems we have.

I find the slowness very noticeable in interactive use. It adds 
something like 2 seconds to a basic call like image list that returns 
almost instantly in the OSC interactive shell where there is no startup 
overhead. From my performance days, any latency over 1 second was 
considered unacceptable for an interactive call. The interactive shell 
does help with that if I'm doing a bunch of calls in a row though.

That said, you're right that 3 minutes multiplied by the number of jobs 
we run per day is significant. Picking 1000 as a round number (and I'm 
pretty sure we run a _lot_ more than that per day), a 3 minute decrease 
in runtime per job would save about 50 hours of CI time in total. Small 
things add up at scale. :-)

>>>>> On 7/26/19 6:53 PM, Clark Boylan wrote:
>>>>>> Today I have been digging into devstack runtime costs to help Donny
>>>>>> Davis understand why tempest jobs sometimes timeout on the
>>>>>> FortNebula cloud. One thing I discovered was that the keystone user,
>>>>>> group, project, role, and domain setup [0] can take many minutes
>>>>>> [1][2] (in the examples here almost 5).
>>>>>> I've rewritten create_keystone_accounts to be a python tool [3] and
>>>>>> get the runtime for that subset of setup from ~100s to ~9s [4].  I
>>>>>> imagine that if we applied this to the other create_X_accounts
>>>>>> functions we would see similar results.
>>>>>> I think this is so much faster because we avoid repeated costs in
>>>>>> openstack client including: python process startup, pkg_resource
>>>>>> disk scanning to find entrypoints, and needing to convert names to
>>>>>> IDs via the API every time osc is run. Given my change shows this
>>>>>> can be so much quicker is there any interest in modifying devstack
>>>>>> to be faster here? And if so what do we think an appropriate
>>>>>> approach would be?
>>>>>> [0]
> https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605cac2a37444097d8/stack.sh#L1146-L1161
>>>>>> [1]
>>>>>> http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.txt.gz#_2019-07-26_12_31_04_488228
>>>>>> [2]
>>>>>> http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.txt.gz#_2019-07-26_12_35_53_445059
>>>>>> [3] https://review.opendev.org/#/c/673108/
>>>>>> [4]
> http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-output.txt.gz#_2019-07-26_23_18_37_211013
>>>>>> Note the jobs compared above all ran on rax-dfw.
>>>>>> Clark