osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[qa][openstackclient] Debugging devstack slowness


On Wed, 2019-08-07 at 10:11 -0500, Ben Nemec wrote:
> 
> On 8/7/19 9:37 AM, Sean Mooney wrote:
> > On Wed, 2019-08-07 at 08:33 -0500, Ben Nemec wrote:
> > > 
> > > On 8/6/19 11:34 AM, Ben Nemec wrote:
> > > > 
> > > > 
> > > > On 8/6/19 10:49 AM, Clark Boylan wrote:
> > > > > On Tue, Aug 6, 2019, at 8:26 AM, Ben Nemec wrote:
> > > > > > Just a reminder that there is also
> > > > > > http://lists.openstack.org/pipermail/openstack-dev/2016-April/092546.html
> > > > > > 
> > > > > > which was intended to address this same issue.
> > > > > > 
> > > > > > I toyed around with it a bit for TripleO installs back then and it did
> > > > > > seem to speed things up, but at the time there was a bug in our client
> > > > > > plugin where it was triggering a prompt for input that was problematic
> > > > > > with the server running in the background. I never really got back to it
> > > > > > once that was fixed. :-/
> > > > > 
> > > > > I'm not tied to any particular implementation. Mostly I wanted to show
> > > > > that we can take this ~5 minute portion of devstack and turn it into a
> > > > > 15 second portion of devstack by improving our use of the service APIs
> > > > > (and possibly even further if we apply it to all of the api
> > > > > interaction). Any idea how difficult it would be to get your client as
> > > > > a service stuff running in devstack again?
> > > > 
> > > > I wish I could take credit, but this is actually Dan Berrange's work. :-)
> > > > 
> > > > > 
> > > > > I do not think we should make a one off change like I've done in my
> > > > > POC. That will just end up being harder to understand and debug in the
> > > > > future since it will be different than all of the other API
> > > > > interaction. I like the idea of a manifest or feeding a longer lived
> > > > > process api update commands as we can then avoid requesting new tokens
> > > > > as well as pkg_resource startup time. Such a system could be used by
> > > > > all of devstack as well (avoiding the "this bit is special" problem).
> > > > > 
> > > > > Is there any interest from the QA team in committing to an approach
> > > > > and working to do a conversion? I don't want to commit any more time
> > > > > to this myself unless there is strong interest in getting changes
> > > > > merged (as I expect it will be a slow process weeding out places where
> > > > > we've made bad assumptions particularly around plugins).
> > > > > 
> > > > > One of the things I found was that using names with osc results in
> > > > > name to id lookups as well. We can avoid these entirely if we remember
> > > > > name to id mappings instead (which my POC does). Any idea if your osc
> > > > > as a service tool does or can do that? Probably have to be more
> > > > > careful for scoping things in a tool like that as it may be reused by
> > > > > people with name collisions across projects/users/groups/domains.
> > > > 
> > > > I don't believe this would handle name to id mapping. It's a very thin
> > > > wrapper around the regular client code that just makes it persistent so
> > > > we don't pay the startup costs every call. On the plus side that means
> > > > it basically works like the vanilla client, on the minus side that means
> > > > it may not provide as much improvement as a more targeted solution.
> > > > 
> > > > IIRC it's pretty easy to use, so I can try it out again and make sure it
> > > > still works and still provides a performance benefit.
> > > 
> > > It still works and it still helps. Using the osc service cut about 3
> > > minutes off my 21 minute devstack run. Subjectively I would say that
> > > most of the time was being spent cloning and installing services and
> > > their deps.
> > > 
> > > I guess the downside is that working around the OSC slowness in CI will
> > > reduce developer motivation to fix the problem, which affects all users
> > > too. Then again, this has been a problem for years and no one has fixed
> > > it, so apparently that isn't a big enough lever to get things moving
> > > anyway. :-/
> > 
> > using osc diretly i dont think the slowness is really perceptable from a human
> > stand point but it adds up in a ci run. there are large problems to kill with gate
> > slowness then fixing osc will solve be every little helps. i do agree however
> > that the gage is not a big enough motivater for people to fix osc slowness as
> > we can wait hours in some cases for jobs to start so 3 minutes is not really a consern
> > form a latency perspective but if we saved 3 mins on every run that might
> > in aggreaget reduce the latency problems we have.
> 
> I find the slowness very noticeable in interactive use. It adds 
> something like 2 seconds to a basic call like image list that returns 
> almost instantly in the OSC interactive shell where there is no startup 
> overhead. From my performance days, any latency over 1 second was 
> considered unacceptable for an interactive call. The interactive shell 
> does help with that if I'm doing a bunch of calls in a row though.
well that was kind of my point when we write sripts we invoke it over and over again.
if i need to use osc to do lots of commands for some reason i generaly enter the interactive
mode. the interactive mode already masks the pain so anytime it has bother me in the past
i have just ended up using it instead. 

its been a long time since i looked at this but it think there were two reasons it is slow on startup.
one is the need to get the token for each request and the other was related to the way we scan
for plugins. i honestly dont know if either have imporved but the interactive shell elimiates both
as issues.
> 
> That said, you're right that 3 minutes multiplied by the number of jobs 
> we run per day is significant. Picking 1000 as a round number (and I'm 
> pretty sure we run a _lot_ more than that per day), a 3 minute decrease 
> in runtime per job would save about 50 hours of CI time in total. Small 
> things add up at scale. :-)
yep it defintly does. 
> 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > On 7/26/19 6:53 PM, Clark Boylan wrote:
> > > > > > > Today I have been digging into devstack runtime costs to help Donny
> > > > > > > Davis understand why tempest jobs sometimes timeout on the
> > > > > > > FortNebula cloud. One thing I discovered was that the keystone user,
> > > > > > > group, project, role, and domain setup [0] can take many minutes
> > > > > > > [1][2] (in the examples here almost 5).
> > > > > > > 
> > > > > > > I've rewritten create_keystone_accounts to be a python tool [3] and
> > > > > > > get the runtime for that subset of setup from ~100s to ~9s [4].  I
> > > > > > > imagine that if we applied this to the other create_X_accounts
> > > > > > > functions we would see similar results.
> > > > > > > 
> > > > > > > I think this is so much faster because we avoid repeated costs in
> > > > > > > openstack client including: python process startup, pkg_resource
> > > > > > > disk scanning to find entrypoints, and needing to convert names to
> > > > > > > IDs via the API every time osc is run. Given my change shows this
> > > > > > > can be so much quicker is there any interest in modifying devstack
> > > > > > > to be faster here? And if so what do we think an appropriate
> > > > > > > approach would be?
> > > > > > > 
> > > > > > > [0]
> > > > > > > 
> > 
> > https://opendev.org/openstack/devstack/src/commit/6aeaceb0c4ef078d028fb6605cac2a37444097d8/stack.sh#L1146-L1161
> > > > > > >   
> > > > > > > 
> > > > > > > [1]
> > > > > > > 
http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.txt.gz#_2019-07-26_12_31_04_488228
> > > > > > >   
> > > > > > > 
> > > > > > > [2]
> > > > > > > 
http://logs.openstack.org/05/672805/4/check/tempest-full/14f3211/job-output.txt.gz#_2019-07-26_12_35_53_445059
> > > > > > >   
> > > > > > > 
> > > > > > > [3] https://review.opendev.org/#/c/673108/
> > > > > > > [4]
> > > > > > > 
> > 
> > http://logs.openstack.org/08/673108/6/check/devstack-xenial/a4107d0/job-output.txt.gz#_2019-07-26_23_18_37_211013
> > > > > > >   
> > > > > > > 
> > > > > > > 
> > > > > > > Note the jobs compared above all ran on rax-dfw.
> > > > > > > 
> > > > > > > Clark
> > > > > > > 
> > > > > > 
> > > > > > 
> > > 
> > > 
> > 
> > 
> 
>