OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt server) disconnection?


Hi Rohit,

I checked that. Thanks for the details!

-Suresh

On Wed, May 16, 2018 at 4:55 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
wrote:

> Hi Suresh,
>
>
> As explained earlier and advised to look at code on the PR, perhaps you
> did not get time so have a look here:
>
> https://github.com/apache/cloudstack/blob/4.11/agent/
> src/com/cloud/agent/Agent.java#L488
>
>
> The reconnect() historically sets the link to null. Therefore, any answer
> from pending tasks end up failing here:
>
> https://github.com/apache/cloudstack/blob/4.11/agent/
> src/com/cloud/agent/Agent.java#L868
>
> and,
>
> https://github.com/apache/cloudstack/blob/4.11/agent/
> src/com/cloud/agent/Agent.java#L893
>
>
> Do note that reconnect() only cancels watch tasks but does not
> cancel/shutdown any running task. Also, in case of network error, the mgmt
> server will fail at thread/context where is has done a agent.send() and
> expecting an answer.
>
>
> You can also perform a small test by doing a while or sleep around this
> code to see how getLink().send() behave when agent does reconnect. When it
> does not reconnect, i.e. the agent is blocked by pending tasks to complete
> such tasks always fail.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Suresh Kumar Anaparti <sureshkumar.anaparti@xxxxxxxxx>
> Sent: Wednesday, May 16, 2018 4:27:36 PM
> To: dev@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> server) disconnection?
>
> Hi Rohit,
>
> When Management Server and Agent are up and running and there is a network
> failure, I think it is better to wait for some time for the pending tasks
> to complete, instead of failing them and try reconnecting. If network delay
> is minimal, there can be a valid thread/context in the management server to
> handle the answers.
>
> It would be great if there are no major side-effects with this PR changes.
>
> Thanks,
> Suresh
>
> On Wed, May 16, 2018 at 3:40 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> wrote:
>
> > All,
> >
> >
> > Based on testing against KVM, XenServer and VMware and this discussion,
> > I'll merged the PR based on code reviews and tests. I investigated both
> > code-wise and against live environment for possible side-effects of
> letting
> > agent connect without being blocked on pending tasks and I found no new
> > fault behaviour.
> >
> >
> > If there are any objections or bugs, please share in which case we'll
> > revert the change to continue legacy/historic behaviour. Thanks.
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > ________________________________
> > From: Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> > Sent: Tuesday, May 15, 2018 2:37:58 PM
> > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> > server) disconnection?
> >
> > Hi Suresh,
> >
> >
> > I've replied to your comment on the PR. In addition, when (i) management
> > server is restarted any pending operation on KVM/SSVM agent side will
> fail
> > fail to be communicated back in the correct thread/context and it depends
> > on a specific feature whether is supports sync or cleanup mechanism, in
> > most cases, the async/job timeout may kick in or cause queue/concurrent
> > failure seen in logs. When (ii) agent is reconnected, it reconnects only
> > after any pending job finishes therefore such jobs finish and fail to be
> > communicated back to the mgmt server (the answer instance is failed to be
> > sent on the link, as link is no longer valid and causes exception).
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > ________________________________
> > From: Suresh Kumar Anaparti <sureshkumar.anaparti@xxxxxxxxx>
> > Sent: Tuesday, May 15, 2018 12:06:14 AM
> > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> > server) disconnection?
> >
> > Hi,
> >
> > @rhtyd, I checked the PR changes. Good that the agent is not waiting for
> > the pending jobs and retrying connection to management server. This might
> > have impact on ssvm and kvm agent tasks, not much on cpvm. Any sync or
> > cleanup mechanism for Volumes/VMs to address the failed/pending agent
> jobs
> > after (i) management server restart and (ii) agent connected ?
> >
> > -Suresh
> >
> > On Mon, May 14, 2018 at 8:05 PM, Marc-Aurèle Brothier <marco@xxxxxxxxxxx
> >
> > wrote:
> >
> > > Correct about the thread context, so if the answer is coming into a
> > > management server that doesn't have the context and drops it, it should
> > be
> > > fine then. The PR is then already a good improvement to let the agent
> > > reconnect even when it's doing a long processing request, so it can
> keeps
> > > on completing other jobs too.
> > >
> > > Regarding the restart/shutdown operation, yes I have to push now the
> > > changes to be able to stop some processing tasks (fetching new async
> jobs
> > > mainly) on a management server to ensure a cleaner shutdown. My
> solution,
> > > as said, is based on the content of a file that is compatible with HA
> > > proxy, thus not the LB mechanism added recently in CS. It could be
> > changed
> > > for an API call to put/move out a management server from maintenance.
> The
> > > listManagementServers API call has been merged and it was a requirement
> > for
> > > that.
> > >
> > > About Zookeeper, it's not on the rolling shutdown/restart for now. We
> are
> > > using it as an efficient and true lock mechanism between multiple
> > > management servers. We are slowly moving the locks code towards ZK and
> > > added one during the allocation phase to ensure no host would be over
> > > allocated. I will take this discussion in another email threads since I
> > > have a few questions regarding ZK and also which to talk about the
> > > connection between the agent & management servers.
> > >
> > > On Mon, May 14, 2018 at 2:39 PM, Rohit Yadav <
> rohit.yadav@xxxxxxxxxxxxx>
> > > wrote:
> > >
> > > > Thanks Marc and Rafael for replying.
> > > >
> > > >
> > > > In my experimentation, when agent disconnects if will wait for the
> > > pending
> > > > jobs/task to complete and on completion it creates an Answer instance
> > and
> > > > tries to sent it using a `link` which no longer exists and fails.
> This
> > is
> > > > current behaviour, on the mgmt server side the resource/task will be
> > left
> > > > hanging and may not be automatically marked failed right away (may be
> > > after
> > > > the configured timeout). My best guess is that the application of the
> > > > change should likely not have any side-effects, other than the
> > > > exceptions/faults we already observe.
> > > >
> > > >
> > > > In my test, the failed async job did not get retried and I hit the
> > famour
> > > > 'concurrency limit 1' issue. At this point, I had to manually cleanup
> > the
> > > > snapshot row, the rows from sync_queue, sync_queue_item and
> async_job.
> > > The
> > > > current implementation we have on the agent side where mgmt server
> > send a
> > > > cmd and agent returns an answer after processing it -- we don't have
> > the
> > > > same for mgmt server where an agent sends a cmd's answer and mgmt
> > server
> > > > processes it irrespective of the context. Therefore, unless the
> answer
> > > > receiving mgmt server is not in the right thread/context/state those
> > > > answers are dropped.
> > > >
> > > >
> > > > I think we need to solve for (1) claim and ownership management of a
> > > > resource (how to manage when the owner/mgmt server shuts down or
> dies),
> > > (2)
> > > > task handover - executing tasks (in-flight) when mgmt server is
> > shutdown
> > > to
> > > > other mgmt server, (3) central locking-service for this and other
> uses.
> > > The
> > > > bigger change ties with the other things we've seen in the discussion
> > > > around mgmt server restart/shutdown. Till the time we get to solving
> > the
> > > > bigger issue,  perhaps we can provide some API/visual/UI ways to show
> > the
> > > > root admin the async jobs in flight for a management server or alert
> > him,
> > > > perhaps an API to do cleaner mgmt server shutdown that waits for all
> > > > pending async jobs on a mgmg server to complete and does not take any
> > new
> > > > async/job API requests (say like Jenkins does with jobs)?
> > > >
> > > >
> > > > Marc - were n't you working on a zookeeper based rolling
> > > shutdown/restart?
> > > > Did that handle some of the failure cases?
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Marc-Aurèle Brothier <marco@xxxxxxxxxxx>
> > > > Sent: Monday, May 14, 2018 4:06:56 PM
> > > > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on
> > (mgmt
> > > > server) disconnection?
> > > >
> > > > Hi,
> > > >
> > > > I'm also for a bigger change but this PR already moves forward to a
> > > better
> > > > agent <-> management connection hanlding.
> > > >
> > > > @rhtyd did you test your PR manually by, for example, requesting a
> long
> > > > snapshot operation and disconnecting the agent.
> > > >
> > > > I have one concern here: when an async job is taken from the DB by a
> > > > management server (in a cluster configuration), the mgmgt ID is put
> in
> > > the
> > > > row to tell which mgmt is managing the job. On disconnection from an
> > > agent,
> > > > the event is propagated and the job is mark as failed in the
> database,
> > > and
> > > > an error is return in the API for that command. Here we are only
> > > resolving
> > > > the fact to let the agent reconnect quickly but I'm unsure of what
> will
> > > > happen in the mgmt when the job response is received by a mgmt (which
> > > might
> > > > be another one than the one registered in the job db row). I know
> it's
> > > here
> > > > it's becoming complicated because one async job might be only one
> part
> > > of a
> > > > bigger scenario for a command (like a live migration). I just want to
> > > > ensure it won't propagate further inconsistency.
> > > >
> > > > Marco
> > > >
> > > > On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner <
> > > > rafaelweingartner@xxxxxxxxx> wrote:
> > > >
> > > > > Would prefer “A bigger design fix would be to make management
> server
> > > > > asynchronous of agent side answer/response handling”. However, I
> > > > understand
> > > > > the volume of changes that requires.
> > > > >
> > > > > I looked at the PR, and I think that everything is ok there. Of
> > > course, I
> > > > > think we might need some more time to review and think about the
> > > possible
> > > > > outcomes of such changes.
> > > > >
> > > > > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <
> > > rohit.yadav@xxxxxxxxxxxxx>
> > > > > wrote:
> > > > >
> > > > > > All,
> > > > > >
> > > > > >
> > > > > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected
> from
> > > the
> > > > > > management server (say due to mgmt server restart etc), the
> > > > reconnection
> > > > > > logic waits for any pending tasks/commands to complete before
> > > > > reconnection
> > > > > > attempts are made. I tried to search git history but could not
> > find a
> > > > > > reason, can anyone share why we may need this?
> > > > > >
> > > > > >
> > > > > > Based on the reported issue:
> > > > > >
> > > > > > https://github.com/apache/cloudstack/issues/2633
> > > > > >
> > > > > >
> > > > > > I've a working patch which removes this limitation:
> > > > > >
> > > > > > https://github.com/apache/cloudstack/pull/2638
> > > > > >
> > > > > >
> > > > > > From testing with various combinations of tasks, I found that
> when
> > > that
> > > > > > happens even if the pending task succeeds it fails to send an
> > Answer
> > > to
> > > > > the
> > > > > > mgmt server, therefore from the control plane's perspective that
> > task
> > > > is
> > > > > > still pending/on-going.
> > > > > >
> > > > > >
> > > > > > When the mgmt server comes back online, and the agent finally
> > > > reconnects
> > > > > > (pending on how long the pending task took) the executed
> operation
> > is
> > > > > still
> > > > > > pending in mgmt server's view and may sometimes require manual
> > > cleanups
> > > > > in
> > > > > > database. By removing the limitation in above PR, at least the
> > agent
> > > > > > reconnects faster while of the failure/fault behaviours remain
> the
> > > > same.
> > > > > A
> > > > > > bigger design fix would be to make management server asynchronous
> > of
> > > > > agent
> > > > > > side answer/response handling.
> > > > > >
> > > > > >
> > > > > > - Rohit
> > > > > >
> > > > > > <https://cloudstack.apache.org>
> > > > > >
> > > > > >
> > > > > >
> > > > > > rohit.yadav@xxxxxxxxxxxxx
> > > > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Rafael Weingärtner
> > > > >
> > > >
> > > > rohit.yadav@xxxxxxxxxxxxx
> > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue
> > > >
> > > >
> > > >
> > > >
> > >
> >
> > rohit.yadav@xxxxxxxxxxxxx
> > www.shapeblue.com<http://www.shapeblue.com>
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> >
> > rohit.yadav@xxxxxxxxxxxxx
> > www.shapeblue.com<http://www.shapeblue.com>
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> >
>
> rohit.yadav@xxxxxxxxxxxxx
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>