OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt server) disconnection?


Hi Suresh,

As long as the TCP link isn't closed, you can have network hiccups without
any issue. If the link is close, the event is propagated on the management
server and on the agent side and there's isn't much that can be done to
address this easily with the current code base.

Marc-Aurèle

On Wed, May 16, 2018 at 1:25 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
wrote:

> Hi Suresh,
>
>
> As explained earlier and advised to look at code on the PR, perhaps you
> did not get time so have a look here:
>
> https://github.com/apache/cloudstack/blob/4.11/agent/
> src/com/cloud/agent/Agent.java#L488
>
>
> The reconnect() historically sets the link to null. Therefore, any answer
> from pending tasks end up failing here:
>
> https://github.com/apache/cloudstack/blob/4.11/agent/
> src/com/cloud/agent/Agent.java#L868
>
> and,
>
> https://github.com/apache/cloudstack/blob/4.11/agent/
> src/com/cloud/agent/Agent.java#L893
>
>
> Do note that reconnect() only cancels watch tasks but does not
> cancel/shutdown any running task. Also, in case of network error, the mgmt
> server will fail at thread/context where is has done a agent.send() and
> expecting an answer.
>
>
> You can also perform a small test by doing a while or sleep around this
> code to see how getLink().send() behave when agent does reconnect. When it
> does not reconnect, i.e. the agent is blocked by pending tasks to complete
> such tasks always fail.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Suresh Kumar Anaparti <sureshkumar.anaparti@xxxxxxxxx>
> Sent: Wednesday, May 16, 2018 4:27:36 PM
> To: dev@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> server) disconnection?
>
> Hi Rohit,
>
> When Management Server and Agent are up and running and there is a network
> failure, I think it is better to wait for some time for the pending tasks
> to complete, instead of failing them and try reconnecting. If network delay
> is minimal, there can be a valid thread/context in the management server to
> handle the answers.
>
> It would be great if there are no major side-effects with this PR changes.
>
> Thanks,
> Suresh
>
> On Wed, May 16, 2018 at 3:40 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> wrote:
>
> > All,
> >
> >
> > Based on testing against KVM, XenServer and VMware and this discussion,
> > I'll merged the PR based on code reviews and tests. I investigated both
> > code-wise and against live environment for possible side-effects of
> letting
> > agent connect without being blocked on pending tasks and I found no new
> > fault behaviour.
> >
> >
> > If there are any objections or bugs, please share in which case we'll
> > revert the change to continue legacy/historic behaviour. Thanks.
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > ________________________________
> > From: Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> > Sent: Tuesday, May 15, 2018 2:37:58 PM
> > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> > server) disconnection?
> >
> > Hi Suresh,
> >
> >
> > I've replied to your comment on the PR. In addition, when (i) management
> > server is restarted any pending operation on KVM/SSVM agent side will
> fail
> > fail to be communicated back in the correct thread/context and it depends
> > on a specific feature whether is supports sync or cleanup mechanism, in
> > most cases, the async/job timeout may kick in or cause queue/concurrent
> > failure seen in logs. When (ii) agent is reconnected, it reconnects only
> > after any pending job finishes therefore such jobs finish and fail to be
> > communicated back to the mgmt server (the answer instance is failed to be
> > sent on the link, as link is no longer valid and causes exception).
> >
> >
> > - Rohit
> >
> > <https://cloudstack.apache.org>
> >
> >
> >
> > ________________________________
> > From: Suresh Kumar Anaparti <sureshkumar.anaparti@xxxxxxxxx>
> > Sent: Tuesday, May 15, 2018 12:06:14 AM
> > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt
> > server) disconnection?
> >
> > Hi,
> >
> > @rhtyd, I checked the PR changes. Good that the agent is not waiting for
> > the pending jobs and retrying connection to management server. This might
> > have impact on ssvm and kvm agent tasks, not much on cpvm. Any sync or
> > cleanup mechanism for Volumes/VMs to address the failed/pending agent
> jobs
> > after (i) management server restart and (ii) agent connected ?
> >
> > -Suresh
> >
> > On Mon, May 14, 2018 at 8:05 PM, Marc-Aurèle Brothier <marco@xxxxxxxxxxx
> >
> > wrote:
> >
> > > Correct about the thread context, so if the answer is coming into a
> > > management server that doesn't have the context and drops it, it should
> > be
> > > fine then. The PR is then already a good improvement to let the agent
> > > reconnect even when it's doing a long processing request, so it can
> keeps
> > > on completing other jobs too.
> > >
> > > Regarding the restart/shutdown operation, yes I have to push now the
> > > changes to be able to stop some processing tasks (fetching new async
> jobs
> > > mainly) on a management server to ensure a cleaner shutdown. My
> solution,
> > > as said, is based on the content of a file that is compatible with HA
> > > proxy, thus not the LB mechanism added recently in CS. It could be
> > changed
> > > for an API call to put/move out a management server from maintenance.
> The
> > > listManagementServers API call has been merged and it was a requirement
> > for
> > > that.
> > >
> > > About Zookeeper, it's not on the rolling shutdown/restart for now. We
> are
> > > using it as an efficient and true lock mechanism between multiple
> > > management servers. We are slowly moving the locks code towards ZK and
> > > added one during the allocation phase to ensure no host would be over
> > > allocated. I will take this discussion in another email threads since I
> > > have a few questions regarding ZK and also which to talk about the
> > > connection between the agent & management servers.
> > >
> > > On Mon, May 14, 2018 at 2:39 PM, Rohit Yadav <
> rohit.yadav@xxxxxxxxxxxxx>
> > > wrote:
> > >
> > > > Thanks Marc and Rafael for replying.
> > > >
> > > >
> > > > In my experimentation, when agent disconnects if will wait for the
> > > pending
> > > > jobs/task to complete and on completion it creates an Answer instance
> > and
> > > > tries to sent it using a `link` which no longer exists and fails.
> This
> > is
> > > > current behaviour, on the mgmt server side the resource/task will be
> > left
> > > > hanging and may not be automatically marked failed right away (may be
> > > after
> > > > the configured timeout). My best guess is that the application of the
> > > > change should likely not have any side-effects, other than the
> > > > exceptions/faults we already observe.
> > > >
> > > >
> > > > In my test, the failed async job did not get retried and I hit the
> > famour
> > > > 'concurrency limit 1' issue. At this point, I had to manually cleanup
> > the
> > > > snapshot row, the rows from sync_queue, sync_queue_item and
> async_job.
> > > The
> > > > current implementation we have on the agent side where mgmt server
> > send a
> > > > cmd and agent returns an answer after processing it -- we don't have
> > the
> > > > same for mgmt server where an agent sends a cmd's answer and mgmt
> > server
> > > > processes it irrespective of the context. Therefore, unless the
> answer
> > > > receiving mgmt server is not in the right thread/context/state those
> > > > answers are dropped.
> > > >
> > > >
> > > > I think we need to solve for (1) claim and ownership management of a
> > > > resource (how to manage when the owner/mgmt server shuts down or
> dies),
> > > (2)
> > > > task handover - executing tasks (in-flight) when mgmt server is
> > shutdown
> > > to
> > > > other mgmt server, (3) central locking-service for this and other
> uses.
> > > The
> > > > bigger change ties with the other things we've seen in the discussion
> > > > around mgmt server restart/shutdown. Till the time we get to solving
> > the
> > > > bigger issue,  perhaps we can provide some API/visual/UI ways to show
> > the
> > > > root admin the async jobs in flight for a management server or alert
> > him,
> > > > perhaps an API to do cleaner mgmt server shutdown that waits for all
> > > > pending async jobs on a mgmg server to complete and does not take any
> > new
> > > > async/job API requests (say like Jenkins does with jobs)?
> > > >
> > > >
> > > > Marc - were n't you working on a zookeeper based rolling
> > > shutdown/restart?
> > > > Did that handle some of the failure cases?
> > > >
> > > >
> > > > - Rohit
> > > >
> > > > <https://cloudstack.apache.org>
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Marc-Aurèle Brothier <marco@xxxxxxxxxxx>
> > > > Sent: Monday, May 14, 2018 4:06:56 PM
> > > > To: dev@xxxxxxxxxxxxxxxxxxxxx
> > > > Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on
> > (mgmt
> > > > server) disconnection?
> > > >
> > > > Hi,
> > > >
> > > > I'm also for a bigger change but this PR already moves forward to a
> > > better
> > > > agent <-> management connection hanlding.
> > > >
> > > > @rhtyd did you test your PR manually by, for example, requesting a
> long
> > > > snapshot operation and disconnecting the agent.
> > > >
> > > > I have one concern here: when an async job is taken from the DB by a
> > > > management server (in a cluster configuration), the mgmgt ID is put
> in
> > > the
> > > > row to tell which mgmt is managing the job. On disconnection from an
> > > agent,
> > > > the event is propagated and the job is mark as failed in the
> database,
> > > and
> > > > an error is return in the API for that command. Here we are only
> > > resolving
> > > > the fact to let the agent reconnect quickly but I'm unsure of what
> will
> > > > happen in the mgmt when the job response is received by a mgmt (which
> > > might
> > > > be another one than the one registered in the job db row). I know
> it's
> > > here
> > > > it's becoming complicated because one async job might be only one
> part
> > > of a
> > > > bigger scenario for a command (like a live migration). I just want to
> > > > ensure it won't propagate further inconsistency.
> > > >
> > > > Marco
> > > >
> > > > On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner <
> > > > rafaelweingartner@xxxxxxxxx> wrote:
> > > >
> > > > > Would prefer “A bigger design fix would be to make management
> server
> > > > > asynchronous of agent side answer/response handling”. However, I
> > > > understand
> > > > > the volume of changes that requires.
> > > > >
> > > > > I looked at the PR, and I think that everything is ok there. Of
> > > course, I
> > > > > think we might need some more time to review and think about the
> > > possible
> > > > > outcomes of such changes.
> > > > >
> > > > > On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <
> > > rohit.yadav@xxxxxxxxxxxxx>
> > > > > wrote:
> > > > >
> > > > > > All,
> > > > > >
> > > > > >
> > > > > > Historically, when the agent (kvm, ssvm, cpvm) is disconnected
> from
> > > the
> > > > > > management server (say due to mgmt server restart etc), the
> > > > reconnection
> > > > > > logic waits for any pending tasks/commands to complete before
> > > > > reconnection
> > > > > > attempts are made. I tried to search git history but could not
> > find a
> > > > > > reason, can anyone share why we may need this?
> > > > > >
> > > > > >
> > > > > > Based on the reported issue:
> > > > > >
> > > > > > https://github.com/apache/cloudstack/issues/2633
> > > > > >
> > > > > >
> > > > > > I've a working patch which removes this limitation:
> > > > > >
> > > > > > https://github.com/apache/cloudstack/pull/2638
> > > > > >
> > > > > >
> > > > > > From testing with various combinations of tasks, I found that
> when
> > > that
> > > > > > happens even if the pending task succeeds it fails to send an
> > Answer
> > > to
> > > > > the
> > > > > > mgmt server, therefore from the control plane's perspective that
> > task
> > > > is
> > > > > > still pending/on-going.
> > > > > >
> > > > > >
> > > > > > When the mgmt server comes back online, and the agent finally
> > > > reconnects
> > > > > > (pending on how long the pending task took) the executed
> operation
> > is
> > > > > still
> > > > > > pending in mgmt server's view and may sometimes require manual
> > > cleanups
> > > > > in
> > > > > > database. By removing the limitation in above PR, at least the
> > agent
> > > > > > reconnects faster while of the failure/fault behaviours remain
> the
> > > > same.
> > > > > A
> > > > > > bigger design fix would be to make management server asynchronous
> > of
> > > > > agent
> > > > > > side answer/response handling.
> > > > > >
> > > > > >
> > > > > > - Rohit
> > > > > >
> > > > > > <https://cloudstack.apache.org>
> > > > > >
> > > > > >
> > > > > >
> > > > > > rohit.yadav@xxxxxxxxxxxxx
> > > > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > > > @shapeblue
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Rafael Weingärtner
> > > > >
> > > >
> > > > rohit.yadav@xxxxxxxxxxxxx
> > > > www.shapeblue.com<http://www.shapeblue.com>
> > > > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > > > @shapeblue
> > > >
> > > >
> > > >
> > > >
> > >
> >
> > rohit.yadav@xxxxxxxxxxxxx
> > www.shapeblue.com<http://www.shapeblue.com>
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> >
> > rohit.yadav@xxxxxxxxxxxxx
> > www.shapeblue.com<http://www.shapeblue.com>
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> >
>
> rohit.yadav@xxxxxxxxxxxxx
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>