Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt server) disconnection?
Thanks Marc and Rafael for replying.
In my experimentation, when agent disconnects if will wait for the pending jobs/task to complete and on completion it creates an Answer instance and tries to sent it using a `link` which no longer exists and fails. This is current behaviour, on the mgmt server side the resource/task will be left hanging and may not be automatically marked failed right away (may be after the configured timeout). My best guess is that the application of the change should likely not have any side-effects, other than the exceptions/faults we already observe.
In my test, the failed async job did not get retried and I hit the famour 'concurrency limit 1' issue. At this point, I had to manually cleanup the snapshot row, the rows from sync_queue, sync_queue_item and async_job. The current implementation we have on the agent side where mgmt server send a cmd and agent returns an answer after processing it -- we don't have the same for mgmt server where an agent sends a cmd's answer and mgmt server processes it irrespective of the context. Therefore, unless the answer receiving mgmt server is not in the right thread/context/state those answers are dropped.
I think we need to solve for (1) claim and ownership management of a resource (how to manage when the owner/mgmt server shuts down or dies), (2) task handover - executing tasks (in-flight) when mgmt server is shutdown to other mgmt server, (3) central locking-service for this and other uses. The bigger change ties with the other things we've seen in the discussion around mgmt server restart/shutdown. Till the time we get to solving the bigger issue, perhaps we can provide some API/visual/UI ways to show the root admin the async jobs in flight for a management server or alert him, perhaps an API to do cleaner mgmt server shutdown that waits for all pending async jobs on a mgmg server to complete and does not take any new async/job API requests (say like Jenkins does with jobs)?
Marc - were n't you working on a zookeeper based rolling shutdown/restart? Did that handle some of the failure cases?
From: Marc-Aurèle Brothier <marco@xxxxxxxxxxx>
Sent: Monday, May 14, 2018 4:06:56 PM
Subject: Re: [DISCUSS][ASK] Should agent wait for pending tasks on (mgmt server) disconnection?
I'm also for a bigger change but this PR already moves forward to a better
agent <-> management connection hanlding.
@rhtyd did you test your PR manually by, for example, requesting a long
snapshot operation and disconnecting the agent.
I have one concern here: when an async job is taken from the DB by a
management server (in a cluster configuration), the mgmgt ID is put in the
row to tell which mgmt is managing the job. On disconnection from an agent,
the event is propagated and the job is mark as failed in the database, and
an error is return in the API for that command. Here we are only resolving
the fact to let the agent reconnect quickly but I'm unsure of what will
happen in the mgmt when the job response is received by a mgmt (which might
be another one than the one registered in the job db row). I know it's here
it's becoming complicated because one async job might be only one part of a
bigger scenario for a command (like a live migration). I just want to
ensure it won't propagate further inconsistency.
On Sat, May 12, 2018 at 7:26 PM, Rafael Weingärtner <
> Would prefer “A bigger design fix would be to make management server
> asynchronous of agent side answer/response handling”. However, I understand
> the volume of changes that requires.
> I looked at the PR, and I think that everything is ok there. Of course, I
> think we might need some more time to review and think about the possible
> outcomes of such changes.
> On Fri, May 11, 2018 at 7:55 AM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> > All,
> > Historically, when the agent (kvm, ssvm, cpvm) is disconnected from the
> > management server (say due to mgmt server restart etc), the reconnection
> > logic waits for any pending tasks/commands to complete before
> > attempts are made. I tried to search git history but could not find a
> > reason, can anyone share why we may need this?
> > Based on the reported issue:
> > https://github.com/apache/cloudstack/issues/2633
> > I've a working patch which removes this limitation:
> > https://github.com/apache/cloudstack/pull/2638
> > From testing with various combinations of tasks, I found that when that
> > happens even if the pending task succeeds it fails to send an Answer to
> > mgmt server, therefore from the control plane's perspective that task is
> > still pending/on-going.
> > When the mgmt server comes back online, and the agent finally reconnects
> > (pending on how long the pending task took) the executed operation is
> > pending in mgmt server's view and may sometimes require manual cleanups
> > database. By removing the limitation in above PR, at least the agent
> > reconnects faster while of the failure/fault behaviours remain the same.
> > bigger design fix would be to make management server asynchronous of
> > side answer/response handling.
> > - Rohit
> > <https://cloudstack.apache.org>
> > rohit.yadav@xxxxxxxxxxxxx
> > www.shapeblue.com<http://www.shapeblue.com>
> > 53 Chandos Place, Covent Garden, London WC2N 4HSUK
> > @shapeblue
> Rafael Weingärtner
53 Chandos Place, Covent Garden, London WC2N 4HSUK