OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] CloudStack graceful shutdown


One comment here (I had to shutdown whole DC for few hours recently....),
please make sure to perhaps at least consider snapshoting process as the
special case - it can take few hours for snapshot to complete really (copy
process from Primary to Secondary Storage)

I did (in my recent unfortunate DC shutdown), actually stop MS (we also
have script to identify running async jobs), so we stop it once safe, but
any running qemu-img processes (we use kVM) need to be killed manually
(ansbile) after MS is stopped, etc,etc...

I can assume most jobs can take reasonable long time to complete, but
snapshots are probably the biggest exceptions as can take extremely long
time to complete...

Cheers

On 4 April 2018 at 22:46, Tutkowski, Mike <Mike.Tutkowski@xxxxxxxxxx> wrote:

> I may be remembering this incorrectly, but from what I recall, if a
> resource is owned by one MS and a request related to that resource comes in
> to another MS, the MS that received the request passes it on to the other
> MS.
>
> > On Apr 4, 2018, at 2:36 PM, Rafael Weingärtner <
> rafaelweingartner@xxxxxxxxx> wrote:
> >
> > Big +1 for this feature; I only have a few doubts.
> >
> > * Regarding the tasks/jobs that management servers (MSs) execute; are
> these
> > tasks originate from requests that come to the MS, or is it possible that
> > requests received by one management server to be executed by other? I
> mean,
> > if I execute a request against MS1, will this request always be
> > executed/threated by MS1, or is it possible that this request is executed
> > by another MS (e.g. MS2)?
> >
> > * I would suggest that after we block traffic coming from
> 8080/8443/8250(we
> > will need to block this as well right?), we can log the execution of
> tasks.
> > I mean, something saying, there are XXX tasks (enumerate tasks) still
> being
> > executed, we will wait for them to finish before shutting down.
> >
> > * The timeout (60 minutes suggested) could be global settings that we can
> > load before executing the graceful-shutdown.
> >
> > On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
> ilya.mailing.lists@xxxxxxxxx>
> > wrote:
> >
> >> Use case:
> >> In any environment - time to time - administrator needs to perform a
> >> maintenance. Current stop sequence of cloudstack management server will
> >> ignore the fact that there may be long running async jobs - and
> terminate
> >> the process. This in turn can create a poor user experience and
> occasional
> >> inconsistency  in cloudstack db.
> >>
> >> This is especially painful in large environments where the user has
> >> thousands of nodes and there is a continuous patching that happens
> around
> >> the clock - that requires migration of workload from one node to
> another.
> >>
> >> With that said - i've created a script that monitors the async job queue
> >> for given MS and waits for it complete all jobs. More details are posted
> >> below.
> >>
> >> I'd like to introduce "graceful-shutdown" into the systemctl/service of
> >> cloudstack-management service.
> >>
> >> The details of how it will work is below:
> >>
> >> Workflow for graceful shutdown:
> >>  Using iptables/firewalld - block any connection attempts on 8080/8443
> (we
> >> can identify the ports dynamically)
> >>  Identify the MSID for the node, using the proper msid - query async_job
> >> table for
> >> 1) any jobs that are still running (or job_status=“0”)
> >> 2) job_dispatcher not like “pseudoJobDispatcher"
> >> 3) job_init_msid=$my_ms_id
> >>
> >> Monitor this async_job table for 60 minutes - until all async jobs for
> MSID
> >> are done, then proceed with shutdown
> >>    If failed for any reason or terminated, catch the exit via trap
> command
> >> and unblock the 8080/8443
> >>
> >> Comments are welcome
> >>
> >> Regards,
> >> ilya
> >>
> >
> >
> >
> > --
> > Rafael Weingärtner
>



-- 

Andrija Panić