OSDir

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] CloudStack graceful shutdown


I'm thinking of using a configuration from "job.cancel.threshold.minutes" -
it will be the longest

      "category": "Advanced",

      "description": "Time (in minutes) for async-jobs to be forcely
cancelled if it has been in process for long",

      "name": "job.cancel.threshold.minutes",

      "value": "60"




On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
rafaelweingartner@xxxxxxxxx> wrote:

> Big +1 for this feature; I only have a few doubts.
>
> * Regarding the tasks/jobs that management servers (MSs) execute; are these
> tasks originate from requests that come to the MS, or is it possible that
> requests received by one management server to be executed by other? I mean,
> if I execute a request against MS1, will this request always be
> executed/threated by MS1, or is it possible that this request is executed
> by another MS (e.g. MS2)?
>
> * I would suggest that after we block traffic coming from 8080/8443/8250(we
> will need to block this as well right?), we can log the execution of tasks.
> I mean, something saying, there are XXX tasks (enumerate tasks) still being
> executed, we will wait for them to finish before shutting down.
>
> * The timeout (60 minutes suggested) could be global settings that we can
> load before executing the graceful-shutdown.
>
> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <ilya.mailing.lists@xxxxxxxxx
> >
> wrote:
>
> > Use case:
> > In any environment - time to time - administrator needs to perform a
> > maintenance. Current stop sequence of cloudstack management server will
> > ignore the fact that there may be long running async jobs - and terminate
> > the process. This in turn can create a poor user experience and
> occasional
> > inconsistency  in cloudstack db.
> >
> > This is especially painful in large environments where the user has
> > thousands of nodes and there is a continuous patching that happens around
> > the clock - that requires migration of workload from one node to another.
> >
> > With that said - i've created a script that monitors the async job queue
> > for given MS and waits for it complete all jobs. More details are posted
> > below.
> >
> > I'd like to introduce "graceful-shutdown" into the systemctl/service of
> > cloudstack-management service.
> >
> > The details of how it will work is below:
> >
> > Workflow for graceful shutdown:
> >   Using iptables/firewalld - block any connection attempts on 8080/8443
> (we
> > can identify the ports dynamically)
> >   Identify the MSID for the node, using the proper msid - query async_job
> > table for
> > 1) any jobs that are still running (or job_status=“0”)
> > 2) job_dispatcher not like “pseudoJobDispatcher"
> > 3) job_init_msid=$my_ms_id
> >
> > Monitor this async_job table for 60 minutes - until all async jobs for
> MSID
> > are done, then proceed with shutdown
> >     If failed for any reason or terminated, catch the exit via trap
> command
> > and unblock the 8080/8443
> >
> > Comments are welcome
> >
> > Regards,
> > ilya
> >
>
>
>
> --
> Rafael Weingärtner
>