osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] CloudStack graceful shutdown


This is not simple e.g. for VMware. Each management server also acts as an agent proxy so tasks against a particular ESX host will be always forwarded. That right answer will be to a native support for “maintenance mode” for management server. When entered to such mode the management server should release all agents including save, block/redirect API calls and login request and finish all a sync job it originated.

Sent from my iPhone

> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <rafaelweingartner@xxxxxxxxx> wrote:
> 
> Ilya, still regarding the management server that is being shut down issue;
> if other MSs/or maybe system VMs (I am not sure to know if they are able to
> do such tasks) can direct/redirect/send new jobs to this management server
> (the one being shut down), the process might never end because new tasks
> are always being created for the management server that we want to shut
> down. Is this scenario possible?
> 
> That is why I mentioned blocking the port 8250 for the “graceful-shutdown”.
> 
> If this scenario is not possible, then everything s fine.
> 
> 
> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <ilya.mailing.lists@xxxxxxxxx>
> wrote:
> 
>> I'm thinking of using a configuration from "job.cancel.threshold.minutes" -
>> it will be the longest
>> 
>>      "category": "Advanced",
>> 
>>      "description": "Time (in minutes) for async-jobs to be forcely
>> cancelled if it has been in process for long",
>> 
>>      "name": "job.cancel.threshold.minutes",
>> 
>>      "value": "60"
>> 
>> 
>> 
>> 
>> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
>> rafaelweingartner@xxxxxxxxx> wrote:
>> 
>>> Big +1 for this feature; I only have a few doubts.
>>> 
>>> * Regarding the tasks/jobs that management servers (MSs) execute; are
>> these
>>> tasks originate from requests that come to the MS, or is it possible that
>>> requests received by one management server to be executed by other? I
>> mean,
>>> if I execute a request against MS1, will this request always be
>>> executed/threated by MS1, or is it possible that this request is executed
>>> by another MS (e.g. MS2)?
>>> 
>>> * I would suggest that after we block traffic coming from
>> 8080/8443/8250(we
>>> will need to block this as well right?), we can log the execution of
>> tasks.
>>> I mean, something saying, there are XXX tasks (enumerate tasks) still
>> being
>>> executed, we will wait for them to finish before shutting down.
>>> 
>>> * The timeout (60 minutes suggested) could be global settings that we can
>>> load before executing the graceful-shutdown.
>>> 
>>> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
>> ilya.mailing.lists@xxxxxxxxx
>>>> 
>>> wrote:
>>> 
>>>> Use case:
>>>> In any environment - time to time - administrator needs to perform a
>>>> maintenance. Current stop sequence of cloudstack management server will
>>>> ignore the fact that there may be long running async jobs - and
>> terminate
>>>> the process. This in turn can create a poor user experience and
>>> occasional
>>>> inconsistency  in cloudstack db.
>>>> 
>>>> This is especially painful in large environments where the user has
>>>> thousands of nodes and there is a continuous patching that happens
>> around
>>>> the clock - that requires migration of workload from one node to
>> another.
>>>> 
>>>> With that said - i've created a script that monitors the async job
>> queue
>>>> for given MS and waits for it complete all jobs. More details are
>> posted
>>>> below.
>>>> 
>>>> I'd like to introduce "graceful-shutdown" into the systemctl/service of
>>>> cloudstack-management service.
>>>> 
>>>> The details of how it will work is below:
>>>> 
>>>> Workflow for graceful shutdown:
>>>>  Using iptables/firewalld - block any connection attempts on 8080/8443
>>> (we
>>>> can identify the ports dynamically)
>>>>  Identify the MSID for the node, using the proper msid - query
>> async_job
>>>> table for
>>>> 1) any jobs that are still running (or job_status=“0”)
>>>> 2) job_dispatcher not like “pseudoJobDispatcher"
>>>> 3) job_init_msid=$my_ms_id
>>>> 
>>>> Monitor this async_job table for 60 minutes - until all async jobs for
>>> MSID
>>>> are done, then proceed with shutdown
>>>>    If failed for any reason or terminated, catch the exit via trap
>>> command
>>>> and unblock the 8080/8443
>>>> 
>>>> Comments are welcome
>>>> 
>>>> Regards,
>>>> ilya
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Rafael Weingärtner
>>> 
>> 
> 
> 
> 
> -- 
> Rafael Weingärtner