osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] CloudStack graceful shutdown


After much useful input from many of you - i realize my approach is
somewhat incomplete and possible very optimistic.

Speaking to Marcus, here is what we propose as alternate solution, i was
hoping to stay outside of the "core" - but it looks like there is no other
away around it.

Proposed functionality: Management Server functional to prepare for
maintenance
* i'm thinking this should be applicable to multinode setup only
drain all connection on 8250 for KVM and Other agents - by issuing a
reconnect command on agents
while 8250 is still listening, a new attempt to connect will be blocked and
agent will be asked to reconnect (if you have LB - it will route it to
another node and eventually reconnect all agents to other nodes - this
might be an area where Marc's HAProxy solution would plugin). In 4.11 -
there is a new framework for managing agent connectivity without needing
Load Balancer, need to investigate how this will work.
allow the existing running async tasks to complete - as per
"job.cancel.threshold.minutes"
max value
        queue the new tasks and process them on the next management server

Still dont know what will happen to Xen or VMware in this case - perhaps
ShapeBlue team can help answer or fill in the blanks for us.

Regards,
ilya

On Thu, Apr 5, 2018 at 2:48 PM, ilya musayev <ilya.mailing.lists@xxxxxxxxx>
wrote:

> Hi Sergey
>
> Glad to see you are doing well,
>
> I was gonna say drop "enterprise virtualization company" and save a
> $fortune$ - but its not for everyone :)
>
> I'll post another proposed solution to bottom of this thread.
>
> Regards
> ilya
>
>
> On Wed, Apr 4, 2018 at 5:22 PM, Sergey Levitskiy <serg38l@xxxxxxxxxxx>
> wrote:
>
>> Now without spellchecking :)
>>
>> This is not simple e.g. for VMware. Each management server also acts as
>> an agent proxy so tasks against a particular ESX host will be always
>> forwarded. That right answer will be to support a native “maintenance mode”
>> for management server. When entered to such mode the management server
>> should release all agents including SSVM, block/redirect API calls and
>> login request and finish all async job it originated.
>>
>>
>>
>> On Apr 4, 2018, at 5:15 PM, Sergey Levitskiy <serg38l@xxxxxxxxxxx<mailto:
>> serg38l@xxxxxxxxxxx>> wrote:
>>
>> This is not simple e.g. for VMware. Each management server also acts as
>> an agent proxy so tasks against a particular ESX host will be always
>> forwarded. That right answer will be to a native support for “maintenance
>> mode” for management server. When entered to such mode the management
>> server should release all agents including save, block/redirect API calls
>> and login request and finish all a sync job it originated.
>>
>> Sent from my iPhone
>>
>> On Apr 4, 2018, at 3:31 PM, Rafael Weingärtner <
>> rafaelweingartner@xxxxxxxxx<mailto:rafaelweingartner@xxxxxxxxx>> wrote:
>>
>> Ilya, still regarding the management server that is being shut down issue;
>> if other MSs/or maybe system VMs (I am not sure to know if they are able
>> to
>> do such tasks) can direct/redirect/send new jobs to this management server
>> (the one being shut down), the process might never end because new tasks
>> are always being created for the management server that we want to shut
>> down. Is this scenario possible?
>>
>> That is why I mentioned blocking the port 8250 for the
>> “graceful-shutdown”.
>>
>> If this scenario is not possible, then everything s fine.
>>
>>
>> On Wed, Apr 4, 2018 at 7:14 PM, ilya musayev <
>> ilya.mailing.lists@xxxxxxxxx<mailto:ilya.mailing.lists@xxxxxxxxx>>
>> wrote:
>>
>> I'm thinking of using a configuration from "job.cancel.threshold.minutes"
>> -
>> it will be the longest
>>
>>     "category": "Advanced",
>>
>>     "description": "Time (in minutes) for async-jobs to be forcely
>> cancelled if it has been in process for long",
>>
>>     "name": "job.cancel.threshold.minutes",
>>
>>     "value": "60"
>>
>>
>>
>>
>> On Wed, Apr 4, 2018 at 1:36 PM, Rafael Weingärtner <
>> rafaelweingartner@xxxxxxxxx<mailto:rafaelweingartner@xxxxxxxxx>> wrote:
>>
>> Big +1 for this feature; I only have a few doubts.
>>
>> * Regarding the tasks/jobs that management servers (MSs) execute; are
>> these
>> tasks originate from requests that come to the MS, or is it possible that
>> requests received by one management server to be executed by other? I
>> mean,
>> if I execute a request against MS1, will this request always be
>> executed/threated by MS1, or is it possible that this request is executed
>> by another MS (e.g. MS2)?
>>
>> * I would suggest that after we block traffic coming from
>> 8080/8443/8250(we
>> will need to block this as well right?), we can log the execution of
>> tasks.
>> I mean, something saying, there are XXX tasks (enumerate tasks) still
>> being
>> executed, we will wait for them to finish before shutting down.
>>
>> * The timeout (60 minutes suggested) could be global settings that we can
>> load before executing the graceful-shutdown.
>>
>> On Wed, Apr 4, 2018 at 5:15 PM, ilya musayev <
>> ilya.mailing.lists@xxxxxxxxx<mailto:ilya.mailing.lists@xxxxxxxxx>
>>
>> wrote:
>>
>> Use case:
>> In any environment - time to time - administrator needs to perform a
>> maintenance. Current stop sequence of cloudstack management server will
>> ignore the fact that there may be long running async jobs - and
>> terminate
>> the process. This in turn can create a poor user experience and
>> occasional
>> inconsistency  in cloudstack db.
>>
>> This is especially painful in large environments where the user has
>> thousands of nodes and there is a continuous patching that happens
>> around
>> the clock - that requires migration of workload from one node to
>> another.
>>
>> With that said - i've created a script that monitors the async job
>> queue
>> for given MS and waits for it complete all jobs. More details are
>> posted
>> below.
>>
>> I'd like to introduce "graceful-shutdown" into the systemctl/service of
>> cloudstack-management service.
>>
>> The details of how it will work is below:
>>
>> Workflow for graceful shutdown:
>> Using iptables/firewalld - block any connection attempts on 8080/8443
>> (we
>> can identify the ports dynamically)
>> Identify the MSID for the node, using the proper msid - query
>> async_job
>> table for
>> 1) any jobs that are still running (or job_status=“0”)
>> 2) job_dispatcher not like “pseudoJobDispatcher"
>> 3) job_init_msid=$my_ms_id
>>
>> Monitor this async_job table for 60 minutes - until all async jobs for
>> MSID
>> are done, then proceed with shutdown
>>   If failed for any reason or terminated, catch the exit via trap
>> command
>> and unblock the 8080/8443
>>
>> Comments are welcome
>>
>> Regards,
>> ilya
>>
>>
>>
>>
>> --
>> Rafael Weingärtner
>>
>>
>>
>>
>>
>> --
>> Rafael Weingärtner
>>
>
>