Re: [DISCUSS] VR upgrade downtime reduction
Yes, nice work!
From: Daan Hoogland <daan.hoogland@xxxxxxxxx>
Sent: Tuesday, May 1, 2018 5:28 AM
Subject: Re: [DISCUSS] VR upgrade downtime reduction
good work Rohit,
I'll review 2508 https://github.com/apache/cloudstack/pull/2508
On Tue, May 1, 2018 at 12:08 PM, Rohit Yadav <rohit.yadav@xxxxxxxxxxxxx>
> A short-term solution to VR upgrade or network restart (with cleanup=true)
> has been implemented:
> - The strategy for redundant VRs builds on top of Wei's original patch
> where backup routers are removed and replace in a rolling basis. The
> downtime I saw was usually 0-2 seconds, and theoretically downtime is
> maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds
> (with cloudstack's default of 1s advertisement interval).
> - For non-redundant routers, I've implemented a strategy where first a new
> VR is deployed, then old VR is powered-off/destroyed, and the new VR is
> again re-programmed. With this strategy, two identical VRs may be up for a
> brief moment (few seconds) where both can serve traffic, however the new VR
> performs arp-ping on its interfaces to update neighbours. After the old VR
> is removed, the new VR is re-programmed which among many things performs
> another arpping. The theoretical downtime is therefore limited by the
> arp-cache refresh which can be up to 30 seconds. In my experiments, against
> various VMware, KVM and XenServer versions I found that the downtime was
> indeed less than 30s, usually between 5-20 seconds. Compared to older ACS
> versions, especially in cases where VRs deployment require full volume copy
> (like in VMware) a 10x-12x improvement was seen.
> Please review, test the following PRs which has test details, benchmarks,
> and some screenshots:
> Future work can be driven towards making all VRs redundant enabled by
> default that can allow for a firewall+connections state transfer
> (conntrackd + VRRP2/3 based) during rolling reboots.
> - Rohit
> From: Daan Hoogland <daan.hoogland@xxxxxxxxx>
> Sent: Thursday, February 8, 2018 3:11:51 PM
> To: dev
> Subject: Re: [DISCUSS] VR upgrade downtime reduction
> to stop the vote and continue the discussion. I personally want unification
> of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
> one we want to create for 'enterprise topology hand-off points'. And I
> think we have some level of consensus on that but the path there is a
> concern for Wido and for some of my colleagues as well, and rightly so. One
> issue is upgrades from older versions.
> I the common scenario as follows:
> + redundancy is deprecated and only number of instances remain.
> + an old VR is replicated in memory by an redundant enabled version, that
> will be in a state of running but inactive.
> - the old one will be destroyed while a ping is running
> - as soon as the ping fails more then three times in a row (this might have
> to have a hypervisor specific implementation or require a helper vm)
> + the new one is activated
> after this upgrade Wei's and/or Remi's code will do the work for any
> following upgrade.
> flames, please
> On Wed, Feb 7, 2018 at 12:17 PM, Nux! <nux@xxxxxxxxx> wrote:
> > +1 too
> > --
> > Sent from the Delta quadrant using Borg technology!
> > Nux!
> > www.nux.ro
> 53 Chandos Place, Covent Garden, London WC2N 4HSUK
> ----- Original Message -----
> > > From: "Rene Moser" <mail@xxxxxxxxxxxxx>
> > > To: "dev" <dev@xxxxxxxxxxxxxxxxxxxxx>
> > > Sent: Wednesday, 7 February, 2018 10:11:45
> > > Subject: Re: [DISCUSS] VR upgrade downtime reduction
> > > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> > >> Hi Daan,
> > >>
> > >> In my opinion the biggest issue is the fact that there are a lot of
> > different
> > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> > why you
> > >> cannot simply switch from a single VPC to a redundant VPC for example.
> > >>
> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> > VPC with a
> > >> single tier and made sure all features are supported. Next we merged
> > the single
> > >> and redundant VPC code paths. The idea here is that redundancy or not
> > should
> > >> only be a difference in the number of routers. Code should be the
> > A
> > >> single router, is also "master" but there just is no "backup".
> > >>
> > >> That simplifies things A LOT, as keepalived is now the master of the
> > whole
> > >> thing. No more assigning ip addresses in Python, but leave that to
> > keepalived
> > >> instead. Lots of code deleted. Easier to maintain, way more stable. We
> > just
> > >> released Cosmic 6 that has this feature and are now rolling it out in
> > >> production. Looking good so far. This change unlocks a lot of
> > possibilities,
> > >> like live upgrading from a single VPC to a redundant one (and back).
> > the
> > >> end, if the redundant VPC is rock solid, you most likely don't even
> > want single
> > >> VPCs any more. But that will come.
> > >>
> > >> As I said, we're rolling this out as we speak. In a few weeks when
> > everything is
> > >> upgraded I can share what we learned and how well it works. CloudStack
> > could
> > >> use a similar approach.
> > >
> > > +1 Pretty much this.
> > >
> > > René