[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[nova][ops] The need for healing instance info cache to base itself on neutron for its port list

On Tue, 2020-10-06 at 14:35 -0400, Jean-Philippe Méthot wrote:
> Hi,
> This is related to bug https://bugs.launchpad.net/nova/+bug/1751923 <https://bugs.launchpad.net/nova/+bug/1751923> . I
> donâ??t see if this was fixed in more recent versions as we are running Rocky, but according to the different code
> reviews linked to the bug report, this was never committed into Openstack master. I apologize in advance if this was
> already fixed elsewhere (itâ??s marked as fixed in Stein, but the reviews say the code was never committed?).
this was commited in https://review.opendev.org/#/c/591607/ and was first released in stien.
it was not backported upstream becasue https://review.opendev.org/#/c/614167/20 has a bug.
but we backported just https://review.opendev.org/#/c/591607/ downstream in redhat osp all the way back too
newton and it works fine. so for redhat osp at least this is fixed but we did not backport the online db migration in
https://review.opendev.org/#/c/614167/20 which trys to popultate the virtual interface table jsut the force refresh.
> Essentially, weâ??re running into a production issue where sometimes, after being shutdown for a while, our VMs ports
> just straight up disappear from Nova. Obviously, since this is production, we have to scramble to link back the port
> to the VM to bring the VM back up. As a result, we have not identified yet the exact source of our issue. However, we
> do have tested Mohammed Naserâ??s patch linked to this issue and it has at the very least offered us a band-aid since
> the VMs appear to be keeping their ports now.
> Would it be possible to review and commit this patch or Matt Riedemanâ??s patch to master and backport it?
we did not backport it due to the db migration bug but its fixed form stein on upstream.
given we have not had issue backporting https://review.opendev.org/#/c/591607/ without  
https://review.opendev.org/#/c/614167/20  downstream i think it would be resonable to do upstream.
>  Couldnâ??t it just have a configuration option to enable it? While Iâ??m not convinced it can fix the root cause of our
> problem, it could at least contribute to the stability of our and other peopleâ??s Openstack cluster.
so this is a subtel thing. its not really a nova bug. its an issue where invalid data is returned by neuton and that
currupts the nova database. The force refesh will heal nova if and only if the neutron issue that casue the issue in the
first place is resovled. if the neutron issue is not fix then the force refresh will contiune to force update the nova
networking info cache with incomplete data.

so if you never have a netuon issue that returns invalid data then you will never need this patch
if you do for say because you broke the neutron policy file then this backprot will fix the nova database only
once the policy issue is corrected. we have had several large customer that have had issue with neutron due to
misconfiging the polify file or due to a third part sdn contol who maintianed port information in an external db
seperate form neutron.  in the case of the policy file customer this self healing worked once they corrected the issue.
in the case of the sdn contoler customer it did not until the sdn vendor fix the sdn contols db. once it returned
correct data again the periodic task healed nova.

> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
> 4414-4416 Louis B Mayer
> Laval, QC, H7P 0G1, Canada
> TEL : +1.514.802.1644 - Poste : 2644
> FAX : +1.514.612.0678
> CA/US : 1.855.774.4678
> FR : 01 76 60 41 43
> UK : 0808 189 0423