osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ops] [nova] [placement] Mismatch between allocations and instances


On 7/5/2019 1:45 AM, Massimo Sgaravatto wrote:
> I tried to check the allocations on each compute node of a Ocata cloud, 
> using the command:
> 
> curl -s ${PLACEMENT_ENDPOINT}/resource_providers/${UUID}/allocations -H 
> "x-auth-token: $TOKEN"  | python -m json.tool
>

Just FYI you can use osc-placement (openstack client plugin) for command 
line:

https://docs.openstack.org/osc-placement/latest/index.html

> I found that, on a few compute nodes, there are some instances for which 
> there is not a corresponding allocation.

The heal_allocations command [1] might be able to find and fix these up 
for you. The bad news for you is that heal_allocations wasn't added 
until Rocky and you're on Ocata. The good news is you should be able to 
take the current version of the code from master (or stein) and run that 
in a container or virtual environment against your Ocata cloud (this 
would be particularly useful if you want to use the --dry-run or 
--instance options added in Train). You could also potentially backport 
those changes to your internal branch, or we could start a discussion 
upstream about backporting that tooling to stable branches - though 
going to Ocata might be a bit much at this point given Ocata and Pike 
are in extended maintenance mode [2].

As for *why* the instances on those nodes are missing allocations, it's 
hard to say without debugging things. The allocation and resource 
tracking code has changed quite a bit since Ocata (in Pike the scheduler 
started creating the allocations but the resource tracker in the compute 
service could still overwrite those allocations if you had older nodes 
during a rolling upgrade). My guess would be a migration failed or there 
was just a bug in Ocata where we didn't cleanup or allocate properly. 
Again, heal_allocations should add the missing allocation for you if you 
can setup the environment to run that command.

> 
> On another Rocky cloud, we had the opposite problem: there were 
> allocations also for some instances that didn't exist anymore.
> And this caused problems since we were not able to use all the resources 
> of the relevant compute nodes: we had to manually remove the fwrong" 
> allocations to fix the problem ...

Yup, this could happen for different reasons, usually all due to known 
bugs for which you don't have the fix yet, e.g. [3][4], or something is 
failing during a migration and we aren't cleaning up properly (an 
unreported/not-yet-fixed bug).

> 
> 
> I wonder why/how this problem can happen ...

I mentioned some possibilities above - but I'm sure there are other bugs 
that have been fixed which I've omitted here, or things that aren't 
fixed yet, especially in failure scenarios (rollback/cleanup handling is 
hard).

Note that your Ocata and Rocky cases could be different because since 
Queens (once all compute nodes are >=Queens) during resize, cold and 
live migration the migration record in nova holds the source node 
allocations during the migration so the actual *consumer* of the 
allocations for a provider in placement might not be an instance 
(server) record but actually a migration, so if you were looking for an 
allocation consumer by ID in nova using something like "openstack server 
show $consumer_id" it might return NotFound because the consumer is 
actually not an instance but a migration record and the allocation was 
leaked.

> 
> And how can we fix the issue ? Should we manually add the missing 
> allocations / manually remove the wrong ones ?

Coincidentally a thread related to this [5] re-surfaced a couple of 
weeks ago. I am not sure what Sylvain's progress is on that audit tool, 
but the linked bug in that email has some other operator scripts you 
could try for the case that there are leaked/orphaned allocations on 
compute nodes that no longer have instances.

> 
> Thanks, Massimo
> 
> 

[1] https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement
[2] https://docs.openstack.org/project-team-guide/stable-branches.html
[3] https://bugs.launchpad.net/nova/+bug/1825537
[4] https://bugs.launchpad.net/nova/+bug/1821594
[5] 
http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html

-- 

Thanks,

Matt