[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

More upgrade issues with PCPUs - input wanted

>>     The step I'm thinking is:
>>     1. upgrade control plane, disable request PCPU, still request VCPU.
>>     2. rolling upgrade compute node, compute nodes begin to report
>>     both PCPU and VCPU. But the request still add to VCPU.
>>     3. enabling the PCPU request, the new request is request PCPU.
>>            In this point, some of instances are using VCPU, some of
>>     instances are using PCPU on same node. And the amount VCPU + PCPU
>>     will double the available cpu resources. The NUMATopology filter
>>     is responsible for stop over-consuming the total number of cpu.
>>     4. rolling update compute node's configure to use
>>     cpu_dedicated_set, that trigger the reshape existed VCPU consuming
>>     to PCPU consuming.
>>          New request is going to PCPU at step3, no more VCPU request
>>     at this point. Roll upgrade node to get rid of existed VCPU consuming.
>>     5. done
>     This had been my initial plan. The issue is that by reporting both
>     PCPU and VCPU in (2), our compute node's resource provider will now
>     have PCPU inventory available (though it won't be used). This is
>     problematic since "does this resource provider have PCPU inventory"
>     is one of the questions I need to ask to determine if I should do a
>     reshape. If I can't rely on this heuristic, I need to start querying
>     for allocation information (so I can ask "does this resource
>     provider have PCPU *allocations*") every time I start a compute
>     node. I'm guessing this is expensive, since we don't do it by default.

We already do it as part of update_available_resource via
_remove_deleted_instances_allocations (there we're only checking the
compute node RP, but in the future we'll have to do it for the whole
tree anyway).

We restricted it to the reshape path in _update_to_placement because
it's not free and it was possible to make the flow work in the general
case without it.

We can still avoid it in the general case by only doing it when startup
is True.

So if you can solve the problem (which I'm still wrapping my brain
around) by looking at the allocations, let's do that. Because...

> I'm not quite ensure understand the problem. How about question you
> should ask is "Does the current amount of VCPU and PCPU is double of
> actual available cpu resources". If the answer is yes, then do a reshape.

Alex's suggestion makes sense to me, but it's a bit of a hack, and the
math might break down if you e.g. stop compute, twiddle your cpu_*_setZ,
and restart.