osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[nova][scheduler] scheduler spawns to the same compute node only


Hi Matt,


thanks for your answers.. find mine below

On 15.04.19 19:04, Matt Riedemann wrote:
> On 4/15/2019 10:36 AM, Nicolas Ghirlanda wrote:
>> New VMs  are just currently always scheduled to the same compute 
>> node, even though a manual live-migration is working fine to other 
>> compute nodes.
>
> How are you doing the live migration? If you're using the openstack 
> command line and defaulting to the 2.1 compute API microversion, 
> you're forcing the server to another host by bypassing the scheduler 
> which is maybe why live migration is "working" but server create is 
> not ever using the other computes.
>
Sound reasonable and yes, I used nova live-migration and specified the 
target machine.
When I used  "openstack server migrate --live", it seemed that all vms 
are transferred to one specific other compute node (but need to confirm 
that).


>>
>>
>> We're not sure, what the issue is, but perhaps someone may spot it 
>> from our config:
>>
>>
>> # nova.conf  scheduler config
>>
>> default_availability_zone = az1
>
> How many computes are in az1? All 8?

yes, in 2 hostgroups.


>
>>
>> ...
>>
>> [filter_scheduler]
>> available_filters = nova.scheduler.filters.all_filters
>> enabled_filters = RetryFilter, AvailabilityZoneFilter, 
>> ComputeCapabilitiesFilter, ImagePropertiesFilter, 
>> ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, 
>> AggregateInstanceExtraSpecsFilter, AggregateMultiTenancyIsolation, 
>> DifferentHostFilter, RamFilter, SameHostFilter, NUMATopologyFilter
>>
>
> Not really related to this probably but you can remove RamFilter since 
> placement does the MEMORY_MB filtering and the RamFilter was 
> deprecated in Stein as a result.
>
> It looks like you're getting the default host_subset_size value:
>
> https://docs.openstack.org/nova/queens/configuration/config.html#filter_scheduler.host_subset_size 
>
>
> Which means your scheduler is "packing" by default. If you have 
> multiple computes and you want to spread instances across them, you 
> can adjust the host_subset_size value.


Thanks, I will try.

>
>>
>>
>> Database is an external Percona XtraDB Cluster (Version 5.7.24) with 
>> haproxy for read-write-splitting (currently only one write node).
>>
>> We do see mysql errors in the nova-scheduler.log on the write DB node 
>> when an instance is created.
>>
>>
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db [-] 
>> Unexpected error while reporting service status: OperationalError: 
>> (pymysql.err.OperationalError) (1213, u'WSREP detected 
>> deadlock/conflict and aborted the transaction. Try restarting the 
>> transaction') (Background on this error at: http://sqlalche.me/e/e3q8)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> Traceback (most recent call last):
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", 
>> line 91, in _report_state
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> service.service_ref.save()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", 
>> line 226, in wrapper
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
>> fn(self, *args, **kwargs)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/objects/service.py", 
>> line 397, in save
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> db_service = db.service_update(self._context, self.id, updates)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/api.py", 
>> line 183, in service_update
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
>> IMPL.service_update(context, service_id, values)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", 
>> line 154, in wrapper
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> ectxt.value = e.inner_exc
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
>> line 220, in __exit__
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.force_reraise()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
>> line 196, in force_reraise
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> six.reraise(self.type_, self.value, self.tb)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", 
>> line 142, in wrapper
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
>> f(*args, **kwargs)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", 
>> line 227, in wrapped
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
>> f(context, *args, **kwargs)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.gen.next()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
>> line 1043, in _transaction_scope
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db yield 
>> resource
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.gen.next()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
>> line 653, in _session
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.session.rollback()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
>> line 220, in __exit__
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.force_reraise()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
>> line 196, in force_reraise
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> six.reraise(self.type_, self.value, self.tb)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
>> line 650, in _session
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self._end_session_transaction(self.session)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
>> line 678, in _end_session_transaction
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> session.commit()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", 
>> line 943, in commit
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.transaction.commit()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", 
>> line 471, in commit
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> t[1].commit()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
>> line 1643, in commit
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self._do_commit()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
>> line 1674, in _do_commit
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.connection._commit_impl()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
>> line 726, in _commit_impl
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self._handle_dbapi_exception(e, None, None, None, None)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
>> line 1409, in _handle_dbapi_exception
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> util.raise_from_cause(newraise, exc_info)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", 
>> line 265, in raise_from_cause
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> reraise(type(exception), exception, tb=exc_tb, cause=cause)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
>> line 724, in _commit_impl
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self.engine.dialect.do_commit(self.connection)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", 
>> line 1765, in do_commit
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> dbapi_connection.commit()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", 
>> line 422, in commit
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> self._read_ok_packet()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", 
>> line 396, in _read_ok_packet
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db pkt = 
>> self._read_packet()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", 
>> line 683, in _read_packet
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> packet.check_error()
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/protocol.py", 
>> line 220, in check_error
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> err.raise_mysql_exception(self._data)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
>> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/err.py", 
>> line 109, in raise_mysql_exception
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db raise 
>> errorclass(errno, errval)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
>> OperationalError: (pymysql.err.OperationalError) (1213, u'WSREP 
>> detected deadlock/conflict and aborted the transaction. Try 
>> restarting the transaction') (Background on this error at: 
>> http://sqlalche.me/e/e3q8)
>> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
>> 2019-04-15 16:52:20.020 24 INFO nova.servicegroup.drivers.db [-] 
>> Recovered from being unable to report status.
>
> This is a service update operation which could indicate that the other 
> computes are reported as 'down' and that's why nothing is getting 
> scheduled to them. Have you checked the "openstack compute service 
> list" output to make sure those computes are all reporting as "up"?


yes, all compute nodes are up in the "openstack compute service list"

>
> https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/compute-service.html#compute-service-list 
>
>
> There is a retry_on_deadlock decorator on that service_update DB API 
> though so I'm kind of surprised to still see the deadlock errors, 
> unless those just get logged while retrying?
>
> https://github.com/openstack/nova/blob/stable/queens/nova/db/sqlalchemy/api.py#L566 
>

yep, it's pretty unclear why this is happening. Our Cloud is not used 
that much, so it's very likely to be the only intance spawned in that 
timeframe, and as we have a single writer node in the Percona Cluster, I 
can't imagine why there should be any deadlock situation occurring.


>
>>
>>
>> The deadlock message is quite strange, as we have haproxy configured 
>> so all write requests are handled by one node.
>>
>>
>> There are NO errors in the mysqld.log WHILE creating an instance, but 
>> we see from time to time aborted connections from nova.
>>
>> 2019-04-15T14:22:36.232108Z 30616972 [Note] Aborted connection 
>> 30616972 to db: 'nova' user: 'nova' host: '10.x.y.z' (Got an error 
>> reading communication packets)
>>
>>
>>
>> As I said, all instances are allocated to the same compute node. 
>> nova-compute.log doesn't show an error while creating the instance.
>>
>>
>> Beside that, we also see messages from nova.scheduler.host_manager on 
>> all other nodes like (but those messages are _not_ triggered, when an 
>> instance is spawned.!)
>>
>>
>> 2019-04-15 16:28:47.771 22 INFO nova.scheduler.host_manager 
>> [req-f92e340e-a88a-44a0-8cad-588390c25bc2 - - - - -] The instance 
>> sync for host 'xxx' did not match. Re-created its InstanceList.
>
> Are there any instances on these other hosts? My guess is you're 
> seeing that after the live migration to another host.

that may be true as I manually reallocated lots of VMs around that 
timestamps. Thanks for the explanation.


>
>>
>>
>>
>> Don't know if that may be relevant, but somehow our (currently 
>> single) AZ is listed several times.
>>
>>
>> # openstack availability zone list
>> +------------+-------------+
>> | Zone Name  | Zone Status |
>> +------------+-------------+
>> | internal   | available   |
>> | az1 | available           |
>> | az1 | available           |
>> | az1 | available           |
>> | az1 | available           |
>> +------------+-------------+
>>
>> May that be related somehow?
>
> I believe those are the AZs for other services as well 
> (cinder/neutron). Specify the --compute option to filter that.
you're right, when I specify --compute there is only one AZ shown. Again 
thanks for the clarification! :-)
>
> -- 
>
> Another thing to check is placement - are there 8 compute node 
> resource providers reporting into placement? You can check using the CLI:
>
> https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-list 
>
>
> In Queens, there should be one resource provider per working compute 
> node in the cell database's compute_nodes table (the UUIDs should 
> match as well).

I do not have "openstack resource provider"? but in "openstack 
hypervisor list" I can see all compute nodes with state "up".


-- 


EveryWare AG
Nicolas Ghirlanda
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich

T  +41 44 466 60 00
F  +41 44 466 60 10

nicolas.ghirlanda at everyware.ch
www.everyware.ch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5230 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190416/00b1a505/attachment-0001.bin>