osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[nova][scheduler] scheduler spawns to the same compute node only


On 4/15/2019 10:36 AM, Nicolas Ghirlanda wrote:
> New VMs  are just currently always scheduled to the same compute node, 
> even though a manual live-migration is working fine to other compute nodes.

How are you doing the live migration? If you're using the openstack 
command line and defaulting to the 2.1 compute API microversion, you're 
forcing the server to another host by bypassing the scheduler which is 
maybe why live migration is "working" but server create is not ever 
using the other computes.

> 
> 
> We're not sure, what the issue is, but perhaps someone may spot it from 
> our config:
> 
> 
> # nova.conf  scheduler config
> 
> default_availability_zone = az1

How many computes are in az1? All 8?

> 
> ...
> 
> [filter_scheduler]
> available_filters = nova.scheduler.filters.all_filters
> enabled_filters = RetryFilter, AvailabilityZoneFilter, 
> ComputeCapabilitiesFilter, ImagePropertiesFilter, 
> ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, 
> AggregateInstanceExtraSpecsFilter, AggregateMultiTenancyIsolation, 
> DifferentHostFilter, RamFilter, SameHostFilter, NUMATopologyFilter
> 

Not really related to this probably but you can remove RamFilter since 
placement does the MEMORY_MB filtering and the RamFilter was deprecated 
in Stein as a result.

It looks like you're getting the default host_subset_size value:

https://docs.openstack.org/nova/queens/configuration/config.html#filter_scheduler.host_subset_size

Which means your scheduler is "packing" by default. If you have multiple 
computes and you want to spread instances across them, you can adjust 
the host_subset_size value.

> 
> 
> Database is an external Percona XtraDB Cluster (Version 5.7.24) with 
> haproxy for read-write-splitting (currently only one write node).
> 
> We do see mysql errors in the nova-scheduler.log on the write DB node 
> when an instance is created.
> 
> 
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db [-] 
> Unexpected error while reporting service status: OperationalError: 
> (pymysql.err.OperationalError) (1213, u'WSREP detected deadlock/conflict 
> and aborted the transaction. Try restarting the transaction') 
> (Background on this error at: http://sqlalche.me/e/e3q8)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db Traceback 
> (most recent call last):
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", 
> line 91, in _report_state
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> service.service_ref.save()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", 
> line 226, in wrapper
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
> fn(self, *args, **kwargs)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/objects/service.py", 
> line 397, in save
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db db_service 
> = db.service_update(self._context, self.id, updates)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/api.py", 
> line 183, in service_update
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
> IMPL.service_update(context, service_id, values)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", 
> line 154, in wrapper
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> ectxt.value = e.inner_exc
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
> line 220, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.force_reraise()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
> line 196, in force_reraise
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> six.reraise(self.type_, self.value, self.tb)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/api.py", 
> line 142, in wrapper
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
> f(*args, **kwargs)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", 
> line 227, in wrapped
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db return 
> f(context, *args, **kwargs)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.gen.next()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
> line 1043, in _transaction_scope
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db yield 
> resource
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.gen.next()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
> line 653, in _session
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.session.rollback()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
> line 220, in __exit__
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.force_reraise()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_utils/excutils.py", 
> line 196, in force_reraise
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> six.reraise(self.type_, self.value, self.tb)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
> line 650, in _session
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self._end_session_transaction(self.session)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", 
> line 678, in _end_session_transaction
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> session.commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", 
> line 943, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.transaction.commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", 
> line 471, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db t[1].commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
> line 1643, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self._do_commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
> line 1674, in _do_commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.connection._commit_impl()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
> line 726, in _commit_impl
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self._handle_dbapi_exception(e, None, None, None, None)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
> line 1409, in _handle_dbapi_exception
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> util.raise_from_cause(newraise, exc_info)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", 
> line 265, in raise_from_cause
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> reraise(type(exception), exception, tb=exc_tb, cause=cause)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", 
> line 724, in _commit_impl
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self.engine.dialect.do_commit(self.connection)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", 
> line 1765, in do_commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> dbapi_connection.commit()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", 
> line 422, in commit
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> self._read_ok_packet()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", 
> line 396, in _read_ok_packet
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db pkt = 
> self._read_packet()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/connections.py", 
> line 683, in _read_packet
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> packet.check_error()
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/protocol.py", 
> line 220, in check_error
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> err.raise_mysql_exception(self._data)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db File 
> "/var/lib/kolla/venv/local/lib/python2.7/site-packages/pymysql/err.py", 
> line 109, in raise_mysql_exception
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db raise 
> errorclass(errno, errval)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db 
> OperationalError: (pymysql.err.OperationalError) (1213, u'WSREP detected 
> deadlock/conflict and aborted the transaction. Try restarting the 
> transaction') (Background on this error at: http://sqlalche.me/e/e3q8)
> 2019-04-15 16:52:10.016 24 ERROR nova.servicegroup.drivers.db
> 2019-04-15 16:52:20.020 24 INFO nova.servicegroup.drivers.db [-] 
> Recovered from being unable to report status.

This is a service update operation which could indicate that the other 
computes are reported as 'down' and that's why nothing is getting 
scheduled to them. Have you checked the "openstack compute service list" 
output to make sure those computes are all reporting as "up"?

https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/compute-service.html#compute-service-list

There is a retry_on_deadlock decorator on that service_update DB API 
though so I'm kind of surprised to still see the deadlock errors, unless 
those just get logged while retrying?

https://github.com/openstack/nova/blob/stable/queens/nova/db/sqlalchemy/api.py#L566

> 
> 
> The deadlock message is quite strange, as we have haproxy configured so 
> all write requests are handled by one node.
> 
> 
> There are NO errors in the mysqld.log WHILE creating an instance, but we 
> see from time to time aborted connections from nova.
> 
> 2019-04-15T14:22:36.232108Z 30616972 [Note] Aborted connection 30616972 
> to db: 'nova' user: 'nova' host: '10.x.y.z' (Got an error reading 
> communication packets)
> 
> 
> 
> As I said, all instances are allocated to the same compute node. 
> nova-compute.log doesn't show an error while creating the instance.
> 
> 
> Beside that, we also see messages from nova.scheduler.host_manager on 
> all other nodes like (but those messages are _not_ triggered, when an 
> instance is spawned.!)
> 
> 
> 2019-04-15 16:28:47.771 22 INFO nova.scheduler.host_manager 
> [req-f92e340e-a88a-44a0-8cad-588390c25bc2 - - - - -] The instance sync 
> for host 'xxx' did not match. Re-created its InstanceList.

Are there any instances on these other hosts? My guess is you're seeing 
that after the live migration to another host.

> 
> 
> 
> Don't know if that may be relevant, but somehow our (currently single) 
> AZ is listed several times.
> 
> 
> # openstack availability zone list
> +------------+-------------+
> | Zone Name  | Zone Status |
> +------------+-------------+
> | internal   | available   |
> | az1 | available           |
> | az1 | available           |
> | az1 | available           |
> | az1 | available           |
> +------------+-------------+
> 
> May that be related somehow?

I believe those are the AZs for other services as well (cinder/neutron). 
Specify the --compute option to filter that.

--

Another thing to check is placement - are there 8 compute node resource 
providers reporting into placement? You can check using the CLI:

https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-provider-list

In Queens, there should be one resource provider per working compute 
node in the cell database's compute_nodes table (the UUIDs should match 
as well).

-- 

Thanks,

Matt