Status does not update to "Shutoff" when instance shuts down itself
- Series queens
- Bug #1753550
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Matt Riedemann |
|
|
Queens |
Fix Committed
|
Medium
|
Matt Riedemann |
|
Bug Description
For instances that shut down themselves, the status remains Active, even though the power state updates to Shutdown. If however the shutdown signal is sent via openstack it does change the status to "SHUTOFF".
After creating an instance, the states are:
+------------------------+---------+
| Field | Value |
+------------------------+---------+
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| status | ACTIVE |
+------------------------+---------+
Using openstack server stop results in:
+------------------------+----------+
| Field | Value |
+------------------------+----------+
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | stopped |
| status | SHUTOFF |
+------------------------+----------+
However logging into the instance and using the poweroff command results in:
+------------------------+----------+
| Field | Value |
+------------------------+----------+
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| status | ACTIVE |
+------------------------+----------+
This results in being unable to use the openstack server start command on it fails and returns:
# openstack server start test_shutdown
Cannot 'start' instance 8881bebb-efbd-45e6-a052-7d23b9b63222 while it is in vm_state active (HTTP 409) (Request-ID: req-e473415d-d6e0-4342-b1f6-267efa934dc0)
despite the virtual machine being powered off. You can work around this by running openstack server stop, and then openstack server start.
This is also an issue for external applications that check the status (and how I noticed it to begin with).
This on devstack with commit id: 5d2add74534719c5670b29152964a60e8f23b42b
Not sure if the following is useful, but
- Using hypervisor libvirt+qemu/kvm:
# virsh version
Compiled against library: libvirt 3.2.0
Using library: libvirt 3.2.0
Using API: QEMU 3.2.0
Running hypervisor: QEMU 2.9.0
- Using ephemeral storage with ex4
- Neutron with OpenVSwitch
tags: | added: compute |
melanie witt (melwitt) wrote on 2018-03-20: | #1 |
There is a periodic task in nova-compute controlled by the config option [DEFAULT]sync_power_state_interval which defaults to 600 seconds (10 minutes). The periodic task will update the power_state in the database to reflect the power state returned by the virt driver. That's the power state update that you see.
Now, looking at that same periodic task, it's also supposed to be calling the compute "stop" API on the instance if the power state from the virt driver is found to be "shutdown". So once that occurs, you should see the instance with a vm_state of "stopped" and a status of SHUTOFF. Do you ever see that after the [DEFAULT]sync_power_state_interval of 10 minutes?
There should be a DEBUG log message in nova-compute like this when it finds a powered off instance:
"Instance shutdown by itself. Calling the stop API"
and
"Stopping instance; current vm_state: ..."
Do you see those log messages or do you find any ERROR log messages in nova-compute around them?
Changed in nova: | |
status: | New → Incomplete |
melanie witt (melwitt) wrote on 2018-03-20: | #2 |
Here's where the code should be updating the instance power state to "shutdown" during the periodic task:
https://github.com/openstack/nova/blob/658d468/nova/compute/manager.py#L7127
Here's where it should be calling the "stop" API:
https://github.com/openstack/nova/blob/658d468/nova/compute/manager.py#L7153
and here's where it should be updated to vm_state STOPPED upon the "stop" API being called:
https://github.com/openstack/nova/blob/658d468/nova/compute/manager.py#L2606
Raoul Hidalgo Charman (raoulhc) wrote on 2018-03-20: | #3 |
I do get both those messages followed by an exception error message:
EndpointNotFound: Could not find requested endpoint for any of the following interfaces: ['internal', 'public']
With this traceback from oslo.messaging.server:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
res = self.dispatcher.dispatch(message)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
return self._do_dispatch(endpoint, method, ctxt, args)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
result = func(ctxt, **new_args)
File "/opt/stack/nova/nova/exception_wrapper.py", line 76, in wrapped
function_name, call_dict, binary)
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/opt/stack/nova/nova/exception_wrapper.py", line 67, in wrapped
return f(self, context, *args, **kw)
File "/opt/stack/nova/nova/compute/manager.py", line 186, in decorated_function
"Error: %s", e, instance=instance)
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/opt/stack/nova/nova/compute/manager.py", line 156, in decorated_function
return function(self, context, *args, **kwargs)
File "/opt/stack/nova/nova/compute/utils.py", line 985, in decorated_function
return function(self, context, *args, **kwargs)
File "/opt/stack/nova/nova/compute/manager.py", line 214, in decorated_function
kwargs['instance'], e, sys.exc_info())
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/opt/stack/nova/nova/compute/manager.py", line 202, in decorated_function
return function(self, context, *args, **kwargs)
File "/opt/stack/nova/nova/compute/manager.py", line 2616, in stop_instance
do_stop_instance()
File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
return f(*args, **kwargs)
File "/opt/stack/nova/nova/compute/manager.py", line 2598, in do_stop_instance
"power_off.start")
File "/opt/stack/nova/nova/compute/manager.py", line 1647, in _notify_about_instance_usage
extra_usage_info=extra_usage_info, fault=fault)
File "/opt/stack/nova/nova/compute/utils.py", line 325, in notify_about_instance_usage
network_info, system_metadata, **extra_usage_info)
File "/opt/stack/nova/nova/notifications/base.py", line 398, in info_from_instance
context)
File "/opt/stack/nova/nova/image/api.py", line 65, in generate_image_url
return "%s/images/%s" % (next(glance.get_api_servers(con...
Raoul Hidalgo Charman (raoulhc) wrote on 2018-03-20: | #4 |
Also no it doesn't get updated eventually, I've left a server for a while and it remains like that. Having a look through the log files it gets the same error and traceback every time it's called.
melanie witt (melwitt) wrote on 2018-03-20: | #5 |
Thank you for adding that trace -- it shows why the intended behavior to set the instance to STOPPED fails:
"File "/opt/stack/nova/nova/compute/manager.py", line 2598, in do_stop_instance
"power_off.start")"
It's failing to send the notification for beginning the "stop" routine, so it's not reaching any of the code that actually stops the instance and sets the status to STOPPED.
I'm not familiar with the error:
"EndpointNotFound: Could not find requested endpoint for any of the following interfaces: ['internal', 'public']"
but I'll find out and update this.
Matt Riedemann (mriedem) wrote on 2018-03-20: | #6 |
What's your [glance] config in nova.conf?
See the 'valid_interfaces' option here:
https://docs.openstack.org/nova/latest/configuration/config.html#glance
Matt Riedemann (mriedem) wrote on 2018-03-20: | #7 |
Also, see https://github.com/openstack/nova/commit/62ef6cfcf01d84813f71d1e8252b86c170ee39f0 - do you have [service_user] configured in nova.conf? Do you have that fix? ^
melanie witt (melwitt) wrote on 2018-03-20: | #8 |
Matt, in comment 0 they mentioned they're using devstack on commit id: 5d2add74534719c5670b29152964a60e8f23b42b so they should have the fix you mentioned.
Raoul, are you running glance in your devstack environment where you've reproduced this problem?
We looked deeper into the code that is failing and it's a call to keystone looking for the glance service in the service catalog during the attempt to send a notification (it falls back on this behavior if [glance]api_servers is not set in nova.conf).
Raoul Hidalgo Charman (raoulhc) wrote on 2018-03-20: | #9 |
I've not a glance config section so it's all default options.
I've got that commit, and don't have service_user configured.
Raoul Hidalgo Charman (raoulhc) wrote on 2018-03-20: | #10 |
And yep, it's devstack all in one server set up.
melanie witt (melwitt) wrote on 2018-03-20: | #11 |
Okay, I was able to reproduce the problem locally with devstack by:
1. Create an instance
2. virsh shutdown <domain>
3. Wait for nova-compute to try to "stop" the instance and it fails to do that
I found that the "EndpointNotFound: Could not find requested endpoint for any of the following interfaces: ['internal', 'public']" error originates from a "exceptions.EmptyCatalog('The service catalog is empty.')" raised from keystoneauth1/access/service_catalog.py(362)endpoint_data_for(). That originates from the fact that the RequestContext.service_catalog is an empty list [] when we make the call.
The "sync power states" periodic task is the routine responsible for updating the instance power state and status to reflect the state reported by the virt driver, and our periodic tasks all run using an anonymous RequestContext we get by calling nova.context.get_admin_context(). This type of context is not going to have service_catalog populated, so we will fail if we try to do anything with the service_catalog on such a RequestContext.
Currently, it looks like the first thing we need to do is determine whether we really need to be trying to do anything with image urls or glance or the service catalog when we're sending notifications. gibi said he will start investigating that tomorrow.
For the moment, other than removing the dependency on glance image urls in notifications, it's not clear how we can fix the periodic task.
Changed in nova: | |
status: | Incomplete → Triaged |
importance: | Undecided → Medium |
melanie witt (melwitt) wrote on 2018-03-20: | #12 |
Setting this to High because any periodic task that will result in notifications being sent, will hit this problem if CONF.glance.api_servers is not set.
Changed in nova: | |
importance: | Medium → High |
status: | Triaged → Confirmed |
melanie witt (melwitt) wrote on 2018-03-20: | #13 |
Midair collision. I was thinking between High and Medium, so Medium it is.
The current idea is to handle EndpointNotFound in the notifications code and set the image_ref_url to the image id.
Changed in nova: | |
importance: | High → Medium |
status: | Confirmed → Triaged |
Matt Riedemann (mriedem) wrote on 2018-03-20: | #14 |
The problem is that the periodic doesn't have a token, and [glance]/api_servers isn't configured, so when the notification code is trying to build a payload, it's trying to get the image API endpoint URL to build a link to the image for the notification. Getting the endpoint URL without a token fails.
This is the code that matters:
If [glance]/api_servers is configured, then we don't go through keystoneauth1 to get the service catalog from keystone.
So options seem to be:
1. Add the ability to specify an admin or service user credentials in nova.conf for the [glance] section which we can use during periodics. This is a bit heavy weight when we've never needed it before.
2. Handle the EndpointNotFound error when building the notification payload and just set image_ref_url to the image id. Yes this is technically a change in the payload, but it's better than breaking everything. Long-term, the versioned notification payload could drop image_ref_url and just pass the image id, which is probably more useful anyway since the consumer might not be able to even reach the URL that's in the payload. We can't change the legacy notification payload, but we can eventually deprecate that in favor of the versioned notifications.
OpenStack Infra (hudson-openstack) wrote on 2018-03-20: Fix proposed to nova (master) | #15 |
Fix proposed to branch: master
Review: https://review.openstack.org/554703
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
status: | Triaged → In Progress |
OpenStack Infra (hudson-openstack) wrote on 2018-03-21: Fix merged to nova (master) | #16 |
Reviewed: https://review.openstack.org/554703
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6482165bb1f44f5c98d9361153d737c22c92112d
Submitter: Zuul
Branch: master
commit 6482165bb1f44f5c98d9361153d737c22c92112d
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 20 17:15:42 2018 -0400
Handle EndpointNotFound when building image_ref_url in notifications
Change I4e755b9c66ec8bc3af0393e81cffd91c56064717 made the
[glance]/api_servers option optional. If not set, we attempt
to get the image service endpoint via keystoneauth adapter and
the service catalog via the request context.
Periodic tasks run without an actual token so there is no way
to get the service catalog and the KSA adapter code to get the
endpoint raises EndpointNotFound when trying to build the
"image_ref_url" entry in the legacy instance notification payload.
This change simply handles the EndpointNotFound and sets the
image_ref_url to the instance.image_ref, which for non-volume-backed
instances is the image id (for volume-backed instances it's an empty
string).
This doesn't affect versioned notifications since those don't use the
"image_ref_url" entry in the payload that is created, they just have
an "image_uuid" entry in the versioned notification payload which is
populated via instance.image_ref.
An upgrade impact release note is added in the case that some consuming
service is actually relying on that legacy notification field being
a URL and parsing it as such. The thinking here, however, is this is
better than not sending the field at all, or setting it to None.
Long-term this code all gets replaced with versioned notifications.
Change-Id: Ie23a9c922776b028da0720c939846cba233ac472
Closes-Bug: #1753550
Changed in nova: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote on 2018-03-21: Fix proposed to nova (stable/queens) | #17 |
Fix proposed to branch: stable/queens
Review: https://review.openstack.org/554963
OpenStack Infra (hudson-openstack) wrote on 2018-03-29: Fix merged to nova (stable/queens) | #18 |
Reviewed: https://review.openstack.org/554963
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e55a5f412bf6bc343eb8bb11348afe57c2012066
Submitter: Zuul
Branch: stable/queens
commit e55a5f412bf6bc343eb8bb11348afe57c2012066
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 20 17:15:42 2018 -0400
Handle EndpointNotFound when building image_ref_url in notifications
Change I4e755b9c66ec8bc3af0393e81cffd91c56064717 made the
[glance]/api_servers option optional. If not set, we attempt
to get the image service endpoint via keystoneauth adapter and
the service catalog via the request context.
Periodic tasks run without an actual token so there is no way
to get the service catalog and the KSA adapter code to get the
endpoint raises EndpointNotFound when trying to build the
"image_ref_url" entry in the legacy instance notification payload.
This change simply handles the EndpointNotFound and sets the
image_ref_url to the instance.image_ref, which for non-volume-backed
instances is the image id (for volume-backed instances it's an empty
string).
This doesn't affect versioned notifications since those don't use the
"image_ref_url" entry in the payload that is created, they just have
an "image_uuid" entry in the versioned notification payload which is
populated via instance.image_ref.
An upgrade impact release note is added in the case that some consuming
service is actually relying on that legacy notification field being
a URL and parsing it as such. The thinking here, however, is this is
better than not sending the field at all, or setting it to None.
Long-term this code all gets replaced with versioned notifications.
Change-Id: Ie23a9c922776b028da0720c939846cba233ac472
Closes-Bug: #1753550
(cherry picked from commit 6482165bb1f44f5c98d9361153d737c22c92112d)
OpenStack Infra (hudson-openstack) wrote on 2018-04-02: Fix included in openstack/nova 17.0.2 | #19 |
This issue was fixed in the openstack/nova 17.0.2 release.
OpenStack Infra (hudson-openstack) wrote on 2018-04-20: Fix included in openstack/nova 18.0.0.0b1 | #20 |
This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.