Deleting multiple nova instances

We spawn large numbers of instances for testing and benchmark purposes. This happens a lot in HPC and as long as it isn’t orchestrated by heat, its fine to batch-delete these. But OpenStack CLI doesn’t appear to provide a way to do this intelligently and carefully. You could log into horizon but this takes a while (improvements coming in that area in Mitaka apparently) and the screen only loads a maximum of 25 or 50 or something.

You don’t want to use this lightly (run just the command and grep to confirm the match) but here it is more for my reference than anything else.

nova list –all-tenants | grep -i UNIQUE_COMMON_KEYWORD_HERE | awk ‘{print $2}’ | grep -v ^ID | grep -v ^$ | xargs -n1 nova delete

Then obviously replace UNIQUE_COMMON_KEYWORD_HERE with mytestinstance or testinstance or cbtest or so on.



Red Hat OSP 7 Director custom hostnames

By default, nodes deployed with Director have standard FQDN in the format:




Which is fine but its nice to customize this a bit, no?

To do so, edit the following undercloud files:


and create/change the following parameter:

dhcp_domain = iaas.local (or whatever you prefer)

You also need to edit the deployment parameter:

default: ‘iaas.local’

in overcloud-without-mergepy.yaml

NB: It looks like this will only apply to OSP 7, OSP 8 has some changes to make this easier

OpenStack and Clustered Data ONTAP

NetApp Fabric-Attached Storage (FAS) devices are pretty great. They offer dual controller config for HA setups, CIFS and NFS including pNFS as well as iSCSI and FCoE and you can populate them with drives for up to 1PB storage with a flash pool for a nice fast cache. They also have the added benefit of being backed by NetApp who are a big contributor to OpenStack storage architectures so its a generally safe assumption that they will play nicely with Cinder, Glance, Swift and now Manila.

Because I mostly blog about problems, its worth noting this one:

The Clustered Data ONTAP GUI doesn’t appear to allow you to set owner and group on volumes when you create them – you have to do this through the CLI. Normally I expect to do most things at the command line but NetApp docs are quite explicit about doing everything at the GUI as commands to create things like volumes are complex, eg:

volume create -vserver vs0 -volume user_jdoe -aggregate aggr1 -state online -policy default_expolicy –user 165 –group 165 -group dev -junction-path /user/jdoe -size 250g -space-guarantee volume -percent-snapshot-space 20 -foreground false

Note the -user and -group parameters. These allow you to set ownership on the volume and therefore when OpenStack mounts, means you can lock it down to either the cinder or glance user.

When is a link actually, y’know, UP?

TL;DR Use device names such as eno1, enp7s0 rather than nic1, nic2

I’ve been chasing an issue with a TripleO-based installation whereby the nodes were provisioning but failing to configure networking correctly.

Debugging TripleO deployments is fiendishly hard and this was made more complex by being unable to connect to the failed nodes. Deployed TripleO nodes only allow key-based ssh authentication. It’s great to see security being so good even the sysadmin can’t access the node I guess.

If you want to login to a node at the console, you basically have to roll your own deployment image. I was on the verge of heading down this route when I considered the following:

TripleO deployments have two methods of specifying usable, active network interfaces. The first (and unfortunately, the default) is to number them nic1 , nic2, nic3 etc in the config. Unfortunately this introduces some logic from os-net-config to determine which links are actually connected to switches.

This would be fine on most machines but sadly the hardware I’m working on has a built-in ethernet-over-usb device for out of band access. This reports itself as having a link (for reasons unknown, maybe a link to the usb interface?) and therefore fulfills the following criteria:

  1. Not the local loopback
  2. Has an address
  3. Reports carrier signal as active in /sys/class/net/<device>/carrier
  4. Has a subdirectory of device information in /sys/class/net/<device>/device/

Despite reporting the link state as UNKNOWN in the ip command, this meant that the logic of os-net-config concluded that this was the management nic and attempted to configure it as such, obviously to no avail.

Happily this has resulted in my first OpenStack patch:

which may even get accepted.

Red Hat Director continued…

Its been an interesting start to the year. Upcoming deployments are calling for the use of Liberty-based deployments so I have been looking at using Red Hat’s Director to deploy these. Because it’s basically a glorified Heat template (albeit an extremely complicated one), it’s ideal for customization for individual environments because no cloud is the same, right?

But that involves actually getting yourself to a working system in the first place when you’re needing to orchestrate storage, networking, HA/pacemaker and so on.

The undercloud installation, as referenced previously, was pretty much a doddle. The snag came when I attempted to deploy the “overcloud” or main production cloud. The nodes failed to deploy consistently to the hardware but in utterly random fashion. For example, in a  3 controller and 3 compute setup, one deployment run would result in 2 controllers and 1 compute node successfully configured. A second run would then result in 1 controller and 3 compute nodes deployed.

So naturally you start by simplifying the setup as much as possible, removing complex storage and networking options. This didn’t improve matters so I tested with a standard OS deployment which worked fine. Changes to partitioning, UEFI configs, firmware amongst others all drew a blank.

Finally I decided to hack the boot config of the deployment images for the nodes (stored in /httpboot) to output debug info to console:

find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/troubleshoot=0/troubleshoot=1/g’
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/console=ttyS0 //g’

which took a few runs but then eventually spotted the blink-and-you’ll-miss-it error:


Specifically, “failed to mount root partition”. Which then led me to the following bug:

This explained why it had been working fine in my RDO environment of course. Its a pity this was missed in Red Hat’s testing but it was a useful learning experience, going round the houses with heat, ipmi, ipxe, uefi, kickstart partitioning and so on. The workaround has been to revert to Director’s 7.1 deployment image as I guess this is a fairly weird and wonderful problem. Future releases will feature a different deployment image so apparently this will no longer be a problem.

All told this took up about two weeks engineering time which goes to show that the random now-you-see-it-now-you-don’t problems are the hardest to fix. Onwards and upwards.

Red Hat Director, The Undercloud

This week I’ve been starting to get to grips with Red Hat’s Director cloud deployment tool. It leverages Ironic to provision baremetal machines and introduces the concept of an undercloud and overcloud – there are essentially two clouds, the undercloud is a basic OpenStack environment with just the tools needed to get the main job done. The Overcloud is the cloud your users interact with and run their whatever on.

Its clear some serious engineering time has gone into trying to make this as easy as possible and the good news is that so far, it seems to be working well. Installation of the undercloud was a simple as defining a few variables like network range, interface etc. I had tried RDO Manager (Director’s upstream product) and had a fairly torrid time. They were going through some major infrastructure changes at the time however so perhaps that was part of the problem. Meh.

Anyway, pretty picture time.

Screenshot from 2015-12-17 12-59-29

So now there are various roles are ready to be deployed. There are the usual compute and control as well as Ceph (no surprise as this is a Red Hat product), Cinder and Swift.

Initial Overcloud deployments haven’t completed yet – this is more due to me being a tool and not following the instructions rather than any particular bug in the software.


The biggest issue so far has been hardware related. I’ve been using an ancient Nortel (yes, remember them!) switch was taking an eon to bring up network links, I think, due to a buggy STP (Spanning Tree Protocol) implementation. Director uses iPXE rather than PXElinux for some reason (UEFI maybe?) and although it downloaded the NBP file fine, when it came to get a DHCP lease, it completely timed out. It was only when I attached the second interface to try and boot from that that it became apparent that the link was taking a long time to come up. So I guess STP was at fault here but never debugged, just replaced the switch (for a good old HP ProCurve) and it worked fine.

In the course of the above issue I’ve learnt plenty of things about PXE booting like how when the logs say “error 8 User aborted the transfer” you can actually ignore it because its normal, even if journalctl flags it in big red letters. Apparently this is an initial check to see what protocols the client supports before initiating the download proper.

Other problems included discovering that there needs to be a default flavor called “baremetal” so unassigned nodes know where to live and enabling SELinux – RDO Manager/Director won’t actually install if it is disabled.

Standing up clouds

One thing that is interesting for me is the sheer number of ways of getting your OpenStack cloud to an end product and the way in that no one system has prevailed.

For working in development environments you have devstack and packstack. My favourite t-shirt slogan from Tokyo was “But it worked in devstack…”.

Moving to production you have Ansible, Salt, Puppet and now TripleO-based installers, each with their own offshoots. I am looking forward to working with RDO-Manager and Director more, although RDO Manager (Director’s upstream) seems to be highly fluid. I don’t even understand how it all fits together. What is a “DeLorean instance” for example?

Red Hat’s lineage is a case in point. They started out with Foreman/Staypuft/Astapor (which I’m currently in the process of helping to bring up to OSP 7 standards), then moved to Director. The recent purchase of Ansible means that it is not implausible to consider that future installers will be based on that tool. Ansible seemed to be the most commonly used tool at the summit but that is based purely on anecdotal evidence.

Then Ubuntu has MAAS and it’s “juju charms” – it is supposed to “Just Work” but during brief in-house trials it didn’t. Probably my fault.

It would be nice if development efforts could be consolidated a bit. I guess that will come with product maturity but it really does need to come sooner rather than later.

More on Satellite 6.1 provisioning

Satellite 6.1 supposedly supports bonded network interfaces. If it does, we’ve yet to get it working.

To be fair, this is a slightly more complex setup with two interfaces on separate cards heading to separate switches using LACP for resilience which is more complex than balancing. There are then a number of virtual interfaces hanging off this on separate vlans.

However even without the VLAN interfaces in play, a basic bond fails and we are then presented with a failure cascade.

  • Foreman (the part of Satellite that controls provisioning) decides to change the IP of the provisioning node, not sure why, support case currently open. This happens whatever you do. Update – see below
  • You then delete the node which runs fine but leaves a dangling entry in DHCP because the MAC address has changed
  • You then have to edit the lease file to remove the entry and restart the daemon
  • Leave this to long and Satellite accepts fact uploads from the deleted node, adding it back into the foreman database and preventing re-provisioning. You then have to run db removal commands alluded to in my previous post. Update – see below

I have back-ported this fix:

which seems to prevent duplicate bonds getting created (another problem) but so far it hasn’t been an easy ride.

If you’re looking at using Satellite 6.1 to provision bonded networks (these are RHEL 7.2 machines) then I would avoid and use snippets, which is what we have done here and appear to be through the worst of the above.

Alternatively consider later versions of foreman with better bonding support?

Update: I have discovered that setting the following:


resolved the MAC address changing issue



resolves hosts getting re-added to the database after deletion.

Satellite 6 and Foreman re-provisioning

We are using Satellite 6 (Specifically 6.1.3) to implement an OSP 7 solution at work. I have used Foreman previously and realised how sensitive it can be. It works with lots of components like DHCP, DNS and Postgres but in the Red Hat Satellite product it pulls in even more like Candlepin and and Katello. For the full list see:

Anyway, we hit a particular bug where after deleting a node, Sat 6 wouldn’t re-provision a node with the same name. This was a problem as we are testing re-provisioning on a frequent basis so we started to run out of nodes on the cluster to test with.

I hunted in the usual places like dhcp lease files and bind configuration databases but there wasn’t any evidence there so it rapidly became obvious that some nasty database hackery would be required.

Sure enough, after filing a bug with Red Hat GSS, they came back and asked me to run the following from the foreman-rake console (whatever that is) :

Host.where("name LIKE ?", "%compute%").collect(&:name)

which outputted all the nodes with the name compute in the title. Sure enough, the deleted nodes were listed, despite the delete task having run successfully in foreman/Satellite. So the next instruction was to run:


which ran a bunch of stuff and cleared the DB of the legacy node data. The node then successfully re-provisioned. So because this doesn’t appear to be documented I’m listing this here so you can go and break your own cluster at will.


Update: Clarified that it doesn’t clear the entire DB…

Removing orphaned instances when all else fails…

Working on OpenStack is complex and working on older versions of OpenStack is even more complex. If your instance is spawning but the shared storage hosting the ephemeral disk or block storage oopses/offlines/panics then you can be left with orphaned instances that exist in the database but nowhere else. You try to delete them using nova delete but this doesn’t work because OpenStack can’t locate the files it wants to delete and you get into a real mess.

Some articles indicate that all you need to do is run some variation on:

mysql -D nova -e "delete from instances where instances.uuid = '$uuid'"

but this is bad because it leaves all sorts of information relating to the VM in existence. It appears to have been fixed in later versions of OpenStack – Kilo hasn’t exhibited this problem yet – so what follows is Icehouse-specific, for those people still running this release.

Most of the database info I have stolen from the URL in the comments, I have just added input so you don’t need to drop to a mysql prompt. Feed it your VM UUID and you’re done. If you have reached this page then you’re probably in not such a great place so the usual warnings apply about random bits of bash script on the internet apply. And remember that reset-state is your friend and you should have tried lots of other stuff first.

#! /bin/bash
# This is an Icehouse-specific script
# to remove an instance that is not consuming ANY resources
# ie. It only exists in the database. You need to be VERY
# sure of this fact before using so as not to leave disks
# orphaned or instances running. Use as a last resort after
# deletion and reset-state nova options have failed. Use nova show to
# inspect libvirt xml prior to using.
# Source for db schema:

read -p "Please enter the UUID of the vm you need to clear from the database:" uuid
mysql -D nova -e "select display_name from instances where instances.uuid = '$uuid'"
read -p "Are you sure this is the instance you are looking for? y/n: " response
if [ $response == y ]; then
mysql -D nova -e "delete from instance_faults where instance_faults.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instance_id_mappings where instance_id_mappings.uuid = '$uuid'"
mysql -D nova -e "delete from instance_info_caches where instance_info_caches.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instance_system_metadata where instance_system_metadata.instance_uuid = '$uuid'"
mysql -D nova -e "delete from security_group_instance_association where security_group_instance_association.instance_uuid = '$uuid'"
mysql -D nova -e "delete from block_device_mapping where block_device_mapping.instance_uuid = '$uuid'"
mysql -D nova -e "delete from fixed_ips where fixed_ips.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instance_actions_events where instance_actions_events.action_id in (select id from instance_actions where instance_actions.instance_uuid = '$uuid')"
mysql -D nova -e "delete from instance_actions where instance_actions.instance_uuid = '$uuid'"
mysql -D nova -e "delete from virtual_interfaces where virtual_interfaces.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instances where instances.uuid = '$uuid'"
echo "Ok, done"
echo "Quitting, no changes made"