Scaling issues with the TripleO undercloud node

With large deployments, its important to make good architectural choices about nodes in your environment and during deployment, none is more critical than the provisioning node itself.

This is because in TripleO the undercloud node doesn’t just push out images to nodes but also orchestrates configuration of the entire cluster. In order to do this, it needs to run a bunch of OpenStack services like nova, neutron, ironic, heat, keystone, glance etc. It also runs two databases, a messaging bus and web server.

All of this means that the provisioning node needs to be quite a powerful piece of kit as the provisioning process involves lots of disk and network I/O amongst other things. So its important to specify a fast disk, plenty of memory, a goodly amount of cores and a quick nic.

But sometimes, even with all of the above, you need to tweak things because with the best will in the world, the undercloud installation and configuration applies “best guess” values when it comes to things like threads, processes, timeouts and retries.

Each service has a tonne of configurable options and its important to understand the implications of each one and the impact this will have in order to get the best performance out of the node. Its also important to understand what changes will help in response to any particular bottleneck.

Specifically, we found that tuning the process and thread count for WSGI processes and increasing haproxy maxconn values caused the node to handle load with greater efficiency. A patch has been merged to address this[1]. Red Hat produce a guide on tuning the undercloud (Director in their commercial parlance)[2].



So you need a management network quick-smart?

TripleO deployments can be deployed with an optional Management VLAN. You can use this to run ansible playbooks, monitoring systems and manage your cloud, hence the name.

However this requires configuration during deployment. So what happens if you have a cloud that doesn’t have a management vlan? You can use the provisioning network. But the problem is that this doesn’t have fixed addresses, only dynamic. However these rarely change so to perform a quick playbook run or a cluster-wide config with pdsh for example, you can use OpenStack’s cli to create a hosts file as follows:

openstack server list -f value --column Networks --column Name | sed 's/ ctlplane=/ /g' | awk '{ print $2 " " $1}'

This converts the output of your ironic nodes to a format you can cat into a hosts file.

This avoids having to add your management node to another network (e.g. storage) and use an existing network.

Its not big, its not clever but it does work.

Exporting Amazon EC2 instances into OpenStack

I had a requirement to get some workloads running on EC2 (which I’m a huge fan of, I just hate the vendor lock-in) imported into OpenStack.

Tools to help you get anything out of AWS are almost non-existent. I did try ec2-create-instance-export-task from AWS API tools but this has so many hurdles to jump through that it became slightly farcical. In the end it wouldn’t let me export the image because it wasn’t an imported image in the first place. Hmmm.

Despite what the general consensus online, this turns out to be fairly straightforward. The problem appears to come if you’ve used Amazon Linux AMI’s with their custom kernel. Thankfully, these were Ubuntu 16.04 images.

Step 1. Boot an instance from your AMI. Use SSD and a decent instance size if you’re feeling flush and in a hurry.

Step 2. Snapshot the instance and attach that snapshot to the running instance

Step 3. On your OpenStack environment, dd the attached disk, gzip and pipe over an ssh tunnel because, y’know Amazon egress charges. E.g.:

ssh -i chris.pem “sudo dd if=/dev/xvdf | gzip -1 -” | dd of=image.gz

Step 4. Unzip the image, upload it to OpenStack and boot it.

Step 5 (For those with Amazon kernels). Fudge around replacing the Amazon kernel with something close to the same version. YMMV.

OpenStack Release Notes with Reno

I’m currently trying to get a patch submitted to the Puppet Keystone project which implements the ability to turn “chase referrals” on or off for deployments that use Active Directory.

One comment came back from the initial patch:

please add release note

Ok. So of course this being OpenStack it turns out to be complicated. You need to use “Reno”, a tool that has been used since Liberty (I think) to document changes to OpenStack. The HUGE irony is that the documentation for OpenStack’s documentation tool is sparse and pretty hopeless. It recommends running:

tox -e venv — reno new slug-goes-here

which gives the error: ERROR: unknown environment ‘venv’

Of course. Thankfully some kind soul in the Manila documentation project has added the missing clue for the clueless:

If reno is not installed globally on your system, you can use it from venv of your manila’s tox. Run:

source .tox/py27/bin/activate

py27 needed replacing with “releasenotes” for some obscure reason in the puppet-keystone directory but then it worked and I could finally run:

reno new implement-chase-referrals

and the release note was created.

Manually re-setting failed deployments with Ironic

OpenStack commands have some odd naming conventions sometimes – just take a look at the whole evacuate/host-evacuate debacle in nova for example – and ironic is no exception.

I’m currently using tripleo to deploy various environments which sometimes results in failed deployments. If you take into account all the vagaries of various ipmi implementations I think it does a pretty good job. Sometimes though, when a stack gets deleted, I’m left with something like the following:

[stack@undercloud ~]$ nova list
| ID | Name | Status | Task State | Power State | Networks |

[stack@undercloud ~]$ ironic node-list
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |

| 447ffea5-ae3f-4796-bfba-ce44dd8a84b7 | compute4 | 26843ce8-e562-4945-ad32-b60504a5bca3 | power on | deploy failed | False |

So an instance is still associated with the baremetal node.

In this case, it isn’t obvious but after some digging:

ironic node-set-provision-state compute4 deleted

should result in the node being set back to available. I’m still not clear if this re-runs the clean steps but it gives me what I want to re-run deployment.

OpenStack Tempest on RDO Mitaka

There are two main tools for testing a deployed cloud, Rally and Tempest.

I have been looking into verifying functionality in a private cloud once it has been created (using TripleO) and the documentation is, as usual, abysmal. Its the usual rabbit warren of developer docs, stuff relating to releases from 3 years back, blueprints which mention “the upcoming Havana release” etc etc.

So for reference (mine mostly), here are the steps to get OpenStack Tempest working on the RDO Mitaka stable release:

  1. Ensure you have a neutron network called “nova”
    $ neutron net-create nova --router:external --provider:network_type flat --provider:physical_network datacentre
    $ neutron subnet-create --name nova --enable_dhcp=False --allocation-pool=start=,end= --gateway= nova
  2. Check that you have a role called “heat_stack_owner”. If not, create one:
    $ openstack role create heat_stack_owner
  3. Create your tempest directory and change into it
    $ mkdir ~/tempest && cd ~/tempest
  4. Initialize the directory by running
    $ /usr/share/openstack-tempest-10.0.0/tools/configure-tempest-directory
  5. Configure tempest
    $ tools/ --deployer-input ~/tempest-deployer-input.conf \
    --create identity.uri $OS_AUTH_URL identity.admin_password $OS_PASSWORD
  6. Run tempest (NOT with tools/
    $ ./
  7. Answer yes to the prompt to initialise your virtual environment. This will download required libraries etc.

Depending on environment the tests will take about an hour to run. So go make a brew and get ready to debug the failures. 🙂


Shellinabox and serial consoles

TripleO is in fairly dire need of something similar to conserver/wcons/rcons in xCAT. Just so you can see what the heck the node’s console is doing instead of having to fire up your out of band web interface, log in, launch web console and that is *if* you have the license for it.

CLI console access in Ironic is currently under development after I filed an RFE:

but in the meantime I decided to try and get serial console access through shellinabox working.

It’s not too hard and the following is a good start:

The key thing to understand is the terminal_port value which varies according to ipmi driver.

Once configured this gives a nice view with a decent amount of scroll-back.

Its a pity all this is manual – I guess it would be fairly easy to script as part of an undercloud install to enable serial consoles but its enough of a security risk to discourage this but not making it so easy perhaps!

Deleting multiple nova instances

We spawn large numbers of instances for testing and benchmark purposes. This happens a lot in HPC and as long as it isn’t orchestrated by heat, its fine to batch-delete these. But OpenStack CLI doesn’t appear to provide a way to do this intelligently and carefully. You could log into horizon but this takes a while (improvements coming in that area in Mitaka apparently) and the screen only loads a maximum of 25 or 50 or something.

You don’t want to use this lightly (run just the command and grep to confirm the match) but here it is more for my reference than anything else.

nova list –all-tenants | grep -i UNIQUE_COMMON_KEYWORD_HERE | awk ‘{print $2}’ | grep -v ^ID | grep -v ^$ | xargs -n1 nova delete

Then obviously replace UNIQUE_COMMON_KEYWORD_HERE with mytestinstance or testinstance or cbtest or so on.


Red Hat OSP 7 Director custom hostnames

By default, nodes deployed with Director have standard FQDN in the format:




Which is fine but its nice to customize this a bit, no?

To do so, edit the following undercloud files:


and create/change the following parameter:

dhcp_domain = iaas.local (or whatever you prefer)

You also need to edit the deployment parameter:

default: ‘iaas.local’

in overcloud-without-mergepy.yaml

NB: It looks like this will only apply to OSP 7, OSP 8 has some changes to make this easier

OpenStack and Clustered Data ONTAP

NetApp Fabric-Attached Storage (FAS) devices are pretty great. They offer dual controller config for HA setups, CIFS and NFS including pNFS as well as iSCSI and FCoE and you can populate them with drives for up to 1PB storage with a flash pool for a nice fast cache. They also have the added benefit of being backed by NetApp who are a big contributor to OpenStack storage architectures so its a generally safe assumption that they will play nicely with Cinder, Glance, Swift and now Manila.

Because I mostly blog about problems, its worth noting this one:

The Clustered Data ONTAP GUI doesn’t appear to allow you to set owner and group on volumes when you create them – you have to do this through the CLI. Normally I expect to do most things at the command line but NetApp docs are quite explicit about doing everything at the GUI as commands to create things like volumes are complex, eg:

volume create -vserver vs0 -volume user_jdoe -aggregate aggr1 -state online -policy default_expolicy –user 165 –group 165 -group dev -junction-path /user/jdoe -size 250g -space-guarantee volume -percent-snapshot-space 20 -foreground false

Note the -user and -group parameters. These allow you to set ownership on the volume and therefore when OpenStack mounts, means you can lock it down to either the cinder or glance user.