OpenStack, Heat and HAProxy

I’ve had an interesting experience debugging (or failing to debug) an issue in RHEL OSP 10.

Stacks were failing to complete:

status_reason: |
Error: resources.repo_definition_repovol_attach: Failed to attach volume x to server y - Unknown Error (HTTP 504)

From the logs it was obvious that nova was booting the instance fine and cinder was attaching the volume ok, even after the stack create failed.

I’ve mostly done OpenStack deployments so debugging operations is new to me. Thankfully Red Hat has some really good people who are used to chasing errors through the system.

One of the support chaps chased this through the system and determined it was due to a HAProxy timeout (I had no idea HAProxy affected Heat in this way) so we bumped the settings:

timeout http-request 20s
timeout queue 2m
timeout connect 20s
timeout client 10m
timeout server 10m
timeout check 20s

After applying this and restarting haproxy, the stack create completed. It turned out some instances were taking ~7 minutes to complete. My gut instinct is that there are storage or instance issues at play here. 7 minutes to wait for a single instance to boot and attach it’s volumes is too long. But then I’m used to GPFS and Ceph…

UPDATE: Yeah, so both Ceph and GPFS/Spectrum Scale support Copy-On-Write hence stuff happens VERY quickly. I just wasn’t used to traditional storage.


So you need a management network quick-smart?

TripleO deployments can be deployed with an optional Management VLAN. You can use this to run ansible playbooks, monitoring systems and manage your cloud, hence the name.

However this requires configuration during deployment. So what happens if you have a cloud that doesn’t have a management vlan? You can use the provisioning network. But the problem is that this doesn’t have fixed addresses, only dynamic. However these rarely change so to perform a quick playbook run or a cluster-wide config with pdsh for example, you can use OpenStack’s cli to create a hosts file as follows:

openstack server list -f value --column Networks --column Name | sed 's/ ctlplane=/ /g' | awk '{ print $2 " " $1}'

This converts the output of your ironic nodes to a format you can cat into a hosts file.

This avoids having to add your management node to another network (e.g. storage) and use an existing network.

Its not big, its not clever but it does work.

When is a link actually, y’know, UP?

TL;DR Use device names such as eno1, enp7s0 rather than nic1, nic2

I’ve been chasing an issue with a TripleO-based installation whereby the nodes were provisioning but failing to configure networking correctly.

Debugging TripleO deployments is fiendishly hard and this was made more complex by being unable to connect to the failed nodes. Deployed TripleO nodes only allow key-based ssh authentication. It’s great to see security being so good even the sysadmin can’t access the node I guess.

If you want to login to a node at the console, you basically have to roll your own deployment image. I was on the verge of heading down this route when I considered the following:

TripleO deployments have two methods of specifying usable, active network interfaces. The first (and unfortunately, the default) is to number them nic1 , nic2, nic3 etc in the config. Unfortunately this introduces some logic from os-net-config to determine which links are actually connected to switches.

This would be fine on most machines but sadly the hardware I’m working on has a built-in ethernet-over-usb device for out of band access. This reports itself as having a link (for reasons unknown, maybe a link to the usb interface?) and therefore fulfills the following criteria:

  1. Not the local loopback
  2. Has an address
  3. Reports carrier signal as active in /sys/class/net/<device>/carrier
  4. Has a subdirectory of device information in /sys/class/net/<device>/device/

Despite reporting the link state as UNKNOWN in the ip command, this meant that the logic of os-net-config concluded that this was the management nic and attempted to configure it as such, obviously to no avail.

Happily this has resulted in my first OpenStack patch:

which may even get accepted.