When is a link actually, y’know, UP?

TL;DR Use device names such as eno1, enp7s0 rather than nic1, nic2

I’ve been chasing an issue with a TripleO-based installation whereby the nodes were provisioning but failing to configure networking correctly.

Debugging TripleO deployments is fiendishly hard and this was made more complex by being unable to connect to the failed nodes. Deployed TripleO nodes only allow key-based ssh authentication. It’s great to see security being so good even the sysadmin can’t access the node I guess.

If you want to login to a node at the console, you basically have to roll your own deployment image. I was on the verge of heading down this route when I considered the following:

TripleO deployments have two methods of specifying usable, active network interfaces. The first (and unfortunately, the default) is to number them nic1 , nic2, nic3 etc in the config. Unfortunately this introduces some logic from os-net-config to determine which links are actually connected to switches.

This would be fine on most machines but sadly the hardware I’m working on has a built-in ethernet-over-usb device for out of band access. This reports itself as having a link (for reasons unknown, maybe a link to the usb interface?) and therefore fulfills the following criteria:

  1. Not the local loopback
  2. Has an address
  3. Reports carrier signal as active in /sys/class/net/<device>/carrier
  4. Has a subdirectory of device information in /sys/class/net/<device>/device/

Despite reporting the link state as UNKNOWN in the ip command, this meant that the logic of os-net-config concluded that this was the management nic and attempted to configure it as such, obviously to no avail.

Happily this has resulted in my first OpenStack patch:


which may even get accepted.


Red Hat Director continued…

Its been an interesting start to the year. Upcoming deployments are calling for the use of Liberty-based deployments so I have been looking at using Red Hat’s Director to deploy these. Because it’s basically a glorified Heat template (albeit an extremely complicated one), it’s ideal for customization for individual environments because no cloud is the same, right?

But that involves actually getting yourself to a working system in the first place when you’re needing to orchestrate storage, networking, HA/pacemaker and so on.

The undercloud installation, as referenced previously, was pretty much a doddle. The snag came when I attempted to deploy the “overcloud” or main production cloud. The nodes failed to deploy consistently to the hardware but in utterly random fashion. For example, in a  3 controller and 3 compute setup, one deployment run would result in 2 controllers and 1 compute node successfully configured. A second run would then result in 1 controller and 3 compute nodes deployed.

So naturally you start by simplifying the setup as much as possible, removing complex storage and networking options. This didn’t improve matters so I tested with a standard OS deployment which worked fine. Changes to partitioning, UEFI configs, firmware amongst others all drew a blank.

Finally I decided to hack the boot config of the deployment images for the nodes (stored in /httpboot) to output debug info to console:

find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/troubleshoot=0/troubleshoot=1/g’
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/console=ttyS0 //g’

which took a few runs but then eventually spotted the blink-and-you’ll-miss-it error:


Specifically, “failed to mount root partition”. Which then led me to the following bug:


This explained why it had been working fine in my RDO environment of course. Its a pity this was missed in Red Hat’s testing but it was a useful learning experience, going round the houses with heat, ipmi, ipxe, uefi, kickstart partitioning and so on. The workaround has been to revert to Director’s 7.1 deployment image as I guess this is a fairly weird and wonderful problem. Future releases will feature a different deployment image so apparently this will no longer be a problem.

All told this took up about two weeks engineering time which goes to show that the random now-you-see-it-now-you-don’t problems are the hardest to fix. Onwards and upwards.

Red Hat Director, The Undercloud

This week I’ve been starting to get to grips with Red Hat’s Director cloud deployment tool. It leverages Ironic to provision baremetal machines and introduces the concept of an undercloud and overcloud – there are essentially two clouds, the undercloud is a basic OpenStack environment with just the tools needed to get the main job done. The Overcloud is the cloud your users interact with and run their whatever on.

Its clear some serious engineering time has gone into trying to make this as easy as possible and the good news is that so far, it seems to be working well. Installation of the undercloud was a simple as defining a few variables like network range, interface etc. I had tried RDO Manager (Director’s upstream product) and had a fairly torrid time. They were going through some major infrastructure changes at the time however so perhaps that was part of the problem. Meh.

Anyway, pretty picture time.

Screenshot from 2015-12-17 12-59-29

So now there are various roles are ready to be deployed. There are the usual compute and control as well as Ceph (no surprise as this is a Red Hat product), Cinder and Swift.

Initial Overcloud deployments haven’t completed yet – this is more due to me being a tool and not following the instructions rather than any particular bug in the software.


The biggest issue so far has been hardware related. I’ve been using an ancient Nortel (yes, remember them!) switch was taking an eon to bring up network links, I think, due to a buggy STP (Spanning Tree Protocol) implementation. Director uses iPXE rather than PXElinux for some reason (UEFI maybe?) and although it downloaded the NBP file fine, when it came to get a DHCP lease, it completely timed out. It was only when I attached the second interface to try and boot from that that it became apparent that the link was taking a long time to come up. So I guess STP was at fault here but never debugged, just replaced the switch (for a good old HP ProCurve) and it worked fine.

In the course of the above issue I’ve learnt plenty of things about PXE booting like how when the logs say “error 8 User aborted the transfer” you can actually ignore it because its normal, even if journalctl flags it in big red letters. Apparently this is an initial check to see what protocols the client supports before initiating the download proper.

Other problems included discovering that there needs to be a default flavor called “baremetal” so unassigned nodes know where to live and enabling SELinux – RDO Manager/Director won’t actually install if it is disabled.

Standing up clouds

One thing that is interesting for me is the sheer number of ways of getting your OpenStack cloud to an end product and the way in that no one system has prevailed.

For working in development environments you have devstack and packstack. My favourite t-shirt slogan from Tokyo was “But it worked in devstack…”.

Moving to production you have Ansible, Salt, Puppet and now TripleO-based installers, each with their own offshoots. I am looking forward to working with RDO-Manager and Director more, although RDO Manager (Director’s upstream) seems to be highly fluid. I don’t even understand how it all fits together. What is a “DeLorean instance” for example?

Red Hat’s lineage is a case in point. They started out with Foreman/Staypuft/Astapor (which I’m currently in the process of helping to bring up to OSP 7 standards), then moved to Director. The recent purchase of Ansible means that it is not implausible to consider that future installers will be based on that tool. Ansible seemed to be the most commonly used tool at the summit but that is based purely on anecdotal evidence.

Then Ubuntu has MAAS and it’s “juju charms” – it is supposed to “Just Work” but during brief in-house trials it didn’t. Probably my fault.

It would be nice if development efforts could be consolidated a bit. I guess that will come with product maturity but it really does need to come sooner rather than later.

More on Satellite 6.1 provisioning

Satellite 6.1 supposedly supports bonded network interfaces. If it does, we’ve yet to get it working.

To be fair, this is a slightly more complex setup with two interfaces on separate cards heading to separate switches using LACP for resilience which is more complex than balancing. There are then a number of virtual interfaces hanging off this on separate vlans.

However even without the VLAN interfaces in play, a basic bond fails and we are then presented with a failure cascade.

  • Foreman (the part of Satellite that controls provisioning) decides to change the IP of the provisioning node, not sure why, support case currently open. This happens whatever you do. Update – see below
  • You then delete the node which runs fine but leaves a dangling entry in DHCP because the MAC address has changed
  • You then have to edit the lease file to remove the entry and restart the daemon
  • Leave this to long and Satellite accepts fact uploads from the deleted node, adding it back into the foreman database and preventing re-provisioning. You then have to run db removal commands alluded to in my previous post. Update – see below

I have back-ported this fix:


which seems to prevent duplicate bonds getting created (another problem) but so far it hasn’t been an easy ride.

If you’re looking at using Satellite 6.1 to provision bonded networks (these are RHEL 7.2 machines) then I would avoid and use snippets, which is what we have done here and appear to be through the worst of the above.

Alternatively consider later versions of foreman with better bonding support?

Update: I have discovered that setting the following:


resolved the MAC address changing issue



resolves hosts getting re-added to the database after deletion.

Satellite 6 and Foreman re-provisioning

We are using Satellite 6 (Specifically 6.1.3) to implement an OSP 7 solution at work. I have used Foreman previously and realised how sensitive it can be. It works with lots of components like DHCP, DNS and Postgres but in the Red Hat Satellite product it pulls in even more like Candlepin and and Katello. For the full list see:


Anyway, we hit a particular bug where after deleting a node, Sat 6 wouldn’t re-provision a node with the same name. This was a problem as we are testing re-provisioning on a frequent basis so we started to run out of nodes on the cluster to test with.

I hunted in the usual places like dhcp lease files and bind configuration databases but there wasn’t any evidence there so it rapidly became obvious that some nasty database hackery would be required.

Sure enough, after filing a bug with Red Hat GSS, they came back and asked me to run the following from the foreman-rake console (whatever that is) :

Host.where("name LIKE ?", "%compute%").collect(&:name)

which outputted all the nodes with the name compute in the title. Sure enough, the deleted nodes were listed, despite the delete task having run successfully in foreman/Satellite. So the next instruction was to run:


which ran a bunch of stuff and cleared the DB of the legacy node data. The node then successfully re-provisioned. So because this doesn’t appear to be documented I’m listing this here so you can go and break your own cluster at will.


Update: Clarified that it doesn’t clear the entire DB…

Configuring OpenStack to use jumbo frames (MTU 9000)

Controller nodes

Disable puppet :

# systemctl stop puppet
# systemctl disable puppet

Place a given controller into standby mode :

# pcs cluster standby $(hostname)

Update the MTU for all physical NICs being used by either provider or tenant networks :

# echo MTU=9000 >> /etc/sysconfig/network-scripts/ifcfg-eth0

Update the various Neutron related configuration files :

Note that if tenant networks are being used then we need to allow for the overhead of VXLAN and GRE.

# echo “dhcp-option-force=26,8900” > /etc/neutron/dnsmasq-neutron.conf
# openstack-config –set /etc/neutron/dhcp_agent.ini DEFAULT dnsmasq_config_file /etc/neutron/dnsmasq-neutron.conf
# openstack-config –set /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini agent veth_mtu 8900
# openstack-config –set /etc/neutron/l3_agent.ini DEFAULT network_device_mtu 9000
# openstack-config –set /etc/nova/nova.conf DEFAULT network_device_mtu 9000

Reboot to ensure everything persists.

# reboot

Unstandby the node and repeat on the remaining controllers :

# pcs cluster unstandby $(hostname)

Compute nodes

Disable puppet :

# systemctl stop puppet
# systemctl disable puppet

Update the MTU for all physical NICs being used by either provider or tenant networks :

# echo MTU=9000 >> /etc/sysconfig/network-scripts/ifcfg-eth0

Update the OVS plugin configuration file :

# openstack-config –set /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini agent veth_mtu 8900
# openstack-config –set /etc/nova/nova.conf DEFAULT network_device_mtu 9000

Reboot to ensure everything persists.

# reboot

Source: https://access.redhat.com/solutions/1417133

mtftar and data recovery where ntbackup.exe fails…

I have a client who has about a huge (160GB) binary blob of a backup file ending in .bkf which wouldn’t be terribly interesting except that their old hardware supplier managed to nuke about 100GB of data at some point and this is the only possible backup.

Problem: Windows ntbackup.exe won’t open it, reporting it as corrupt.

Possible answer number 1: Pay USD $89 for some weird proprietary commerical app (google for bkf file recovery) which seems to work reading the file as it shows relevant data

Correct answer number 2: cp the file across to a linux partition, install mtftar (needs compile unfortunately as no distros seem to package it) then recover required files. Cashback!

The open source, open standard shipping container…

Meet Malcolm McLean. An early open source pioneer. Watched an interesting documentary on how this man’s invention revolutionised the world. Most notable is the way he gave the patents he had been awarded to the industry. FTA:

Believing that standardization was also the path to overall industry growth, McLean chose to make his patents available by issuing a royalty-free lease to the International Organization for Standardization.

This meant that all ports the world over work to the same standard and can handle any container from any port.

Which is nice.

Symantec and train-wreck of a website

Symantec. You took over Veritas just under six years ago. That is enough time to merge websites and fix broken links. Oh, no, maybe it isn’t.

Symantec – Just try finding service packs for Backup Exec 10d. Go on.

I’m currently using http://www.currybeast.com to get the bits instead. Ugh.