When is a link actually, y’know, UP?

TL;DR Use device names such as eno1, enp7s0 rather than nic1, nic2

I’ve been chasing an issue with a TripleO-based installation whereby the nodes were provisioning but failing to configure networking correctly.

Debugging TripleO deployments is fiendishly hard and this was made more complex by being unable to connect to the failed nodes. Deployed TripleO nodes only allow key-based ssh authentication. It’s great to see security being so good even the sysadmin can’t access the node I guess.

If you want to login to a node at the console, you basically have to roll your own deployment image. I was on the verge of heading down this route when I considered the following:

TripleO deployments have two methods of specifying usable, active network interfaces. The first (and unfortunately, the default) is to number them nic1 , nic2, nic3 etc in the config. Unfortunately this introduces some logic from os-net-config to determine which links are actually connected to switches.

This would be fine on most machines but sadly the hardware I’m working on has a built-in ethernet-over-usb device for out of band access. This reports itself as having a link (for reasons unknown, maybe a link to the usb interface?) and therefore fulfills the following criteria:

  1. Not the local loopback
  2. Has an address
  3. Reports carrier signal as active in /sys/class/net/<device>/carrier
  4. Has a subdirectory of device information in /sys/class/net/<device>/device/

Despite reporting the link state as UNKNOWN in the ip command, this meant that the logic of os-net-config concluded that this was the management nic and attempted to configure it as such, obviously to no avail.

Happily this has resulted in my first OpenStack patch:

https://review.openstack.org/#/c/291243

which may even get accepted.

Red Hat Director continued…

Its been an interesting start to the year. Upcoming deployments are calling for the use of Liberty-based deployments so I have been looking at using Red Hat’s Director to deploy these. Because it’s basically a glorified Heat template (albeit an extremely complicated one), it’s ideal for customization for individual environments because no cloud is the same, right?

But that involves actually getting yourself to a working system in the first place when you’re needing to orchestrate storage, networking, HA/pacemaker and so on.

The undercloud installation, as referenced previously, was pretty much a doddle. The snag came when I attempted to deploy the “overcloud” or main production cloud. The nodes failed to deploy consistently to the hardware but in utterly random fashion. For example, in a  3 controller and 3 compute setup, one deployment run would result in 2 controllers and 1 compute node successfully configured. A second run would then result in 1 controller and 3 compute nodes deployed.

So naturally you start by simplifying the setup as much as possible, removing complex storage and networking options. This didn’t improve matters so I tested with a standard OS deployment which worked fine. Changes to partitioning, UEFI configs, firmware amongst others all drew a blank.

Finally I decided to hack the boot config of the deployment images for the nodes (stored in /httpboot) to output debug info to console:

find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/troubleshoot=0/troubleshoot=1/g’
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/console=ttyS0 //g’

which took a few runs but then eventually spotted the blink-and-you’ll-miss-it error:

osp_72_dracut_error

Specifically, “failed to mount root partition”. Which then led me to the following bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1296330

This explained why it had been working fine in my RDO environment of course. Its a pity this was missed in Red Hat’s testing but it was a useful learning experience, going round the houses with heat, ipmi, ipxe, uefi, kickstart partitioning and so on. The workaround has been to revert to Director’s 7.1 deployment image as I guess this is a fairly weird and wonderful problem. Future releases will feature a different deployment image so apparently this will no longer be a problem.

All told this took up about two weeks engineering time which goes to show that the random now-you-see-it-now-you-don’t problems are the hardest to fix. Onwards and upwards.

Red Hat Director, The Undercloud

This week I’ve been starting to get to grips with Red Hat’s Director cloud deployment tool. It leverages Ironic to provision baremetal machines and introduces the concept of an undercloud and overcloud – there are essentially two clouds, the undercloud is a basic OpenStack environment with just the tools needed to get the main job done. The Overcloud is the cloud your users interact with and run their whatever on.

Its clear some serious engineering time has gone into trying to make this as easy as possible and the good news is that so far, it seems to be working well. Installation of the undercloud was a simple as defining a few variables like network range, interface etc. I had tried RDO Manager (Director’s upstream product) and had a fairly torrid time. They were going through some major infrastructure changes at the time however so perhaps that was part of the problem. Meh.

Anyway, pretty picture time.

Screenshot from 2015-12-17 12-59-29

So now there are various roles are ready to be deployed. There are the usual compute and control as well as Ceph (no surprise as this is a Red Hat product), Cinder and Swift.

Initial Overcloud deployments haven’t completed yet – this is more due to me being a tool and not following the instructions rather than any particular bug in the software.

Gotchas:

The biggest issue so far has been hardware related. I’ve been using an ancient Nortel (yes, remember them!) switch was taking an eon to bring up network links, I think, due to a buggy STP (Spanning Tree Protocol) implementation. Director uses iPXE rather than PXElinux for some reason (UEFI maybe?) and although it downloaded the NBP file fine, when it came to get a DHCP lease, it completely timed out. It was only when I attached the second interface to try and boot from that that it became apparent that the link was taking a long time to come up. So I guess STP was at fault here but never debugged, just replaced the switch (for a good old HP ProCurve) and it worked fine.

In the course of the above issue I’ve learnt plenty of things about PXE booting like how when the logs say “error 8 User aborted the transfer” you can actually ignore it because its normal, even if journalctl flags it in big red letters. Apparently this is an initial check to see what protocols the client supports before initiating the download proper.

Other problems included discovering that there needs to be a default flavor called “baremetal” so unassigned nodes know where to live and enabling SELinux – RDO Manager/Director won’t actually install if it is disabled.

Standing up clouds

One thing that is interesting for me is the sheer number of ways of getting your OpenStack cloud to an end product and the way in that no one system has prevailed.

For working in development environments you have devstack and packstack. My favourite t-shirt slogan from Tokyo was “But it worked in devstack…”.

Moving to production you have Ansible, Salt, Puppet and now TripleO-based installers, each with their own offshoots. I am looking forward to working with RDO-Manager and Director more, although RDO Manager (Director’s upstream) seems to be highly fluid. I don’t even understand how it all fits together. What is a “DeLorean instance” for example?

Red Hat’s lineage is a case in point. They started out with Foreman/Staypuft/Astapor (which I’m currently in the process of helping to bring up to OSP 7 standards), then moved to Director. The recent purchase of Ansible means that it is not implausible to consider that future installers will be based on that tool. Ansible seemed to be the most commonly used tool at the summit but that is based purely on anecdotal evidence.

Then Ubuntu has MAAS and it’s “juju charms” – it is supposed to “Just Work” but during brief in-house trials it didn’t. Probably my fault.

It would be nice if development efforts could be consolidated a bit. I guess that will come with product maturity but it really does need to come sooner rather than later.

More on Satellite 6.1 provisioning

Satellite 6.1 supposedly supports bonded network interfaces. If it does, we’ve yet to get it working.

To be fair, this is a slightly more complex setup with two interfaces on separate cards heading to separate switches using LACP for resilience which is more complex than balancing. There are then a number of virtual interfaces hanging off this on separate vlans.

However even without the VLAN interfaces in play, a basic bond fails and we are then presented with a failure cascade.

  • Foreman (the part of Satellite that controls provisioning) decides to change the IP of the provisioning node, not sure why, support case currently open. This happens whatever you do. Update – see below
  • You then delete the node which runs fine but leaves a dangling entry in DHCP because the MAC address has changed
  • You then have to edit the lease file to remove the entry and restart the daemon
  • Leave this to long and Satellite accepts fact uploads from the deleted node, adding it back into the foreman database and preventing re-provisioning. You then have to run db removal commands alluded to in my previous post. Update – see below

I have back-ported this fix:

http://projects.theforeman.org/issues/10607

which seems to prevent duplicate bonds getting created (another problem) but so far it hasn’t been an easy ride.

If you’re looking at using Satellite 6.1 to provision bonded networks (these are RHEL 7.2 machines) then I would avoid and use snippets, which is what we have done here and appear to be through the worst of the above.

Alternatively consider later versions of foreman with better bonding support?

Update: I have discovered that setting the following:

ignore_puppet_facts_for_provisioning=true

resolved the MAC address changing issue

and

create_new_host_when_report_is_uploaded=false
create_new_host_when_facts_are_uploaded=false

resolves hosts getting re-added to the database after deletion.

Satellite 6 and Foreman re-provisioning

We are using Satellite 6 (Specifically 6.1.3) to implement an OSP 7 solution at work. I have used Foreman previously and realised how sensitive it can be. It works with lots of components like DHCP, DNS and Postgres but in the Red Hat Satellite product it pulls in even more like Candlepin and and Katello. For the full list see:

https://access.redhat.com/articles/1343683

Anyway, we hit a particular bug where after deleting a node, Sat 6 wouldn’t re-provision a node with the same name. This was a problem as we are testing re-provisioning on a frequent basis so we started to run out of nodes on the cluster to test with.

I hunted in the usual places like dhcp lease files and bind configuration databases but there wasn’t any evidence there so it rapidly became obvious that some nasty database hackery would be required.

Sure enough, after filing a bug with Red Hat GSS, they came back and asked me to run the following from the foreman-rake console (whatever that is) :

Host.where("name LIKE ?", "%compute%").collect(&:name)

which outputted all the nodes with the name compute in the title. Sure enough, the deleted nodes were listed, despite the delete task having run successfully in foreman/Satellite. So the next instruction was to run:

Host.find_by_name("<insertnodenamehere>").destroy

which ran a bunch of stuff and cleared the DB of the legacy node data. The node then successfully re-provisioned. So because this doesn’t appear to be documented I’m listing this here so you can go and break your own cluster at will.

Enjoy.

Update: Clarified that it doesn’t clear the entire DB…

Removing orphaned instances when all else fails…

Working on OpenStack is complex and working on older versions of OpenStack is even more complex. If your instance is spawning but the shared storage hosting the ephemeral disk or block storage oopses/offlines/panics then you can be left with orphaned instances that exist in the database but nowhere else. You try to delete them using nova delete but this doesn’t work because OpenStack can’t locate the files it wants to delete and you get into a real mess.

Some articles indicate that all you need to do is run some variation on:

mysql -D nova -e "delete from instances where instances.uuid = '$uuid'"

but this is bad because it leaves all sorts of information relating to the VM in existence. It appears to have been fixed in later versions of OpenStack – Kilo hasn’t exhibited this problem yet – so what follows is Icehouse-specific, for those people still running this release.

Most of the database info I have stolen from the URL in the comments, I have just added input so you don’t need to drop to a mysql prompt. Feed it your VM UUID and you’re done. If you have reached this page then you’re probably in not such a great place so the usual warnings apply about random bits of bash script on the internet apply. And remember that reset-state is your friend and you should have tried lots of other stuff first.

#! /bin/bash
# IMPORTANT - READ ME
# This is an Icehouse-specific script
# to remove an instance that is not consuming ANY resources
# ie. It only exists in the database. You need to be VERY
# sure of this fact before using so as not to leave disks
# orphaned or instances running. Use as a last resort after
# deletion and reset-state nova options have failed. Use nova show to
# inspect libvirt xml prior to using.
# Source for db schema: https://raymii.org/s/articles/

read -p "Please enter the UUID of the vm you need to clear from the database:" uuid
mysql -D nova -e "select display_name from instances where instances.uuid = '$uuid'"
read -p "Are you sure this is the instance you are looking for? y/n: " response
if [ $response == y ]; then
mysql -D nova -e "delete from instance_faults where instance_faults.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instance_id_mappings where instance_id_mappings.uuid = '$uuid'"
mysql -D nova -e "delete from instance_info_caches where instance_info_caches.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instance_system_metadata where instance_system_metadata.instance_uuid = '$uuid'"
mysql -D nova -e "delete from security_group_instance_association where security_group_instance_association.instance_uuid = '$uuid'"
mysql -D nova -e "delete from block_device_mapping where block_device_mapping.instance_uuid = '$uuid'"
mysql -D nova -e "delete from fixed_ips where fixed_ips.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instance_actions_events where instance_actions_events.action_id in (select id from instance_actions where instance_actions.instance_uuid = '$uuid')"
mysql -D nova -e "delete from instance_actions where instance_actions.instance_uuid = '$uuid'"
mysql -D nova -e "delete from virtual_interfaces where virtual_interfaces.instance_uuid = '$uuid'"
mysql -D nova -e "delete from instances where instances.uuid = '$uuid'"
echo "Ok, done"
else
echo "Quitting, no changes made"
fi

The future is back.

Yesterday I went to visit one of the most advanced robots in the world. On the way there, I was thinking about Back To The Future and how there had been several articles about how we sort of, kind of succeeded or failed on that future, depending on each articles viewpoint.

As I travelled out to meet Asimo, the robot Honda are iteratively updating to create a home assistant, I thought about the card I had tapped onto a reader which automatically opened a gate to allow me to enter the station. It would do the same thing when I got off and charge me a defined amount for the privilege of transporting me across a city at the same time.

I looked out the front because I had sat at the front. This was because there was no driver as the entire system was automated. A display above the exit door told me how many minutes until I arrived at my destination, how many stops until I got there and which way to turn to exit the station when I got off. It told me a bullet train had hit an animal somewhere on the outskirts of Tokyo and there was a delay.

As the train left the next station, the sky went dark so I looked up and realised that the road was now above my head. I looked down and saw the ocean. I looked up again and saw a plane arc overhead and because the turn was so sharp, it seemed to hang in the sky.

I took a video of the train crossing over the sea on my phone and sped it up 8 times. I had used the same phone to have a video conversation with my wife whilst she sat at home in Sheffield, about 6000 miles away. As I got off the train it told me I had to walk 5 minutes and turn left and then I’d be where I wanted to be. This was due to satellites hovering miles out into space, amongst other things. It also told me I had taken 7006 steps that day so far. It told me a thousand other things but they were irrelevant at the time.

Because 5 minutes later I had bought a ticket with the same card I had used to get on the train with and a few minutes later a robot walked past me, entirely on it’s own, waved, and said hello.

Understanding salt errors

For a project I’m currently working on we use salt to manage configuration across the cluster. This is something I’ve had to learn quickly but thankfully it is reliable and robust … until something goes wrong.

Last week I hit the following error when trying to replicate a client’s setup in-house and running a manual salt-call on one of the nodes.

Rendering SLS ‘base:service.keepalived.cluster’ failed: Jinja variable list object has no element 0

I’m not a programmer. I once made a very average pass at Java but that was about a decade ago. So the stuff about list objects not having an element 0 wasn’t helpful and isn’t really very good error output. This is a fairly old version of salt (2014.7) so perhaps this has been addressed since.

You can turn on debugging when running salt calls with -l and an option, e.g. debug, info, all etc.

This indicated it was running a dns lookup in the previous command:

dig +short db-cluster-1.test.cluster A

The difference here between the working setup and my failing salt run was that the output only gave the hostname on mine whereas the working config returned the host IP as well.

Once this was added into the bind db and the daemon restarted the errors stopped.

Configuring OpenStack to use jumbo frames (MTU 9000)

Controller nodes

Disable puppet :

# systemctl stop puppet
# systemctl disable puppet

Place a given controller into standby mode :

# pcs cluster standby $(hostname)

Update the MTU for all physical NICs being used by either provider or tenant networks :

# echo MTU=9000 >> /etc/sysconfig/network-scripts/ifcfg-eth0

Update the various Neutron related configuration files :

Note that if tenant networks are being used then we need to allow for the overhead of VXLAN and GRE.

# echo “dhcp-option-force=26,8900” > /etc/neutron/dnsmasq-neutron.conf
# openstack-config –set /etc/neutron/dhcp_agent.ini DEFAULT dnsmasq_config_file /etc/neutron/dnsmasq-neutron.conf
# openstack-config –set /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini agent veth_mtu 8900
# openstack-config –set /etc/neutron/l3_agent.ini DEFAULT network_device_mtu 9000
# openstack-config –set /etc/nova/nova.conf DEFAULT network_device_mtu 9000

Reboot to ensure everything persists.

# reboot

Unstandby the node and repeat on the remaining controllers :

# pcs cluster unstandby $(hostname)

Compute nodes

Disable puppet :

# systemctl stop puppet
# systemctl disable puppet

Update the MTU for all physical NICs being used by either provider or tenant networks :

# echo MTU=9000 >> /etc/sysconfig/network-scripts/ifcfg-eth0

Update the OVS plugin configuration file :

# openstack-config –set /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini agent veth_mtu 8900
# openstack-config –set /etc/nova/nova.conf DEFAULT network_device_mtu 9000

Reboot to ensure everything persists.

# reboot

Source: https://access.redhat.com/solutions/1417133