So you need a management network quick-smart?

TripleO deployments can be deployed with an optional Management VLAN. You can use this to run ansible playbooks, monitoring systems and manage your cloud, hence the name.

However this requires configuration during deployment. So what happens if you have a cloud that doesn’t have a management vlan? You can use the provisioning network. But the problem is that this doesn’t have fixed addresses, only dynamic. However these rarely change so to perform a quick playbook run or a cluster-wide config with pdsh for example, you can use OpenStack’s cli to create a hosts file as follows:

openstack server list -f value --column Networks --column Name | sed 's/ ctlplane=/ /g' | awk '{ print $2 " " $1}'

This converts the output of your ironic nodes to a format you can cat into a hosts file.

This avoids having to add your management node to another network (e.g. storage) and use an existing network.

Its not big, its not clever but it does work.

Advertisements

Manually re-setting failed deployments with Ironic

OpenStack commands have some odd naming conventions sometimes – just take a look at the whole evacuate/host-evacuate debacle in nova for example – and ironic is no exception.

I’m currently using tripleo to deploy various environments which sometimes results in failed deployments. If you take into account all the vagaries of various ipmi implementations I think it does a pretty good job. Sometimes though, when a stack gets deleted, I’m left with something like the following:

[stack@undercloud ~]$ nova list
+—-+——+——–+————+————-+———-+
| ID | Name | Status | Task State | Power State | Networks |
+—-+——+——–+————+————-+———-+
+—-+——+——–+————+————-+———-+

[stack@undercloud ~]$ ironic node-list
+————————————–+————-+————————————–+————-+——————–+————-+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+————————————–+————-+————————————–+————-+——————–+————-+

| 447ffea5-ae3f-4796-bfba-ce44dd8a84b7 | compute4 | 26843ce8-e562-4945-ad32-b60504a5bca3 | power on | deploy failed | False |

So an instance is still associated with the baremetal node.

In this case, it isn’t obvious but after some digging:

ironic node-set-provision-state compute4 deleted

should result in the node being set back to available. I’m still not clear if this re-runs the clean steps but it gives me what I want to re-run deployment.

Shellinabox and serial consoles

TripleO is in fairly dire need of something similar to conserver/wcons/rcons in xCAT. Just so you can see what the heck the node’s console is doing instead of having to fire up your out of band web interface, log in, launch web console and that is *if* you have the license for it.

CLI console access in Ironic is currently under development after I filed an RFE:

https://bugs.launchpad.net/ironic/+bug/1536572

but in the meantime I decided to try and get serial console access through shellinabox working.

It’s not too hard and the following is a good start:

http://docs.openstack.org/developer/ironic/deploy/install-guide.html#configure-node-web-console

The key thing to understand is the terminal_port value which varies according to ipmi driver.

Once configured this gives a nice view with a decent amount of scroll-back.

Its a pity all this is manual – I guess it would be fairly easy to script as part of an undercloud install to enable serial consoles but its enough of a security risk to discourage this but not making it so easy perhaps!

Red Hat OSP 7 Director custom hostnames

By default, nodes deployed with Director have standard FQDN in the format:

overcloud-$nodetype-$nodenumber.localdomain

E.g.

overcloud-controller-2.localdomain

Which is fine but its nice to customize this a bit, no?

To do so, edit the following undercloud files:

/etc/neutron/dhcp_agent.ini
/etc/nova/nova.conf

and create/change the following parameter:

dhcp_domain = iaas.local (or whatever you prefer)

You also need to edit the deployment parameter:

CloudDomain:
default: ‘iaas.local’

in overcloud-without-mergepy.yaml

NB: It looks like this will only apply to OSP 7, OSP 8 has some changes to make this easier

When is a link actually, y’know, UP?

TL;DR Use device names such as eno1, enp7s0 rather than nic1, nic2

I’ve been chasing an issue with a TripleO-based installation whereby the nodes were provisioning but failing to configure networking correctly.

Debugging TripleO deployments is fiendishly hard and this was made more complex by being unable to connect to the failed nodes. Deployed TripleO nodes only allow key-based ssh authentication. It’s great to see security being so good even the sysadmin can’t access the node I guess.

If you want to login to a node at the console, you basically have to roll your own deployment image. I was on the verge of heading down this route when I considered the following:

TripleO deployments have two methods of specifying usable, active network interfaces. The first (and unfortunately, the default) is to number them nic1 , nic2, nic3 etc in the config. Unfortunately this introduces some logic from os-net-config to determine which links are actually connected to switches.

This would be fine on most machines but sadly the hardware I’m working on has a built-in ethernet-over-usb device for out of band access. This reports itself as having a link (for reasons unknown, maybe a link to the usb interface?) and therefore fulfills the following criteria:

  1. Not the local loopback
  2. Has an address
  3. Reports carrier signal as active in /sys/class/net/<device>/carrier
  4. Has a subdirectory of device information in /sys/class/net/<device>/device/

Despite reporting the link state as UNKNOWN in the ip command, this meant that the logic of os-net-config concluded that this was the management nic and attempted to configure it as such, obviously to no avail.

Happily this has resulted in my first OpenStack patch:

https://review.openstack.org/#/c/291243

which may even get accepted.

Red Hat Director continued…

Its been an interesting start to the year. Upcoming deployments are calling for the use of Liberty-based deployments so I have been looking at using Red Hat’s Director to deploy these. Because it’s basically a glorified Heat template (albeit an extremely complicated one), it’s ideal for customization for individual environments because no cloud is the same, right?

But that involves actually getting yourself to a working system in the first place when you’re needing to orchestrate storage, networking, HA/pacemaker and so on.

The undercloud installation, as referenced previously, was pretty much a doddle. The snag came when I attempted to deploy the “overcloud” or main production cloud. The nodes failed to deploy consistently to the hardware but in utterly random fashion. For example, in a  3 controller and 3 compute setup, one deployment run would result in 2 controllers and 1 compute node successfully configured. A second run would then result in 1 controller and 3 compute nodes deployed.

So naturally you start by simplifying the setup as much as possible, removing complex storage and networking options. This didn’t improve matters so I tested with a standard OS deployment which worked fine. Changes to partitioning, UEFI configs, firmware amongst others all drew a blank.

Finally I decided to hack the boot config of the deployment images for the nodes (stored in /httpboot) to output debug info to console:

find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/troubleshoot=0/troubleshoot=1/g’
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/console=ttyS0 //g’

which took a few runs but then eventually spotted the blink-and-you’ll-miss-it error:

osp_72_dracut_error

Specifically, “failed to mount root partition”. Which then led me to the following bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1296330

This explained why it had been working fine in my RDO environment of course. Its a pity this was missed in Red Hat’s testing but it was a useful learning experience, going round the houses with heat, ipmi, ipxe, uefi, kickstart partitioning and so on. The workaround has been to revert to Director’s 7.1 deployment image as I guess this is a fairly weird and wonderful problem. Future releases will feature a different deployment image so apparently this will no longer be a problem.

All told this took up about two weeks engineering time which goes to show that the random now-you-see-it-now-you-don’t problems are the hardest to fix. Onwards and upwards.

Red Hat Director, The Undercloud

This week I’ve been starting to get to grips with Red Hat’s Director cloud deployment tool. It leverages Ironic to provision baremetal machines and introduces the concept of an undercloud and overcloud – there are essentially two clouds, the undercloud is a basic OpenStack environment with just the tools needed to get the main job done. The Overcloud is the cloud your users interact with and run their whatever on.

Its clear some serious engineering time has gone into trying to make this as easy as possible and the good news is that so far, it seems to be working well. Installation of the undercloud was a simple as defining a few variables like network range, interface etc. I had tried RDO Manager (Director’s upstream product) and had a fairly torrid time. They were going through some major infrastructure changes at the time however so perhaps that was part of the problem. Meh.

Anyway, pretty picture time.

Screenshot from 2015-12-17 12-59-29

So now there are various roles are ready to be deployed. There are the usual compute and control as well as Ceph (no surprise as this is a Red Hat product), Cinder and Swift.

Initial Overcloud deployments haven’t completed yet – this is more due to me being a tool and not following the instructions rather than any particular bug in the software.

Gotchas:

The biggest issue so far has been hardware related. I’ve been using an ancient Nortel (yes, remember them!) switch was taking an eon to bring up network links, I think, due to a buggy STP (Spanning Tree Protocol) implementation. Director uses iPXE rather than PXElinux for some reason (UEFI maybe?) and although it downloaded the NBP file fine, when it came to get a DHCP lease, it completely timed out. It was only when I attached the second interface to try and boot from that that it became apparent that the link was taking a long time to come up. So I guess STP was at fault here but never debugged, just replaced the switch (for a good old HP ProCurve) and it worked fine.

In the course of the above issue I’ve learnt plenty of things about PXE booting like how when the logs say “error 8 User aborted the transfer” you can actually ignore it because its normal, even if journalctl flags it in big red letters. Apparently this is an initial check to see what protocols the client supports before initiating the download proper.

Other problems included discovering that there needs to be a default flavor called “baremetal” so unassigned nodes know where to live and enabling SELinux – RDO Manager/Director won’t actually install if it is disabled.