Its been an interesting start to the year. Upcoming deployments are calling for the use of Liberty-based deployments so I have been looking at using Red Hat’s Director to deploy these. Because it’s basically a glorified Heat template (albeit an extremely complicated one), it’s ideal for customization for individual environments because no cloud is the same, right?
But that involves actually getting yourself to a working system in the first place when you’re needing to orchestrate storage, networking, HA/pacemaker and so on.
The undercloud installation, as referenced previously, was pretty much a doddle. The snag came when I attempted to deploy the “overcloud” or main production cloud. The nodes failed to deploy consistently to the hardware but in utterly random fashion. For example, in a 3 controller and 3 compute setup, one deployment run would result in 2 controllers and 1 compute node successfully configured. A second run would then result in 1 controller and 3 compute nodes deployed.
So naturally you start by simplifying the setup as much as possible, removing complex storage and networking options. This didn’t improve matters so I tested with a standard OS deployment which worked fine. Changes to partitioning, UEFI configs, firmware amongst others all drew a blank.
Finally I decided to hack the boot config of the deployment images for the nodes (stored in /httpboot) to output debug info to console:
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/troubleshoot=0/troubleshoot=1/g’
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/console=ttyS0 //g’
which took a few runs but then eventually spotted the blink-and-you’ll-miss-it error:
Specifically, “failed to mount root partition”. Which then led me to the following bug:
This explained why it had been working fine in my RDO environment of course. Its a pity this was missed in Red Hat’s testing but it was a useful learning experience, going round the houses with heat, ipmi, ipxe, uefi, kickstart partitioning and so on. The workaround has been to revert to Director’s 7.1 deployment image as I guess this is a fairly weird and wonderful problem. Future releases will feature a different deployment image so apparently this will no longer be a problem.
All told this took up about two weeks engineering time which goes to show that the random now-you-see-it-now-you-don’t problems are the hardest to fix. Onwards and upwards.