Red Hat Director continued…

Its been an interesting start to the year. Upcoming deployments are calling for the use of Liberty-based deployments so I have been looking at using Red Hat’s Director to deploy these. Because it’s basically a glorified Heat template (albeit an extremely complicated one), it’s ideal for customization for individual environments because no cloud is the same, right?

But that involves actually getting yourself to a working system in the first place when you’re needing to orchestrate storage, networking, HA/pacemaker and so on.

The undercloud installation, as referenced previously, was pretty much a doddle. The snag came when I attempted to deploy the “overcloud” or main production cloud. The nodes failed to deploy consistently to the hardware but in utterly random fashion. For example, in a  3 controller and 3 compute setup, one deployment run would result in 2 controllers and 1 compute node successfully configured. A second run would then result in 1 controller and 3 compute nodes deployed.

So naturally you start by simplifying the setup as much as possible, removing complex storage and networking options. This didn’t improve matters so I tested with a standard OS deployment which worked fine. Changes to partitioning, UEFI configs, firmware amongst others all drew a blank.

Finally I decided to hack the boot config of the deployment images for the nodes (stored in /httpboot) to output debug info to console:

find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/troubleshoot=0/troubleshoot=1/g’
find /httpboot/ -name config -print0 | xargs -0 sed -i ‘s/console=ttyS0 //g’

which took a few runs but then eventually spotted the blink-and-you’ll-miss-it error:

osp_72_dracut_error

Specifically, “failed to mount root partition”. Which then led me to the following bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1296330

This explained why it had been working fine in my RDO environment of course. Its a pity this was missed in Red Hat’s testing but it was a useful learning experience, going round the houses with heat, ipmi, ipxe, uefi, kickstart partitioning and so on. The workaround has been to revert to Director’s 7.1 deployment image as I guess this is a fairly weird and wonderful problem. Future releases will feature a different deployment image so apparently this will no longer be a problem.

All told this took up about two weeks engineering time which goes to show that the random now-you-see-it-now-you-don’t problems are the hardest to fix. Onwards and upwards.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s