I’ve had an interesting experience debugging (or failing to debug) an issue in RHEL OSP 10.
Stacks were failing to complete:
Error: resources.repo_definition_repovol_attach: Failed to attach volume x to server y - Unknown Error (HTTP 504)
From the logs it was obvious that nova was booting the instance fine and cinder was attaching the volume ok, even after the stack create failed.
I’ve mostly done OpenStack deployments so debugging operations is new to me. Thankfully Red Hat has some really good people who are used to chasing errors through the system.
One of the support chaps chased this through the system and determined it was due to a HAProxy timeout (I had no idea HAProxy affected Heat in this way) so we bumped the settings:
timeout http-request 20s
timeout queue 2m
timeout connect 20s
timeout client 10m
timeout server 10m
timeout check 20s
After applying this and restarting haproxy, the stack create completed. It turned out some instances were taking ~7 minutes to complete. My gut instinct is that there are storage or instance issues at play here. 7 minutes to wait for a single instance to boot and attach it’s volumes is too long. But then I’m used to GPFS and Ceph…
UPDATE: Yeah, so both Ceph and GPFS/Spectrum Scale support Copy-On-Write hence stuff happens VERY quickly. I just wasn’t used to traditional storage.