vCAC 5.2 - Accidental Deletion of a non-vCAC VM
It was tempting to call this article “vCAC Ate My VM” but it's not a useful description of what it's actually about.
I was onsite with a customer recently when an odd bit of behaviour occurred whilst testing some out some code in the BuildingMachine stub. I've reproduced what happened in my home lab and while it's a bit worrying and probably a bug, I'd hesitate to ring the alarm bells too loudly.
A bit of scene setting is required to explain this first.
- The customer wanted to use user specified machine names. The blueprints in use have been configured to request a machine name from the person requesting a VM.
- This name is also used for the VM's guest OS hostname during the customization of the VM. Understandably this has to be unique within the DNS zone / network being used.
- The vCenter being used as a vCAC endpoint is the same one that “owns” the vCAC infrastructure and many other production VMs. However vCAC has it's own cluster to consume resources from.
The customer wanted to ensure that users couldn't request a VM name that was already in use. vCAC does its own checking to ensure that the same name is not used with vCAC itself. However, it does not check for existing VMs in vSphere. This is why I was adding some code to the WFStubBuildingMachine workflow.
The solution that I had was a simple piece of PowerCLI that connected to the vCenter server, checked to see if the requested VM name was in use in any of the other clusters and failed the request if it was. Fairly simple and it worked. What I saw however was that the existing VM was destroyed by vCAC. Luckily it was a test one and not a production one. However, given that the vCenter server also managed non-vCAC VMs, this was a bit worrying and why I have been investigating it in my lab.
To reproduce the issue, I needed two clusters in my homelab (which I already had):
One for management VMs and one resource cluster for vCAC to provision into.
I created a simple VM from a vSphere template called “testvm” in my MGMT cluster that would be my guinea pig. I then built a quick vCAC 5.2 server and configured my vCenter server's “RES” cluster as a Compute Resource. With a reservation in place and a simple blueprint I was ready to test.
Having verified that I could create VMs via vCAC with custom names successfully, I then went about customising the WFStubBuildingMachine workflow so that it would exit in a “Failed” state. Adam Bohle has a posting that explains how to accomplish this, I simplified it a bit as I didn't need all of the logic in place, just a failure.
Using the vCAC Designer, I simply added a step to return a Failed state from WFStubBuildingMachine and sent the change back to the Model Manager.
After another quick test, I could see that as soon as any request hit the “Building Machine” stage, it failed and vCAC would dispose of the VM. The important thing to realise is that in the lifecycle of a vCAC machine, “Building Machine” means that nothing has been created yet outside of vCAC. No cloning in vSphere has taken place. So disposing of a failed request at this stage should not really involve vCenter at all.
Now the real test…
This time I made a vCAC request for a VM called “testvm” (remember that it's in my MGMT cluster and vCAC is set to use only my RES cluster for VMs).
As expected, the requests fails at the “Building Machine” stage and vCAC disposes of the VM.
Back in vCenter “testvm” is still there and running ok. This is good. As I'd hoped, vCAC doesn't touch something that's in another cluster.
If the “testvm” machine is moved to the RES cluster though, what then? Boom! vCAC jumps into a Disposing stage as expected but deletes the non-vCAC VM from vCenter that has the same name!
Whilst this probably shouldn't happen, what I was doing here wasn't good practice anyway. The cluster that vCAC provisions into should only be used by vCAC. There should be no other VMs in there at all.