Troubleshooting: Missing file after vMotion attempt

I apologise in advance if this doesn't make much sense to you. It took me a while to unravel what was wrong and I still don't know why.
Update 5 was being applied to a 3.5 cluster and one of the hosts was being placed into maintenance mode. Most of the VMs were migrated to other hosts but one failed part way through and was powered off. At first I thought that one of the two hosts had a grumpy moment but the VM then refused to power on again and the following message was shown:

Ooops. I had a good rummage in the hostd.log file on the host that attempted to power on the VM and found the following messages:

[text][2010-01-27 12:33:09.101 ‘BaseLibs' 133225392 info] DISKLIB-VMFS : "/vmfs/volumes/4a081291-4fb12f12-bef0-001e0bcdc996/myvm/mydisk_1-000001-delta.vmdk" : open successful (21) size = 16106127360, hd = 0. Type 8
[2010-01-27 12:33:09.103 ‘BaseLibs' 133225392 info] DISKLIB-VMFS : "/vmfs/volumes/4a081291-4fb12f12-bef0-001e0bcdc996/myvm/mydisk_1-000001-delta.vmdk" : closed.
[2010-01-27 12:33:09.151 ‘BaseLibs' 133225392 info] SNAPSHOT: Unable to find all files for ‘/vmfs/volumes/4992b455-063c9aec-5e36-001e0bcdc996/mytemplate/mydisk_1.vmdk'
[2010-01-27 12:33:56.219 ‘vm:/vmfs/volumes/4a081291-4fb12f12-bef0-001e0bcdc996/myvm/myvm.vmx' 20868016 info] Question info: VMware ESX Server cannot find the virtual disk "/vmfs/volumes/4992b455-063c9aec-5e36-001e0bcdc996/mytemplate/mydisk_1.vmdk". Please verify the path is valid and try again.
Cannot open the disk ‘/vmfs/volumes/4a081291-4fb12f12-bef0-001e0bcdc996/myvm/mydisk_1-000001.vmdk' or one of the snapshot disks it depends on.
[2010-01-27 12:33:56.240 ‘ha-eventmgr' 20868016 info] Event 81 : Message on myvm on myhost.local in ha-datacenter: VMware ESX Server cannot find the virtual disk "/vmfs/volumes/4992b455-063c9aec-5e36-001e0bcdc996/mytemplate/mydisk_1.vmdk". Please verify the path is valid and try again.
Cannot open the disk ‘/vmfs/volumes/4a081291-4fb12f12-bef0-001e0bcdc996/myvm/mydisk_1-000001.vmdk' or one of the snapshot disks it depends on.[/text]

(I've sanitised this log file snippet so the names aren't accurate but they are consistent with the issue that I discovered.)

Firstly the logfile shows a delta file. That means that the VM is running from a snapshot. This didn't show up beforehand and the Snapshot Manager did not show it. Most likely VCB (or the backup software using it) didn't clean up after itself. Browsing the datastore where the VM resides showed that the snapshot was nearly two weeks old.

Secondly, you can see the issue in the third line onwards. It looks like the base disk file has gone missing. However reading more closely it looks like the base disk is on a different datastore and actually part of a different VM! For some reason, when this VM was deployed from a template it retained one of the template's disks as its own. Looking into that datastore I could see the mydisk_1-flat.vmdk file but there was no mydisk_1.vmdk file. (Just to explain, the former is the actual disk file. 15Gb in size and containing the VM's data. The latter file is a small text file and contains configuration data. I'll call it the disk descriptor file.) So, it was a missing disk descriptor file that was the issue. I did a quick google and didn't find anything immediately helpful so I ran through the following steps:

  1. Copied the mydisk_1-flat.vmdk file from the template VM's datastore to the broken VM's datastore.
  2. Knowing that the disk was supposed to be 15Gb in size, I created a quick VM with a single 15Gb disk and copied the disk descriptor file to the broken VM's datastore.
  3. Next I made a note of the parentCID from the mydisk_1-000001.vmdk disk descriptor file. This value (from the snapshot delta's disk descriptor file) is the ID of the parent disk.
  4. [text]# Disk DescriptorFile
    version=1
    CID=de54d5dd
    parentCID=1bb73626
    createType="vmfsSparse"
    parentFileNameHint="/vmfs/volumes/4992b455-063c9aec-5e36-001e0bcdc996/mytemplate/mydisk_1.vmdk"
    # Extent description
    RW 31457280 VMFSSPARSE "mydisk_1-000001-delta.vmdk"

    # The Disk Data Base
    #DDB

    ddb.toolsVersion = "7302"[/text]

  5. I also modified the file above to correct the parentFileNameHint value so that it referred to the local datastore and became:
  6. [text]parentFileNameHint="mydisk_1.vmdk"[/text]

  7. I modified the newly created 15Gb disk descriptor file with the CID matching the parent value from step 3. And made sure that the Extent description was correct.
  8. [text]# Disk DescriptorFile
    version=1
    CID=1bb73626
    parentCID=ffffffff
    createType="vmfs"

    # Extent description
    RW 31457280 VMFS "mydisk_1-flat.vmdk"

    # The Disk Data Base
    #DDB

    ddb.virtualHWVersion = "4"
    ddb.uuid = "60 00 C2 91 07 97 77 cb-87 9e 5d 9f 95 95 2c 46"
    ddb.geometry.cylinders = "1958"
    ddb.geometry.heads = "255"
    ddb.geometry.sectors = "63"
    ddb.adapterType = "lsilogic"[/text]

  9. I saved the file as mydisk_1.vmdk

The VM then powered on successfully. I checked the disks after successful boot up and they're there.

Now all that remains is to sort out the snapshot. It still doesn't register in snapshot manager.

This has been a bit of a hack but it worked. And before anyone comments, I just modified my google search terms and found the answer in a VMware KB – first hit! Recreating a missing virtual disk (VMDK) header/descriptor file