RCA 24: VM Failover after ESXi Host crashed

Host on which Vm was running before crash : abcde01203.svr.emea.xyze.net

Host on which Vm was running after crash : abcde01188.svr.emea.xyze.net

Time of reboot of VM: 12.26 UTC Approx.

VM Name: abcd00008134

From the VM we are not able to see much:

2019-10-07T12:26:28.421Z| vmx| I120: Hostname=abcde01188.emea.ad.xyze.com
2019-10-07T12:26:28.478Z| vmx| I120: /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx: Setup symlink /var/run/vmware/268c3b0ccf27a6da293d986a40960bea -> /var/run/vmware/root_0/1570451188421971_82029353
2019-10-07T12:26:28.486Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0
2019-10-07T12:26:28.486Z| vmx| I120: Transitioned vmx/execState/val to poweredOff
2019-10-07T12:26:28.486Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2019-10-07T12:26:28.486Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=2, newAppState=1877, success=1 additionalError=0
2019-10-07T12:26:28.486Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=3, newAppState=1881, success=1 additionalError=0
2019-10-07T12:26:28.517Z| vmx| W110: ObtainHardwareID unexpected failure: 22.
2019-10-07T12:26:28.517Z| vmx| W110: Hostinfo_MachineID ObtainHardwareID failure (Invalid argument); providing default.
2019-10-07T12:26:28.698Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2019-10-07T12:26:28.698Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=2, newAppState=1878, success=1 additionalError=0
2019-10-07T12:26:28.925Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1871, success=1 additionalError=0
2019-10-07T12:26:28.929Z| vmx| I120: DICT   toolScripts.afterResume = TRUE
2019-10-07T12:26:28.929Z| vmx| I120: DICT                 uuid.bios = 42 29 72 58 fc 94 ae 4d-46 2f a2 2e a9 73 53 15
2019-10-07T12:26:28.929Z| vmx| I120: DICT                   vc.uuid = 50 29 86 64 c9 59 96 7d-7c 2f 0f b5 6f e6 46 4a
2019-10-07T12:26:28.930Z| vmx| I120: DICT             uuid.location = 56 4d 5e e0 b9 40 34 08-43 ab 08 65 17 f8 41 22
2019-10-07T12:26:28.936Z| vmx| I120: Vix: [82029353 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2019-10-07T12:26:28.966Z| vmx| I120:   External-interrupt exiting     {0,1}
2019-10-07T12:26:28.966Z| vmx| I120:   Process posted interrupts      {0,1}
2019-10-07T12:26:28.966Z| vmx| I120:   Interrupt-window exiting       {0,1}
2019-10-07T12:26:28.966Z| vmx| I120:   Virtual-interrupt delivery     {0,1}
2019-10-07T12:26:28.966Z| vmx| I120:   Acknowledge interrupt on exit {0,1}
2019-10-07T12:26:28.970Z| vmx| I120: MONITOR MODE: guestOS preferred modes: HWMMU HV BT32
2019-10-07T12:26:29.098Z| vmx| I120: UUID: Writing uuid.location value: ’56 4d 7a 98 25 a0 56 8c-ed 21 7e 75 63 25 a2 b9′
2019-10-07T12:26:29.098Z| vmx| A115: ConfigDB: Setting uuid.location = “56 4d 7a 98 25 a0 56 8c-ed 21 7e 75 63 25 a2 b9”
2019-10-07T12:26:29.220Z| Worker#1| I120: DISKLIB-LINK : Opened ‘/vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134_1.vmdk’ (0xa): vmfs, 2013265920 sectors / 960 GB.
2019-10-07T12:26:29.222Z| Worker#0| I120: DISKLIB-LINK : Opened ‘/vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmdk’ (0xa): vmfs, 167772160 sectors / 80 GB.
2019-10-07T12:26:29.256Z| vmx| I120: DumpDiskInfo: scsi0:0 createType=11, capacity = 167772160, numLinks = 1, allocationType = 2
2019-10-07T12:26:29.262Z| vmx| I120: DumpDiskInfo: scsi1:0 createType=11, capacity = 2013265920, numLinks = 1, allocationType = 2

Hostname: abcde01203.emea.ad.xyze.com

VMware ESXi 5.5.0 build-6480324 ( ESXi 5.5 Express Patch 11 )

VMK Summary Logs:

2019-10-07T12:32:15Z bootstop: Host has booted

VOBD Logs:

Reviewed the VOBD Logs and can see the Linkstate change which is probably the outcome of the Host Reboot.

2019-10-07T12:30:57.321Z: [netCorrelator] 56984901us: [vob.net.vmnic.linkstate.down] vmnic vmnic2 linkstate down
2019-10-07T12:30:57.927Z: [netCorrelator] 57590171us: [vob.net.vmnic.linkstate.down] vmnic vmnic3 linkstate down
2019-10-07T12:30:58.002Z: [netCorrelator] 57665718us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic2 linkstate is down
2019-10-07T12:30:58.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2019-10-07T12:30:58.002Z: [netCorrelator] 57665944us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic3 linkstate is down
2019-10-07T12:30:58.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2019-10-07T12:32:15.324Z: [UserLevelCorrelator] 134987297us: [vob.user.host.boot] Host has booted.
2019-10-07T12:32:15.324Z: [GenericCorrelator] 134987297us: [vob.user.host.boot] Host has booted.
2019-10-07T12:32:15.324Z: [UserLevelCorrelator] 134987720us: [esx.audit.host.boot] Host has booted.

VMKernel Logs:

From the Vmkernel as well, we are not able to isolate anything specific, other than the Boot events.

2019-10-07T11:08:23.531Z cpu36:33806)NMP: nmp_ResetDeviceLogThrottling:3311: Error status H:0x0 D:0x2 P:0x0 Sense Data: 0x5 0x20 0x0 from dev “naa.600508b1001c1584d73e9e8d8b8953a7” occurred 1 times(of 0 commands)
2019-10-07T11:38:23.614Z cpu36:33806)NMP: nmp_ResetDeviceLogThrottling:3311: Error status H:0x0 D:0x2 P:0x0 Sense Data: 0x5 0x20 0x0 from dev “naa.600508b1001c1584d73e9e8d8b8953a7” occurred 1 times(of 0 commands)
2019-10-07T12:08:23.697Z cpu64:33806)NMP: nmp_ResetDeviceLogThrottling:3311: Error status H:0x0 D:0x2 P:0x0 Sense Data: 0x5 0x20 0x0 from dev “naa.600508b1001c1584d73e9e8d8b8953a7” occurred 1 times(of 0 commands)

TSC: 186852 cpu0:1)BootConfig: 88: mcaEnableAllErrorSources = FALSE
0:00:00:00.000 cpu0:1)WARNING: Serial: 646: Invalid serial port config: mem-mapped to addr 0x0.

0:00:00:00.736 cpu0:1)Starting VMKernel initialization …
0:00:00:00.749 cpu0:1)Initializing memory …
0:00:00:17.577 cpu0:1)WARNING: MemMap: 2803: Node 3 has no free pages
0:00:00:17.598 cpu0:1)IRQ: 540: 0xfe <error> exclusive, flags 0x0
0:00:00:17.601 cpu0:1)WARNING: Vmkperf: 3929: 2 perf counters in use by BIOS
0:00:00:23.663 cpu0:32768)Initializing storage stack …
0:00:00:23.789 cpu0:32768)PCI: 59: Failed to initialize _OSC for 0000:7f

0:00:00:24.267 cpu0:32768)Init: 879: Vmkernel initialization done.
0:00:00:24.267 cpu0:32768)VMKernel loaded successfully.
2019-10-07T12:30:24.822Z cpu0:32768)Loading module user …
2019-10-07T12:30:24.836Z cpu0:32768)user loaded successfully.

2019-10-07T12:32:15.563Z cpu69:35974)Boot Successful

2019-10-08T02:59:48.064Z cpu36:193558)Migrate: vm 193561: 3286: Setting VMOTION info: Dest ts = 1570503582214591, src ip = <169.78.11.7> dest ip = <169.78.11.2> Dest wid = 0 using SHARED swap
2019-10-08T02:59:48.085Z cpu36:193558)Hbr: 3340: Migration start received (worldID=193561) (migrateType=1) (event=0) (isSource=0) (sharedConfig=1)

Hostd Logs:

Hostd Logs show some events associated to the FDS Limit under Memory:

https://ikb.vmware.com/s/article/50111942?lang=en_US

2019-10-07T12:25:10.858Z [3EBC1B70 verbose ‘SoapAdapter’] Responded to service state request
2019-10-07T12:25:31.952Z [3EDC1B70 verbose ‘SoapAdapter’] Responded to service state request
Section for VMware ESX, pid=34510, version=5.5.0, build=6480324, option=Release
—— In-memory logs start ——–
mem> 2019-10-07T12:31:48.068Z [FF9589A0 verbose ‘Default’] No update forwarding configured
mem> 2019-10-07T12:31:48.068Z [FF9589A0 info ‘Default’] Supported VMs 512
mem> 2019-10-07T12:31:48.068Z [FF9589A0 warning ‘Default’] Estimated fds limit 4096 > 4096 max supported by setrlimit. Setting fds limit to 4096
mem> 2019-10-07T12:31:48.068Z [FF9589A0 info ‘Handle checker’] Setting system limit of 4096
mem> 2019-10-07T12:31:48.068Z [FF9589A0 info ‘Handle checker’] Set system limit to 4096
mem> 2019-10-07T12:31:48.068Z [FF9589A0 info ‘Default’] Setting malloc mmap threshold to 32 k
mem> 2019-10-07T12:31:48.068Z [FF9589A0 info ‘Default’] getrlimit(RLIMIT_NPROC): curr=64 max=128, return code = Success
mem> 2019-10-07T12:31:48.068Z [FF9589A0 info ‘Default’] setrlimit(RLIMIT_NPROC): curr=128 max=128, return code = Success

—— In-memory logs end ——–

Hostname: abcde01188.emea.ad.xyze.com

VMware ESXi 5.5.0 build-6480324 ( ESXi 5.5 Update 3f )

FDM Logs:

2019-10-07T12:26:11.739Z [60CC2B70 verbose ‘Execution’] [ActionScheduler::AddPendingActionInt] Added pending action opId = host-15886:0-0, cfgPath = /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx, type = VmFailover, priority = 32
2019-10-07T12:26:11.739Z [FFC7AB70 verbose ‘Execution’ opID=host-15886:0-0] [FailoverAction::StartAsync] Failing over vm /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx (isRegistered=false)
2019-10-07T12:26:11.739Z [FFC7AB70 verbose ‘Execution’ opID=host-15886:0-0] [FailoverAction::StartAsync] Registering vm
2019-10-07T12:26:21.750Z [60CC2B70 warning ‘commonvpxLro’] [VpxLRO] No taskinfo property updates for haTask-ha-folder-vm-vim.Folder.registerVm-274763869 in 10000 ms
2019-10-07T12:26:21.753Z [60CC2B70 warning ‘commonvpxLro’] [VpxLRO] Taskinfo polling sees no change
2019-10-07T12:26:28.228Z [60CC2B70 verbose ‘Execution’ opID=host-15886:0-0] [FailoverAction::RegisterCompletionCallback] Registering vm done (vmid=/vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx, hostdVmId=2)
2019-10-07T12:26:28.231Z [60CC2B70 verbose ‘Execution’ opID=host-15886:0-0] [FailoverAction::InitiateReconfigure] Not reconfiguring vm
2019-10-07T12:26:28.231Z [60CC2B70 verbose ‘Execution’ opID=host-15886:0-0] [FailoverAction::ReconfigureCompletionCallback] Powering on vm
2019-10-07T12:26:29.094Z [60540B70 verbose ‘Execution’ opID=SWI-2a65dcf0] [ExecutionManagerImpl::HandleVmQuestion] Question while failing over vm /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx, Label = msg.uuid.altered
2019-10-07T12:26:29.760Z [60C40B70 verbose ‘Execution’ opID=host-15886:0-0] [FailoverAction::PowerOnCompletionCallback] Power on vm /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx done
2019-10-07T12:26:48.077Z [FFBD5B90 verbose ‘Invt’ opID=SWI-42409122] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx changed toolsStatus=toolsOk

VM Kernel Logs:

VM Worlds are started on the Esxi Host because of the reboot.

2019-10-07T12:26:29.002Z cpu15:82029353)World: vm 82029356: 1466: Starting world vmm0:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029358: 1466: Starting world vmm1:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029359: 1466: Starting world vmm2:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029360: 1466: Starting world vmm3:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029361: 1466: Starting world vmm4:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029362: 1466: Starting world vmm5:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029363: 1466: Starting world vmm6:abcd00008134 of type 8
2019-10-07T12:26:29.019Z cpu15:82029353)World: vm 82029364: 1466: Starting world vmm7:abcd00008134 of type 8

Conclusion:

From the logs we are not able to conclude the Exact reason for the Host reboot, the only event we can find is below:

2019-10-07T12:32:03.664Z [FF9589A0 verbose ‘Solo.VmwareCLI’] Command reboot (reboot)

In order to investigate further we need the Support Bundle from

Esxi Host : abcde01194.svr.emea.xyze.net as this is the Master for FDM.
vCenter: 169.94.153.58

______________________________________________________________________________

Hostname: esx-abcde01194.emea.ad.xyze.com

FDM Logs:

As per the Article we can see that the Master lost the communication to the Esxi Host-15862 after which a cluster Failover is initiated:

2019-10-07T12:25:50.695Z [FFB3DB70 error ‘Cluster’ opID=SWI-56f32f43] [ClusterSlave::LiveCheck] Timeout for slave @ host-15862
2019-10-07T12:26:11.731Z [21E81B70 verbose ‘Placement’] [PlacementManagerImpl::IssuePlacementStartCompleteEventLocked] Issue failover start event
2019-10-07T12:26:11.732Z [21A40B70 verbose ‘FDM’ opID=SWI-3ceb7cde] [FdmService] New event: EventEx=com.vmware.vc.HA.ClusterFailoverActionInitiatedEvent vm= host= tag=host-15886:-390685961:1
2019-10-07T12:26:11.735Z [21EC2B70 verbose ‘Placement’ opID=SWI-7d923e5b] [DrmPE::GenerateFailoverRecommendation] Snapshot is valid
2019-10-07T12:26:11.735Z [21EC2B70 verbose ‘Placement’ opID=SWI-7d923e5b] [DrmPE::GenerateFailoverRecommendation] 1 Vms are to be powered on
2019-10-07T12:26:11.736Z [21EC2B70 verbose ‘Placement’ opID=SWI-7d923e5b] [DrmPE::GenerateFailoverRecommendation] 1 vms added to domain config
2019-10-07T12:26:11.736Z [21EC2B70 verbose ‘Placement’ opID=SWI-7d923e5b] [DrmPE::InvokeDrsMultiplePasses] Pass2: respect host preference but not failover hosts

Since the Failover task was initiated, the Failed VM Name abcd00008134 was started:

2019-10-07T12:26:28.225Z [22281B70 verbose ‘Invt’ opID=SWI-2ce496be] [VmStateChange::SavePowerChange] Vm /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx uncleanly powered off; alreadyCreated=true
2019-10-07T12:26:29.761Z [21E81B70 verbose ‘Policy’ opID=SWI-5e8a2842] [VmOperationsManager::RequestCancelPlacement] VM ID /vmfs/volumes/5aead89d-4a42a019-c968-d06726cf4f12/abcd00008134/abcd00008134.vmx: Transitioned to ‘Cancel’ state for ‘VmFailover’
2019-10-07T12:26:48.287Z [22240B70 verbose ‘Placement’] [PlacementManagerImpl::IssuePlacementStartCompleteEventLocked] Issue failover complete event
2019-10-07T12:26:48.288Z [21EC2B70 verbose ‘FDM’ opID=SWI-19a5a584] [FdmService] New event: EventEx=com.vmware.vc.HA.ClusterFailoverActionCompletedEvent vm= host= tag=host-15886:-390685961:2
2019-10-07T12:27:46.730Z [FFBBFB70 error ‘Election’ opID=SWI-6058ed8] [ClusterElection::SendAll] sendto 10.189.183.152 failed: Host is down
2019-10-07T12:27:47.731Z [FFBBFB70 error ‘Election’ opID=SWI-6058ed8] [ClusterElection::SendAll] sendto 10.189.183.152 failed: Host is down
2019-10-07T12:27:48.733Z [FFBBFB70 error ‘Election’ opID=SWI-6058ed8] [ClusterElection::SendAll] sendto 10.189.183.152 failed: Host is down
2019-10-07T12:27:49.734Z [FFBBFB70 error ‘Election’ opID=SWI-6058ed8] [ClusterElection::SendAll] sendto 10.189.183.152 failed: Host is down
2019-10-07T12:27:50.225Z [21EC2B70 warning ‘Cluster’] [HostPing::Ping] sendto[ipv4] 10.189.183.152: Host is down
2019-10-07T12:27:50.736Z [FFBBFB70 error ‘Election’ opID=SWI-6058ed8] [ClusterElection::SendAll] sendto 10.189.183.152 failed: Host is down
2019-10-07T12:32:25.912Z [FFB3DB70 error ‘Cluster’ opID=SWI-56f32f43] Slave host-15862 is already registered
2019-09-07T14:25:33.155Z [21EC2B70 error ‘commonvpxVmomi’] [VpxVmomi] IO exception while sending result

vCenter Logs:

Logs are not available around 07/10/2019.

Hostname: abcde01203.emea.ad.xyze.com

VMK Summary Logs:

From the VMK summary we can only see the Event for the Host has booted up:

2019-10-07T12:32:15Z bootstop: Host has booted

When a host shuts down unexpectedly, you do not see log information to indicate the shutdown. The logs indicate a start-up only.
For More information: https://kb.vmware.com/s/article/2036544

From the IPMI Logs we can see few Events associated to soft-off but not sure about the exact problem:

Record Id: 85 When: 2019-10-07T12:33:40 Message: Assert + Processor IERR
Record Id: 86 When: 2019-10-07T12:34:03 Message: Assert + System ACPI Power State S0/G0: working
Record Id: 87 When: 2019-10-07T12:34:30 Message: Assert + System ACPI Power State S4/S5: soft-off
Record Id: 88 When: 2019-10-07T12:34:33 Message: Assert + System ACPI Power State S0/G0: working
Record Id: 89 When: 2019-10-07T12:34:49 Message: Assert + System ACPI Power State S0/G0: working
Record Id: 90 When: 2019-10-07T12:30:02 Message: Deassert + Drive Slot Drive Present
Record Id: 91 When: 2019-10-07T12:30:02 Message: Assert + Drive Slot In Failed Array
Record Id: 92 When: 2019-10-07T12:30:02 Message: Deassert + Drive Slot Drive Present
Record Id: 93 When: 2019-10-07T12:30:02 Message: Assert + Drive Slot In Failed Array

Conclusion:

As per the Article: https://kb.vmware.com/s/article/2036544 we are only able not able to see any PSOD events or any User reboot Task.
Please check the Hardware Associated logs for any events and Task.
Please get an Extensive Hardware Diagnostics done on the Esxi Host with discussion with the Hardware Vendor.
Since there we don’t have the logs available in the vCenter bundle for the Time when the issue happened we are not able to find any vCenter related task and information.