RCA 15: Host Hardware Failures caused issues with VM

 

Hostname: abcdcc2v026.naeast.ad.abcde.com

 

Esxi Version: VMware ESXi 5.5.0 build-9919047

 

Scratch location: /tmp/scratch :: This location is not ideal, during a host reboot the logs will not be saved, see KB 1033696

 

CIM Logs:

 

  • From the CIM logs we can see that there is an Issue with the Host Hardware itself.

 

[Wed Jun  5 03:25:38 UTC 2019] Dumping instances of OMC_RawIpmiSensor

                   Description = Disk or Disk Bay 1 HDD1_INFO: Drive Fault
                   Description = Disk or Disk Bay 1 HDD1_INFO: In Critical Array
                   Description = Disk or Disk Bay 1 HDD1_INFO: In Failed Array
                   Description = Disk or Disk Bay 0 HDD0_INFO: Drive Fault
                   Description = Disk or Disk Bay 0 HDD0_INFO: In Critical Array
                   Description = Disk or Disk Bay 0 HDD0_INFO: In Failed Array

 

VMK Summary:

 

  • From the VMKernel logs we can see that the host has rebooted and since there is no Persistent logs location present we cannot find the reason for the Failure.

 

2019-06-05T01:34:07Z bootstop: Host has booted
2019-06-05T02:00:01Z heartbeat: up 0d0h29m53s, 0 VMs; [[36076 fdm 13184kB] [35320 vpxa-worker 21900kB] [34944 hostd-worker 44492kB]] [[36800 sfcb-vmware_raw 6%max] [36215 sfcb-vmware_bas 14%max] [36209 sfcb-pycim 17%max]]
2019-06-05T03:00:01Z heartbeat: up 0d1h29m53s, 0 VMs; [[36076 fdm 13184kB] [35320 vpxa-worker 23092kB] [34944 hostd-worker 46104kB]] [[36800 sfcb-vmware_raw 6%max] [36215 sfcb-vmware_bas 14%max] [36209 sfcb-pycim 17%max]]

 

Hostd:

 

  • From the logs we can see that the Services got started around this time .

 

2019-06-05T01:32:33.285Z [FFBDB9A0 info ‘Default’] BEGIN SERVICES

 

  • Since there was no scratch partition configured we are getting the below error:

 

2019-06-05T01:33:33.463Z [272C2B70 warning ‘Hostsvc.VmkVprobSource’] Argument ‘1’ for vprob ‘esx.problem.scratch.partition.unconfigured’ not found
2019-06-05T01:33:33.463Z [27C81B70 info ‘Libs’ opID=hostd-93af] CPU[11]: MSR       0xce =      0xc0064011600
2019-06-05T01:33:33.463Z [272C2B70 warning ‘Hostsvc.VmkVprobSource’] Wrong argument count for vprob ‘esx.problem.scratch.partition.unconfigured’, expected: 0, got: 1

 

 

VOBD:

 

  • VOBD Logs are showing Power-On Reset which is being triggered:

2019-06-05T01:31:02.044Z: [scsiCorrelator] 63403606us: [vob.scsi.scsipath.por] Power-on Reset occurred on vmhba4:C0:T1:L17
2019-06-05T01:31:02.060Z: [scsiCorrelator] 63419374us: [vob.scsi.scsipath.por] Power-on Reset occurred on vmhba3:C0:T1:L17
2019-06-05T01:31:02.077Z: [scsiCorrelator] 63436596us: [vob.scsi.scsipath.por] Power-on Reset occurred on vmhba2:C0:T0:L17
2019-06-05T01:31:02.136Z: [scsiCorrelator] 63494942us: [vob.scsi.scsipath.por] Power-on Reset occurred on vmhba1:C0:T0:L17

 

  • As per the Article: https://kb.vmware.com/s/article/1020702 The SAN might become heavily congested, which can cause I/O requests to take a long time to complete. However this could be due to the reason that Esxi Host has rebooted.

 

Conclusion:

 

  • Based on the information that is present on the Host I will recommend you to check the Hardware Event if you can find any errors associated with the Hardware and get an Extensive Hardware Diagnostics done.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply