RCA 17: Unexpected ESXi Host Crash

 

Hostname: c977abc.int.abcde.com | uptime: 0.0 days | uptime: 54 minutes

 

Build Version: VMware ESXi 6.5.0 build-10719125

 

Time of Issue:

 

2019-06-09T22:43:19.236Z: [GenericCorrelator] 82943240us: [vob.user.host.boot] Host has booted.

2019-06-09T22:43:13.631Z info hostd[B31CB80] [Originator@6876 sub=Default] BEGIN SERVICES

 

 

IPMI log

 

From the IPMI Logs I can see that there is a warning regarding the Voltage State asserted.

 

Record Id: 20 When: 2019-06-09T17:37:29 Message: Assert + Voltage State Asserted

 

Record:20:

   Record Id: 20

   When: 2019-06-09T17:37:29

   Event Type: 3 (Degraded/Warning)

   SEL Type: 2 (System Event)

   Message: Assert + Voltage State Asserted

   Sensor Number: 37

   Raw:

   Formatted-Raw: 14 00 02 d9 43 fd 5c 20 00 04 02 25 03 01 ff ff

 

 

VMKernel:

 

  • From the VMKernel Logs we can see that there are some SCSI Errors but its not associated with our issue and its coming multiple times in a day:

 

2019-06-09T13:34:07.351Z cpu35:66359)ScsiDeviceIO: 2954: Cmd(0x439542663980) 0x1a, CmdSN 0x2f3d4b from world 0 to dev “naa.644a842030e8b5002244da3817e6abcd” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2019-06-09T22:34:34.662Z cpu1:66358)ScsiDeviceIO: 2954: Cmd(0x439562854c00) 0x1a, CmdSN 0x2f5f9a from world 0 to dev “naa.644a842030e8b5002244da3817e6abcd” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

 

 

  • From here we can see that the Host was rebooted around: 2019-06-09T22:34:34.662Z

VMB: 112: mbMagic: 2badb002, mbInfo 0x1016a8
VMB: 56: flags a6d

 

 

0:00:00:00.000 cpu0:1)Init: 252: Time spent in vmkBoot: 906558 us
0:00:00:00.042 cpu0:1)Starting VMKernel initialization …
0:00:00:00.050 cpu0:1)Initializing memory …
0:00:00:10.315 cpu0:65536)Initializing storage stack …

0:00:00:10.490 cpu0:65536)VMKernel loaded successfully.
2019-06-09T22:42:06.414Z cpu0:65536)Loading module user …
2019-06-09T22:42:06.430Z cpu0:65536)user loaded successfully.

 

 

VOBD:

 

 

  • From the VOBD logs we are not able to find much information regarding reason for crash:

 

2019-06-09T22:42:33.486Z: [netCorrelator] 37801385us: [vob.net.vmnic.linkstate.down] vmnic vmnic1 linkstate down
2019-06-09T22:42:33.492Z: [netCorrelator] 37807621us: [vob.net.firewall.port.hooked] Port vmk0 is now protected by Firewall.
2019-06-09T22:42:33.492Z: [netCorrelator] 37807734us: [esx.audit.net.firewall.port.hooked] Port vmk0 is now protected by Firewall.
2019-06-09T22:42:33.492Z: An event (esx.audit.net.firewall.port.hooked) could not be sent immediately to hostd; queueing for retry.
2019-06-09T22:42:33.498Z: [netCorrelator] 37813506us: [vob.net.firewall.port.hooked] Port vmk1 is now protected by Firewall.
2019-06-09T22:42:33.498Z: [netCorrelator] 37813594us: [esx.audit.net.firewall.port.hooked] Port vmk1 is now protected by Firewall.
2019-06-09T22:42:33.498Z: An event (esx.audit.net.firewall.port.hooked) could not be sent immediately to hostd; queueing for retry.
2019-06-09T22:42:33.988Z: [netCorrelator] 38303461us: [vob.net.vmnic.linkstate.up] vmnic vmnic1 linkstate up
2019-06-09T22:42:34.002Z: [netCorrelator] 38317296us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic1 linkstate is down
2019-06-09T22:42:34.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2019-06-09T22:42:34.002Z: [netCorrelator] 38317340us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic1 linkstate is up
2019-06-09T22:42:34.002Z: An event (esx.clear.net.vmnic.linkstate.up) could not be sent immediately to hostd; queueing for retry.
2019-06-09T22:42:34.685Z: [netCorrelator] 39000197us: [vob.net.lacp.lag.transition.down] LACP warning: Link aggregation group lag1 on VDS DvsPortset-0 is down.
2019-06-09T22:42:34.685Z: [netCorrelator] 39000591us: [esx.problem.net.lacp.lag.transition.down] LAG lag1 on VDS DvsPortset-0 is down.
2019-06-09T22:42:34.685Z: An event (esx.problem.net.lacp.lag.transition.down) could not be sent immediately to hostd; queueing for retry.

 

2019-06-09T22:42:41.907Z: [GenericCorrelator] 46222209us: [vob.user.coredump.configured2] At least one coredump target is enabled.
2019-06-09T22:42:41.907Z: [UserLevelCorrelator] 46222209us: [vob.user.coredump.configured2] At least one coredump target is enabled.
2019-06-09T22:42:42.445Z: An event (esx.audit.vmfs.volume.mounted) could not be sent immediately to hostd; queueing for retry.
2019-06-09T22:42:42.490Z: [vmfsCorrelator] 46805453us: [vob.vmfs.nfs.server.restored] Restored connection to the server abcdlnt-h0141.int.abcde.com mount point /cb0780_infra_virtual_ich0141_vol01_snap07, mounted as 9d54a394-b36f2c04-0000-000000000000 (“cb0780_infra_virtual_ich0141_vol01_snap07”)
2019-06-09T22:42:42.490Z: [vmfsCorrelator] 46805555us: [esx.problem.vmfs.nfs.server.restored] abcdlnt-h0141.int.abcde.com /cb0780_infra_virtual_ich0141_vol01_snap07 9d54a394-b36f2c04-0000-000000000000 cb0780_infra_virtual_ich0141_vol01_snap07
2019-06-09T22:42:42.490Z: An event (esx.problem.vmfs.nfs.server.restored) could not be sent immediately to hostd; queueing for retry.
2019-06-09T22:42:42.542Z: [vmfsCorrelator] 46858041us: [vob.vmfs.nfs.server.restored] Restored connection to the server abcdlnt-h0027.int.abcde.com mount point /cb0780_infra_virtual_abc0027_vol04_snap07, mounted as fc1d0c2f-fbfeb095-0000-000000000000 (“cb0780_infra_virtual_abc0027_vol04_snap07″)
2019-06-09T22:42:42.542Z: [vmfsCorrelator] 46858140us: [esx.problem.vmfs.nfs.server.restored] abcdlnt-h0027.int.abcde.com /cb0780_infra_virtual_abc0027_vol04_snap07 fc1d0c2f-fbfeb095-0000-000000000000 cb0780_infra_virtual_abc0027_vol04_snap07

 

 

Hostd Logs:

 

 

—— Early init logs start ——–
2019-06-09T22:43:03.273Z verbose -[B31CB80] [Originator@6876 sub=Default] No update forwarding configured
2019-06-09T22:43:03.273Z info -[B31CB80] [Originator@6876 sub=Default] Supported VMs 768
2019-06-09T22:43:03.273Z warning -[B31CB80] [Originator@6876 sub=Default] Estimated fds limit 4608 > 4096 max supported by setrlimit. Setting fds limit to 4096
2019-06-09T22:43:03.273Z info -[B31CB80] [Originator@6876 sub=Handle checker] Setting system limit of 4096

 

2019-06-09T22:43:03.505Z error hostd[B31CB80] [Originator@6876 sub=Hostsvc] Error reading boot device type: Hardware Query Error: Unable to get  boot order from BIOS or operation not  supported on this hardwareNot supported
2019-06-09T22:43:03.960Z warning hostd[B31CB80] [Originator@6876 sub=Solo.EnvironmentManager] Failed to enumerate parallel devices
2019-06-09T22:43:05.566Z info hostd[B31CB80] [Originator@6876 sub=Solo.VmwareCLI] (vim.EsxCLI.nvme.device.log.error) ha-cli-handler-nvme-device-log-error created

 

 

Conclusion:

 

  • Based on the logs we were not able to find any reason for the host to fail. Now this could be due to two reasons
  • ASR: Automated System Recovery: Please check and verify if the Automatic System Recovery is enabled, This will prevent the Host from creating a dump file.
  • From the IPMI Logs I can see that there is a warning regarding the Voltage State asserted. However this is coming few hours earlier to our issue but it’s worth Investigating. Check on your Hardware events if we are able to see anything. Also engage the Hardware vendor, Incase if you find anything.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply