RCA 30: Host Unexpected Reboot

Hostname: abcdef38xsdi004.sdi.corp.abcdef.com

ESXi Version: VMware ESXi 6.5.0 build-15256549

 

Time of Issue: 8/4/2020  between 7:50 IST – 8:10 IST

Time in GMT: 2:20 AM – 2:40 AM GMT 8/4/2020

 

 

VMKsummary:

 

  • From the logs we can see the Time when the Host was booted:

 

2020-08-04T02:39:43Z : storage-path-claim-completed
mark2020-08-04T02:40:06Z bootstop: Host has booted
2020-08-04T03:00:01Z heartbeat: up 0d0h29m22s, 0 VMs; [[69431 hostd-worker 37008kB] [66115 vmsyslog 39136kB] [68612 python 67180kB]] [[69431 hostd-worker 0%max] [66115 vmsyslogd 0%max] [68612 python 0%max]]
2020-08-04T04:00:00Z heartbeat: up 0d1h29m22s, 3 VMs; [[83149 vmx 66706740kB] [82334 vmx 66859160kB] [82835 vmx 66937816kB]] [[83149 vmx 0%max] [82334 vmx 0%max] [82835 vmx 0%max]]

 

 

VMKernel Logs:

 

  • In the Vmkernel Logs can see few warnings regarding the Admission Failure and Storage Disk, However was not able to find any trace of them crashing the Server.

 

2020-08-04T02:08:57.790Z cpu12:66115)WARNING: LinuxThread: 381: vmsyslogd: Error cloning thread: -28 (bad0081)
2020-08-04T02:09:41.291Z cpu44:67722)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk t10.NVMe____Dell_Express_Flash_NVMe_P4800X_375GB_SFF0001A82265E4D25C:2
2020-08-04T02:11:48.740Z cpu18:66115)MemSched: 14635: Admission failure in path: logging/vmsyslogd.28514731/uw.28514731
2020-08-04T02:11:48.740Z cpu18:66115)MemSched: 14635: Admission failure in path: logging/vmsyslogd.28514731/uw.28514731
2020-08-04T02:11:48.742Z cpu18:66115)WARNING: LinuxThread: 381: vmsyslogd: Error cloning thread: -28 (bad0081)
2020-08-04T02:14:41.243Z cpu0:67722)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk t10.NVMe____Dell_Express_Flash_NVMe_P4800X_375GB_SFF0001A82265E4D25C:2
2020-08-04T02:19:29.559Z cpu26:65651)ScsiDeviceIO: 3001: Cmd(0x43a114af1fc0) 0x85, CmdSN 0x16746 from world 69961 to dev “t10.ATA_____DELLBOSS_VD_____________________________059912691e93001000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0$
2020-08-04T02:19:41.247Z cpu38:67722)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk t10.NVMe____Dell_Express_Flash_NVMe_P4800X_375GB_SFF0001A82265E4D25C:2
2020-08-04T02:24:41.263Z cpu46:67722)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:786: Throttled: Attempt to get Virsto stats on unsupported disk t10.NVMe____Dell_Express_Flash_NVMe_P4800X_375GB_SFF0001A82265E4D25C:2

VMB: 112: mbMagic: 2badb002, mbInfo 0x101910
0:00:00:00.000 cpu0:1)Init: 252: Time spent in vmkBoot: 264217 us
0:00:00:00.036 cpu0:1)Starting VMKernel initialization …
0:00:00:00.044 cpu0:1)Initializing memory …
0:00:00:11.807 cpu0:65536)Initializing storage stack …
0:00:00:11.837 cpu0:65536)PCI: 206: enablePCIErrors: 0, enableValidPCIDevices: 1, pciSetBusMaster: 1, disableACSCheck: 0, disableACSCheckForRP: 1, enablePCIEHotplug: 1, pciBarAllocPolicy: 0

 

2020-08-04T02:31:47.160Z cpu2:67202)WARNING: RDT: RDTModInit:989: Kernel is not configured for IPv6
2020-08-04T02:31:57.550Z cpu4:67267)WARNING: CMMDS: MasterSendHeartbeatLogMsg:1832: Send heartbeat to all agents: Failure. Duration of previous failed sends: 0 secs
2020-08-04T02:40:06.392Z cpu8:71052)Boot Successful
2020-08-04T02:40:15.984Z cpu10:71400)WARNING: kbdmode_set:519: invalid keyboard mode 4: Not supported

 

VOBD Logs:

 

  • From the VOBD logs as well we cannot see any events that could trigger the crash.

 

2020-08-04T02:31:25.002Z: [netCorrelator] 55808697us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic1 linkstate is down
2020-08-04T02:31:25.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-08-04T02:31:31.170Z: [netCorrelator] 61976621us: [vob.net.vmnic.linkstate.down] vmnic vmnic0 linkstate down
2020-08-04T02:31:32.002Z: [netCorrelator] 62808664us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic0 linkstate is down
2020-08-04T02:31:32.002Z: An event (esx.problem.net.vmnic.linkstate.down) could not be sent immediately to hostd; queueing for retry.
2020-08-04T02:31:38.001Z: [netCorrelator] 68807684us: [vob.net.lacp.lag.transition.down] LACP warning: Link aggregation group lag1 on VDS DvsPortset-0 is down.
2020-08-04T02:31:38.001Z: [netCorrelator] 68808151us: [esx.problem.net.lacp.lag.transition.down] LAG lag1 on VDS DvsPortset-0 is down.
2020-08-04T02:31:38.001Z: An event (esx.problem.net.lacp.lag.transition.down) could not be sent immediately to hostd; queueing for retry.
2020-08-04T02:31:38.001Z: [netCorrelator] 68808049us: [vob.net.lacp.uplink.disconnected] LACP warning: Uplink vmnic1 on VDS DvsPortset-0 was disconnected.
2020-08-04T02:39:43.599Z: [GenericCorrelator] 554406338us: [vob.user.ssh.enabled] SSH access has been enabled
2020-08-04T02:39:43.599Z: [UserLevelCorrelator] 554406338us: [vob.user.ssh.enabled] SSH access has been enabled
2020-08-04T02:39:43.599Z: [UserLevelCorrelator] 554406584us: [esx.audit.ssh.enabled] SSH access has been enabled.
2020-08-04T02:40:06.186Z: [GenericCorrelator] 576526913us: [vob.user.host.boot] Host has booted.
2020-08-04T02:40:06.186Z: [UserLevelCorrelator] 576526913us: [vob.user.host.boot] Host has booted.
2020-08-04T02:40:06.186Z: [UserLevelCorrelator] 576527446us: [esx.audit.host.boot] Host has booted.

 

VPXA Logs:

 

2020-08-04T02:39:54.701Z – time the service was last started, Section for VMware ESX, pid=70076, version=6.5.0, build=15256549, option=Release
2020-08-04T02:39:54.702Z error vpxa[5A116B0] [Originator@6876 sub=Default] Failed to set the current working directory: /var/log/vmware/vpx, N7Vmacore15SystemExceptionE(No such file or directory)
2020-08-04T02:39:54.703Z info vpxa[5A116B0] [Originator@6876 sub=Default] Initializing SSL
2020-08-04T02:39:54.757Z warning vpxa[5A32B70] [Originator@6876 sub=SysCommandPosix] Status of pid 70084 is discarded
2020-08-04T02:39:55.022Z error vpxa[5A116B0] [Originator@6876 sub=Default] [VimXml] Error fetching /sdk/vimServiceVersions.xml: 503 (Service Unavailable)

 

 

Conclusion:

 

 

  • Determine if the ESXi host was deliberately restarted. We need to Review the below Errors in Hostd Logs:
    • Hostd: [12:51:54.284 27D13B90 info ‘TaskManager’] Task Created : haTask-ha-host-vim.HostSystem.reboot-50
      or
    • DCUI: reboot

 

No such information is logged in the Hostd Logs at the time of issue.

 

 

  • Determine if the ESXi host was deliberately shut down. If an ESXi server was shut down deliberately, it contains an event similar to:

 

  • Hostd: [<YYYY-MM-DD> <time>.550 2FEDEB90 info ‘TaskManager’] Task Created : haTask-ha-host-vim.HostSystem.shutdown-78</time>
    or
     
  • DCUI: poweroff

 

No such information is logged in the Hostd Logs at the time of issue.

 

 

  • Verify whether the virtual machine or ESXi host has generated a core dump
  • When the host is rebooted after a crash and if the core dump was successful, the /var/log/vmksummary.log shows that a core dump is found.

 

For example:

<YYYY-MM-DD>T<time>Z bootstop: Host has booted

<YYYY-MM-DD>T<time>Z bootstop: file core dump found</time></time>

 

No Core Dump is Generated.

 

Action Plan:

 

  • I will recommend you to get in touch with the Hardware Vendor and get an Extensive Hardware Diagnostics done as this could be an issue where the Hardware/Power might have failed and the ESXi Host didn’t got change to log it into the logs.

 

 

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply