RCA 32: ESXi Host in Not Responding State

Hostname: node003-bbabc.cc.ap-au-1.abc.xyz

ESXi Version: ESXi 6.5 Update 3

Time of Issue: 8:30 AM IST ,30th Sept

Time in GMT: 3:00 AM GMT, 30th Sept

Storage:

vmhba	driver	driver version	VID	DID	SVID	SDID	model
vmhba0	lsi_mr3	7.702.13.00-3vmw.650.2.50.8294253	1000	005f	1137	129	Avago (LSI) MegaRAID SAS Fury Controller
vmhba1	fnic	1.6.0.50	1137	45	1137	137	Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba2	fnic	1.6.0.50	1137	45	1137	137	Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba3	lsi_mr3	7.702.13.00-3vmw.650.2.50.8294253	1000	005f	1137	129	Avago (LSI) MegaRAID SAS Fury Controller

VMK Summary:

From the Summary we can see that the ESXi Host rebooted at: 3:32 AM.

2020-09-30T03:32:12Z bootstop: Host has booted

VOBD Logs:

From VOBD Logs we can see that the Hostd Service seems to be unresponsive.

2020-09-30T03:29:57.688Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.688Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.688Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.694Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.701Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.707Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.713Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.

2020-09-30T03:31:11.477Z: An event (esx.problem.scratch.partition.size.small) could not be sent immediately to hostd; queueing for retry.

2020-09-30T03:31:49.903Z: An event (esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:31:49.903Z: An event (esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:31:59.627Z: An event (esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd; queueing for retry.

Post which we saw that the ESXi Host has been rebooted:

2020-09-30T03:32:12.617Z: [GenericCorrelator] 209649596us: [vob.user.host.boot] Host has booted.
2020-09-30T03:32:12.617Z: [UserLevelCorrelator] 209649596us: [vob.user.host.boot] Host has booted.

2020-09-30T03:34:22.659Z: [GenericCorrelator] 339691615us: [vob.user.maintenancemode.entered] The host has entered maintenance mode
2020-09-30T03:34:22.659Z: [UserLevelCorrelator] 339691615us: [vob.user.maintenancemode.entered] The host has entered maintenance mode
2020-09-30T03:34:22.659Z: [UserLevelCorrelator] 339691755us: [esx.audit.maintenancemode.entered] The host has entered maintenance mode.

Vmkernel Logs:

From the Vmkernel Logs we can see a lot of events related to the “State in Doubt” Errors and they are not limited to the time when we have noticed the issue.

2020-09-30T01:49:30.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e5a2b4b55396a794c” state in doubt; requested fast path state update…
2020-09-30T01:49:30.956Z cpu72:66828)ScsiDeviceIO: 2980: Cmd(0x43975ecdf9c0) 0x2a, CmdSN 0x92d14 from world 65719 to dev “naa.600a09803830436e5a2b4b55396a794c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:30.956Z cpu72:66828)VVol: 2205: [vvolDev: naa.600a09803830454b79244e52784f354e] WRITE command (ioFlags 0x11, cmd 0) failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:30.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e6a244b553954396a” state in doubt; requested fast path state update…
2020-09-30T01:49:30.956Z cpu72:66828)ScsiDeviceIO: 2980: Cmd(0x4397412fb2c0) 0x2a, CmdSN 0x98406 from world 65719 to dev “naa.600a09803830436e6a244b553954396a” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:30.956Z cpu72:66828)VVol: 2205: [vvolDev: naa.600a09803830454c575d4d632f783673] WRITE command (ioFlags 0x11, cmd 0) failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:33.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e5a2b4b55396a794c” state in doubt; requested fast path state update…
2020-09-30T01:49:33.956Z cpu72:66828)ScsiDeviceIO: 2980: Cmd(0x4397413dffc0) 0x2a, CmdSN 0x92d16 from world 65719 to dev “naa.600a09803830436e5a2b4b55396a794c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:33.956Z cpu72:66828)VVol: 2205: [vvolDev: naa.600a09803830454b79244e52784f354e] WRITE command (ioFlags 0x11, cmd 0) failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:33.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e6a244b553954396a” state in doubt; requested fast path state update…

At the tie of issue as well we are only able to see the State in Doubt Errors:

2020-09-30T02:20:52.040Z cpu35:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a0980383042306e5d4e4c354c702f” state in doubt; requested fast path state update…
2020-09-30T02:20:52.040Z cpu35:66828)ScsiDeviceIO: 2980: Cmd(0x43975ed05740) 0x2a, CmdSN 0x54fe81 from world 65719 to dev “naa.600a0980383042306e5d4e4c354c702f” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T02:20:52.041Z cpu141:7668681)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:52.041Z cpu66:7668676)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:52.041Z cpu134:7669444)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:52.227Z cpu63:66600)NMP: nmp_ResetDeviceLogThrottling:3458: last error status from device naa.600a0980383047574c3f4c796a526442 repeated 3 times
2020-09-30T02:20:52.227Z cpu63:66600)NMP: nmp_ResetDeviceLogThrottling:3458: last error status from device naa.600a098038304757485d4c796a4e1111 repeated 3 times
2020-09-30T02:20:55.041Z cpu16:7668920)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:55.041Z cpu35:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a0980383042306e5d4e4c354c702f” state in doubt; requested fast path state update…
2020-09-30T02:20:55.041Z cpu35:66828)ScsiDeviceIO: 2980: Cmd(0x43975ecc2bc0) 0x2a, CmdSN 0x54fe83 from world 65719 to dev “naa.600a0980383042306e5d4e4c354c702f” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T02:20:55.041Z cpu121:7668925)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:55.041Z cpu168:7668679)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:55.042Z cpu147:7668923)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:21:13.041Z cpu35:66828)NMP: nmp_ThrottleLogForDevice:3616: Cmd 0x2a (0x43974126a940, 65719) to dev “naa.600a098038304757485d4c796a4e1111” on path “vmhba64:C3:T0:L100” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe. Act:EVAL
2020-09-30T02:21:13.041Z cpu141:7668681)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention

Hostd Logs:

Unable to see any events which could be the cause of the issue.

Conclusion:

From the logs we are not able to conclude the reason for the Host going in Not Responding state. However this could be the outcome of the Hostd Service going unresponsive.

We can see a lot of “State in Doubt” Errors in the Logs:

As per the Article: https://kb.vmware.com/s/article/1022026

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

Array backup operations (LUN backup, replication, etc)
General overload on the array
Read/Write Cache on the array (misconfiguration, lack of cache, etc)
Incorrect tiered storage used (SATA over SCSI)
Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

Action Plan:

From the logs I can see that VMBHA is running with driver version 1.6.0.50, but I am not able to see the firmware version.
https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37359&deviceCategory=io&details=1&VID=1137&DID=0045&SVID=1137&SSID=0137&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc
Please check with the Hardware Vendor to confirm if you are running on the Compatible Driver/Firmware Version.

I will also recommend you to investigate the “State in Doubt” Errors with the Storage or Hardware vendor to investigate.

In order to have deeper understanding about the issue, we would need more information from the ESXi Host at the time when it’s in Not Responding State.

Please make sure that SSH is enable on all the ESXi Host as we would need to run the below commands at the time when we are experiencing the issue.

Capture the Live ESXi Kernel Dump:

Log in to the ESXi host command line via SSH or physical console
Instruct the system to perform a live dump via localcli:

localcli –plugin-dir /usr/lib/vmware/esxcli/int/ debug livedump perform

The core dump will then be gathered. This process can take some time, as it does during a PSOD. When the process is completed, you will be returned to the command prompt.
Once the core dump has been collected and the process is finished, gather a vm-support bundle to collect the logging, system state and livecore.