RCA 32: ESXi Host in Not Responding State

Hostname: node003-bbabc.cc.ap-au-1.abc.xyz

ESXi Version: ESXi 6.5 Update 3

 

Time of Issue: 8:30 AM IST ,30th Sept

Time in GMT: 3:00 AM GMT, 30th Sept

 

Storage:

 

vmhba

driver

driver version

VID

DID

SVID

SDID

model

vmhba0

lsi_mr3

7.702.13.00-3vmw.650.2.50.8294253

1000

005f

1137

129

Avago (LSI) MegaRAID SAS Fury Controller

vmhba1

fnic

1.6.0.50

1137

45

1137

137

Cisco Systems Inc Cisco VIC FCoE HBA Driver

vmhba2

fnic

1.6.0.50

1137

45

1137

137

Cisco Systems Inc Cisco VIC FCoE HBA Driver

vmhba3

lsi_mr3

7.702.13.00-3vmw.650.2.50.8294253

1000

005f

1137

129

Avago (LSI) MegaRAID SAS Fury Controller

 

 

VMK Summary:

 

  • From the Summary we can see that the ESXi Host rebooted at: 3:32 AM.

 

2020-09-30T03:32:12Z bootstop: Host has booted

 

 

VOBD Logs:

 

  • From VOBD Logs we can see that the Hostd Service seems to be unresponsive.

 

2020-09-30T03:29:57.688Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.688Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.688Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.694Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.701Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.707Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:29:57.713Z: An event (esx.audit.net.firewall.config.changed) could not be sent immediately to hostd; queueing for retry.

2020-09-30T03:31:11.477Z: An event (esx.problem.scratch.partition.size.small) could not be sent immediately to hostd; queueing for retry.

2020-09-30T03:31:49.903Z: An event (esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:31:49.903Z: An event (esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd; queueing for retry.
2020-09-30T03:31:59.627Z: An event (esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd; queueing for retry.

 

 

  • Post which we saw that the ESXi Host has been rebooted:

 

2020-09-30T03:32:12.617Z: [GenericCorrelator] 209649596us: [vob.user.host.boot] Host has booted.
2020-09-30T03:32:12.617Z: [UserLevelCorrelator] 209649596us: [vob.user.host.boot] Host has booted.

2020-09-30T03:34:22.659Z: [GenericCorrelator] 339691615us: [vob.user.maintenancemode.entered] The host has entered maintenance mode
2020-09-30T03:34:22.659Z: [UserLevelCorrelator] 339691615us: [vob.user.maintenancemode.entered] The host has entered maintenance mode
2020-09-30T03:34:22.659Z: [UserLevelCorrelator] 339691755us: [esx.audit.maintenancemode.entered] The host has entered maintenance mode.

 

 

Vmkernel Logs:

 

  • From the Vmkernel Logs we can see a lot of events related to the “State in Doubt” Errors and they are not limited to the time when we have noticed the issue.

 

2020-09-30T01:49:30.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e5a2b4b55396a794c” state in doubt; requested fast path state update…
2020-09-30T01:49:30.956Z cpu72:66828)ScsiDeviceIO: 2980: Cmd(0x43975ecdf9c0) 0x2a, CmdSN 0x92d14 from world 65719 to dev “naa.600a09803830436e5a2b4b55396a794c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:30.956Z cpu72:66828)VVol: 2205: [vvolDev: naa.600a09803830454b79244e52784f354e] WRITE command (ioFlags 0x11, cmd 0) failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:30.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e6a244b553954396a” state in doubt; requested fast path state update…

2020-09-30T01:49:30.956Z cpu72:66828)ScsiDeviceIO: 2980: Cmd(0x4397412fb2c0) 0x2a, CmdSN 0x98406 from world 65719 to dev “naa.600a09803830436e6a244b553954396a” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:30.956Z cpu72:66828)VVol: 2205: [vvolDev: naa.600a09803830454c575d4d632f783673] WRITE command (ioFlags 0x11, cmd 0) failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:33.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e5a2b4b55396a794c” state in doubt; requested fast path state update…

2020-09-30T01:49:33.956Z cpu72:66828)ScsiDeviceIO: 2980: Cmd(0x4397413dffc0) 0x2a, CmdSN 0x92d16 from world 65719 to dev “naa.600a09803830436e5a2b4b55396a794c” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:33.956Z cpu72:66828)VVol: 2205: [vvolDev: naa.600a09803830454b79244e52784f354e] WRITE command (ioFlags 0x11, cmd 0) failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:33.956Z cpu72:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a09803830436e6a244b553954396a” state in doubt; requested fast path state update…

 

 

  • At the tie of issue as well we are only able to see the State in Doubt Errors:

 

2020-09-30T02:20:52.040Z cpu35:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a0980383042306e5d4e4c354c702f” state in doubt; requested fast path state update…
2020-09-30T02:20:52.040Z cpu35:66828)ScsiDeviceIO: 2980: Cmd(0x43975ed05740) 0x2a, CmdSN 0x54fe81 from world 65719 to dev “naa.600a0980383042306e5d4e4c354c702f” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T02:20:52.041Z cpu141:7668681)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention

2020-09-30T02:20:52.041Z cpu66:7668676)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:52.041Z cpu134:7669444)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:52.227Z cpu63:66600)NMP: nmp_ResetDeviceLogThrottling:3458: last error status from device naa.600a0980383047574c3f4c796a526442 repeated 3 times
2020-09-30T02:20:52.227Z cpu63:66600)NMP: nmp_ResetDeviceLogThrottling:3458: last error status from device naa.600a098038304757485d4c796a4e1111 repeated 3 times
2020-09-30T02:20:55.041Z cpu16:7668920)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:55.041Z cpu35:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a0980383042306e5d4e4c354c702f” state in doubt; requested fast path state update…
2020-09-30T02:20:55.041Z cpu35:66828)ScsiDeviceIO: 2980: Cmd(0x43975ecc2bc0) 0x2a, CmdSN 0x54fe83 from world 65719 to dev “naa.600a0980383042306e5d4e4c354c702f” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe.
2020-09-30T02:20:55.041Z cpu121:7668925)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention

2020-09-30T02:20:55.041Z cpu168:7668679)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:20:55.042Z cpu147:7668923)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention
2020-09-30T02:21:13.041Z cpu35:66828)NMP: nmp_ThrottleLogForDevice:3616: Cmd 0x2a (0x43974126a940, 65719) to dev “naa.600a098038304757485d4c796a4e1111” on path “vmhba64:C3:T0:L100” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe. Act:EVAL
2020-09-30T02:21:13.041Z cpu141:7668681)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit attention

 

Hostd Logs:

 

  • Unable to see any events which could be the cause of the issue.

 

 

 

Conclusion:

 

  • From the logs we are not able to conclude the reason for the Host going in Not Responding state. However this could be the outcome of the Hostd Service going unresponsive.

 

  • We can see a lot of “State in Doubt” Errors in the Logs:

 

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

 

  • Array backup operations (LUN backup, replication, etc)
  • General overload on the array
  • Read/Write Cache on the array (misconfiguration, lack of cache, etc)
  • Incorrect tiered storage used (SATA over SCSI)
  • Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

 

 

Action Plan:

 

 

  • I will also recommend you to investigate the “State in Doubt” Errors with the Storage or Hardware vendor to investigate.

 

  • In order to have deeper understanding about the issue, we would need more information from the ESXi Host at the time when it’s in Not Responding State.

 

  • Please make sure that SSH is enable on all the ESXi Host as we would need to run the below commands at the time when we are experiencing the issue.

 

  • Capture the Live ESXi Kernel Dump:

 

  1. Log in to the ESXi host command line via SSH or physical console
  2. Instruct the system to perform a live dump via localcli:

    localcli –plugin-dir /usr/lib/vmware/esxcli/int/ debug livedump perform

 

  1. The core dump will then be gathered. This process can take some time, as it does during a PSOD. When the process is completed, you will be returned to the command prompt.
  2. Once the core dump has been collected and the process is finished, gather a vm-support bundle to collect the logging, system state and livecore.

 

  • Capture the Hostd Dump to investigate the reason for Hostd Failure.

 

vmkbacktrace -n hostd -c -w

 

  • One this is done please capture a ESXi Host support Bundle from the SSH using vm-support command

For more information please refer to: https://kb.vmware.com/s/article/1010705

 

  • Reboot the ESXi Host.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply