Hostname: node003-bbabc.cc.ap-au-1.abc.xyz
ESXi Version: ESXi 6.5 Update 3
Time of Issue: 8:30 AM IST ,30th Sept
Time in GMT: 3:00 AM GMT, 30th Sept
Storage:
vmhba |
driver |
driver version |
VID |
DID |
SVID |
SDID |
model |
vmhba0 |
lsi_mr3 |
7.702.13.00-3vmw.650.2.50.8294253 |
1000 |
005f |
1137 |
129 |
Avago (LSI) MegaRAID SAS Fury Controller |
vmhba1 |
fnic |
1.6.0.50 |
1137 |
45 |
1137 |
137 |
Cisco Systems Inc Cisco VIC FCoE HBA Driver |
vmhba2 |
fnic |
1.6.0.50 |
1137 |
45 |
1137 |
137 |
Cisco Systems Inc Cisco VIC FCoE HBA Driver |
vmhba3 |
lsi_mr3 |
7.702.13.00-3vmw.650.2.50.8294253 |
1000 |
005f |
1137 |
129 |
Avago (LSI) MegaRAID SAS Fury Controller |
VMK Summary:
- From the
Summary we can see that the ESXi Host rebooted at: 3:32 AM.
2020-09-30T03:32:12Z bootstop: Host has booted
VOBD Logs:
- From VOBD
Logs we can see that the Hostd Service seems to be unresponsive.
2020-09-30T03:29:57.688Z: An event
(esx.audit.net.firewall.config.changed) could not be sent immediately to hostd;
queueing for retry.
2020-09-30T03:29:57.688Z:
An event (esx.audit.net.firewall.config.changed) could not be sent immediately
to hostd; queueing for retry.
2020-09-30T03:29:57.688Z:
An event (esx.audit.net.firewall.config.changed) could not be sent immediately
to hostd; queueing for retry.
2020-09-30T03:29:57.694Z:
An event (esx.audit.net.firewall.config.changed) could not be sent immediately
to hostd; queueing for retry.
2020-09-30T03:29:57.701Z:
An event (esx.audit.net.firewall.config.changed) could not be sent immediately
to hostd; queueing for retry.
2020-09-30T03:29:57.707Z:
An event (esx.audit.net.firewall.config.changed) could not be sent immediately
to hostd; queueing for retry.
2020-09-30T03:29:57.713Z:
An event (esx.audit.net.firewall.config.changed) could not be sent immediately
to hostd; queueing for retry.
2020-09-30T03:31:11.477Z: An event
(esx.problem.scratch.partition.size.small) could not be sent immediately to
hostd; queueing for retry.
2020-09-30T03:31:49.903Z: An event
(esx.problem.vmsyslogd.remote.failure) could not be sent immediately to hostd;
queueing for retry.
2020-09-30T03:31:49.903Z: An event (esx.problem.vmsyslogd.remote.failure) could
not be sent immediately to hostd; queueing for retry.
2020-09-30T03:31:59.627Z: An event (esx.problem.vmsyslogd.remote.failure) could
not be sent immediately to hostd; queueing for retry.
- Post
which we saw that the ESXi Host has been rebooted:
2020-09-30T03:32:12.617Z: [GenericCorrelator] 209649596us:
[vob.user.host.boot] Host has booted.
2020-09-30T03:32:12.617Z:
[UserLevelCorrelator] 209649596us: [vob.user.host.boot] Host has booted.
2020-09-30T03:34:22.659Z: [GenericCorrelator]
339691615us: [vob.user.maintenancemode.entered] The host has entered
maintenance mode
2020-09-30T03:34:22.659Z: [UserLevelCorrelator] 339691615us:
[vob.user.maintenancemode.entered] The host has entered maintenance mode
2020-09-30T03:34:22.659Z: [UserLevelCorrelator] 339691755us:
[esx.audit.maintenancemode.entered] The host has entered maintenance mode.
Vmkernel Logs:
- From the
Vmkernel Logs we can see a lot of events related to the “State in
Doubt” Errors and they are not limited to the time when we have
noticed the issue.
2020-09-30T01:49:30.956Z cpu72:66828)WARNING: NMP:
nmp_DeviceRequestFastDeviceProbe:237: NMP device
“naa.600a09803830436e5a2b4b55396a794c” state in doubt; requested fast
path state update…
2020-09-30T01:49:30.956Z cpu72:66828)ScsiDeviceIO: 2980:
Cmd(0x43975ecdf9c0) 0x2a, CmdSN 0x92d14 from world 65719 to dev
“naa.600a09803830436e5a2b4b55396a794c” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:30.956Z cpu72:66828)VVol: 2205: [vvolDev:
naa.600a09803830454b79244e52784f354e] WRITE command (ioFlags 0x11, cmd 0)
failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:30.956Z cpu72:66828)WARNING: NMP:
nmp_DeviceRequestFastDeviceProbe:237: NMP device
“naa.600a09803830436e6a244b553954396a” state in doubt; requested fast
path state update…
2020-09-30T01:49:30.956Z cpu72:66828)ScsiDeviceIO: 2980:
Cmd(0x4397412fb2c0) 0x2a, CmdSN 0x98406 from world 65719 to dev
“naa.600a09803830436e6a244b553954396a” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:30.956Z cpu72:66828)VVol: 2205: [vvolDev:
naa.600a09803830454c575d4d632f783673] WRITE command (ioFlags 0x11, cmd 0)
failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:33.956Z cpu72:66828)WARNING: NMP:
nmp_DeviceRequestFastDeviceProbe:237: NMP device
“naa.600a09803830436e5a2b4b55396a794c” state in doubt; requested fast
path state update…
2020-09-30T01:49:33.956Z cpu72:66828)ScsiDeviceIO: 2980:
Cmd(0x4397413dffc0) 0x2a, CmdSN 0x92d16 from world 65719 to dev
“naa.600a09803830436e5a2b4b55396a794c” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x6 0x3f 0xe.
2020-09-30T01:49:33.956Z cpu72:66828)VVol: 2205: [vvolDev:
naa.600a09803830454b79244e52784f354e] WRITE command (ioFlags 0x11, cmd 0)
failed: Transient storage condition, suggest retry (0xbad00be)
2020-09-30T01:49:33.956Z cpu72:66828)WARNING: NMP:
nmp_DeviceRequestFastDeviceProbe:237: NMP device
“naa.600a09803830436e6a244b553954396a” state in doubt; requested fast
path state update…
- At the
tie of issue as well we are only able to see the State in Doubt Errors:
2020-09-30T02:20:52.040Z cpu35:66828)WARNING: NMP:
nmp_DeviceRequestFastDeviceProbe:237: NMP device
“naa.600a0980383042306e5d4e4c354c702f” state in doubt; requested fast
path state update…
2020-09-30T02:20:52.040Z cpu35:66828)ScsiDeviceIO: 2980: Cmd(0x43975ed05740)
0x2a, CmdSN 0x54fe81 from world 65719 to dev
“naa.600a0980383042306e5d4e4c354c702f” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x6 0x3f 0xe.
2020-09-30T02:20:52.041Z cpu141:7668681)ScsiDevice: 3977: Handle REPORTED LUNS
CHANGED DATA unit attention
2020-09-30T02:20:52.041Z
cpu66:7668676)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit
attention
2020-09-30T02:20:52.041Z
cpu134:7669444)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit
attention
2020-09-30T02:20:52.227Z cpu63:66600)NMP:
nmp_ResetDeviceLogThrottling:3458: last error status from device
naa.600a0980383047574c3f4c796a526442 repeated 3 times
2020-09-30T02:20:52.227Z cpu63:66600)NMP:
nmp_ResetDeviceLogThrottling:3458: last error status from device naa.600a098038304757485d4c796a4e1111
repeated 3 times
2020-09-30T02:20:55.041Z
cpu16:7668920)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit
attention
2020-09-30T02:20:55.041Z
cpu35:66828)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device
“naa.600a0980383042306e5d4e4c354c702f” state in doubt; requested fast
path state update…
2020-09-30T02:20:55.041Z cpu35:66828)ScsiDeviceIO: 2980: Cmd(0x43975ecc2bc0)
0x2a, CmdSN 0x54fe83 from world 65719 to dev
“naa.600a0980383042306e5d4e4c354c702f” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x6 0x3f 0xe.
2020-09-30T02:20:55.041Z cpu121:7668925)ScsiDevice: 3977: Handle REPORTED LUNS
CHANGED DATA unit attention
2020-09-30T02:20:55.041Z
cpu168:7668679)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit
attention
2020-09-30T02:20:55.042Z
cpu147:7668923)ScsiDevice: 3977: Handle REPORTED LUNS CHANGED DATA unit
attention
2020-09-30T02:21:13.041Z
cpu35:66828)NMP: nmp_ThrottleLogForDevice:3616: Cmd 0x2a (0x43974126a940,
65719) to dev “naa.600a098038304757485d4c796a4e1111” on path “vmhba64:C3:T0:L100”
Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x3f 0xe. Act:EVAL
2020-09-30T02:21:13.041Z cpu141:7668681)ScsiDevice: 3977: Handle REPORTED LUNS
CHANGED DATA unit attention
Hostd Logs:
- Unable to
see any events which could be the cause of the issue.
Conclusion:
- From the
logs we are not able to conclude the reason for the Host going in Not
Responding state. However this could be the outcome of the Hostd Service
going unresponsive.
- We can
see a lot of “State in Doubt” Errors in the Logs:
- As per
the Article: https://kb.vmware.com/s/article/1022026
These messages appear as the Host Bus Adapter
(HBA) driver aborts a command because the command took longer than the timeout
period of 5 seconds to complete. An operation can take longer than the timeout
period because of several reasons including:
- Array backup operations (LUN backup, replication, etc)
- General overload on the array
- Read/Write Cache on the array (misconfiguration, lack of cache,
etc)
- Incorrect tiered storage used (SATA over SCSI)
- Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)
Action Plan:
- From the
logs I can see that VMBHA is running with driver version 1.6.0.50, but I
am not able to see the firmware version.
- https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37359&deviceCategory=io&details=1&VID=1137&DID=0045&SVID=1137&SSID=0137&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc
- Please
check with the Hardware Vendor to confirm if you are running on the
Compatible Driver/Firmware Version.
- I will
also recommend you to investigate the “State in Doubt” Errors
with the Storage or Hardware vendor to investigate.
- In order
to have deeper understanding about the issue, we would need more
information from the ESXi Host at the time when it’s in Not Responding
State.
- Please
make sure that SSH is enable on all the ESXi Host as we would need to run
the below commands at the time when we are experiencing the issue.
- Capture
the Live ESXi Kernel Dump:
- Log in to
the ESXi host command line via SSH or physical console
- Instruct
the system to perform a live dump via localcli:
localcli –plugin-dir /usr/lib/vmware/esxcli/int/ debug livedump perform
- The core
dump will then be gathered. This process can take some time, as it does
during a PSOD. When the process is completed, you will be returned to the
command prompt.
- Once the
core dump has been collected and the process is finished, gather a
vm-support bundle to collect the logging, system state and livecore.
- Capture
the Hostd Dump to investigate the reason for Hostd Failure.
vmkbacktrace -n hostd -c -w
- One this
is done please capture a ESXi Host support Bundle from the SSH using
vm-support command
For more information please
refer to: https://kb.vmware.com/s/article/1010705
- Reboot
the ESXi Host.