A 6.5 host in our environment is reporting state in doubt errors. Need to find out why.
Hostname: fr-abc-15k11-03.infra.abcde.com
Esxi Version: VMware ESXi 6.5.0 build-9298722 ESXi 6.5 U2C
As per the Article https://kb.vmware.com/s/article/1022026
These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete.
An operation can take longer than the timeout period because of several reasons including:
- Array backup operations (LUN backup, replication, etc)
- General overload on the array
- Read/Write Cache on the array (misconfiguration, lack of cache, etc)
- Incorrect tiered storage used (SATA over SCSI)
- Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)
vmhba | driver | driver version | VID | DID | SVID | SDID | model |
vmhba0 | lsi_mr3 | 6.910.18.00-1vmw.650.0.0.4564106 | 1000 | 005d | 1028 | 1f42 | Avago (LSI) Dell PERC H730P Adapter |
vmhba2 | qlnativefc | 2.1.81.0-1OEM.600.0.0.2768847 | 1077 | 2031 | 1077 | 257 | QLogic Corp 2600 Series 16Gb Fibre Channel to PCI Express HBA |
vmhba3 | qlnativefc | 2.1.81.0-1OEM.600.0.0.2768847 | 1077 | 2031 | 1077 | 257 | QLogic Corp 2600 Series 16Gb Fibre Channel to PCI Express HBA |
QLogic QLE2662 Dual Port FC16 HBA for QLE2662:
FC Firmware Version: 8.08.203 (d0d5), Driver version 2.1.81.0
- As per the Article: VMwareHCL we can see that the Driver and Firmware are running at the latest Version:
VMKernel Logs:
- From the VMKernel Logs we can see issues with the HBA State in Doubt:
2019-10-11T07:07:20.484Z cpu17:66579)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x28 (0x439701104640, 72323) to dev “naa.600a098038304342783f4a70534aaaa” on path “vmhba3:C0:T7:L12” Failed: H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2019-10-11T07:07:20.484Z cpu17:66579)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a098038304342783f4a70534aaaa” state in doubt; requested fast path state update…
2019-10-11T07:07:20.484Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439701104640) 0x28, CmdSN 0x39b from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2019-10-11T07:07:21.084Z cpu17:66579)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a098038304342783f4a70534aaaa” state in doubt; requested fast path state update…
2019-10-11T07:07:21.084Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x4397011e5b40) 0x28, CmdSN 0x39b from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2019-10-11T07:07:21.384Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x4397011ec7c0) 0x28, CmdSN 0x39b from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:21.684Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439714f08540) 0x28, CmdSN 0x364 from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:21.984Z cpu17:72344)WARNING: iodm: vmk_IodmEvent:194: vmhba3: FRAME DROP event has been observed 6 times in the last one minute. This suggests a problem with Fibre Channel link/switch!.
2019-10-11T07:07:21.984Z cpu17:66579)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a098038304342783f4a70534aaaa” state in doubt; requested fast path state update…
2019-10-11T07:07:21.984Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439714f35d40) 0x28, CmdSN 0x3d0 from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:22.384Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439714f7e3c0) 0x28, CmdSN 0x3db from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:22.684Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439701053340) 0x28, CmdSN 0x34a from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:33.484Z cpu10:72344)WARNING: iodm: vmk_IodmEvent:194: vmhba3: FRAME DROP event has been observed 35 times in the last one minute. This suggests a problem with Fibre Channel link/switch!.
- We can see there are multiple events for the FRAME Drop, which could an issue with the Fibre Channel/Link.
- From the SCSI Code we can see issues with the Path Failures.
Type | Code | Name | Description |
Host Status | [0x2] | BUS_BUSY | This status is returned when the HBA driver is unable to issue a command to the device. This status can occur due to dropped FCP frames in the environment. |
Device Status | [0x0] | GOOD | This status is returned when there is no error from the device or target side. This is when you will see if there is a status for Host or Plugin. |
Plugin Status | [0x0] | GOOD | No error. (ESXi 5.x / 6.x only) |
Sense Key | [0x80] | Sense Key unknown. |
|
Additional Sense Data | 41/00 | DATA PATH FAILURE (SHOULD USE 40 NN) |
|
VOBD Logs:
- From the VOBD Logs we can see issues with the Performance Latency for Few Datastores.
- Datastore Name: naa.600a098038304342783f4a70534aaaa, naa.600a098038304342783f4a7053434111, naa.600a098038304342783f4a705343cccc and naa.600a098038304342783f4a705343qqqq.
Date | Time | NAA ID | Latency |
01-10-2019 | 06:08:07.848Z: | naa.600a098038304342783f4a70534aaaa | 133806 |
01-10-2019 | 06:08:07.848Z: | naa.600a098038304342783f4a70534aaaa | 133806 |
01-10-2019 | 06:08:09.792Z: | naa.600a098038304342783f4a70534aaaa | 287017 |
01-10-2019 | 06:08:09.792Z: | naa.600a098038304342783f4a70534aaaa | 287017 |
01-10-2019 | 06:08:09.842Z: | naa.600a098038304342783f4a70534aaaa | 582535 |
01-10-2019 | 14:33:46.756Z: | naa.600a098038304342783f4a705343cccc | 14630 |
01-10-2019 | 14:33:46.756Z: | naa.600a098038304342783f4a705343cccc | 14630 |
01-10-2019 | 14:33:49.324Z: | naa.600a098038304342783f4a705343cccc | 14630 |
02-10-2019 | 07:52:27.332Z: | naa.600a098038304342783f4a705343qqqq | 25667 |
02-10-2019 | 07:52:27.332Z: | naa.600a098038304342783f4a705343qqqq | 25667 |
02-10-2019 | 07:57:55.821Z: | naa.600a098038304342783f4a705343qqqq | 30743 |
Conclusion:
- From the logs we can see that the issue is happening because of the HBA Driver Aborting Commands.
- We can also see a huge performance latency on some of the datastores which are associated with the Esxi Host.
Action Plan:
- Please check with your Hardware Vendor to check if there are any issues with the HBA/Switch which could be the reason.
- Please share the Article https://kb.vmware.com/s/article/1022026 with the Vendor and investigate this further.