RCA 25: ESXi Host showing State In Doubt Error

A 6.5 host in our environment is reporting state in doubt errors. Need to find out why.

 

Hostname: fr-abc-15k11-03.infra.abcde.com

 

Esxi Version: VMware ESXi 6.5.0 build-9298722 ESXi 6.5 U2C

 

 

As per the Article https://kb.vmware.com/s/article/1022026

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete.

An operation can take longer than the timeout period because of several reasons including:

 

  • Array backup operations (LUN backup, replication, etc)
  • General overload on the array
  • Read/Write Cache on the array (misconfiguration, lack of cache, etc)
  • Incorrect tiered storage used (SATA over SCSI)
  • Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

 

 

vmhba

driver

driver version

VID

DID

SVID

SDID

model

vmhba0

lsi_mr3

6.910.18.00-1vmw.650.0.0.4564106

1000

005d

1028

1f42

Avago (LSI) Dell PERC H730P Adapter

vmhba2

qlnativefc

2.1.81.0-1OEM.600.0.0.2768847

1077

2031

1077

257

QLogic Corp 2600 Series 16Gb Fibre Channel to PCI Express HBA

vmhba3

qlnativefc

2.1.81.0-1OEM.600.0.0.2768847

1077

2031

1077

257

QLogic Corp 2600 Series 16Gb Fibre Channel to PCI Express HBA

QLogic QLE2662 Dual Port FC16 HBA for QLE2662:

FC Firmware Version: 8.08.203 (d0d5), Driver version 2.1.81.0

 

  • As per the Article: VMwareHCL we can see that the Driver and Firmware are running at the latest Version:

 

VMKernel Logs:

 

  • From the VMKernel Logs we can see issues with the HBA State in Doubt:

 

2019-10-11T07:07:20.484Z cpu17:66579)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x28 (0x439701104640, 72323) to dev “naa.600a098038304342783f4a70534aaaa” on path “vmhba3:C0:T7:L12” Failed: H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2019-10-11T07:07:20.484Z cpu17:66579)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a098038304342783f4a70534aaaa” state in doubt; requested fast path state update…
2019-10-11T07:07:20.484Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439701104640) 0x28, CmdSN 0x39b from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2019-10-11T07:07:21.084Z cpu17:66579)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a098038304342783f4a70534aaaa” state in doubt; requested fast path state update…
2019-10-11T07:07:21.084Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x4397011e5b40) 0x28, CmdSN 0x39b from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2019-10-11T07:07:21.384Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x4397011ec7c0) 0x28, CmdSN 0x39b from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:21.684Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439714f08540) 0x28, CmdSN 0x364 from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:21.984Z cpu17:72344)WARNING: iodm: vmk_IodmEvent:194: vmhba3: FRAME DROP event has been observed 6 times in the last one minute. This suggests a problem with Fibre Channel link/switch!.
2019-10-11T07:07:21.984Z cpu17:66579)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.600a098038304342783f4a70534aaaa” state in doubt; requested fast path state update…
2019-10-11T07:07:21.984Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439714f35d40) 0x28, CmdSN 0x3d0 from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:22.384Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439714f7e3c0) 0x28, CmdSN 0x3db from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-10-11T07:07:22.684Z cpu17:66579)ScsiDeviceIO: 2933: Cmd(0x439701053340) 0x28, CmdSN 0x34a from world 72323 to dev “naa.600a098038304342783f4a70534aaaa” failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

 

2019-10-11T07:07:33.484Z cpu10:72344)WARNING: iodm: vmk_IodmEvent:194: vmhba3: FRAME DROP event has been observed 35 times in the last one minute. This suggests a problem with Fibre Channel link/switch!.

 

  • We can see there are multiple events for the FRAME Drop, which could an issue with the Fibre Channel/Link.
  • From the SCSI Code we can see issues with the Path Failures.

 

Type

Code

Name

Description

Host Status

[0x2]

BUS_BUSY

This status is returned when the HBA driver is unable to issue a command to the device. This status can occur due to dropped FCP frames in the environment.

Device Status

[0x0]

GOOD

This status is returned when there is no error from the device or target side. This is when you will see if there is a status for Host or Plugin.

Plugin Status

[0x0]

GOOD

No error. (ESXi 5.x / 6.x only)

Sense Key

[0x80]

Sense Key unknown.

 

Additional Sense Data

41/00

DATA PATH FAILURE (SHOULD USE 40 NN)

 

VOBD Logs:

 

  • From the VOBD Logs we can see issues with the Performance Latency for Few Datastores.
  • Datastore Name: naa.600a098038304342783f4a70534aaaa, naa.600a098038304342783f4a7053434111,  naa.600a098038304342783f4a705343cccc and  naa.600a098038304342783f4a705343qqqq.

 

 

Date

Time

NAA ID

Latency

01-10-2019

06:08:07.848Z:

 naa.600a098038304342783f4a70534aaaa

133806

01-10-2019

06:08:07.848Z:

 naa.600a098038304342783f4a70534aaaa

133806

01-10-2019

06:08:09.792Z:

 naa.600a098038304342783f4a70534aaaa

287017

01-10-2019

06:08:09.792Z:

 naa.600a098038304342783f4a70534aaaa

287017

01-10-2019

06:08:09.842Z:

 naa.600a098038304342783f4a70534aaaa

582535

01-10-2019

14:33:46.756Z:

 naa.600a098038304342783f4a705343cccc

14630

01-10-2019

14:33:46.756Z:

 naa.600a098038304342783f4a705343cccc

14630

01-10-2019

14:33:49.324Z:

 naa.600a098038304342783f4a705343cccc

14630

02-10-2019

07:52:27.332Z:

 naa.600a098038304342783f4a705343qqqq

25667

02-10-2019

07:52:27.332Z:

 naa.600a098038304342783f4a705343qqqq

25667

02-10-2019

07:57:55.821Z:

 naa.600a098038304342783f4a705343qqqq

30743

Conclusion:

 

  • From the logs we can see that the issue is happening because of the HBA Driver Aborting Commands.
  • We can also see a huge performance latency on some of the datastores which are associated with the Esxi Host.

 

Action Plan:

 

  • Please check with your Hardware Vendor to check if there are any issues with the HBA/Switch which could be the reason.
  • Please share the Article https://kb.vmware.com/s/article/1022026 with the Vendor and investigate this further.

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply