RCA 19: Storage Related issues causing VM to Fail

Time: 25th June 2019 8AM MST to 10:30AM MST.

Time in GMT: 25th June 2019 2:00 PM to 4:30 PM

Host Name for Dead Path: esx-auswsyd-esx14.au.abcd.com

VMware ESXi 6.5.0 build-7388607

Cisco Systems Inc UCSC-C240-M4SX

Storage HBA Information:

vmhba	driver	driver version	VID	DID	SVID	SDID	model
vmhba0	vmw_ahci	1.0.0-39vmw.650.1.26.5969303	8086	8d02	1137	67	Intel Corporation Wellsburg AHCI Controller
vmhba1	lsi_mr3	6.910.18.00-1vmw.650.0.0.4564106	1000	005d	1137	00db	Avago (LSI) MegaRAID SAS Invader Controller
vmhba2	lpfc	11.1.0.6-1vmw.650.0.0.4564106	10df	f100	10df	f100	Emulex Corporation Emulex LPe12000 8Gb PCIe Fibre Channel Adapter

Storage Topology:

Topology	Number of paths
FC (Fibre Channel)	12
FCoE (Fibre Channel over Ethernet)	0
iSCSI	0
NFS (mounted targets)	0
Local (mpx devices)	2
vSAN (datastores)	0
Total (mpath, no NFS or vSAN)	14
Dead	0

Vendor	Model	Revision
Cisco	UCS-C240-M4	0a28
Cisco	UCSC-MRAID12G	4.62
DGC	VRAID	4203

VOBD Logs:

From the VOBD logs we can see that the Path State changed to Dead somewhere around: 14:18:47.

2019-06-25T14:18:47.923Z: [scsiCorrelator] 20977718305959us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L0 changed state from on
2019-06-25T14:18:47.925Z: [scsiCorrelator] 20977997297565us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf45a6d92 degraded. Path vmhba64:C0:T3:L0 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN01”.
2019-06-25T14:18:47.927Z: [scsiCorrelator] 20977718309789us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L1 changed state from on
2019-06-25T14:18:47.928Z: [scsiCorrelator] 20977997300432us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 degraded. Path vmhba64:C0:T2:L1 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN02”.
2019-06-25T14:18:47.928Z: [scsiCorrelator] 20977718309793us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L1 changed state from on
2019-06-25T14:18:47.929Z: [scsiCorrelator] 20977997301536us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 degraded. Path vmhba64:C0:T3:L1 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN02”.
2019-06-25T14:18:47.929Z: [scsiCorrelator] 20977718310185us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L2 changed state from on
2019-06-25T14:18:47.930Z: [scsiCorrelator] 20977997302580us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000bb9615cda52abcd degraded. Path vmhba64:C0:T2:L2 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN03”.
2019-06-25T14:18:47.930Z: [scsiCorrelator] 20977718310188us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L2 changed state from on
2019-06-25T14:18:47.931Z: [scsiCorrelator] 20977997303608us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000bb9615cda52abcd degraded. Path vmhba64:C0:T3:L2 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN03”.
2019-06-25T14:18:49.929Z: [scsiCorrelator] 20977720311952us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L0 changed state from on
2019-06-25T14:18:49.931Z: [scsiCorrelator] 20977999302845us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf45a6d92 degraded. Path vmhba64:C0:T2:L0 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN01”.
2019-06-25T17:51:50.725Z: [scsiCorrelator] 20990500955210us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L0 changed state from dead

2019-06-25T17:51:50.726Z: [scsiCorrelator] 20990780098211us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6006016030a03a000ab9615cf45a6d92 (Datastores: “AMX_SYD_EUC_VPT_Share_LUN01”) restored. Path vmhba64:C0:T2:L0 is active again.
2019-06-25T17:51:50.726Z: [scsiCorrelator] 20990500955216us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L1 changed state from dead
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990780099265us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 (Datastores: “AMX_SYD_EUC_VPT_Share_LUN02”) restored. Path vmhba64:C0:T2:L1 is active again.
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990500955838us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T3:L0 changed state from dead
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990500956170us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T3:L1 changed state from dead
2019-06-25T17:51:50.729Z: [scsiCorrelator] 20990500959496us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L2 changed state from dead
2019-06-25T17:51:50.730Z: [scsiCorrelator] 20990780102162us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6006016030a03a000bb9615cda52abcd (Datastores: “AMX_SYD_EUC_VPT_Share_LUN03”) restored. Path vmhba64:C0:T2:L2 is active again.

Vmkernel Logs:

The ESXi hosts first logs the failover by reporting the RSCN that would have been triggered when the storage controller TPG’s were disabled.

For example, RSCNs logged by a lpfc driver:

2019-06-25T14:17:04.697Z cpu12:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 1, path is dead, IO errors: busy 0, retry 3682, no_connect 145139, fcperr 33659, tmo 0
2019-06-25T14:17:04.697Z cpu12:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 0, path is dead, IO errors: busy 0, retry 0, no_connect 144665, fcperr 33651, tmo 0
2019-06-25T14:18:37.921Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5700: 1:(0):0214 RSCN received Data: x220 x0 x4 x1
2019-06-25T14:18:37.921Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5710: 1:(0):5973 RSCN received event x0 : Address format x00 : DID x020001
2019-06-25T14:18:37.924Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5700: 1:(0):0214 RSCN received Data: x268 x1 x4 x1
2019-06-25T14:18:37.924Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5710: 1:(0):5973 RSCN received event x0 : Address format x00 : DID x020021

We see PLOGI failure messages from the driver, looks like login to the device is not happening and is failing to establish a connection with the device. Therefore, the Paths fail to come back after reboot.
“Devloss Timeout” is a mechanism in lpfc driver. From start observation of a target lost, to confirm the target is lost. lpfc driver will wait about 10 seconds. So the following log means the lpfc confirms the target is lost

2019-06-25T14:18:39.628Z cpu30:66359)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0xf1 (0x4395904373c0, 65591) to dev “naa.6006016030a03a000bb9615cda52abcd” on path “vmhba64:C0:T3:L2” Failed: H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:NONE
2019-06-25T14:18:39.628Z cpu30:66359)ScsiDeviceIO: 2927: Cmd(0x43958dfcfa80) 0xfe, CmdSN 0x8a0514 from world 65591 to dev “naa.6006016030a03a000bb9615cda52abcd” failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

2019-06-25T14:18:47.922Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:247: 1:(0):0203 Devloss timeout on WWPN 50:06:01:6a:48:60:0c:23 NPort x020001 Data: x8 x8 x0
2019-06-25T14:18:47.922Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:288: 1:(0):3298 ScsiNotifyPathStateChangeAsyncSAdapter Num x3 TID x3, DID x020001.
2019-06-25T14:18:47.926Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:247: 1:(0):0203 Devloss timeout on WWPN 50:06:01:62:48:60:0c:23 NPort x020021 Data: x8 x8 x0
2019-06-25T14:18:47.926Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:288: 1:(0):3298 ScsiNotifyPathStateChangeAsyncSAdapter Num x3 TID x2, DID x020021.

After the ‘devloss’ timer expires the WWPNs are removed from available use:[…] and the RSCN got Timed out.

2019-06-25T14:19:10.922Z cpu21:66188)WARNING: lpfc: lpfc_disc_timeout_handler:5654: 1:(0):0231 RSCN timeout Data: x0 x3
2019-06-25T14:19:37.403Z cpu8:66215)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.6006016030a03a000bb9615cda52abcd repeated 137 times

2019-06-25T14:20:07.403Z cpu5:3680684)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1101: Could not select path for device “naa.600601600db03c00e277704e7dfeabcd“.
2019-06-25T14:20:07.403Z cpu10:3693432)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1101: Could not select path for device “naa.600601600db03c00d5e3b3ed77abcd11“.
2019-06-25T14:21:33.566Z cpu27:66357)ScsiDeviceIO: 2948: Cmd(0x439d89f4a580) 0x1a, CmdSN 0x2d0b06 from world 0 to dev “naa.6006bf1d58d632202095c60507cfc809” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 1, path is dead, IO errors: busy 0, retry 3682, no_connect 145143, fcperr 33659, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 0, path is dead, IO errors: busy 0, retry 0, no_connect 144669, fcperr 33651, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 3, path is dead, IO errors: busy 0, retry 288, no_connect 14, fcperr 11, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 2, path is dead, IO errors: busy 0, retry 312, no_connect 14, fcperr 4, tmo 0

Here we can see that the command TEST_UNIT_READY is initiated instantly due to a failed command:

2019-06-26T17:44:28.678Z cpu31:1637010)WARNING: lsi_mr3: map_cmd_status:1640: mfi: I/O returned error, cmd = 0x439d87c91180[0x4d] status = 0x2d, ext_status = 0x8.
2019-06-26T17:44:28.678Z cpu31:66357)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x4d (0x439d87c91180, 0) to dev “naa.5002cc8d9f1aabcd” on path “vmhba1:C0:T8:L0” Failed: H:0x7 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2019-06-26T17:44:28.678Z cpu31:66357)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.5002cc8d9f1aabcd” state in doubt; requested fast path state update…
2019-06-26T17:44:28.678Z cpu31:66357)ScsiDeviceIO: 2927: Cmd(0x439d87c91180) 0x4d, CmdSN 0x2d3d90 from world 0 to dev “naa.5002cc8d9f1aabcd” failed H:0x7 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2019-06-26T17:44:28.695Z cpu31:66357)WARNING: lsi_mr3: map_cmd_status:1640: mfi: I/O returned error, cmd = 0x439d87c91180[0x12] status = 0x2d, ext_status = 0x8.

Host Name for Active Path: esx-HPAPVEABCD01.au.abcd.com

VMware ESXi 6.0.0 build-7967664

Host Bus Adapter (HBA, hardware only, lspci)

vmhba	driver	driver version	VID	DID	SVID	SDID	model
vmhba2	qlnativefc	2.1.57.0-1OEM.600.0.0.2768847	1077	2261	1077	02a8	QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter
vmhba3	qlnativefc	2.1.57.0-1OEM.600.0.0.2768847	1077	2261	1077	02a8	QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter
vmhba6	qlnativefc	2.1.57.0-1OEM.600.0.0.2768847	1077	2261	1077	02a8	QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter
vmhba7	qlnativefc	2.1.57.0-1OEM.600.0.0.2768847	1077	2261	1077	02a8	QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter

HBA commands statistics

Adapter	Driver	Good CMDS	Failed CMDS	<– %	MB read	Failed MB read	<– %	MB written	Failed MB written	<– %	Res. ops	Failed res. ops	Res. conflicts
vmhba2	qlnativefc	55539365	18	0.00%	1189230.604	0	0.00%	878067.813	0	0.00%	0	0	0
vmhba3	qlnativefc	0	0	–	0	0	–	0	0	–	0	0	0
vmhba5	lsi_mr3	216167266	60496	0.00%	5139163.045	0	0.00%	1989061.516	0	0.00%	2502370	0	0
vmhba6	qlnativefc	55577208	4	0.00%	1189488.888	0	0.00%	878717.496	0	0.00%	0	0	0
vmhba7	qlnativefc	0	0	–	0	0	–	0	0	–	0	0	0

Datastore Name	naa / t10 / eui ID	Volume UUID	Commands	Blocks Read	Blocks Written	LUN ID	Medium	Number Path	Vendor
Unity_VDG-AUS-UCE_LUN00	naa.6006016030a03a007181485c5238abcd	5c49ed89-ec633db0-0ae4-e4434b04c344	109925180	487163874200.00%	3597897593	LUN:0	fc	4	DGC Fibre Channel
zDoNotUse_HPAPVEABCD01_Local_Storage	naa.6d0946606f185500232c124b8676abcd	5b9a60e5-734a83de-7028-e4434b04c344	181284200	1020851568000.00%	4069593297	LUN:0	sas	1	DELL Serial Attached

VOBD Logs:

VOBD Logs are missing from the bundle, so we are not able to find the device state.

Vmkernel Logs:

From the vmkernel logs we can see SCSI Device IO Errors:

2019-06-25T14:03:39.548Z cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d2f from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-06-25T14:03:39.593Z cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d32 from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-06-25T14:03:39.615Z cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d35 from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-06-25T14:08:39.557Z cpu47:33547)ScsiDeviceIO: 2652: Cmd(0x43bdd3286b80) 0x35, CmdSN 0x6e3d85 from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2019-06-25T14:18:38.582Z cpu26:33549)qlnativefc: vmhba6(87:0.0): Created work queue qlnativefc256t10wq
2019-06-25T14:18:38.586Z cpu26:33549)qlnativefc: vmhba6(87:0.0): fcport 5006016a48600c23 (targetId = 0) ONLINE
2019-06-25T14:18:38.587Z cpu26:33549)qlnativefc: vmhba6(87:0.0): fcport 5006016248600c23 (targetId = 1) ONLINE

We can also see the FC Link initiated a Rescan for the HBA which could probably bring the storage online.

2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba2(3b:0.0): Issuing a Rescan FC Link
2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba3(3b:0.1): Issuing a Rescan FC Link
2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba6(87:0.0): Issuing a Rescan FC Link
2019-06-26T17:30:10.084Z cpu54:15071195)qlnativefc: vmhba7(87:0.1): Issuing a Rescan FC Link

Conclusion:

Based on the Analysis we were not able to find any Vmkernel associated issues, however we can see PLOGI failure messages from the driver, looks like login to the device is not happening and is failing to establish a connection with the device.
Therefore, the Paths fail to come back after reboot.

“Devloss Timeout” is a mechanism in lpfc driver.

As per the Warning Message: WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.5002cc8d9f1aabcd” state in doubt; requested fast path state update…

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

Array backup operations (LUN backup, replication, etc)
General overload on the array
Read/Write Cache on the array (misconfiguration, lack of cache, etc)
Incorrect tiered storage used (SATA over SCSI)
Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

For more information please refer: https://kb.vmware.com/s/article/1022026

Action Plan

Driver Version which we are using is lpfc version 11.1.0.6-1vmw which as per the Article: Vmware HCL we can see that this driver version is VMware Inbox, native. I will recommend you to check with you hardware vendor and get to the latest version of HBA.