RCA 19: Storage Related issues causing VM to Fail

Time: 25th June 2019 8AM MST to 10:30AM MST.

Time in GMT: 25th June 2019 2:00 PM to 4:30 PM

 

Host Name for Dead Path: esx-auswsyd-esx14.au.abcd.com

VMware ESXi 6.5.0 build-7388607

Cisco Systems Inc UCSC-C240-M4SX

 

  • Storage HBA Information:

 

vmhba

driver

driver version

VID

DID

SVID

SDID

model

vmhba0

vmw_ahci

1.0.0-39vmw.650.1.26.5969303

8086

8d02

1137

67

Intel Corporation Wellsburg AHCI Controller

vmhba1

lsi_mr3

6.910.18.00-1vmw.650.0.0.4564106

1000

005d

1137

00db

Avago (LSI) MegaRAID SAS Invader Controller

vmhba2

lpfc

11.1.0.6-1vmw.650.0.0.4564106

10df

f100

10df

f100

Emulex Corporation Emulex LPe12000 8Gb PCIe Fibre Channel Adapter

 

  • Storage Topology:

 

Topology

Number of paths

FC (Fibre Channel)

12

FCoE (Fibre Channel over Ethernet)

0

iSCSI

0

NFS (mounted targets)

0

Local (mpx devices)

2

vSAN (datastores)

0

Total (mpath, no NFS or vSAN)

14

Dead

0

 

 

Vendor

Model

Revision

Cisco

UCS-C240-M4

0a28

Cisco

UCSC-MRAID12G

4.62

DGC

VRAID

4203

 

VOBD Logs:

 

  • From the VOBD logs we can see that the Path State changed to Dead somewhere around: 14:18:47.

 

2019-06-25T14:18:47.923Z: [scsiCorrelator] 20977718305959us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L0 changed state from on
2019-06-25T14:18:47.925Z: [scsiCorrelator] 20977997297565us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf45a6d92 degraded. Path vmhba64:C0:T3:L0 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN01”.
2019-06-25T14:18:47.927Z: [scsiCorrelator] 20977718309789us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L1 changed state from on
2019-06-25T14:18:47.928Z: [scsiCorrelator] 20977997300432us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 degraded. Path vmhba64:C0:T2:L1 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN02”.
2019-06-25T14:18:47.928Z: [scsiCorrelator] 20977718309793us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L1 changed state from on
2019-06-25T14:18:47.929Z: [scsiCorrelator] 20977997301536us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 degraded. Path vmhba64:C0:T3:L1 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN02”.
2019-06-25T14:18:47.929Z: [scsiCorrelator] 20977718310185us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L2 changed state from on
2019-06-25T14:18:47.930Z: [scsiCorrelator] 20977997302580us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000bb9615cda52abcd degraded. Path vmhba64:C0:T2:L2 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN03”.
2019-06-25T14:18:47.930Z: [scsiCorrelator] 20977718310188us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L2 changed state from on
2019-06-25T14:18:47.931Z: [scsiCorrelator] 20977997303608us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000bb9615cda52abcd degraded. Path vmhba64:C0:T3:L2 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN03”.
2019-06-25T14:18:49.929Z: [scsiCorrelator] 20977720311952us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L0 changed state from on
2019-06-25T14:18:49.931Z: [scsiCorrelator] 20977999302845us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6006016030a03a000ab9615cf45a6d92 degraded. Path vmhba64:C0:T2:L0 is down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN01”.
2019-06-25T17:51:50.725Z: [scsiCorrelator] 20990500955210us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L0 changed state from dead

2019-06-25T17:51:50.726Z: [scsiCorrelator] 20990780098211us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6006016030a03a000ab9615cf45a6d92 (Datastores: “AMX_SYD_EUC_VPT_Share_LUN01”) restored. Path vmhba64:C0:T2:L0 is active again.
2019-06-25T17:51:50.726Z: [scsiCorrelator] 20990500955216us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L1 changed state from dead
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990780099265us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 (Datastores: “AMX_SYD_EUC_VPT_Share_LUN02”) restored. Path vmhba64:C0:T2:L1 is active again.
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990500955838us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T3:L0 changed state from dead
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990500956170us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T3:L1 changed state from dead
2019-06-25T17:51:50.729Z: [scsiCorrelator] 20990500959496us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L2 changed state from dead
2019-06-25T17:51:50.730Z: [scsiCorrelator] 20990780102162us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6006016030a03a000bb9615cda52abcd (Datastores: “AMX_SYD_EUC_VPT_Share_LUN03”) restored. Path vmhba64:C0:T2:L2 is active again.

 

 

Vmkernel Logs:

 

  • The ESXi hosts first logs the failover by reporting the RSCN that would have been triggered when the storage controller TPG’s were disabled.

 

For example, RSCNs logged by a lpfc driver:

 

2019-06-25T14:17:04.697Z cpu12:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 1, path is dead, IO errors: busy 0, retry 3682, no_connect 145139, fcperr 33659, tmo 0
2019-06-25T14:17:04.697Z cpu12:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 0, path is dead, IO errors: busy 0, retry 0, no_connect 144665, fcperr 33651, tmo 0
2019-06-25T14:18:37.921Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5700: 1:(0):0214 RSCN received Data: x220 x0 x4 x1
2019-06-25T14:18:37.921Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5710: 1:(0):5973 RSCN received event x0 : Address format x00 : DID x020001
2019-06-25T14:18:37.924Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5700: 1:(0):0214 RSCN received Data: x268 x1 x4 x1
2019-06-25T14:18:37.924Z cpu9:66188)lpfc: lpfc_els_rcv_rscn:5710: 1:(0):5973 RSCN received event x0 : Address format x00 : DID x020021

 

  • We see PLOGI failure messages from the driver, looks like login to the device is not happening and is failing to establish a connection with the device. Therefore, the Paths fail to come back after reboot.
  • “Devloss Timeout” is a mechanism in lpfc driver. From start observation of a target lost, to confirm the target is lost. lpfc driver will wait about 10 seconds. So the following log means the lpfc confirms the target is lost

 

 

2019-06-25T14:18:39.628Z cpu30:66359)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0xf1 (0x4395904373c0, 65591) to dev “naa.6006016030a03a000bb9615cda52abcd” on path “vmhba64:C0:T3:L2” Failed: H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:NONE
2019-06-25T14:18:39.628Z cpu30:66359)ScsiDeviceIO: 2927: Cmd(0x43958dfcfa80) 0xfe, CmdSN 0x8a0514 from world 65591 to dev “naa.6006016030a03a000bb9615cda52abcd” failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

2019-06-25T14:18:47.922Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:247: 1:(0):0203 Devloss timeout on WWPN 50:06:01:6a:48:60:0c:23 NPort x020001 Data: x8 x8 x0
2019-06-25T14:18:47.922Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:288: 1:(0):3298 ScsiNotifyPathStateChangeAsyncSAdapter Num x3 TID x3, DID x020001.
2019-06-25T14:18:47.926Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:247: 1:(0):0203 Devloss timeout on WWPN 50:06:01:62:48:60:0c:23 NPort x020021 Data: x8 x8 x0
2019-06-25T14:18:47.926Z cpu12:66188)WARNING: lpfc: lpfc_dev_loss_tmo_handler:288: 1:(0):3298 ScsiNotifyPathStateChangeAsyncSAdapter Num x3 TID x2, DID x020021.

 

  • After the ‘devloss’ timer expires the WWPNs are removed from available use:[…] and the RSCN got Timed out.

 

2019-06-25T14:19:10.922Z cpu21:66188)WARNING: lpfc: lpfc_disc_timeout_handler:5654: 1:(0):0231 RSCN timeout Data: x0 x3
2019-06-25T14:19:37.403Z cpu8:66215)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.6006016030a03a000bb9615cda52abcd repeated 137 times

 

2019-06-25T14:20:07.403Z cpu5:3680684)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1101: Could not select path for device “naa.600601600db03c00e277704e7dfeabcd“.
2019-06-25T14:20:07.403Z cpu10:3693432)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1101: Could not select path for device “naa.600601600db03c00d5e3b3ed77abcd11“.

2019-06-25T14:21:33.566Z cpu27:66357)ScsiDeviceIO: 2948: Cmd(0x439d89f4a580) 0x1a, CmdSN 0x2d0b06 from world 0 to dev “naa.6006bf1d58d632202095c60507cfc809” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 1, path is dead, IO errors: busy 0, retry 3682, no_connect 145143, fcperr 33659, tmo 0

2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 0, path is dead, IO errors: busy 0, retry 0, no_connect 144669, fcperr 33651, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 3, path is dead, IO errors: busy 0, retry 288, no_connect 14, fcperr 11, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0) Compression log for fcp target 2, path is dead, IO errors: busy 0, retry 312, no_connect 14, fcperr 4, tmo 0

 

 

 

  • Here we can see that the command TEST_UNIT_READY is initiated instantly due to a failed command:

 

2019-06-26T17:44:28.678Z cpu31:1637010)WARNING: lsi_mr3: map_cmd_status:1640: mfi: I/O returned error, cmd = 0x439d87c91180[0x4d] status = 0x2d, ext_status = 0x8.
2019-06-26T17:44:28.678Z cpu31:66357)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x4d (0x439d87c91180, 0) to dev “naa.5002cc8d9f1aabcd” on path “vmhba1:C0:T8:L0” Failed: H:0x7 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2019-06-26T17:44:28.678Z cpu31:66357)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.5002cc8d9f1aabcd” state in doubt; requested fast path state update…
2019-06-26T17:44:28.678Z cpu31:66357)ScsiDeviceIO: 2927: Cmd(0x439d87c91180) 0x4d, CmdSN 0x2d3d90 from world 0 to dev “naa.5002cc8d9f1aabcd” failed H:0x7 D:0x8 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2019-06-26T17:44:28.695Z cpu31:66357)WARNING: lsi_mr3: map_cmd_status:1640: mfi: I/O returned error, cmd = 0x439d87c91180[0x12] status = 0x2d, ext_status = 0x8.

 

 

 

Host Name for Active Path: esx-HPAPVEABCD01.au.abcd.com

VMware ESXi 6.0.0 build-7967664

 

Host Bus Adapter (HBA, hardware only, lspci)

 

vmhba

driver

driver version

VID

DID

SVID

SDID

model

vmhba2

qlnativefc

2.1.57.0-1OEM.600.0.0.2768847

1077

2261

1077

02a8

QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter

vmhba3

qlnativefc

2.1.57.0-1OEM.600.0.0.2768847

1077

2261

1077

02a8

QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter

vmhba6

qlnativefc

2.1.57.0-1OEM.600.0.0.2768847

1077

2261

1077

02a8

QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter

vmhba7

qlnativefc

2.1.57.0-1OEM.600.0.0.2768847

1077

2261

1077

02a8

QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to PCIe Adapter

 

 

HBA commands statistics

 

 

Adapter

Driver

Good CMDS

Failed CMDS

<– %

MB read

Failed MB read

<– %

MB written

Failed MB written

<– %

Res. ops

Failed res. ops

Res. conflicts

vmhba2

qlnativefc

55539365

18

0.00%

1189230.604

0

0.00%

878067.813

0

0.00%

0

0

0

vmhba3

qlnativefc

0

0

0

0

0

0

0

0

0

vmhba5

lsi_mr3

216167266

60496

0.00%

5139163.045

0

0.00%

1989061.516

0

0.00%

2502370

0

0

vmhba6

qlnativefc

55577208

4

0.00%

1189488.888

0

0.00%

878717.496

0

0.00%

0

0

0

vmhba7

qlnativefc

0

0

0

0

0

0

0

0

0

 

Datastore Name

naa / t10 / eui ID

Volume UUID

Commands

Blocks Read

Blocks Written

LUN ID

Medium

Number Path

Vendor

Unity_VDG-AUS-UCE_LUN00

naa.6006016030a03a007181485c5238abcd

5c49ed89-ec633db0-0ae4-e4434b04c344

109925180

487163874200.00%

3597897593

LUN:0

fc

4

DGC Fibre Channel

zDoNotUse_HPAPVEABCD01_Local_Storage

naa.6d0946606f185500232c124b8676abcd

5b9a60e5-734a83de-7028-e4434b04c344

181284200

1020851568000.00%

4069593297

LUN:0

sas

1

DELL Serial Attached

 

VOBD Logs:

 

  • VOBD Logs are missing from the bundle, so we are not able to find the device state.

 

Vmkernel Logs:

 

  • From the vmkernel logs we can see SCSI Device IO Errors:

 

2019-06-25T14:03:39.548Z cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d2f from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-06-25T14:03:39.593Z cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d32 from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-06-25T14:03:39.615Z cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d35 from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2019-06-25T14:08:39.557Z cpu47:33547)ScsiDeviceIO: 2652: Cmd(0x43bdd3286b80) 0x35, CmdSN 0x6e3d85 from world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

 

 

2019-06-25T14:18:38.582Z cpu26:33549)qlnativefc: vmhba6(87:0.0): Created work queue qlnativefc256t10wq
2019-06-25T14:18:38.586Z cpu26:33549)qlnativefc: vmhba6(87:0.0): fcport 5006016a48600c23 (targetId = 0) ONLINE
2019-06-25T14:18:38.587Z cpu26:33549)qlnativefc: vmhba6(87:0.0): fcport 5006016248600c23 (targetId = 1) ONLINE

 

  • We can also see the FC Link initiated a Rescan for the HBA which could probably bring the storage online.

 

2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba2(3b:0.0): Issuing a Rescan FC Link
2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba3(3b:0.1): Issuing a Rescan FC Link
2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba6(87:0.0): Issuing a Rescan FC Link
2019-06-26T17:30:10.084Z cpu54:15071195)qlnativefc: vmhba7(87:0.1): Issuing a Rescan FC Link

 

 

Conclusion:

 

  • Based on the Analysis we were not able to find any Vmkernel associated issues, however we can see PLOGI failure messages from the driver, looks like login to the device is not happening and is failing to establish a connection with the device.
  • Therefore, the Paths fail to come back after reboot.
  • “Devloss Timeout” is a mechanism in lpfc driver.
  • As per the Warning Message: WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.5002cc8d9f1aabcd” state in doubt; requested fast path state update…

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

  • Array backup operations (LUN backup, replication, etc)
  • General overload on the array
  • Read/Write Cache on the array (misconfiguration, lack of cache, etc)
  • Incorrect tiered storage used (SATA over SCSI)
  • Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

For more information please refer: https://kb.vmware.com/s/article/1022026

 

Action Plan

 

  • Driver Version which we are using is lpfc version 11.1.0.6-1vmw which as per the Article: Vmware HCL we can see that this driver version is VMware Inbox, native. I will recommend you to check with you hardware vendor and get to the latest version of HBA.

 

  • Check if you are able to find any issues or event with any Intermediatory devices between the Storage and the Esxi Host.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply