Time: 25th June 2019 8AM MST to 10:30AM MST.
Time in GMT: 25th June 2019 2:00 PM to 4:30
PM
Host Name for Dead Path: esx-auswsyd-esx14.au.abcd.com
VMware ESXi 6.5.0 build-7388607
Cisco Systems Inc
UCSC-C240-M4SX
- Storage
HBA Information:
vmhba |
driver |
driver version |
VID |
DID |
SVID |
SDID |
model |
vmhba0 |
vmw_ahci |
1.0.0-39vmw.650.1.26.5969303 |
8086 |
8d02 |
1137 |
67 |
Intel Corporation Wellsburg AHCI Controller |
vmhba1 |
lsi_mr3 |
6.910.18.00-1vmw.650.0.0.4564106 |
1000 |
005d |
1137 |
00db |
Avago (LSI) MegaRAID SAS Invader Controller |
vmhba2 |
lpfc |
11.1.0.6-1vmw.650.0.0.4564106 |
10df |
f100 |
10df |
f100 |
Emulex Corporation Emulex LPe12000 8Gb PCIe Fibre
Channel Adapter |
- Storage
Topology:
Topology |
Number of paths |
FC (Fibre
Channel) |
12 |
FCoE (Fibre
Channel over Ethernet) |
0 |
iSCSI |
0 |
NFS (mounted
targets) |
0 |
Local (mpx
devices) |
2 |
vSAN
(datastores) |
0 |
Total
(mpath, no NFS or vSAN) |
14 |
Dead |
0 |
Vendor |
Model |
Revision |
Cisco |
UCS-C240-M4 |
0a28 |
Cisco |
UCSC-MRAID12G |
4.62 |
DGC |
VRAID |
4203 |
VOBD Logs:
- From the
VOBD logs we can see that the Path State changed to Dead somewhere around:
14:18:47.
2019-06-25T14:18:47.923Z: [scsiCorrelator] 20977718305959us:
[vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L0 changed state from
on
2019-06-25T14:18:47.925Z: [scsiCorrelator]
20977997297565us: [esx.problem.storage.redundancy.degraded] Path redundancy to
storage device naa.6006016030a03a000ab9615cf45a6d92 degraded. Path
vmhba64:C0:T3:L0 is down. Affected datastores:
“AMX_SYD_EUC_VPT_Share_LUN01”.
2019-06-25T14:18:47.927Z: [scsiCorrelator] 20977718309789us:
[vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L1 changed state from
on
2019-06-25T14:18:47.928Z: [scsiCorrelator] 20977997300432us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.6006016030a03a000ab9615cf61f8965 degraded. Path vmhba64:C0:T2:L1 is down.
Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN02”.
2019-06-25T14:18:47.928Z: [scsiCorrelator] 20977718309793us:
[vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L1 changed state from
on
2019-06-25T14:18:47.929Z: [scsiCorrelator] 20977997301536us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.6006016030a03a000ab9615cf61f8965 degraded. Path vmhba64:C0:T3:L1 is down.
Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN02”.
2019-06-25T14:18:47.929Z: [scsiCorrelator] 20977718310185us:
[vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L2 changed state from
on
2019-06-25T14:18:47.930Z: [scsiCorrelator] 20977997302580us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.6006016030a03a000bb9615cda52abcd degraded. Path vmhba64:C0:T2:L2 is
down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN03”.
2019-06-25T14:18:47.930Z: [scsiCorrelator] 20977718310188us:
[vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T3:L2 changed state from
on
2019-06-25T14:18:47.931Z: [scsiCorrelator] 20977997303608us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.6006016030a03a000bb9615cda52abcd degraded. Path vmhba64:C0:T3:L2 is
down. Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN03”.
2019-06-25T14:18:49.929Z: [scsiCorrelator] 20977720311952us:
[vob.scsi.scsipath.pathstate.dead] scsiPath vmhba64:C0:T2:L0 changed state from
on
2019-06-25T14:18:49.931Z: [scsiCorrelator] 20977999302845us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.6006016030a03a000ab9615cf45a6d92 degraded. Path vmhba64:C0:T2:L0 is down.
Affected datastores: “AMX_SYD_EUC_VPT_Share_LUN01”.
2019-06-25T17:51:50.725Z: [scsiCorrelator]
20990500955210us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L0
changed state from dead
2019-06-25T17:51:50.726Z: [scsiCorrelator] 20990780098211us:
[esx.clear.storage.redundancy.restored] Path redundancy to storage device
naa.6006016030a03a000ab9615cf45a6d92 (Datastores:
“AMX_SYD_EUC_VPT_Share_LUN01”) restored. Path vmhba64:C0:T2:L0 is
active again.
2019-06-25T17:51:50.726Z: [scsiCorrelator] 20990500955216us:
[vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L1 changed state from
dead
2019-06-25T17:51:50.727Z:
[scsiCorrelator] 20990780099265us: [esx.clear.storage.redundancy.restored] Path
redundancy to storage device naa.6006016030a03a000ab9615cf61f8965 (Datastores:
“AMX_SYD_EUC_VPT_Share_LUN02”) restored. Path vmhba64:C0:T2:L1 is
active again.
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990500955838us:
[vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T3:L0 changed state from
dead
2019-06-25T17:51:50.727Z: [scsiCorrelator] 20990500956170us:
[vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T3:L1 changed state from
dead
2019-06-25T17:51:50.729Z: [scsiCorrelator] 20990500959496us:
[vob.scsi.scsipath.pathstate.on] scsiPath vmhba64:C0:T2:L2 changed state from
dead
2019-06-25T17:51:50.730Z:
[scsiCorrelator] 20990780102162us: [esx.clear.storage.redundancy.restored] Path
redundancy to storage device naa.6006016030a03a000bb9615cda52abcd
(Datastores: “AMX_SYD_EUC_VPT_Share_LUN03”) restored. Path
vmhba64:C0:T2:L2 is active again.
Vmkernel Logs:
- The ESXi
hosts first logs the failover by reporting the RSCN that would have been
triggered when the storage controller TPG’s were disabled.
For example, RSCNs logged by a
lpfc driver:
2019-06-25T14:17:04.697Z cpu12:66184)lpfc: lpfc_rportStats:4087: 1:(0)
Compression log for fcp target 1, path is dead, IO errors: busy 0, retry 3682,
no_connect 145139, fcperr 33659, tmo 0
2019-06-25T14:17:04.697Z cpu12:66184)lpfc:
lpfc_rportStats:4087: 1:(0) Compression log for fcp target 0, path is dead, IO
errors: busy 0, retry 0, no_connect 144665, fcperr 33651, tmo 0
2019-06-25T14:18:37.921Z
cpu9:66188)lpfc: lpfc_els_rcv_rscn:5700: 1:(0):0214 RSCN received Data: x220 x0
x4 x1
2019-06-25T14:18:37.921Z
cpu9:66188)lpfc: lpfc_els_rcv_rscn:5710: 1:(0):5973 RSCN received event x0 :
Address format x00 : DID x020001
2019-06-25T14:18:37.924Z
cpu9:66188)lpfc: lpfc_els_rcv_rscn:5700: 1:(0):0214 RSCN received Data: x268 x1
x4 x1
2019-06-25T14:18:37.924Z
cpu9:66188)lpfc: lpfc_els_rcv_rscn:5710: 1:(0):5973 RSCN received event x0 :
Address format x00 : DID x020021
- We see PLOGI failure messages from the driver, looks like login to
the device is not happening and is failing to establish a connection with
the device. Therefore, the Paths fail to come back after reboot.
- “Devloss Timeout” is a mechanism in lpfc driver. From
start observation of a target lost, to confirm the target is lost. lpfc
driver will wait about 10 seconds. So the following log means the lpfc
confirms the target is lost
2019-06-25T14:18:39.628Z cpu30:66359)NMP: nmp_ThrottleLogForDevice:3647:
Cmd 0xf1 (0x4395904373c0, 65591) to dev “naa.6006016030a03a000bb9615cda52abcd”
on path “vmhba64:C0:T3:L2” Failed: H:0xc D:0x0 P:0x0 Invalid sense
data: 0x0 0x0 0x0. Act:NONE
2019-06-25T14:18:39.628Z cpu30:66359)ScsiDeviceIO: 2927: Cmd(0x43958dfcfa80)
0xfe, CmdSN 0x8a0514 from world 65591 to dev “naa.6006016030a03a000bb9615cda52abcd”
failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2019-06-25T14:18:47.922Z cpu12:66188)WARNING: lpfc:
lpfc_dev_loss_tmo_handler:247: 1:(0):0203 Devloss timeout on WWPN
50:06:01:6a:48:60:0c:23 NPort x020001 Data: x8 x8 x0
2019-06-25T14:18:47.922Z cpu12:66188)WARNING: lpfc:
lpfc_dev_loss_tmo_handler:288: 1:(0):3298
ScsiNotifyPathStateChangeAsyncSAdapter Num x3 TID x3, DID x020001.
2019-06-25T14:18:47.926Z cpu12:66188)WARNING: lpfc:
lpfc_dev_loss_tmo_handler:247: 1:(0):0203 Devloss timeout on WWPN
50:06:01:62:48:60:0c:23 NPort x020021 Data: x8 x8 x0
2019-06-25T14:18:47.926Z cpu12:66188)WARNING:
lpfc: lpfc_dev_loss_tmo_handler:288: 1:(0):3298
ScsiNotifyPathStateChangeAsyncSAdapter Num x3 TID x2, DID x020021.
- After the
‘devloss’ timer expires the WWPNs are removed from available use:[…] and
the RSCN got Timed out.
2019-06-25T14:19:10.922Z cpu21:66188)WARNING: lpfc:
lpfc_disc_timeout_handler:5654: 1:(0):0231 RSCN timeout Data: x0 x3
2019-06-25T14:19:37.403Z
cpu8:66215)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from
device naa.6006016030a03a000bb9615cda52abcd repeated 137 times
2019-06-25T14:20:07.403Z cpu5:3680684)WARNING: vmw_psp_rr:
psp_rrSelectPathToActivate:1101: Could not select path for device “naa.600601600db03c00e277704e7dfeabcd“.
2019-06-25T14:20:07.403Z cpu10:3693432)WARNING: vmw_psp_rr:
psp_rrSelectPathToActivate:1101: Could not select path for device “naa.600601600db03c00d5e3b3ed77abcd11“.
2019-06-25T14:21:33.566Z cpu27:66357)ScsiDeviceIO: 2948: Cmd(0x439d89f4a580)
0x1a, CmdSN 0x2d0b06 from world 0 to dev
“naa.6006bf1d58d632202095c60507cfc809” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x5 0x24 0x0.
2019-06-25T14:22:09.780Z cpu3:66184)lpfc: lpfc_rportStats:4087: 1:(0)
Compression log for fcp target 1, path is dead, IO errors: busy 0, retry 3682,
no_connect 145143, fcperr 33659, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc:
lpfc_rportStats:4087: 1:(0) Compression log for fcp target 0, path is dead, IO
errors: busy 0, retry 0, no_connect 144669, fcperr 33651, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc:
lpfc_rportStats:4087: 1:(0) Compression log for fcp target 3, path is dead, IO
errors: busy 0, retry 288, no_connect 14, fcperr 11, tmo 0
2019-06-25T14:22:09.780Z cpu3:66184)lpfc:
lpfc_rportStats:4087: 1:(0) Compression log for fcp target 2, path is dead, IO
errors: busy 0, retry 312, no_connect 14, fcperr 4, tmo 0
- Here we can see that
the command TEST_UNIT_READY is initiated instantly due to a failed
command:
2019-06-26T17:44:28.678Z cpu31:1637010)WARNING: lsi_mr3:
map_cmd_status:1640: mfi: I/O returned error, cmd = 0x439d87c91180[0x4d] status
= 0x2d, ext_status = 0x8.
2019-06-26T17:44:28.678Z cpu31:66357)NMP:
nmp_ThrottleLogForDevice:3647: Cmd 0x4d (0x439d87c91180, 0) to dev “naa.5002cc8d9f1aabcd”
on path “vmhba1:C0:T8:L0” Failed: H:0x7 D:0x8 P:0x0 Invalid sense
data: 0x0 0x0 0x0. Act:EVAL
2019-06-26T17:44:28.678Z cpu31:66357)WARNING:
NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.5002cc8d9f1aabcd”
state in doubt; requested fast path state update…
2019-06-26T17:44:28.678Z
cpu31:66357)ScsiDeviceIO: 2927: Cmd(0x439d87c91180) 0x4d, CmdSN 0x2d3d90 from
world 0 to dev “naa.5002cc8d9f1aabcd” failed H:0x7 D:0x8 P:0x0
Invalid sense data: 0x0 0x0 0x0.
2019-06-26T17:44:28.695Z
cpu31:66357)WARNING: lsi_mr3: map_cmd_status:1640: mfi: I/O returned error, cmd
= 0x439d87c91180[0x12] status = 0x2d, ext_status = 0x8.
Host Name for Active Path: esx-HPAPVEABCD01.au.abcd.com
VMware ESXi 6.0.0 build-7967664
Host Bus Adapter (HBA, hardware
only, lspci)
vmhba |
driver |
driver version |
VID |
DID |
SVID |
SDID |
model |
vmhba2 |
qlnativefc |
2.1.57.0-1OEM.600.0.0.2768847 |
1077 |
2261 |
1077 |
02a8 |
QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to
PCIe Adapter |
vmhba3 |
qlnativefc |
2.1.57.0-1OEM.600.0.0.2768847 |
1077 |
2261 |
1077 |
02a8 |
QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to
PCIe Adapter |
vmhba6 |
qlnativefc |
2.1.57.0-1OEM.600.0.0.2768847 |
1077 |
2261 |
1077 |
02a8 |
QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to
PCIe Adapter |
vmhba7 |
qlnativefc |
2.1.57.0-1OEM.600.0.0.2768847 |
1077 |
2261 |
1077 |
02a8 |
QLogic Corp QLE2692 Dual Port 16Gb Fibre Channel to
PCIe Adapter |
HBA commands statistics
Adapter |
Driver |
Good CMDS |
Failed CMDS |
<– % |
MB read |
Failed MB read |
<– % |
MB written |
Failed MB written |
<– % |
Res. ops |
Failed res. ops |
Res. conflicts |
vmhba2 |
qlnativefc |
55539365 |
18 |
0.00% |
1189230.604 |
0 |
0.00% |
878067.813 |
0 |
0.00% |
0 |
0 |
0 |
vmhba3 |
qlnativefc |
0 |
0 |
– |
0 |
0 |
– |
0 |
0 |
– |
0 |
0 |
0 |
vmhba5 |
lsi_mr3 |
216167266 |
60496 |
0.00% |
5139163.045 |
0 |
0.00% |
1989061.516 |
0 |
0.00% |
2502370 |
0 |
0 |
vmhba6 |
qlnativefc |
55577208 |
4 |
0.00% |
1189488.888 |
0 |
0.00% |
878717.496 |
0 |
0.00% |
0 |
0 |
0 |
vmhba7 |
qlnativefc |
0 |
0 |
– |
0 |
0 |
– |
0 |
0 |
– |
0 |
0 |
0 |
Datastore Name |
naa / t10 / eui ID |
Volume UUID |
Commands |
Blocks Read |
Blocks Written |
LUN ID |
Medium |
Number Path |
Vendor |
Unity_VDG-AUS-UCE_LUN00 |
naa.6006016030a03a007181485c5238abcd |
5c49ed89-ec633db0-0ae4-e4434b04c344 |
109925180 |
487163874200.00% |
3597897593 |
LUN:0 |
fc |
4 |
DGC Fibre Channel |
zDoNotUse_HPAPVEABCD01_Local_Storage |
naa.6d0946606f185500232c124b8676abcd |
5b9a60e5-734a83de-7028-e4434b04c344 |
181284200 |
1020851568000.00% |
4069593297 |
LUN:0 |
sas |
1 |
DELL Serial Attached |
VOBD Logs:
- VOBD Logs
are missing from the bundle, so we are not able to find the device state.
Vmkernel Logs:
- From the
vmkernel logs we can see SCSI Device IO Errors:
2019-06-25T14:03:39.548Z cpu51:32898)ScsiDeviceIO: 2652:
Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d2f from world 0 to dev
“eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5
0x20 0x0.
2019-06-25T14:03:39.593Z
cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d32 from
world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x5 0x20 0x0.
2019-06-25T14:03:39.615Z
cpu51:32898)ScsiDeviceIO: 2652: Cmd(0x43bdd30cb500) 0x35, CmdSN 0x6e3d35 from
world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x5 0x20 0x0.
2019-06-25T14:08:39.557Z
cpu47:33547)ScsiDeviceIO: 2652: Cmd(0x43bdd3286b80) 0x35, CmdSN 0x6e3d85 from
world 0 to dev “eui.0050430000000000” failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x5 0x20 0x0.
2019-06-25T14:18:38.582Z cpu26:33549)qlnativefc: vmhba6(87:0.0): Created
work queue qlnativefc256t10wq
2019-06-25T14:18:38.586Z cpu26:33549)qlnativefc: vmhba6(87:0.0): fcport
5006016a48600c23 (targetId = 0) ONLINE
2019-06-25T14:18:38.587Z cpu26:33549)qlnativefc: vmhba6(87:0.0): fcport
5006016248600c23 (targetId = 1) ONLINE
- We can
also see the FC Link initiated a Rescan for the HBA which could probably
bring the storage online.
2019-06-26T17:30:10.083Z cpu54:15071195)qlnativefc: vmhba2(3b:0.0):
Issuing a Rescan FC Link
2019-06-26T17:30:10.083Z
cpu54:15071195)qlnativefc: vmhba3(3b:0.1): Issuing a Rescan FC Link
2019-06-26T17:30:10.083Z
cpu54:15071195)qlnativefc: vmhba6(87:0.0): Issuing a Rescan FC Link
2019-06-26T17:30:10.084Z
cpu54:15071195)qlnativefc: vmhba7(87:0.1): Issuing a Rescan FC Link
Conclusion:
- Based on
the Analysis we were not able to find any Vmkernel associated issues,
however we can see PLOGI failure messages from the driver, looks like login to the
device is not happening and is failing to establish a connection with the
device.
- Therefore, the Paths
fail to come back after reboot.
- “Devloss Timeout” is a mechanism in lpfc driver.
- As per the Warning Message: WARNING:
NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.5002cc8d9f1aabcd”
state in doubt; requested fast path state update…
These messages
appear as the Host Bus Adapter (HBA) driver aborts a command because the
command took longer than the timeout period of 5 seconds to complete. An
operation can take longer than the timeout period because of several reasons
including:
- Array backup operations (LUN backup, replication, etc)
- General overload on the array
- Read/Write Cache on the array (misconfiguration, lack of cache,
etc)
- Incorrect tiered storage used (SATA over SCSI)
- Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)
For more information please
refer: https://kb.vmware.com/s/article/1022026
Action Plan
- Driver Version which we are using is lpfc version 11.1.0.6-1vmw which as per the
Article: Vmware
HCL we can see that this driver version is VMware Inbox, native. I will recommend you to check with you
hardware vendor and get to the latest version of HBA.
- Check if
you are able to find any issues or event with any Intermediatory devices
between the Storage and the Esxi Host.