RCA 20: Virtual Machine Networking issues on ESXi Host

Hostname: abcesxpap41

VMware ESXi 6.0.0 build-9313334 ESXi 6.0 EP 15

Time: August 27, 2019, at 08:06 am to 08:09 am HKT

Time in GMT: 12:06 AM – 12:09 AM GMT

 

 

vmhba

driver

driver version

VID

DID

SVID

SDID

model

vmhba0

hpsa

6.0.0.124-1OEM

103c

323b

103c

3355

Hewlett Packard Enterprise Smart Array P220i

vmhba1

qlnativefc

2.1.81.0-1OEM.600.0.0.2768847

1077

2532

103c

338e

QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

vmhba2

qlnativefc

2.1.81.0-1OEM.600.0.0.2768847

1077

2532

103c

338e

QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

vmhba3

qlnativefc

2.1.81.0-1OEM.600.0.0.2768847

1077

2532

103c

338e

QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

vmhba4

qlnativefc

2.1.81.0-1OEM.600.0.0.2768847

1077

2532

103c

338e

QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

 

 

Adapter

Driver

Good CMDS

Failed CMDS

<– %

MB read

Failed MB read

<– %

MB written

Failed MB written

<– %

Res. ops

Failed res. ops

Res. conflicts

vmhba0

hpsa

13408315

14665

0.10%

169284.5

0

0.00%

44278.84

0

0.00%

48580

0

0

vmhba1

qlnativefc

90100562

1910108

2.10%

753961.6

0

0.00%

199789.5

0

0.00%

0

0

0

vmhba2

qlnativefc

90051489

8285333

8.40%

753292.2

0

0.00%

200414.9

0

0.00%

0

0

0

vmhba3

qlnativefc

90040104

27330240

23.30%

753592.8

0

0.00%

199704.4

0

0.00%

0

0

0

vmhba4

qlnativefc

90051269

20862765

18.80%

753207.9

0

0.00%

200076.3

0

0.00%

0

0

0

vmhba32

bnx2i

0

0

0

0

0

0

0

0

0

vmhba33

bnx2i

0

0

0

0

0

0

0

0

0

 

 

the

Rx

Tx

Rx

Rx

Tx

Tx

Rx

Tx

Rx

Tx

 

 

Rx

Rx

Rx

Rx

Rx

Rx

Tx

Tx

Rx

vmnic

Pkts

Pkts

Bytes

GB

Bytes

GB

Err

Err

Drop

Drop

M-cast

Coll

Len Err

Ovr Err

CRC Err

Frm Err

Fifo Err

Miss Err

Abt Err

Carr Err

Total Err

vmnic0

3460168368

1050184023

2.7997E+12

260740.00%

4.87E+12

4532

1026000.00%

0

0

0.00%

930551239

0

1161

0

0

0

9099

0

0

0

0.00%

vmnic1

6749656176

9275562324

6.72775E+11

62660.00%

6.12E+12

5697.3

929600.00%

0

0

0.00%

906003972

0

1161

0

0

0

8135

0

0

0

0.00%

 

VMKernel Logs:

 

  • From the vmkernel logs we can see that the Heartbeat of the Datastore Timed out.

2019-08-27T00:06:06.100Z cpu25:33495)IntrCookie: 3429: Interrupt received on invalid vector (cpu 25, vector 76); ignoring it.
2019-08-27T00:06:17.041Z cpu29:108949)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_1’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:06:17.041Z cpu29:108949)  [HB state abcdef02 offset 3670016 gen 33 stampUS 1696332028433 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 1656800> drv 14.61 lockImpl 4]
2019-08-27T00:06:17.041Z cpu28:108954)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_2’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:06:17.041Z cpu28:108954)  [HB state abcdef02 offset 3670016 gen 3 stampUS 1696332028488 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 832400> drv 14.61 lockImpl 4]
2019-08-27T00:06:25.909Z cpu2:1715672)ScsiCore: 62: Starting taskmgmt handler world 1715672/3
2019-08-27T00:06:27.907Z cpu17:1715673)ScsiCore: 62: Starting taskmgmt handler world 1715673/4
2019-08-27T00:06:40.086Z cpu24:32995)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_1’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:06:40.086Z cpu24:32995)  [HB state abcdef02 offset 3670016 gen 33 stampUS 1696332028433 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 1656800> drv 14.61 lockImpl 4]

  • QLogic HBA throwing some warning which used to come when the HBA goes Unresponsive:

 

2019-08-27T00:06:45.715Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Timeout: Max mbx wait reached.
2019-08-27T00:06:45.715Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Mailbox command 0x54 timeout occurred. Issuing ISP abort.
2019-08-27T00:06:45.715Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Firmware has been previously dumped (0x4308f69a5000) — ignoring request…
2019-08-27T00:06:45.715Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): Inside qlnativefcAbortIsp

 

 

  • Here we can see a reset being initiated on the HBA:

 

 

[SCSI Code]: Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

ASC/ASCQ:

00h / 00h

ASC Details:

NO ADDITIONAL SENSE INFORMATION

Status:

00h – GOOD

Description: This status code indicates that the device has completed the task successfully.

Hostebyte:

0x08 – DID_RESET – Reset by somebody.

Sensekey:

0h

Sensekey MSG:

NO SENSE

Sensekey Notes:

Indicates that there is no specific sense key information to be reported. This may occur for a successful command or for a command that receives CHECK CONDITION status because one of the FILEMARK. EOM. or ILI bits is set to one.

Plugin notes:

VMK_SCSI_PLUGIN_GOOD – No error.

 

 

2019-08-27T00:06:45.715Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): Performing ISP error recovery – ha= 0x4308fe061080.
2019-08-27T00:06:45.715Z cpu7:33494)NMP: nmp_ThrottleLogForDevice:3302: Cmd 0x2a (0x43b61a818800, 197862) to dev “naa.60000970000297500084533030391111” on path “vmhba2:C0:T0:L20” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-08-27T00:06:45.715Z cpu7:33494)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000297500084533030391111” state in doubt; requested fast path state update…
2019-08-27T00:06:45.715Z cpu7:33494)ScsiDeviceIO: 2595: Cmd(0x43b61a818800) 0x2a, CmdSN 0x800e002d from world 197862 to dev “naa.60000970000297500084533030391111” failed H:0x8 D:0x0 P:0x0
2019-08-27T00:06:45.715Z cpu25:33495)NMP: nmp_ThrottleLogForDevice:3302: Cmd 0x28 (0x43be0f423a00, 1715649) to dev “naa.60000970000297500084533030422222” on path “vmhba2:C0:T0:L2” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-08-27T00:06:45.715Z cpu25:33495)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000297500084533030422222” state in doubt; requested fast path state update…
2019-08-27T00:06:45.715Z cpu7:33494)ScsiDeviceIO: 2595: Cmd(0x43b6198012c0) 0x2a, CmdSN 0x800e007c from world 197862 to dev “naa.60000970000297500084533030391111” failed H:0x8 D:0x0 P:0x0
2019-08-27T00:06:45.715Z cpu25:33495)ScsiDeviceIO: 2613: Cmd(0x43be0f423a00) 0x28, CmdSN 0x15a5c from world 1715649 to dev “naa.60000970000297500084533030422222” failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2019-08-27T00:06:45.715Z cpu7:33494)ScsiDeviceIO: 2595: Cmd(0x43b600fdc3c0) 0x2a, CmdSN 0x800e0079 from world 197862 to dev “naa.60000970000297500084533030391111” failed H:0x8 D:0x0 P:0x0
2019-08-27T00:06:45.715Z cpu7:33494)NMP: nmp_ThrottleLogForDevice:3235: last error status from device naa.60000970000297500084533030391111 repeated 10 times
2019-08-27T00:06:45.715Z cpu7:33494)ScsiDeviceIO: 2613: Cmd(0x43b61a81d180) 0x28, CmdSN 0x15a5b from world 1715654 to dev “naa.60000970000297500084533030422222” failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2019-08-27T00:06:45.715Z cpu7:33494)NMP: nmp_ThrottleLogForDevice:3302: Cmd 0x2a (0x43b600ee4e40, 32806) to dev “naa.60000970000297500084533030393333” on path “vmhba2:C0:T0:L15” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-08-27T00:06:45.715Z cpu7:33494)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000297500084533030393333” state in doubt; requested fast path state update…
2019-08-27T00:06:45.724Z cpu14:33494)NMP: nmp_ThrottleLogForDevice:3235: last error status from device naa.60000970000297500084533030422222 repeated 20 times
2019-08-27T00:06:45.732Z cpu25:33495)NMP: nmp_ThrottleLogForDevice:3235: last error status from device naa.60000970000297500084533030422222 repeated 40 times
2019-08-27T00:06:45.746Z cpu14:33494)NMP: nmp_ThrottleLogForDevice:3235: last error status from device naa.60000970000297500084533030422222 repeated 80 times

 

  • Here we can see that the Timeout has finally reached where the command started to failed:

 

2019-08-27T00:06:55.744Z cpu16:34146)Fil3: 15978: Retry 10 for caller Fil3_FileIO (status ‘IO was aborted by VMFS via a virt-reset on the device’)
2019-08-27T00:06:55.909Z cpu19:32996)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_2’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:06:55.909Z cpu19:32996)  [HB state abcdef02 offset 3670016 gen 3 stampUS 1696332028488 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 832400> drv 14.61 lockImpl 4]
2019-08-27T00:07:03.105Z cpu18:937761)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_1’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:07:03.105Z cpu18:937761)  [HB state abcdef02 offset 3670016 gen 33 stampUS 1696332028433 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 1656800> drv 14.61 lockImpl 4]
2019-08-27T00:07:07.043Z cpu29:108949)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_1’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:07:07.043Z cpu29:108949)  [HB state abcdef02 offset 3670016 gen 33 stampUS 1696332028433 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 1656800> drv 14.61 lockImpl 4]
2019-08-27T00:07:07.043Z cpu28:108954)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_2’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:07:07.043Z cpu28:108954)  [HB state abcdef02 offset 3670016 gen 3 stampUS 1696332028488 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 832400> drv 14.61 lockImpl 4]
2019-08-27T00:07:15.716Z cpu30:1714943)WARNING: qlnativefc: vmhba2(3:0.1): Timeout: Max mbx wait reached.
2019-08-27T00:07:15.716Z cpu30:1714943)qlnativefc: vmhba2(3:0.1): qlnativefcMailboxCommand(1): **** FAILED
 mb[0]=54 mb[2]=aaa72019-08-27T00:07:15.716Z cpu30:1714943) cmd=54 ****
2019-08-27T00:07:15.716Z cpu30:1714943)WARNING: ScsiPath: 7174: Set retry timeout for failed TaskMgmt abort for CmdSN  0x1a7fe6, status Failure, path vmhba2:C0:T0:L15
2019-08-27T00:07:15.716Z cpu9:35647)qlnativefc: vmhba2(3:0.1): qlnativefcMailboxCommand(1): **** FAILED
 mb[0]=4001 mb[1]=4953 mb[2]=5020 mb[3]=25322019-08-27T00:07:15.716Z cpu9:35647) cmd=69 ****
2019-08-27T00:07:15.761Z cpu16:34146)Fil3: 15969: Max retries (20) exceeded for caller Fil3_FileIO (status ‘IO was aborted by VMFS via a virt-reset on the device’)
2019-08-27T00:07:15.761Z cpu16:34146)BC: 2440: Failed to write (uncached) object ‘.iormstats.sf’: Maximum kernel-level retries exceeded
2019-08-27T00:07:15.811Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Firmware dump previously allocated.
2019-08-27T00:07:15.847Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): Enabling PUREX.
2019-08-27T00:07:15.859Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): DPORT feature : disabled.
2019-08-27T00:07:15.859Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): FAWWN feature : disabled.
2019-08-27T00:07:16.433Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): LOOP UP detected (8 Gbps).
2019-08-27T00:07:16.817Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): fcport 50000975900150c6 (targetId = 0) ONLINE
2019-08-27T00:07:16.829Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): qlnativefcMailboxCommand(1): **** FAILED
 mb[0]=54 mb[2]=aaa72019-08-27T00:07:16.829Z cpu20:1714945) cmd=54 ****
2019-08-27T00:07:16.829Z cpu20:1714945)WARNING: ScsiPath: 7174: Set retry timeout for failed TaskMgmt abort for CmdSN  0x285068, status Failure, path vmhba2:C0:T0:L20
2019-08-27T00:07:17.717Z cpu20:1714945)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000297500084533030393333” state in doubt; requested fast path state update…
2019-08-27T00:07:17.717Z cpu20:1714945)ScsiDeviceIO: 2595: Cmd(0x43be0e5b0880) 0xfe, CmdSN 0x1a7fe6 from world 32806 to dev “naa.60000970000297500084533030393333” failed H:0x8 D:0x0 P:0x0
2019-08-27T00:07:17.764Z cpu12:32859)HBX: 283: ‘DS_Postpaid_Cluster_APPS_1’: HB at offset 3670016 – Reclaimed heartbeat [Timeout]:
2019-08-27T00:07:17.764Z cpu12:32859)  [HB state abcdef02 offset 3670016 gen 33 stampUS 1696402079932 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 1656800> drv 14.61 lockImpl 4]
2019-08-27T00:07:17.788Z cpu18:937761)HBX: 2851: ‘DS_Postpaid_Cluster_APPS_2’: HB at offset 3670016 – Waiting for timed out HB:
2019-08-27T00:07:17.788Z cpu18:937761)  [HB state abcdef02 offset 3670016 gen 3 stampUS 1696332028488 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 832400> drv 14.61 lockImpl 4]
2019-08-27T00:07:18.831Z cpu20:1714945)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000297500084533030391111” state in doubt; requested fast path state update…
2019-08-27T00:07:18.831Z cpu20:1714945)ScsiDeviceIO: 2595: Cmd(0x43be00ea8980) 0xfe, CmdSN 0x285068 from world 32806 to dev “naa.60000970000297500084533030391111” failed H:0x8 D:0x0 P:0x0
2019-08-27T00:07:18.870Z cpu15:32857)HBX: 283: ‘DS_Postpaid_Cluster_APPS_2’: HB at offset 3670016 – Reclaimed heartbeat [Timeout]:
2019-08-27T00:07:18.870Z cpu15:32857)  [HB state abcdef02 offset 3670016 gen 3 stampUS 1696403185720 uuid 5d4a91f0-5d376611-e106-3863bb4a1158 jrnl <FB 832400> drv 14.61 lockImpl 4]
2019-08-27T00:07:18.878Z cpu13:33491)NMP: nmp_ThrottleLogForDevice:3252: last error status from device naa.60000970000297500084533030391111 repeated 2 times
2019-08-27T00:07:18.878Z cpu13:33491)NMP: nmp_ThrottleLogForDevice:3302: Cmd 0xf1 (0x43b613c78e40, 1715712) to dev “naa.60000970000297500084533030391111” on path “vmhba1:C0:T0:L20” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE
2019-08-27T00:07:18.878Z cpu13:33491)ScsiDeviceIO: 2613: Cmd(0x43b6190e93c0) 0xfe, CmdSN 0x28508b from world 1715712 to dev “naa.60000970000297500084533030391111” failed H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0.

 

Hostd Logs:

 

  • On the Hostd logs we can see that the Path went down when there is a reset which is being initiated by the HBA, where we can see almost all the datastore were affected:

 

2019-08-27T00:06:54.909Z info hostd[74F9FB70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9713 : Path redundancy to storage device naa.60000970000297500084533030422222 degraded. Path vmhba2:C0:T0:L2 is down. Affected datastores: DS_Prepaid_Cluster6_APPS_1.
2019-08-27T00:06:54.910Z info hostd[74F9FB70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9714 : Path redundancy to storage device naa.60000970000297500084533030391111 degraded. Path vmhba2:C0:T0:L20 is down. Affected datastores: DS_Postpaid_Cluster_APPS_2.
2019-08-27T00:06:54.910Z info hostd[74F9FB70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9715 : Path redundancy to storage device naa.60000970000297500084533030393333 degraded. Path vmhba2:C0:T0:L15 is down. Affected datastores: DS_Postpaid_Cluster_APPS_1.
2019-08-27T00:06:54.911Z info hostd[746CBB70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9716 : Path redundancy to storage device naa.60000970000297500084533030423842 degraded. Path vmhba2:C0:T0:L10 is down. Affected datastores: DS_Prepaid_Cluster6_APPS_3.
2019-08-27T00:06:54.917Z info hostd[72EB1B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9726 : Path redundancy to storage device naa.60000970000297500084533030423832 degraded. Path vmhba2:C0:T0:L1 is down. Affected datastores: DS_Prepaid_Cluster6

 

VOBD Logs:

  • From the VOBD Logs we can see the same path state changing probably came when the HBA Initiated a Reset:

 

2019-08-27T00:06:54.907Z: [scsiCorrelator] 1696379222751us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L2 changed state from on
2019-08-27T00:06:54.908Z: [scsiCorrelator] 1696369608926us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030422222 degraded. Path vmhba2:C0:T0:L2 is down. Affected datastores: “DS_Prepaid_Cluster6_APPS_1”.
2019-08-27T00:06:54.908Z: [scsiCorrelator] 1696379222755us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L20 changed state from on
2019-08-27T00:06:54.909Z: [scsiCorrelator] 1696369609470us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030391111 degraded. Path vmhba2:C0:T0:L20 is down. Affected datastores: “DS_Postpaid_Cluster_APPS_2”.
2019-08-27T00:06:54.909Z: [scsiCorrelator] 1696379222760us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L15 changed state from on
2019-08-27T00:06:54.909Z: [scsiCorrelator] 1696369609960us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030393333 degraded. Path vmhba2:C0:T0:L15 is down. Affected datastores: “DS_Postpaid_Cluster_APPS_1”.
2019-08-27T00:06:54.909Z: [scsiCorrelator] 1696379223358us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L10 changed state from on
2019-08-27T00:06:54.910Z: [scsiCorrelator] 1696369610482us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030423842 degraded. Path vmhba2:C0:T0:L10 is down. Affected datastores: “DS_Prepaid_Cluster6_APPS_3”.
2019-08-27T00:06:54.910Z: [scsiCorrelator] 1696379224049us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L11 changed state from on
2019-08-27T00:06:54.910Z: [scsiCorrelator] 1696369610980us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030423843 degraded. Path vmhba2:C0:T0:L11 is down. Affected datastores: “DS_Prepaid_Cluster6_APPS_3”.
2019-08-27T00:06:54.924Z: [scsiCorrelator] 1696379239388us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L22 changed state from on
2019-08-27T00:06:54.924Z: [scsiCorrelator] 1696369624918us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030394143 degraded. Path vmhba2:C0:T0:L22 is down. Affected datastores: “DS_Postpaid_Cluster_APPS_2”.
2019-08-27T00:06:54.924Z: [scsiCorrelator] 1696379240069us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L23 changed state from on
2019-08-27T00:06:54.925Z: [scsiCorrelator] 1696369625585us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030394144 degraded. Path vmhba2:C0:T0:L23 is down. Affected datastores: “DS_Postpaid_Cluster_APPS_2”.
2019-08-27T00:06:54.925Z: [scsiCorrelator] 1696379240767us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T0:L24 changed state from on
2019-08-27T00:06:54.926Z: [scsiCorrelator] 1696369626301us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60000970000297500084533030394145 degraded. Path vmhba2:C0:T0:L24 is down. Affected datastores: “DS_Postpaid_Cluster_APPS_2”.

 

  • After the reset operation got comepleted we can see that the Path came online and everything went fine.

 

2019-08-27T00:07:16.939Z: [scsiCorrelator] 1696401255008us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L10 changed state from dead
2019-08-27T00:07:16.940Z: [scsiCorrelator] 1696391640751us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030423842 (Datastores: “DS_Prepaid_Cluster6_APPS_3”) restored. Path vmhba2:C0:T0:L10 is active again.
2019-08-27T00:07:16.942Z: [scsiCorrelator] 1696401257471us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L11 changed state from dead
2019-08-27T00:07:16.942Z: [scsiCorrelator] 1696391642927us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030423843 (Datastores: “DS_Prepaid_Cluster6_APPS_3”) restored. Path vmhba2:C0:T0:L11 is active again.
2019-08-27T00:07:16.944Z: [scsiCorrelator] 1696401259885us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L12 changed state from dead
2019-08-27T00:07:16.945Z: [scsiCorrelator] 1696391645322us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030423844 (Datastores: “DS_Prepaid_Cluster6_APPS_3”) restored. Path vmhba2:C0:T0:L12 is active again.
2019-08-27T00:07:16.947Z: [scsiCorrelator] 1696401262305us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L13 changed state from dead
2019-08-27T00:07:16.947Z: [scsiCorrelator] 1696391647749us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030423845 (Datastores: “DS_Prepaid_Cluster6_APPS_3”) restored. Path vmhba2:C0:T0:L13 is active again.
2019-08-27T00:07:16.949Z: [scsiCorrelator] 1696401264786us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L14 changed state from dead
2019-08-27T00:07:16.950Z: [scsiCorrelator] 1696391650270us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030423846 (Datastores: “DS_Prepaid_Cluster6_APPS_3”) restored. Path vmhba2:C0:T0:L14 is active again.
2019-08-27T00:07:16.952Z: [scsiCorrelator] 1696401267478us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L15 changed state from dead
2019-08-27T00:07:16.952Z: [scsiCorrelator] 1696391652879us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030393333 (Datastores: “DS_Postpaid_Cluster_APPS_1”) restored. Path vmhba2:C0:T0:L15 is active again.
2019-08-27T00:07:16.954Z: [scsiCorrelator] 1696401269813us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T0:L16 changed state from dead
2019-08-27T00:07:16.955Z: [scsiCorrelator] 1696391655216us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.60000970000297500084533030394136 (Datastores: “DS_Postpaid_Cluster_APPS_1”) restored. Path vmhba2:C0:T0:L16 is active again.

2019-08-27T00:07:17.765Z: [vmfsCorrelator] 1696392465399us: [esx.problem.vmfs.heartbeat.recovered] 5d4cd6c0-2d6003b1-afa6-3863bb4a1158 DS_Postpaid_Cluster_APPS_1
2019-08-27T00:07:18.871Z: [vmfsCorrelator] 1696393571426us: [esx.problem.vmfs.heartbeat.recovered] 5d4cd7af-327f5b20-6bbe-3863bb4a1158 DS_Postpaid_Cluster_APPS_2

 

 

 

Conclusion:

  • As per the below Events:

2019-08-27T00:06:06.100Z cpu25:33495)IntrCookie: 3429: Interrupt received on invalid vector (cpu 25, vector 76); ignoring it.

2019-08-27T00:06:45.715Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Timeout: Max mbx wait reached.
2019-08-27T00:06:45.715Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Mailbox command 0x54 timeout occurred. Issuing ISP abort.
2019-08-27T00:06:45.715Z cpu20:1714945)WARNING: qlnativefc: vmhba2(3:0.1): Firmware has been previously dumped (0x4308f69a5000) — ignoring request…
2019-08-27T00:06:45.715Z cpu20:1714945)qlnativefc: vmhba2(3:0.1): Inside qlnativefcAbortIsp

  • This is the reason why we have noticed the issue in the environment.
  • As per my Research this is a Known issue with QLogic 8Gb FC HBA and HPE Gen8 platform

The vmkernel alert shown indicates that ESXi has received an interrupt on an unexpected processor interrupt vector and has ignored the interrupt because ESXi does not know the source of the interrupt or which software handler to invoke.

Reception of such an interrupt is not normal and may cause loss of I/O connectivity and/or ESXi PSODs.

In order to distribute the interrupt load evenly across processors, ESXi performs interrupt re-balancing. As part of this process, ESXi may modify the target processor and interrupt vector assigned to an I/O device. This process is known as I/O device interrupt migration.

When the VT-d interrupt remapper is not present in a platform or its use is disabled in ESXi (boot option iovDisableIR=TRUE), ESXi performs I/O device interrupt migration by modifying the PCI MSI/MSI-X registers in the I/O device(s).

VMware has recently received reports that some I/O devices do not react well when ESXi performs interrupt migration by programming the device’s MSI/MSI-X registers. Specifically, some I/O devices may interrupt in a processor vector which ESXi no longer uses and considers invalid, causing the generation of the alert message and its subsequent effects. This issue is only present on 8G FC connection.

 

  • To Understand more about this issue I will recommend that you check with HP if there is any fix released for the issue mentioned above because as per Vmware Engineering team this is a Third Party issue and They will have to find a fix while iovDisableIR is enabled.

 

Workaround:

 For systems that support, enable Intel VT-d Interrupt Remapper. 

See the VMware KB article, ESXi IO connectivity issues or PSOD with VT-d interrupt remapper disabled (2149592) .

  1. Connect to the ESXi host with an SSH session and root credentials.
  1. Run this command:
    esxcli system settings kernel set –setting=iovDisableIR -v FALSE
  1. Reboot the ESXi host.
  1. Ensure the iovDisableIR setting is set to FALSE with this command:
    esxcli system settings kernel list -o iovDisableIR

 

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply