RCA-10: VM experienced an IO error

VM experienced an IO error on 01/04/2019: 09:12:51am.

ESX Host 1650128slesm02

The ESX was rebooted on 08/03

VM experienced an IO error on 01/04/2019: 09:12:51am.

VMware ESXi 6.0.0 build-4600944

VM Info:

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABC-XYZ-04

Time Zone E. Australia Standard Time

NIC Drivers:

From the Nic Driver version we can see that we are using 2.3.0.7 and for FC we are using 1.6.0.25 and the Firmware Version as: 4.0(8d).

vmnic   PCI bus address link speed duplex MTU   driver driver version firmware version MAC address        VID   DID   SVID SDID name
—–   ————— —- —– —— —   —— ————– —————- ———–        —   —   —- —- ———————————–
vmnic0 0000:06:00.0     Up    20000 Full    1500 enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:60 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic1 0000:07:00.0     Up    20000 Full    1500 enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:61 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2 0000:0d:00.0     Up    20000 Full    1600 enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:62 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic3 0000:0e:00.0     Up    20000 Full    1600 enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:63 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC

vmhba   driver driver version VID   DID   SVID SDID model
—–   —— ————– —   —   —- —- ————————————————-
vmhba0 fnic    1.6.0.25        1137 0045 1137 012c Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba1 fnic    1.6.0.25        1137 0045 1137 012c Cisco Systems Inc Cisco VIC FCoE HBA Driver

As per the Article:

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37360&deviceCategory=io&details=1&VID=1137&DID=0043&SVID=1137&SSID=012c&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37358&deviceCategory=io&details=1&VID=1137&DID=0045&SVID=1137&SSID=012c&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

The Firmware associated with the driver version should be: 4.1(1d), However the installed one is 4.0(8d).

System Events:

From the System logs we can see that there is a reset sent to the Device on 1st April.

TimeGenerated	EntryType	MachineName	EventID	Source	Message
01-04-2019 9.12	Warning	ABC-XYZ-04.callcentre.abc.qld.abc.au	153	disk	The IO operation at logical block address 0x6411a8 for Disk 1 (PDO name: \Device\00000036) was retried.
01-04-2019 9.12	Warning	ABC-XYZ-04.callcentre.abc.qld.abc.au	129	LSI_SAS	Reset to device, \Device\RaidPort1, was issued.

TimeGenerated	EntryType	MachineName	EventID	Source	Message
01-04-2019 9.09	0	ABC-XYZ-04.callcentre.abc.qld.abc.au	258	vmStatsProvider	The “vmGuestLibrary” is successfully initialized for this Virtual Machine.
01-04-2019 9.11	Information	ABC-XYZ-04.callcentre.abc.qld.abc.au	1	pwcallmgr[syd_gir_mcp_02]	WARN Logging queue’s maximum limit has been reached – waiting for queue size to drop.
01-04-2019 9.12	Information	ABC-XYZ-04.callcentre.abc.qld.abc.au	1	pwcallmgr[syd_gir_mcp_02]	NOTE EMS Logging queue has returned below limit.

01-04-2019 9.12 AEST = 11:12 PM 31-03-2019 GMT.

VMKernel:

From the Vmkernel logs we can see the similar issue of the Reset being sent on the NAA ID: naa.6000144000000010e005dc884882abcd

2019-03-31T23:12:51.846Z cpu4:33091)<7>fnic : 0 :: Returning from abort cmd type 2 SUCCESS
2019-03-31T23:12:51.846Z cpu2:33767)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x8a (0x439e01224e80, 751789) to dev “naa.6000144000000010e005dc884882abcd” on path “vmhba0:C0:T3:L12” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-03-31T23:12:51.846Z cpu2:33767)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.6000144000000010e005dc884882abcd” state in doubt; requested fast path state update…
2019-03-31T23:12:51.846Z cpu2:33767)ScsiDeviceIO: 2595: Cmd(0x439e01224e80) 0x8a, CmdSN 0x80000023 from world 751789 to dev “naa.6000144000000010e005dc884882abcd” failed H:0x8 D:0x0 P:0x0
2019-03-31T23:12:51.846Z cpu4:33091)VSCSI: 2661: handle 8193(vscsi1:0):Completing reset (0 outstanding commands)
2019-03-31T23:13:20.148Z cpu37:1649929)ScsiScan: 836: Path vmhba1:C0:T2:L0 supports REPORT LUNS 0x11

As per the Article: “state in doubt; requested fast path state update” error in ESXi (1022026)

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

Array backup operations (LUN backup, replication, etc)
General overload on the array
Read/Write Cache on the array (misconfiguration, lack of cache, etc)
Incorrect tiered storage used (SATA over SCSI)
Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

However we have seen a lot of events mentioning the performance of the LUN:

2019-03-24T13:05:34.614Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 41338 microseconds. # –> this is 0.041338 seconds
2019-03-25T13:04:50.521Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 36007 microseconds. # –> this is 0.036007 seconds
2019-03-28T13:14:28.990Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 37602 microseconds. # –> this is 0.037602 seconds

As per the Article: performance has deteriorated” message in ESXi (2007236)

This occurs when either the latency ratio to the last time the log was updated is 30 or if the ratio doubled since the last log. The device latency may increase due to one of these reasons:

Changes made on the target
Disk or media failures
Overload conditions on the device
Failover

VOBD Logs:

While reviewing the VOBD logs at the time of issue we were not able to find anything specific but we can see the same performance deterioration events being logged:

2019-03-25T13:04:50.521Z: [scsiCorrelator] 1532803651330us: [vob.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1794 microseconds to 36007 microseconds.
2019-03-25T13:04:50.521Z: [scsiCorrelator] 1532840761656us: [esx.problem.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1794 microseconds to 36007 microseconds.
2019-03-25T13:05:23.082Z: [scsiCorrelator] 1532836211217us: [vob.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 36007 microseconds to 7178 microseconds.
2019-03-25T13:05:23.082Z: [scsiCorrelator] 1532873322338us: [esx.clear.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 36007 microseconds to 7178 microseconds.
2019-03-25T13:05:42.503Z: [scsiCorrelator] 1532855632032us: [vob.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 7178 microseconds to 3538 microseconds.
2019-03-25T13:05:42.503Z: [scsiCorrelator] 1532892743443us: [esx.clear.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 7178 microseconds to 3538 microseconds.
2019-03-28T13:14:28.991Z: [scsiCorrelator] 1792575834676us: [vob.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1827 microseconds to 37602 microseconds.

From the SCSI Sense code we can see a Reset being sent from HBA.

vmhba0:C0:T3:L12 H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0 –> Act:EVAL –> DID RESET | GOOD | VMK SCSI PLUGIN GOOD | NO SENSE | NO ADDITIONAL SENSE INFORMATION

Recommendation:

I will first recommend you to move few VMs from the Datastore so that if there will be any issues with the device being overloaded we can fix it.
Based on the logs and the Article, I will recommend you to check with the Hardware vendor regarding the issue that we are seeing here.
Check with the Vendor regarding the Firmware Version that we are using as per the Article mentioned above the firmware version running with driver version 1.6.0.2 should be 4.1(1d), However the installed one is 4.0(8d).