VM experienced an IO error on 01/04/2019: 09:12:51am.
ESX Host 1650128slesm02
The ESX was rebooted on 08/03
VM experienced an IO error on 01/04/2019: 09:12:51am.
VMware ESXi 6.0.0 build-4600944
VM Info:
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABC-XYZ-04
Time Zone E. Australia Standard Time
NIC Drivers:
- From the Nic Driver version we can see that we are using 2.3.0.7 and for FC we are using 1.6.0.25 and the Firmware Version as: 4.0(8d).
vmnic PCI bus address link speed duplex MTU driver driver version firmware version MAC address VID DID SVID SDID name
—– ————— —- —– —— — —— ————– —————- ———– — — —- —- ———————————–
vmnic0 0000:06:00.0 Up 20000 Full 1500 enic 2.3.0.7 4.0(8d) 00:25:b5:f1:18:60 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic1 0000:07:00.0 Up 20000 Full 1500 enic 2.3.0.7 4.0(8d) 00:25:b5:f1:18:61 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2 0000:0d:00.0 Up 20000 Full 1600 enic 2.3.0.7 4.0(8d) 00:25:b5:f1:18:62 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic3 0000:0e:00.0 Up 20000 Full 1600 enic 2.3.0.7 4.0(8d) 00:25:b5:f1:18:63 1137 0043 1137 012c Cisco Systems Inc Cisco VIC Ethernet NIC
vmhba driver driver version VID DID SVID SDID model
—– —— ————– — — —- —- ————————————————-
vmhba0 fnic 1.6.0.25 1137 0045 1137 012c Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba1 fnic 1.6.0.25 1137 0045 1137 012c Cisco Systems Inc Cisco VIC FCoE HBA Driver
As per the Article:
The Firmware associated with the driver version should be: 4.1(1d), However the installed one is 4.0(8d).
System Events:
- From the System logs we can see that there is a reset sent to the Device on 1st April.
TimeGenerated | EntryType | MachineName | EventID | Source | Message |
01-04-2019 9.12 | Warning | ABC-XYZ-04.callcentre.abc.qld.abc.au | 153 | disk | The IO operation at logical block address 0x6411a8 for Disk 1 (PDO name: \Device\00000036) was retried. |
01-04-2019 9.12 | Warning | ABC-XYZ-04.callcentre.abc.qld.abc.au | 129 | LSI_SAS | Reset to device, \Device\RaidPort1, was issued. |
TimeGenerated | EntryType | MachineName | EventID | Source | Message |
01-04-2019 9.09 | 0 | ABC-XYZ-04.callcentre.abc.qld.abc.au | 258 | vmStatsProvider | The “vmGuestLibrary” is successfully initialized for this Virtual Machine. |
01-04-2019 9.11 | Information | ABC-XYZ-04.callcentre.abc.qld.abc.au | 1 | pwcallmgr[syd_gir_mcp_02] | WARN Logging queue’s maximum limit has been reached – waiting for queue size to drop. |
01-04-2019 9.12 | Information | ABC-XYZ-04.callcentre.abc.qld.abc.au | 1 | pwcallmgr[syd_gir_mcp_02] | NOTE EMS Logging queue has returned below limit. |
- 01-04-2019 9.12 AEST = 11:12 PM 31-03-2019 GMT.
VMKernel:
- From the Vmkernel logs we can see the similar issue of the Reset being sent on the NAA ID: naa.6000144000000010e005dc884882abcd
2019-03-31T23:12:51.846Z cpu4:33091)<7>fnic : 0 :: Returning from abort cmd type 2 SUCCESS
2019-03-31T23:12:51.846Z cpu2:33767)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x8a (0x439e01224e80, 751789) to dev “naa.6000144000000010e005dc884882abcd” on path “vmhba0:C0:T3:L12” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-03-31T23:12:51.846Z cpu2:33767)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.6000144000000010e005dc884882abcd” state in doubt; requested fast path state update…
2019-03-31T23:12:51.846Z cpu2:33767)ScsiDeviceIO: 2595: Cmd(0x439e01224e80) 0x8a, CmdSN 0x80000023 from world 751789 to dev “naa.6000144000000010e005dc884882abcd” failed H:0x8 D:0x0 P:0x0
2019-03-31T23:12:51.846Z cpu4:33091)VSCSI: 2661: handle 8193(vscsi1:0):Completing reset (0 outstanding commands)
2019-03-31T23:13:20.148Z cpu37:1649929)ScsiScan: 836: Path vmhba1:C0:T2:L0 supports REPORT LUNS 0x11
- As per the Article: “state in doubt; requested fast path state update” error in ESXi (1022026)
These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:
- Array backup operations (LUN backup, replication, etc)
- General overload on the array
- Read/Write Cache on the array (misconfiguration, lack of cache, etc)
- Incorrect tiered storage used (SATA over SCSI)
- Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)
- However we have seen a lot of events mentioning the performance of the LUN:
2019-03-24T13:05:34.614Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 41338 microseconds. # –> this is 0.041338 seconds
2019-03-25T13:04:50.521Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 36007 microseconds. # –> this is 0.036007 seconds
2019-03-28T13:14:28.990Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 37602 microseconds. # –> this is 0.037602 seconds
- As per the Article: performance has deteriorated” message in ESXi (2007236)
This occurs when either the latency ratio to the last time the log was updated is 30 or if the ratio doubled since the last log. The device latency may increase due to one of these reasons:
- Changes made on the target
- Disk or media failures
- Overload conditions on the device
- Failover
VOBD Logs:
- While reviewing the VOBD logs at the time of issue we were not able to find anything specific but we can see the same performance deterioration events being logged:
2019-03-25T13:04:50.521Z: [scsiCorrelator] 1532803651330us: [vob.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1794 microseconds to 36007 microseconds.
2019-03-25T13:04:50.521Z: [scsiCorrelator] 1532840761656us: [esx.problem.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1794 microseconds to 36007 microseconds.
2019-03-25T13:05:23.082Z: [scsiCorrelator] 1532836211217us: [vob.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 36007 microseconds to 7178 microseconds.
2019-03-25T13:05:23.082Z: [scsiCorrelator] 1532873322338us: [esx.clear.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 36007 microseconds to 7178 microseconds.
2019-03-25T13:05:42.503Z: [scsiCorrelator] 1532855632032us: [vob.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 7178 microseconds to 3538 microseconds.
2019-03-25T13:05:42.503Z: [scsiCorrelator] 1532892743443us: [esx.clear.scsi.device.io.latency.improved] Device naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 7178 microseconds to 3538 microseconds.
2019-03-28T13:14:28.991Z: [scsiCorrelator] 1792575834676us: [vob.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1827 microseconds to 37602 microseconds.
- From the SCSI Sense code we can see a Reset being sent from HBA.
vmhba0:C0:T3:L12 H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0 –> Act:EVAL –> DID RESET | GOOD | VMK SCSI PLUGIN GOOD | NO SENSE | NO ADDITIONAL SENSE INFORMATION
Recommendation:
- I will first recommend you to move few VMs from the Datastore so that if there will be any issues with the device being overloaded we can fix it.
- Based on the logs and the Article, I will recommend you to check with the Hardware vendor regarding the issue that we are seeing here.
- Check with the Vendor regarding the Firmware Version that we are using as per the Article mentioned above the firmware version running with driver version 1.6.0.2 should be 4.1(1d), However the installed one is 4.0(8d).