RCA-10: VM experienced an IO error

VM experienced an IO error on 01/04/2019: 09:12:51am.

ESX Host  1650128slesm02

The ESX was rebooted on 08/03

VM experienced an IO error on 01/04/2019: 09:12:51am.

VMware ESXi 6.0.0 build-4600944

 

 

VM Info:

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABC-XYZ-04

Time Zone        E. Australia Standard Time

 

NIC Drivers:

 

  • From the Nic Driver version we can see that we are using 2.3.0.7 and for FC we are using 1.6.0.25 and the Firmware Version as: 4.0(8d).

 

vmnic   PCI bus address  link  speed  duplex  MTU   driver  driver version  firmware version  MAC address        VID   DID   SVID  SDID  name
—–   —————  —-  —–  ——     ——  ————–  —————-  ———–              —-  —-  ———————————–
vmnic0  0000:06:00.0     Up    20000  Full    1500  enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:60  1137  0043  1137  012c  Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic1  0000:07:00.0     Up    20000  Full    1500  enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:61  1137  0043  1137  012c  Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2  0000:0d:00.0     Up    20000  Full    1600  enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:62  1137  0043  1137  012c  Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic3  0000:0e:00.0     Up    20000  Full    1600  enic    2.3.0.7         4.0(8d)           00:25:b5:f1:18:63  1137  0043  1137  012c  Cisco Systems Inc Cisco VIC Ethernet NIC

 

vmhba   driver  driver version  VID   DID   SVID  SDID  model
—–   ——  ————–        —-  —-  ————————————————-
vmhba0  fnic    1.6.0.25        1137  0045  1137  012c  Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba1  fnic    1.6.0.25        1137  0045  1137  012c  Cisco Systems Inc Cisco VIC FCoE HBA Driver

 

As per the Article:

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37360&deviceCategory=io&details=1&VID=1137&DID=0043&SVID=1137&SSID=012c&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37358&deviceCategory=io&details=1&VID=1137&DID=0045&SVID=1137&SSID=012c&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

 

The Firmware associated with the driver version should be: 4.1(1d), However the installed one is 4.0(8d).

 

 

System Events:

 

  • From the System logs we can see that there is a reset sent to the Device on 1st April.

 

TimeGenerated

EntryType

MachineName

EventID

Source

Message

01-04-2019 9.12

Warning

ABC-XYZ-04.callcentre.abc.qld.abc.au

153

disk

The IO operation at logical block address 0x6411a8 for Disk 1 (PDO name: \Device\00000036) was retried.

01-04-2019 9.12

Warning

ABC-XYZ-04.callcentre.abc.qld.abc.au

129

LSI_SAS

Reset to device, \Device\RaidPort1, was issued.

 

 

TimeGenerated

EntryType

MachineName

EventID

Source

Message

01-04-2019 9.09

0

ABC-XYZ-04.callcentre.abc.qld.abc.au

258

vmStatsProvider

The “vmGuestLibrary” is successfully initialized for this Virtual Machine.

01-04-2019 9.11

Information

ABC-XYZ-04.callcentre.abc.qld.abc.au

1

pwcallmgr[syd_gir_mcp_02]

WARN Logging queue’s maximum limit has been reached – waiting for queue size to drop.

01-04-2019 9.12

Information

ABC-XYZ-04.callcentre.abc.qld.abc.au

1

pwcallmgr[syd_gir_mcp_02]

NOTE EMS Logging queue has returned below limit.

 

  • 01-04-2019 9.12 AEST = 11:12 PM 31-03-2019 GMT.

 

 

VMKernel:

 

  • From the Vmkernel logs we can see the similar issue of the Reset being sent on the NAA ID: naa.6000144000000010e005dc884882abcd

 

2019-03-31T23:12:51.846Z cpu4:33091)<7>fnic : 0 :: Returning from abort cmd type 2 SUCCESS
2019-03-31T23:12:51.846Z cpu2:33767)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x8a (0x439e01224e80, 751789) to dev “naa.6000144000000010e005dc884882abcd” on path “vmhba0:C0:T3:L12” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-03-31T23:12:51.846Z cpu2:33767)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.6000144000000010e005dc884882abcd” state in doubt; requested fast path state update…
2019-03-31T23:12:51.846Z cpu2:33767)ScsiDeviceIO: 2595: Cmd(0x439e01224e80) 0x8a, CmdSN 0x80000023 from world 751789 to dev “naa.6000144000000010e005dc884882abcd” failed H:0x8 D:0x0 P:0x0
2019-03-31T23:12:51.846Z cpu4:33091)VSCSI: 2661: handle 8193(vscsi1:0):Completing reset (0 outstanding commands)
2019-03-31T23:13:20.148Z cpu37:1649929)ScsiScan: 836: Path vmhba1:C0:T2:L0 supports REPORT LUNS 0x11

 

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

 

  • Array backup operations (LUN backup, replication, etc)
  • General overload on the array
  • Read/Write Cache on the array (misconfiguration, lack of cache, etc)
  • Incorrect tiered storage used (SATA over SCSI)
  • Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

 

  • However we have seen a lot of events mentioning the performance of the LUN:

 

2019-03-24T13:05:34.614Z Device naa.6000144000000010e005dc884882abcd performance has deteriorated to 41338 microseconds. # –> this is 0.041338 seconds
2019-03-25T13:04:50.521Z Device
naa.6000144000000010e005dc884882abcd performance has deteriorated to 36007 microseconds. # –> this is 0.036007 seconds
2019-03-28T13:14:28.990Z Device
naa.6000144000000010e005dc884882abcd performance has deteriorated to 37602 microseconds. # –> this is 0.037602 seconds

 

This occurs when either the latency ratio to the last time the log was updated is 30 or if the ratio doubled since the last log. The device latency may increase due to one of these reasons:

  • Changes made on the target
  • Disk or media failures
  • Overload conditions on the device
  • Failover

 

 

VOBD Logs:

 

  • While reviewing the VOBD logs at the time of issue we were not able to find anything specific but we can see the same performance deterioration events being logged:

 

2019-03-25T13:04:50.521Z: [scsiCorrelator] 1532803651330us: [vob.scsi.device.io.latency.high] Device naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1794 microseconds to 36007 microseconds.
2019-03-25T13:04:50.521Z: [scsiCorrelator] 1532840761656us: [esx.problem.scsi.device.io.latency.high] Device
naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1794 microseconds to 36007 microseconds.
2019-03-25T13:05:23.082Z: [scsiCorrelator] 1532836211217us: [vob.scsi.device.io.latency.improved] Device
naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 36007 microseconds to 7178 microseconds.
2019-03-25T13:05:23.082Z: [scsiCorrelator] 1532873322338us: [esx.clear.scsi.device.io.latency.improved] Device
naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 36007 microseconds to 7178 microseconds.
2019-03-25T13:05:42.503Z: [scsiCorrelator] 1532855632032us: [vob.scsi.device.io.latency.improved] Device
naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 7178 microseconds to 3538 microseconds.
2019-03-25T13:05:42.503Z: [scsiCorrelator] 1532892743443us: [esx.clear.scsi.device.io.latency.improved] Device
naa.6000144000000010e005dc884882abcd performance has improved. I/O latency reduced from 7178 microseconds to 3538 microseconds.
2019-03-28T13:14:28.991Z: [scsiCorrelator] 1792575834676us: [vob.scsi.device.io.latency.high] Device
naa.6000144000000010e005dc884882abcd performance has deteriorated. I/O latency increased from average value of 1827 microseconds to 37602 microseconds.

 

  • From the SCSI Sense code we can see a Reset being sent from HBA.

vmhba0:C0:T3:L12     H:0x8  D:0x0  P:0x0  Possible sense data: 0x0  0x0  0x0       –> Act:EVAL      –> DID RESET | GOOD | VMK SCSI PLUGIN GOOD | NO SENSE | NO ADDITIONAL SENSE INFORMATION

 

 

Recommendation:

 

  • I will first recommend you to move few VMs from the Datastore so that if there will be any issues with the device being overloaded we can fix it.
  • Based on the logs and the Article, I will recommend you to check with the Hardware vendor regarding the issue that we are seeing here.
  • Check with the Vendor regarding the Firmware Version that we are using as per the Article mentioned above the firmware version running with driver version 1.6.0.2 should be 4.1(1d), However the installed one is 4.0(8d).

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply