RCA-5: vMotion logging and Storage Latency on vCenter Infrastructure

Host Name: HQI*****EFG

 

Reported Time of the Issue: January 19, 2019 at 16:34

 

VM on which the issue was reported: abcdefghij04 & abcd*****j09. (This Analysis is based on VM abcd*****j09

 

 

Hostd:

 

  • From the Hostd Logs we can see that Initially issue started with Low Disk space for the VM abcd*****j09 on Host HQI*****EFG:

 

2019-01-19T21:02:20.912Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Answered question 48975048

2019-01-19T21:02:21.717Z info hostd[12307B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Upgrade is required for virtual machine, version: 10

2019-01-19T21:02:46.046Z warning hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Failed to find activation record, event user unknown.

2019-01-19T21:02:46.046Z info hostd[120C6B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 7463 : Message on abcd*****j09 on HQI*****EFG.phx.aexp.com in ha-datacenter: There is no more space for virtual disk ‘abcd*****j09-000001.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking _Retry. Click Cancel to terminate this session.

 

  • After which the VM Undergone a Storage Migration which can be seen with the Event Below with MigrationType=2:
  • This Migration is initiated by the User ADS\kp***2

 

  • Now from the below logs we can see that the VM Storage migration is initiated from Location: /vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee to 5bd0e096-786144d4-b713-0025b552a0ee where VM is Currently residing:

 

2019-01-19T23:02:16.060Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-a0-f79b user=vpxuser:ADS\kp***2] Completed scheduling PrepareSourceEx [2493027353303613693].

2019-01-19T23:02:16.068Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] VigorMigrateNotifyCb:: hostlog state changed from none to emigrating

2019-01-19T23:02:16.076Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] VMotionStatusCb [2493027353303613693] : Prepare task completed successfully

2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] PrepareDestinationEx [2493027353303613693]

2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotionDst (2493027353303613693) opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] VMotionEntry: migrateType = 2

2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] Completed PrepareDestinationEx [2493027353303613693]

2019-01-19T23:02:16.108Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] InitiateDestination [2493027353303613693], VM = ‘/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx’

 

2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] State Transition (VM_STATE_INITIALIZING -> VM_STATE_OFF)

 

2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] Initialized virtual machine.

2019-01-19T23:02:16.148Z info hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Skip a duplicate transition to: VM_STATE_OFF

 

 

 

Vmware.log:

 

  • From the Vmware logs we can see that the Migration was initiated for the Virtual Machine and got completed successfully.

 

2019-01-19T23:02:16.375Z| vmx| I125: Received migrate ‘from’ request for mid id 2493027353303613693, src ip <127.0.0.1>.

2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetInfo: state=8 srcIp=<127.0.0.1> dstIp=<127.0.0.1> mid=2493027353303613693 uuid=0000002d-0000-0000-0000-54494e53004d priority=low

2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 8.

 

2019-01-19T23:02:16.513Z| vmx| I125: MigrateSetState: Transitioning from state 8 to 9.

2019-01-19T23:02:16.555Z| vmx| I125: MigrateSetState: Transitioning from state 9 to 10.

2019-01-20T02:36:40.724Z| vcpu-0| I125: MigrateSetState: Transitioning from state 12 to 0.

 

 

  • Now to understand the Storage we tried to find the NAA id associated with the VMFS.

 

Volume Name                          

VMFS UUID                           

Device Name                          

Partition

A**_VS-***-EXT_02F8-F9P01_LUN15      

5bd0de64-6c843b76-54eb-0025b552a0ee             

naa.6000097000019780035353303****638         

1 

A**_VS-***-EXT_02F8-F9P01_LUN15      

5bd0de64-6c843b76-54eb-0025b552a0ee             

naa.600009700001978003535****0324639         

1 

A**_VS-***-EXT_00393P02_LUN23        

5bd0e096-786144d4-b713-0025b552a0ee            

naa.60000970000197800352533****33933         

1 

A**_VS-***-EXT_00393P02_LUN23        

5bd0e096-786144d4-b713-0025b552a0ee             

naa.600009700001978003****3030333031         

1 

 

 

  • From the below Information we can see that there is a lot of Failed Blocks Written:

 

 

   Device: naa.60000970000197800352533****33933

   Successful Commands: 199708637

   Blocks Read: 7373684931

   Blocks Written: 7434662607

   Read Operations: 98014681

   Write Operations: 100199226

   Reserve Operations: 0

   Reservation Conflicts: 0

   Failed Commands: 6618

   Failed Blocks Read: 133

   Failed Blocks Written: 817351

   Failed Read Operations: 5

   Failed Write Operations: 2156

   Failed Reserve Operations: 0

 

 

   Device: naa.600009700001978003****3030333031

   Successful Commands: 90638039

   Blocks Read: 2026385100

   Blocks Written: 5615288608

   Read Operations: 24366245

   Write Operations: 65636791

   Reserve Operations: 0

   Reservation Conflicts: 0

   Failed Commands: 3724

   Failed Blocks Read: 138

   Failed Blocks Written: 1465710

   Failed Read Operations: 4

   Failed Write Operations: 3072

   Failed Reserve Operations: 0

 

VOBD Logs:

 

  • From VOBD logs we can see that there is an issue with the Device NAA.60000970000197800352533****33933 on which the VM is currently residing.

 

2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642112672980us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.

2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642174744579us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.

2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642132785017us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.

2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642194857079us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.

2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642152808128us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.

2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642214880632us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.

 

 

As per the Article: https://kb.vmware.com/kb/2007236

 

  • The device latency may increase due to one of these reasons:
    • Changes made on the target
    • Disk or media failures
    • Overload conditions on the device
    • Failover

 

  • While reviewing the Vmkernel Logs with reference to the NAA id we can see that the below Events:

 

2019-01-20T21:02:31.979Z cpu61:65868)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439d8ac03600, 3601728) to dev “naa.60000970000197800352533****33933on path “vmhba1:C0:T1:L14” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE

2019-01-20T21:02:31.979Z cpu61:65868)ScsiDeviceIO: 2927: Cmd(0x439d8ac03600) 0x8a, CmdSN 0x27 from world 3601728 to dev “naa.60000970000197800352533****33933failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

2019-01-20T21:02:54.126Z cpu37:66119)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439980cf5ec0, 3947495) to dev “naa.60000970000197800353533****24433on path “vmhba1:C0:T3:L5” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE

 

2019-01-20T21:02:54.126Z cpu37:66119)ScsiDeviceIO: 2927: Cmd(0x439980cf5ec0) 0x8a, CmdSN 0x5d from world 3947495 to dev “naa.60000970000197800353533****24433failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

2019-01-20T21:03:25.503Z cpu67:65874)ScsiDeviceIO: 2927: Cmd(0x439d81515b40) 0x2a, CmdSN 0xe6 from world 4010505 to dev “naa.60000970000197800352533****33933failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

 

 

This SCSI Code translates to:

 

Host Status = H:0x0 = OK

Device Status = D:0x2 = CHECK CONDITION

Plugin Status = P:0x0 = OK

 

Valid sense data: 0xb 0x44 0x0.

 

Sense Key = 0xb = ABORTED COMMAND (There is an Abort in the Sense Key Coming from the Hardware)

Additional Sense Code/ASC Qualifier = 0x44/0x0 = INTERNAL TARGET FAILURE

 

 

  • We are also getting the below Events:

 

2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.6000097000019780035253****34132 repeated 1 times

2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.60000970000197800352533****33933 repeated 1 times

2019-01-19T19:39:39.448Z cpu72:2192705)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000197800353533****24433” state in doubt; requested fast path state update…

 

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

  • Array backup operations (LUN backup, replication, etc)
  • General overload on the array
  • Read/Write Cache on the array (misconfiguration, lack of cache, etc)
  • Incorrect tiered storage used (SATA over SCSI)
  • Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

 

 

  • This Might now be Associated with our issue but we are also getting Warning regarding the External IO Activity:

 

2019-01-19T18:10:16.833Z: [UserLevelCorrelator] 7664302537404us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore A**_VS-***-EXT_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.

2019-01-20T04:42:42.020Z: [UserLevelCorrelator] 7702247724518us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore A**_VS-***-EXT_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.

 

For more information please refer to: https://kb.vmware.com/s/article/1020651

 

 

 

Summary:

 

  • From the Above Logs we can conclude that the issue started happening after the VM was moved to another Datastore. This could happen wither because of an Hardware issue or that specific datastore is not able to handle the IO load which is being generated by the VM.

 

What Next:

 

  • I would recommend you to check with you Hardware vendor regarding this issue

 

Ask from the Vendor:

 

  • Why the Datastore is not able to handle enough load of the VM ?
  • Why the HBA is aborting the commands ?
  • Why Failed Blocks Written value is so high ?

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply