RCA-5: vMotion logging and Storage Latency on vCenter Infrastructure

Host Name: HQI*****EFG

Reported Time of the Issue: January 19, 2019 at 16:34

VM on which the issue was reported: abcdefghij04 & abcd*****j09. (This Analysis is based on VM abcd*****j09

Hostd:

From the Hostd Logs we can see that Initially issue started with Low Disk space for the VM abcd*****j09 on Host HQI*****EFG:

2019-01-19T21:02:20.912Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Answered question 48975048

2019-01-19T21:02:21.717Z info hostd[12307B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Upgrade is required for virtual machine, version: 10

2019-01-19T21:02:46.046Z warning hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Failed to find activation record, event user unknown.

2019-01-19T21:02:46.046Z info hostd[120C6B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 7463 : Message on abcd*****j09 on HQI*****EFG.phx.aexp.com in ha-datacenter: There is no more space for virtual disk ‘abcd*****j09-000001.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking _Retry. Click Cancel to terminate this session.

After which the VM Undergone a Storage Migration which can be seen with the Event Below with MigrationType=2:
This Migration is initiated by the User ADS\kp***2

Now from the below logs we can see that the VM Storage migration is initiated from Location: /vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee to 5bd0e096-786144d4-b713-0025b552a0ee where VM is Currently residing:

2019-01-19T23:02:16.060Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-a0-f79b user=vpxuser:ADS\kp***2] Completed scheduling PrepareSourceEx [2493027353303613693].

2019-01-19T23:02:16.068Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] VigorMigrateNotifyCb:: hostlog state changed from none to emigrating

2019-01-19T23:02:16.076Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] VMotionStatusCb [2493027353303613693] : Prepare task completed successfully

2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] PrepareDestinationEx [2493027353303613693]

2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotionDst (2493027353303613693) opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] VMotionEntry: migrateType = 2

2019-01-19T23:02:16.108Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] InitiateDestination [2493027353303613693], VM = ‘/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx’

2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] State Transition (VM_STATE_INITIALIZING -> VM_STATE_OFF)

2019-01-19T23:02:16.148Z info hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Skip a duplicate transition to: VM_STATE_OFF

Vmware.log:

From the Vmware logs we can see that the Migration was initiated for the Virtual Machine and got completed successfully.

2019-01-19T23:02:16.375Z| vmx| I125: Received migrate ‘from’ request for mid id 2493027353303613693, src ip <127.0.0.1>.

2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetInfo: state=8 srcIp=<127.0.0.1> dstIp=<127.0.0.1> mid=2493027353303613693 uuid=0000002d-0000-0000-0000-54494e53004d priority=low

2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 8.

2019-01-19T23:02:16.513Z| vmx| I125: MigrateSetState: Transitioning from state 8 to 9.

2019-01-19T23:02:16.555Z| vmx| I125: MigrateSetState: Transitioning from state 9 to 10.

2019-01-20T02:36:40.724Z| vcpu-0| I125: MigrateSetState: Transitioning from state 12 to 0.

Now to understand the Storage we tried to find the NAA id associated with the VMFS.

Volume Name	VMFS UUID	Device Name	Partition
A_VS-*-EXT_02F8-F9P01_LUN15	5bd0de64-6c843b76-54eb-0025b552a0ee	naa.6000097000019780035353303****638	1
A_VS-*-EXT_02F8-F9P01_LUN15	5bd0de64-6c843b76-54eb-0025b552a0ee	naa.600009700001978003535****0324639	1
A_VS-*-EXT_00393P02_LUN23	5bd0e096-786144d4-b713-0025b552a0ee	naa.60000970000197800352533****33933	1
A_VS-*-EXT_00393P02_LUN23	5bd0e096-786144d4-b713-0025b552a0ee	naa.600009700001978003****3030333031	1

From the below Information we can see that there is a lot of Failed Blocks Written:

Device: naa.60000970000197800352533****33933

Successful Commands: 199708637

Blocks Read: 7373684931

Blocks Written: 7434662607

Read Operations: 98014681

Write Operations: 100199226

Reserve Operations: 0

Reservation Conflicts: 0

Failed Commands: 6618

Failed Blocks Read: 133

Failed Blocks Written: 817351

Failed Read Operations: 5

Failed Write Operations: 2156

Failed Reserve Operations: 0

Device: naa.600009700001978003****3030333031

Successful Commands: 90638039

Blocks Read: 2026385100

Blocks Written: 5615288608

Read Operations: 24366245

Write Operations: 65636791

Reserve Operations: 0

Reservation Conflicts: 0

Failed Commands: 3724

Failed Blocks Read: 138

Failed Blocks Written: 1465710

Failed Read Operations: 4

Failed Write Operations: 3072

Failed Reserve Operations: 0

VOBD Logs:

From VOBD logs we can see that there is an issue with the Device NAA.60000970000197800352533****33933 on which the VM is currently residing.

2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642112672980us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.

2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642174744579us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.

2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642132785017us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.

2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642194857079us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.

2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642152808128us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.

2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642214880632us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.

As per the Article: https://kb.vmware.com/kb/2007236

The device latency may increase due to one of these reasons:
- Changes made on the target
- Disk or media failures
- Overload conditions on the device
- Failover

While reviewing the Vmkernel Logs with reference to the NAA id we can see that the below Events:

2019-01-20T21:02:31.979Z cpu61:65868)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439d8ac03600, 3601728) to dev “naa.60000970000197800352533****33933” on path “vmhba1:C0:T1:L14” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE

2019-01-20T21:02:31.979Z cpu61:65868)ScsiDeviceIO: 2927: Cmd(0x439d8ac03600) 0x8a, CmdSN 0x27 from world 3601728 to dev “naa.60000970000197800352533****33933” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

2019-01-20T21:02:54.126Z cpu37:66119)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439980cf5ec0, 3947495) to dev “naa.60000970000197800353533****24433” on path “vmhba1:C0:T3:L5” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE

2019-01-20T21:02:54.126Z cpu37:66119)ScsiDeviceIO: 2927: Cmd(0x439980cf5ec0) 0x8a, CmdSN 0x5d from world 3947495 to dev “naa.60000970000197800353533****24433” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

2019-01-20T21:03:25.503Z cpu67:65874)ScsiDeviceIO: 2927: Cmd(0x439d81515b40) 0x2a, CmdSN 0xe6 from world 4010505 to dev “naa.60000970000197800352533****33933” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

This SCSI Code translates to:

Host Status = H:0x0 = OK

Device Status = D:0x2 = CHECK CONDITION

Plugin Status = P:0x0 = OK

Valid sense data: 0xb 0x44 0x0.

Sense Key = 0xb = ABORTED COMMAND (There is an Abort in the Sense Key Coming from the Hardware)

Additional Sense Code/ASC Qualifier = 0x44/0x0 = INTERNAL TARGET FAILURE

We are also getting the below Events:

2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.6000097000019780035253****34132 repeated 1 times

2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.60000970000197800352533****33933 repeated 1 times

2019-01-19T19:39:39.448Z cpu72:2192705)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000197800353533****24433” state in doubt; requested fast path state update…

As per the Article: KB1022026

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

Array backup operations (LUN backup, replication, etc)
General overload on the array
Read/Write Cache on the array (misconfiguration, lack of cache, etc)
Incorrect tiered storage used (SATA over SCSI)
Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

This Might now be Associated with our issue but we are also getting Warning regarding the External IO Activity:

2019-01-19T18:10:16.833Z: [UserLevelCorrelator] 7664302537404us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore A**_VS-***-EXT_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.

2019-01-20T04:42:42.020Z: [UserLevelCorrelator] 7702247724518us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore A**_VS-***-EXT_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.

For more information please refer to: https://kb.vmware.com/s/article/1020651

Summary:

From the Above Logs we can conclude that the issue started happening after the VM was moved to another Datastore. This could happen wither because of an Hardware issue or that specific datastore is not able to handle the IO load which is being generated by the VM.

What Next: