RCA-2: Space Issue Triggering Storage vMotion

Host Name: HQIDABABCDE

Reported Time of the Issue: January 19, 2019 at 16:34

VM on which the issue was reported: abcdefghij04 & abcdefghij09. (This Analysis is based on VM abcdefghij09 as The logs for VMware-vCenter-support-2019-01-21@03-01-45.zip is corrupted).

Hostd:

From the Hostd Logs we can see that Initially issue started with Low Disk space for the VM abcdefghij09on Host HQIDABABCDE:

2019-01-19T21:02:20.912Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcdefghij09/abcdefghij09.vmx] Answered question 48975048
2019-01-19T21:02:21.717Z info hostd[12307B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcdefghij09/abcdefghij09.vmx] Upgrade is required for virtual machine, version: 10
2019-01-19T21:02:46.046Z warning hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcdefghij09/abcdefghij09.vmx] Failed to find activation record, event user unknown.
2019-01-19T21:02:46.046Z info hostd[120C6B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 7463 : Message on abcdefghij09 on HQIDABABCDE.phx.aexp.com in ha-datacenter: There is no more space for virtual disk 'abcdefghij09-000001.vmdk'. You might be able to continue this session by freeing disk space on the relevant volume, and clicking _Retry. Click Cancel to terminate this session.

After which the VM Undergone a Storage Migration which can be seen with the Event Below with MigrationType=2:
This Migration is initiated by the User ABC\kpabc2

Now from the below logs we can see that the VM Storage migration is initiated from Location: /vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0eeto 5bd0e096-786144d4-b713-0025b552a0eewhere VM is Currently residing:

2019-01-19T23:02:16.060Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-a0-f79b user=vpxuser:ABC\kpabc2] Completed scheduling PrepareSourceEx [2493027353303613693].
2019-01-19T23:02:16.068Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcdefghij09/abcdefghij09.vmx] VigorMigrateNotifyCb:: hostlog state changed from none to emigrating
2019-01-19T23:02:16.076Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcdefghij09/abcdefghij09.vmx] VMotionStatusCb [2493027353303613693] : Prepare task completed successfully
2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ABC\kpabc2] PrepareDestinationEx [2493027353303613693]
2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotionDst (2493027353303613693) opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ABC\kpabc2] VMotionEntry: migrateType = 2
2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ABC\kpabc2] Completed PrepareDestinationEx [2493027353303613693]
2019-01-19T23:02:16.108Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] InitiateDestination [2493027353303613693], VM = '/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcdefghij09/abcdefghij09.vmx'
2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcdefghij09/abcdefghij09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] State Transition (VM_STATE_INITIALIZING -> VM_STATE_OFF)
2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcdefghij09/abcdefghij09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] Initialized virtual machine.
2019-01-19T23:02:16.148Z info hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcdefghij09/abcdefghij09.vmx] Skip a duplicate transition to: VM_STATE_OFF

Vmware.log:

From the Vmware logs we can see that the Migration was initiated for the Virtual Machine and got completed successfully.

2019-01-19T23:02:16.375Z| vmx| I125: Received migrate 'from' request for mid id 2493027353303613693, src ip <127.0.0.1>.
2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetInfo: state=8 srcIp=<127.0.0.1> dstIp=<127.0.0.1> mid=2493027353303613693 uuid=0000002d-0000-0000-0000-54494e53004d priority=low
2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 8.

2019-01-19T23:02:16.513Z| vmx| I125: MigrateSetState: Transitioning from state 8 to 9.
2019-01-19T23:02:16.555Z| vmx| I125: MigrateSetState: Transitioning from state 9 to 10.
2019-01-20T02:36:40.724Z| vcpu-0| I125: MigrateSetState: Transitioning from state 12 to 0.

Now to understand the Storage we tried to find the NAA id associated with the VMFS.

Volume Name	VMFS UUID	Device Name	Partition
ABC_VS-Dev-ABC_02F8-F9P01_LUN15	5bd0de64-6c843b76-54eb-0025b552a0ee	naa.60000970000197800353533030324638	1
ABC_VS-Dev-ABC_02F8-F9P01_LUN15	5bd0de64-6c843b76-54eb-0025b552a0ee	naa.60000970000197800353533030324639	1
ABC_VS-Dev-ABC_00393P02_LUN23	5bd0e096-786144d4-b713-0025b552a0ee	naa.60000970000197800352533030333933	1
ABC_VS-Dev-ABC_00393P02_LUN23	5bd0e096-786144d4-b713-0025b552a0ee	naa.60000970000197800353533030333031	1

From the below Information we can see that there is a lot of Failed Blocks Written:

   Device: naa.60000970000197800352533030333933
   Successful Commands: 199708637
   Blocks Read: 7373684931
   Blocks Written: 7434662607
   Read Operations: 98014681
   Write Operations: 100199226
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 6618
   Failed Blocks Read: 133
   Failed Blocks Written: 817351
   Failed Read Operations: 5
   Failed Write Operations: 2156
   Failed Reserve Operations: 0

   Device: naa.60000970000197800353533030333031
   Successful Commands: 90638039
   Blocks Read: 2026385100
   Blocks Written: 5615288608
   Read Operations: 24366245
   Write Operations: 65636791
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 3724
   Failed Blocks Read: 138
   Failed Blocks Written: 1465710
   Failed Read Operations: 4
   Failed Write Operations: 3072
   Failed Reserve Operations: 0

VOBD Logs:

From VOBD logs we can see that there is an issue with the Device NAA.60000970000197800352533030333933on which the VM is currently residing.

2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642112672980us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533030333933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.
2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642174744579us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533030333933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.
2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642132785017us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533030333933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.
2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642194857079us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533030333933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.
2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642152808128us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533030333933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.
2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642214880632us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533030333933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.

As per the Article: https://kb.vmware.com/kb/2007236

The device latency may increase due to one of these reasons:
- Changes made on the target
- Disk or media failures
- Overload conditions on the device
- Failover

While reviewing the Vmkernel Logs with reference to the NAA id we can see that the below Events:

2019-01-20T21:02:31.979Z cpu61:65868)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439d8ac03600, 3601728) to dev "naa.60000970000197800352533030333933" on path "vmhba1:C0:T1:L14" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE
2019-01-20T21:02:31.979Z cpu61:65868)ScsiDeviceIO: 2927: Cmd(0x439d8ac03600) 0x8a, CmdSN 0x27 from world 3601728 to dev "naa.60000970000197800352533030333933" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
2019-01-20T21:02:54.126Z cpu37:66119)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439980cf5ec0, 3947495) to dev "naa.60000970000197800353533030324433" on path "vmhba1:C0:T3:L5" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE
2019-01-20T21:02:54.126Z cpu37:66119)ScsiDeviceIO: 2927: Cmd(0x439980cf5ec0) 0x8a, CmdSN 0x5d from world 3947495 to dev "naa.60000970000197800353533030324433" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
2019-01-20T21:03:25.503Z cpu67:65874)ScsiDeviceIO: 2927: Cmd(0x439d81515b40) 0x2a, CmdSN 0xe6 from world 4010505 to dev "naa.60000970000197800352533030333933" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.

This SCSI Code translates to:

Host Status = H:0x0 = OK

Device Status = D:0x2 = CHECK CONDITION

Plugin Status = P:0x0 = OK

Valid sense data: 0xb 0x44 0x0.

Sense Key = 0xb = ABORTED COMMAND (There is an Abort in the Sense Key Coming from the Hardware)

Additional Sense Code/ASC Qualifier = 0x44/0x0 = INTERNAL TARGET FAILURE

We are also getting the below Events:

2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.60000970000197800352533030334132 repeated 1 times
2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.60000970000197800352533030333933 repeated 1 times
2019-01-19T19:39:39.448Z cpu72:2192705)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60000970000197800353533030324433" state in doubt; requested fast path state update...

As per the Article: KB1022026

These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:

Array backup operations (LUN backup, replication, etc)
General overload on the array
Read/Write Cache on the array (misconfiguration, lack of cache, etc)
Incorrect tiered storage used (SATA over SCSI)
Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)

This Might now be Associated with our issue but we are also getting Warning regarding the External IO Activity:

2019-01-19T18:10:16.833Z: [UserLevelCorrelator] 7664302537404us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore ABC_VS-Dev-ABC_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.
2019-01-20T04:42:42.020Z: [UserLevelCorrelator] 7702247724518us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore ABC_VS-Dev-ABC_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.

For more information please refer to: https://kb.vmware.com/s/article/1020651

Summary:

From the Above Logs we can conclude that the issue started happening after the VM was moved to another Datastore. This could happen wither because of an Hardware issue or that specific datastore is not able to handle the IO load which is being generated by the VM.

What Next: