Host Name: HQI*****EFG
Reported Time of the Issue: January 19, 2019 at 16:34
VM on which the issue was reported: abcdefghij04 & abcd*****j09. (This Analysis is based on VM abcd*****j09
Hostd:
- From the Hostd Logs we can see that Initially issue started with Low Disk space for the VM abcd*****j09 on Host HQI*****EFG:
2019-01-19T21:02:20.912Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Answered question 48975048
2019-01-19T21:02:21.717Z info hostd[12307B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Upgrade is required for virtual machine, version: 10
2019-01-19T21:02:46.046Z warning hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Failed to find activation record, event user unknown.
2019-01-19T21:02:46.046Z info hostd[120C6B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 7463 : Message on abcd*****j09 on HQI*****EFG.phx.aexp.com in ha-datacenter: There is no more space for virtual disk ‘abcd*****j09-000001.vmdk’. You might be able to continue this session by freeing disk space on the relevant volume, and clicking _Retry. Click Cancel to terminate this session.
- After which the VM Undergone a Storage Migration which can be seen with the Event Below with MigrationType=2:
- This Migration is initiated by the User ADS\kp***2
- Now from the below logs we can see that the VM Storage migration is initiated from Location: /vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee to 5bd0e096-786144d4-b713-0025b552a0ee where VM is Currently residing:
2019-01-19T23:02:16.060Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-a0-f79b user=vpxuser:ADS\kp***2] Completed scheduling PrepareSourceEx [2493027353303613693].
2019-01-19T23:02:16.068Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] VigorMigrateNotifyCb:: hostlog state changed from none to emigrating
2019-01-19T23:02:16.076Z info hostd[11580B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0de64-6c843b76-54eb-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] VMotionStatusCb [2493027353303613693] : Prepare task completed successfully
2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] PrepareDestinationEx [2493027353303613693]
2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotionDst (2493027353303613693) opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] VMotionEntry: migrateType = 2
2019-01-19T23:02:16.092Z info hostd[118C2B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-3-f7a3 user=vpxuser:ADS\kp***2] Completed PrepareDestinationEx [2493027353303613693]
2019-01-19T23:02:16.108Z info hostd[11F40B70] [Originator@6876 sub=Vcsvc.VMotion opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] InitiateDestination [2493027353303613693], VM = ‘/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx’
2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] State Transition (VM_STATE_INITIALIZING -> VM_STATE_OFF)
2019-01-19T23:02:16.148Z info hostd[11C40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx opID=68dd81e8-fb51-408c-a2e8-4d05f94f5a46-4872723-h5c:70104055-70-02-8b-f7a8 user=vpxuser:vpxuser] Initialized virtual machine.
2019-01-19T23:02:16.148Z info hostd[120C6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5bd0e096-786144d4-b713-0025b552a0ee/abcd*****j09/abcd*****j09.vmx] Skip a duplicate transition to: VM_STATE_OFF
Vmware.log:
- From the Vmware logs we can see that the Migration was initiated for the Virtual Machine and got completed successfully.
2019-01-19T23:02:16.375Z| vmx| I125: Received migrate ‘from’ request for mid id 2493027353303613693, src ip <127.0.0.1>.
2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetInfo: state=8 srcIp=<127.0.0.1> dstIp=<127.0.0.1> mid=2493027353303613693 uuid=0000002d-0000-0000-0000-54494e53004d priority=low
2019-01-19T23:02:16.376Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 8.
2019-01-19T23:02:16.513Z| vmx| I125: MigrateSetState: Transitioning from state 8 to 9.
2019-01-19T23:02:16.555Z| vmx| I125: MigrateSetState: Transitioning from state 9 to 10.
2019-01-20T02:36:40.724Z| vcpu-0| I125: MigrateSetState: Transitioning from state 12 to 0.
- Now to understand the Storage we tried to find the NAA id associated with the VMFS.
Volume Name | VMFS UUID | Device Name | Partition |
A**_VS-***-EXT_02F8-F9P01_LUN15 | 5bd0de64-6c843b76-54eb-0025b552a0ee | naa.6000097000019780035353303****638 | 1 |
A**_VS-***-EXT_02F8-F9P01_LUN15 | 5bd0de64-6c843b76-54eb-0025b552a0ee | naa.600009700001978003535****0324639 | 1 |
A**_VS-***-EXT_00393P02_LUN23 | 5bd0e096-786144d4-b713-0025b552a0ee | naa.60000970000197800352533****33933 | 1 |
A**_VS-***-EXT_00393P02_LUN23 | 5bd0e096-786144d4-b713-0025b552a0ee | naa.600009700001978003****3030333031 | 1 |
- From the below Information we can see that there is a lot of Failed Blocks Written:
Device: naa.60000970000197800352533****33933
Successful Commands: 199708637
Blocks Read: 7373684931
Blocks Written: 7434662607
Read Operations: 98014681
Write Operations: 100199226
Reserve Operations: 0
Reservation Conflicts: 0
Failed Commands: 6618
Failed Blocks Read: 133
Failed Blocks Written: 817351
Failed Read Operations: 5
Failed Write Operations: 2156
Failed Reserve Operations: 0
Device: naa.600009700001978003****3030333031
Successful Commands: 90638039
Blocks Read: 2026385100
Blocks Written: 5615288608
Read Operations: 24366245
Write Operations: 65636791
Reserve Operations: 0
Reservation Conflicts: 0
Failed Commands: 3724
Failed Blocks Read: 138
Failed Blocks Written: 1465710
Failed Read Operations: 4
Failed Write Operations: 3072
Failed Reserve Operations: 0
VOBD Logs:
- From VOBD logs we can see that there is an issue with the Device NAA.60000970000197800352533****33933 on which the VM is currently residing.
2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642112672980us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.
2018-11-22T15:08:09.040Z: [scsiCorrelator] 2642174744579us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1491 microseconds to 134320 microseconds.
2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642132785017us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.
2018-11-22T15:08:29.152Z: [scsiCorrelator] 2642194857079us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1506 microseconds to 283556 microseconds.
2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642152808128us: [vob.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.
2018-11-22T15:08:49.176Z: [scsiCorrelator] 2642214880632us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000197800352533****33933 performance has deteriorated. I/O latency increased from average value of 1507 microseconds to 584132 microseconds.
As per the Article: https://kb.vmware.com/kb/2007236
- The device latency may increase due to one of these reasons:
- Changes made on the target
- Disk or media failures
- Overload conditions on the device
- Failover
- While reviewing the Vmkernel Logs with reference to the NAA id we can see that the below Events:
2019-01-20T21:02:31.979Z cpu61:65868)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439d8ac03600, 3601728) to dev “naa.60000970000197800352533****33933” on path “vmhba1:C0:T1:L14” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE
2019-01-20T21:02:31.979Z cpu61:65868)ScsiDeviceIO: 2927: Cmd(0x439d8ac03600) 0x8a, CmdSN 0x27 from world 3601728 to dev “naa.60000970000197800352533****33933” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
2019-01-20T21:02:54.126Z cpu37:66119)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x8a (0x439980cf5ec0, 3947495) to dev “naa.60000970000197800353533****24433” on path “vmhba1:C0:T3:L5” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0. Act: NONE
2019-01-20T21:02:54.126Z cpu37:66119)ScsiDeviceIO: 2927: Cmd(0x439980cf5ec0) 0x8a, CmdSN 0x5d from world 3947495 to dev “naa.60000970000197800353533****24433” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
2019-01-20T21:03:25.503Z cpu67:65874)ScsiDeviceIO: 2927: Cmd(0x439d81515b40) 0x2a, CmdSN 0xe6 from world 4010505 to dev “naa.60000970000197800352533****33933” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xb 0x44 0x0.
This SCSI Code translates to:
Host Status = H:0x0 = OK
Device Status = D:0x2 = CHECK CONDITION
Plugin Status = P:0x0 = OK
Valid sense data: 0xb 0x44 0x0.
Sense Key = 0xb = ABORTED COMMAND (There is an Abort in the Sense Key Coming from the Hardware)
Additional Sense Code/ASC Qualifier = 0x44/0x0 = INTERNAL TARGET FAILURE
- We are also getting the below Events:
2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.6000097000019780035253****34132 repeated 1 times
2019-01-20T21:03:25.708Z cpu25:66370)NMP: nmp_ResetDeviceLogThrottling:3348: last error status from device naa.60000970000197800352533****33933 repeated 1 times
2019-01-19T19:39:39.448Z cpu72:2192705)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device “naa.60000970000197800353533****24433” state in doubt; requested fast path state update…
- As per the Article: KB1022026
These messages appear as the Host Bus Adapter (HBA) driver aborts a command because the command took longer than the timeout period of 5 seconds to complete. An operation can take longer than the timeout period because of several reasons including:
- Array backup operations (LUN backup, replication, etc)
- General overload on the array
- Read/Write Cache on the array (misconfiguration, lack of cache, etc)
- Incorrect tiered storage used (SATA over SCSI)
- Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)
- This Might now be Associated with our issue but we are also getting Warning regarding the External IO Activity:
2019-01-19T18:10:16.833Z: [UserLevelCorrelator] 7664302537404us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore A**_VS-***-EXT_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.
2019-01-20T04:42:42.020Z: [UserLevelCorrelator] 7702247724518us: [esx.problem.iorm.nonviworkload] An external IO activity is detected on datastore A**_VS-***-EXT_00393P02_LUN23, this is an unsupported configuration. Please consult IO Resource Management Documentation or follow the Ask VMware link for more information.
For more information please refer to: https://kb.vmware.com/s/article/1020651
Summary:
- From the Above Logs we can conclude that the issue started happening after the VM was moved to another Datastore. This could happen wither because of an Hardware issue or that specific datastore is not able to handle the IO load which is being generated by the VM.
What Next:
- I would recommend you to check with you Hardware vendor regarding this issue
Ask from the Vendor:
- Why the Datastore is not able to handle enough load of the VM ?
- Why the HBA is aborting the commands ?
- Why Failed Blocks Written value is so high ?