RCA-9:  Guest OS hung due to loss of tools heartbeat timeout

Issue Description:  Guest OS hung due to loss of tools heartbeat timeout.

 

Note: The Guest VM name which created impacted was abcapp216. Host details abcevbkvhp1815 and 1816 (vmotioned) 

Time of the issue is between 12:20 PM AEST to 12:50 AEST Date 2019-04-03.

 

Host Name: VMware ESXi 6.0.0 build-5572656

 

 

Hostd Logs:

 

  • From the Host logs I can see that the vMotion has initiated for the VM abcapp216.

 

2019-04-03T01:33:34.646Z info hostd[61BBBB70] [Originator@6876 sub=Vcsvc.VMotion opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] PrepareSourceEx [1554255214254855], VM = ‘153’
2019-04-03T01:33:34.646Z info hostd[61BBBB70] [Originator@6876 sub=Vcsvc.VMotionSrc (1554255214254855) opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] VMotionEntry: migrateType = 1
2019-04-03T01:33:34.646Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)

 

2019-04-03T01:33:34.648Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] VMotionPrepare: srcMgmtIp=10.139.152.153
2019-04-03T01:33:34.648Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser]
VMotionPrepare: dstMgmtIp=10.139.152.154
2019-04-03T01:33:34.648Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] VMotionPrepare: dstVmDirPath=/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/
2019-04-03T01:33:34.648Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser]
VMotionPrepare: dstVmFileName=abcapp216.vmx
2019-04-03T01:33:34.648Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] SetVirtualDeviceChange: deviceChange # 0
2019-04-03T01:33:34.648Z info hostd[61BBBB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 user=vpxuser] VMotionPrepare (1554255214254855): Sending ‘to’ srcIp=10.139.153.153 dstIp=10.139.153.154, type=1, encrypted=false, remoteThumbprint=6C:A7:AB:5D:0A:E4:E1:20:86:7A:1E:57:64:21:EF:10:3A:47:E4:E2

2019-04-03T01:33:34.679Z info hostd[61BFCB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx] VigorMigrateNotifyCb:: hostlog state changed from none to emigrating
2019-04-03T01:33:34.679Z info hostd[61BFCB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx] VMotionStatusCb [1554255214254855] : Prepare task completed successfully
2019-04-03T01:33:35.165Z info hostd[60F90B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-11-fb2e user=vpxuser] GetWid: got 522646
2019-04-03T01:33:35.166Z info hostd[60F90B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx opID=task-internal-1-47deb2d2-8a-17-3-3b-11-fb2e user=vpxuser] VMotionInitiateSrc (1554255214254855): wid=8039130

  • However the Ongoing vMotion Failed to Complete as it was not able to find any VMtools process running for the VM and a Cleanup operation is initiated.

 

2019-04-03T01:37:26.729Z info hostd[61BBBB70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 3339 : The dvPort 27956 link was down in the vSphere Distributed Switch  in ha-datacenter
2019-04-03T01:37:26.729Z info hostd[61BBBB70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 3340 : The dvPort 27956 was not in passthrough mode in the vSphere Distributed Switch  in ha-datacenter.

2019-04-03T01:37:36.940Z info hostd[5FEC2B70] [Originator@6876 sub=Libs] CnxConnectAuthd: Returning false because CnxAuthdProtoConnect failed
2019-04-03T01:37:36.940Z info hostd[5FEC2B70] [Originator@6876 sub=Libs] Cnx_Connect: Returning false because CnxConnectAuthd failed
2019-04-03T01:37:36.940Z info hostd[5FEC2B70] [Originator@6876 sub=vm:Cnx_Connect: Error message: There is no VMware process running for config file /vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx]
2019-04-03T01:37:36.940Z warning hostd[5FEC2B70] [Originator@6876 sub=vm:VigorTransportClientManageConnection: Failed to re-connect to VM /vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx]

2019-04-03T01:37:36.942Z info hostd[601F2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx] State Transition (VM_STATE_EMIGRATING -> VM_STATE_OFF)
2019-04-03T01:37:36.943Z info hostd[601F2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx] State Transition (VM_STATE_OFF -> VM_STATE_UNREGISTERING)
2019-04-03T01:37:36.944Z warning hostd[601F2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx] Failed to find activation record, event user unknown.

2019-04-03T01:37:36.944Z info hostd[601F2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx] VMotion cleanup and unregistration completed

 

  • VMKernel doesn’t have much information as referenced to OpID:

 

2019-04-03T01:33:34.648Z cpu9:60828 opID=cf2b3d44)World: 15544: VC opID task-internal-1-47deb2d2-8a-17-3-3b-c8-fb21 maps to vmkernel opID cf2b3d44
2019-04-03T01:33:34.648Z cpu9:60828 opID=cf2b3d44)Config: 681: “SIOControlFlag2” = 1, Old Value: 0, (Status: 0x0)

 

Vmware Logs:

 

  • While reviewing the Vmware logs we can see that the Migration is initiated and during state 4 to 5 VM got powered off.

 

2019-04-03T01:33:34.648Z| vmx| I125: MigrateSetInfo: state=1 srcIp=<10.139.153.153> dstIp=<10.139.153.154> mid=1554255214254855 uuid=b5250000-0200-0000-0000-0000000000ae priority=low
2019-04-03T01:33:34.648Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 1.

 

 

2019-04-03T01:33:35.168Z| vmx| I125: MigrateSetState: Transitioning from state 1 to 2.
2019-04-03T01:33:35.171Z| vmx| I125: MigratePlatformInitMigration:  DiskOp file set to /vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216-diskOp.tmp
2019-04-03T01:33:35.175Z| vmx| I125: DISKLIB-LIB_MISC   : DiskLib_SetOpenFlags: Failed to set open flags: Operation not supported 11.
2019-04-03T01:33:35.188Z| vmx| I125: MigrateSetState: Transitioning from state 2 to 3.
2019-04-03T01:37:26.726Z| vcpu-0| I125: MigrateSetState: Transitioning from state 3 to 4.

2019-04-03T01:37:26.727Z| vcpu-0| I125: Migrate: VM starting stun, waiting 100 seconds for go/no-go message.
2019-04-03T01:37:26.727Z| vcpu-0| I125: Destroying virtual dev for scsi0:0 vscsi=9119
2019-04-03T01:37:26.727Z| vcpu-0| I125: VMMon_VSCSIStopVports: No such target on adapter

2019-04-03T01:37:26.742Z| vcpu-0| I125: Progress 3% (none)
2019-04-03T01:37:26.742Z| vcpu-0| I125: Progress 6% (none)
2019-04-03T01:37:26.743Z| vcpu-0| I125: Progress 9% (none)
2019-04-03T01:37:26.751Z| vcpu-0| I125: Progress 96% (none)
2019-04-03T01:37:26.752Z| vcpu-0| I125: Progress 100% (none)
2019-04-03T01:37:26.754Z| vcpu-0| I125: MigrateSetState: Transitioning from state 4 to 5.
2019-04-03T01:37:26.754Z| vcpu-0| I125: Closing all the disks of the VM.
2019-04-03T01:37:27.252Z| vmx| I125: MigrateSetState: Transitioning from state 5 to 6.

2019-04-03T01:37:27.253Z| vmx| I125: /vmfs/volumes/5ac5d8b8-72b58bf8-8bce-0025b504a039/abcapp216/abcapp216.vmx: lost race to rename /var/run/vmware/98034ceb786a89cce35eea8dca9957c8.1 to /var/run/vmware/98034ceb786a89cce35eea8dca9957c8
2019-04-03T01:37:27.253Z| vmx| I125: Migrate: Powering off
2019-04-03T01:37:27.253Z| vmx| I125: VigorTransport_ServerSendResponse opID=task-internal-1-47deb2d2-8a-17-3-3b-11-fb2e seq=18031: Completed Migrate request.
2019-04-03T01:37:27.253Z| vmx| I125: Stopping VCPU threads…

2019-04-03T01:37:27.261Z| vmx| I125: MigrateSetState: Transitioning from state 6 to 0.
2019-04-03T01:37:27.261Z| vmx| I125: Migrate: Final status reported through VMDB.
2019-04-03T01:37:27.261Z| vmx| I125: SVMotion_PowerOff: Not running Storage vMotion. Nothing to do
2019-04-03T01:37:27.261Z| vmx| I125: Vix: [522645 mainDispatch.c:1188]: VMAutomationPowerOff: Powering off.
2019-04-03T01:37:27.262Z| vmx| I125: WORKER: asyncOps=12 maxActiveOps=3 maxPending=2 maxCompleted=0
2019-04-03T01:37:36.874Z| vmx| I125: Vix: [522645 mainDispatch.c:4292]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2019-04-03T01:37:36.874Z| vmx| I125: Vix: [522645 mainDispatch.c:4311]: VMAutomation: Ignoring ReportPowerOpFinished because the VMX is shutting down.
2019-04-03T01:37:36.874Z| vmx| I125: Vix: [522645 mainDispatch.c:4292]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0

2019-04-03T01:37:36.885Z| vmx| W115: VMX has left the building: 0.

 

However the Migration was successful and the VM moved to Host abcevbkvhp1816

 

Conclusion:

 

  • From the logs we can see that the vMotion was successful however the VM was probably in hung state as the logs inside the VM was also not conclusive or have events missing.
  • From the Host end we were not able to find any issues as the vMotion seems to be completed and the there is no other errors associated with the VM.

 

Plan:

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply