RCA 44: Virtual Machine Hangs while taking Backup

Hostname: ABC-P-F06-SY02.xyz.abc.com

ESXi Version: ESXi 6.5 Update 3

 

Time of Issue: 11:59 , 27th July 2021

Time in GMT: 5:59 PM, 27th July 2021

 

VM Name: ABC-p-qv-mq1

 

  • While Looking at the Vmware logs for the VM we can see that the Snapshot Request came with Lazy=1

 

2021-07-27T17:59:57.301Z| vmx| I125: VigorTransportProcessClientPayload: opID=kq6q6uj9-6inz2-h5:70166868-20-ee-d30c seq=5261055: Receiving Snapshot.Take request.
2021-07-27T17:59:57.301Z| vmx| I125: SnapshotVMX_TakeSnapshot start: ‘VM Snapshot 7%252f27%252f2021, 11:59:45 PM’, deviceState=1, lazy=1, quiesced=0, forceNative=0, tryNative=1, saveAllocMaps=0 cb=C1B6B98A10, cbData=C1B84B3B00
2021-07-27T17:59:57.313Z| vmx| I125: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).

2021-07-27T17:59:57.314Z| vcpu-0| I125: Destroying virtual dev for scsi0:0 vscsi=8816
2021-07-27T17:59:57.314Z| vcpu-0| I125: Destroying virtual dev for scsi0:1 vscsi=8817
2021-07-27T17:59:57.314Z| vcpu-0| I125: Destroying virtual dev for scsi0:2 vscsi=8818
2021-07-27T17:59:57.443Z| vcpu-0| I125: Progress -1% (msg.checkpoint.saveStatusWithName)
2021-07-27T17:59:57.443Z| vcpu-0| I125: Checkpointed in VMware ESX, 6.5.0, build-14320405, Linux Host
2021-07-27T17:59:57.474Z| vcpu-0| I125: Progress 0% (none)
2021-07-27T17:59:57.571Z| vcpu-0| I125: Progress 101% (none)
2021-07-27T17:59:57.571Z| vcpu-0| I125: Closing all the disks of the VM.
2021-07-27T17:59:57.571Z| vcpu-0| I125: Closing disk ‘scsi0:2’
2021-07-27T17:59:57.572Z| vcpu-0| I125: Closing disk ‘scsi0:1’
2021-07-27T17:59:57.572Z| vcpu-0| I125: Closing disk ‘scsi0:0’

 

  • Here we can see that the VM was stun only for 0.44 Seconds:

 

2021-07-27T17:59:57.757Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 443003 us

 

  • However since we have selected the Snapshot with Memory there is a Lazy IO Task which is initiated and took around 5 Seconds.

 

.2021-07-27T17:59:58.420Z| Worker#2| I125: MainMem: Begin lazy IO (16777216 pages, 0 done, 1 threads, bio = 0).

.

.

2021-07-27T18:04:37.460Z| Worker#2| I125: MainMem: End lazy IO (16777216 done, sync = 0, error = 0).

 

2021-07-27T18:04:37.465Z| vmx| I125: VigorTransport_ServerSendResponse opID=kq6q6uj9-6inz2-h5:70166868-20-ee-d30c seq=5261055: Completed Snapshot request.

 

 

 

 

VM Name: ABC-p-qv-mq2

 

  • Another Operation triggered on 18:05.

 

2021-07-27T18:05:10.815Z| vmx| I125: VigorTransportProcessClientPayload: opID=kq6q6uj9-6ir7b-h5:70166955-98-2d-d5da seq=5236325: Receiving Snapshot.Take request.
2021-07-27T18:05:10.815Z| vmx| I125: SnapshotVMX_TakeSnapshot start: ‘VM Snapshot 7%252f28%252f2021, 12:05:01 AM’, deviceState=1, lazy=1, quiesced=0, forceNative=0, tryNative=1, saveAllocMaps=0 cb=38A4790A10, cbData=38A5E8BA00
2021-07-27T18:05:10.827Z| vmx| I125: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).
2021-07-27T18:05:10.828Z| vcpu-0| I125: Destroying virtual dev for scsi0:0 vscsi=8198
2021-07-27T18:05:10.828Z| vcpu-0| I125: Destroying virtual dev for scsi0:1 vscsi=8199
2021-07-27T18:05:10.828Z| vcpu-0| I125: Destroying virtual dev for scsi0:2 vscsi=8200
2021-07-27T18:05:10.959Z| vcpu-0| I125: Progress -1% (msg.checkpoint.saveStatusWithName)
2021-07-27T18:05:10.982Z| vcpu-0| I125: Progress 0% (none)
2021-07-27T18:05:11.079Z| vcpu-0| I125: Progress 101% (none)
2021-07-27T18:05:11.079Z| vcpu-0| I125: Closing all the disks of the VM.
2021-07-27T18:05:11.079Z| vcpu-0| I125: Closing disk ‘scsi0:2’
2021-07-27T18:05:11.081Z| vcpu-0| I125: Closing disk ‘scsi0:1’
2021-07-27T18:05:11.081Z| vcpu-0| I125: Closing disk ‘scsi0:0’

 

  • This operation stun the VM for same 0.44 seconds:

 

2021-07-27T18:05:11.274Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 445986 us

 

  • However since the Snapshot with Memory was selected we can see the Begin and End Lazy IO Operation took around 4 Seconds.

 

2021-07-27T18:05:11.982Z| Worker#1| I125: MainMem: Begin lazy IO (16777216 pages, 0 done, 1 threads, bio = 0).
2021-07-27T18:09:24.836Z| Worker#1| I125: MainMem: End lazy IO (16777216 done, sync = 0, error = 0).
2021-07-27T18:09:24.840Z| vmx| I125: VigorTransport_ServerSendResponse opID=kq6q6uj9-6ir7b-h5:70166955-98-2d-d5da seq=5236325: Completed Snapshot request.

 

 

 

 

VM Name: ABC-p-qv-mq3

 

  • Sane kind of scenario happened with Mq3 as well, the Stun Time took around 0.44 Seconds and the Memory copy took around 4 Seconds.

 

2021-07-27T18:11:09.234Z| vmx| I125: VigorTransportProcessClientPayload: opID=kq6q6uj9-6iuc5-h5:70166996-67-53-d8ab seq=5206468: Receiving Snapshot.Take request.
2021-07-27T18:11:09.235Z| vmx| I125: SnapshotVMX_TakeSnapshot start: ‘VM Snapshot 7%252f28%252f2021, 12:10:59 AM’, deviceState=1, lazy=1, quiesced=0, forceNative=0, tryNative=1, saveAllocMaps=0 cb=B6870B7A10, cbData=B68865F9A0
2021-07-27T18:11:09.245Z| vmx| I125: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).
2021-07-27T18:11:09.246Z| vcpu-0| I125: Destroying virtual dev for scsi0:0 vscsi=8207
2021-07-27T18:11:09.246Z| vcpu-0| I125: Destroying virtual dev for scsi0:1 vscsi=8208
2021-07-27T18:11:09.246Z| vcpu-0| I125: Destroying virtual dev for scsi0:2 vscsi=8209
2021-07-27T18:11:09.378Z| vcpu-0| I125: Progress -1% (msg.checkpoint.saveStatusWithName)
2021-07-27T18:11:09.406Z| vcpu-0| I125: Progress 0% (none)
2021-07-27T18:11:09.503Z| vcpu-0| I125: Progress 101% (none)
2021-07-27T18:11:09.503Z| vcpu-0| I125: Closing all the disks of the VM.
2021-07-27T18:11:09.503Z| vcpu-0| I125: Closing disk ‘scsi0:2’
2021-07-27T18:11:09.504Z| vcpu-0| I125: Closing disk ‘scsi0:1’
2021-07-27T18:11:09.505Z| vcpu-0| I125: Closing disk ‘scsi0:0’
2021-07-27T18:11:09.702Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 456295 us
2021-07-27T18:11:10.430Z| Worker#0| I125: MainMem: Begin lazy IO (16777216 pages, 0 done, 1 threads, bio = 0).
2021-07-27T18:15:17.642Z| Worker#0| I125: MainMem: End lazy IO (16777216 done, sync = 0, error = 0).
2021-07-27T18:15:17.647Z| vmx| I125: VigorTransport_ServerSendResponse opID=kq6q6uj9-6iuc5-h5:70166996-67-53-d8ab seq=5206468: Completed Snapshot request.

 

 

Conclusion:

 

  • Generally, while performing a Change its always recommended to have healthy backup or a Snapshot of the Virtual machine so that Incase if there are any issue with the process we can always revert back to the Healthy state.

 

  • Based on the logs we can see that the Snapshot Operation stun the Virtual Machine for 0.44 Seconds which should not be much of a concern. However, the Since we have selected Memory Copy, it took nearly 4 Seconds of its own which can cause issues with a IO Sensitive Virtual Machine.

 

Action Plan:

 

  • I will recommend you check if it’s possible to take Offline Snapshots for this IO Sensitive Virtual Machines as it will release all the Pending IO from Memory and make the Snapshot operation more consistent.
  • Incase if it’s not possible then, I will recommend you to take a snapshot without Memory as these snapshot required the Memory content to be copied which might affect a IO Sensitive workload.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply