RCA 47: Virtual Machine Crashed Unrepectedly on ESXi Host

Hostname: abcdmesxi843

ESXi Version: VMware ESXi 7.0.2 build-17867351

 

VM Name: xyz761

Time of Issue: 10.48 AM SGT

Time in GMT: 2:48 am GMT

Datastore Details: naa.60060e8012456900504045690000025d || G800-RCG-INT-APP-NPRD-RH-01-N-007-25D

 

 

Vmware Logs:

 

  • From the Vmware Logs we can see the VM was running on Host: abcdmesxi843

 

2021-11-17T01:13:25.001Z| vmx| | I005: Hostname=abcdmesxi843

2021-11-17T02:40:01.321Z| vcpu-1| | I005: Msg_Question:

 

  • Here we can see that the VM Lost the Access to the VMDK: xyz761_1.vmdk and we got the below warning for the same:

 

2021-11-17T02:40:01.321Z| vcpu-1| | I005: [msg.hbacommon.locklost] The lock protecting ‘/vmfs/volumes/6174c00b-20f7956e-08e3-0031b51502cd/xyz761/xyz761_1.vmdk’ has been lost, possibly due to underlying storage issues.

2021-11-17T02:40:01.321Z| vcpu-1| | I005+ If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK.

2021-11-17T02:40:01.321Z| vcpu-1| | I005: —————————————-

2021-11-17T02:40:22.422Z| vcpu-0| | I005: Tools: Tools heartbeat timeout.

2021-11-17T02:44:01.878Z| vcpu-1| | I005: Timing out dialog 247634252021-11-17T02:44:01.879Z| vcpu-1| | I005: MsgQuestion: msg.hbacommon.locklost reply=0

2021-11-17T02:44:01.879Z| vcpu-1| | E001: PANIC: Exiting because of failed disk operation.

2021-11-17T02:44:02.496Z| vcpu-1| | W003: A core file is available in “/vmfs/volumes/61659acc-89cee08e-924f-0031b515054f/xyz761_1636708105855/vmx-zdump.003″

 

  • Post this we can see that the VM shutdown:

 

2021-11-17T02:44:02.496Z| mks| | W003: Panic in progress… ungrabbing2021-11-17T02:44:02.496Z| mks| | I005: MKS: Release starting (Panic)

2021-11-17T02:44:02.496Z| mks| | I005: MKS: Release finished (Panic)

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace:

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace[0] 00000029efbe2d80 rip=00000029a7209779 rbx=00000029efbe2da0 rbp=00000029efbe3280 r12=00000029a8422ac0 r13=0000000000000001 r14=00000029a8196440 r15=00000029edc612e0

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace[1] 00000029efbe3290 rip=00000029a736ff61 rbx=00000029a8c71678 rbp=00000029efbe32b0 r12=00000029a8bbce70 r13=0000000000000002 r14=00000029a8196440 r15=00000029edc612e0

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace[2] 00000029efbe32c0 rip=00000029a7365e7a rbx=00000029a8c6ae00 rbp=00000029efbe3470 r12=00000029a8422ac0 r13=000000000000014f r14=00000029a8196440 r15=00000029edc612e0

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace[3] 00000029efbe3480 rip=00000029a767e767 rbx=0000000000000230 rbp=00000029efbe34c0 r12=00000029a86cc598 r13=000000000000014f r14=00000029a8196440 r15=00000029edc612e0

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace[4] 00000029efbe34d0 rip=00000029a768769d rbx=00000029edc62368 rbp=00000029efbe3510 r12=00000029edc612d0 r13=000000000000012d r14=00000029edc612e0 r15=0000000000000001

2021-11-17T02:44:02.496Z| vcpu-1| | I005: Backtrace[5] 00000029efbe3520 rip=00000029a767e922 rbx=00000029a86930e0 rbp=00000029efbe3530 r12=0000000000000001 r13=00000029efbe79c0 r14=00000029a88f3040 r15=0000000000000003

2021-11-17T02:44:02.497Z| vcpu-1| | I005: Backtrace[6] 00000029efbe3540 rip=00000029a77b4037 rbx=00000029efbe3540 rbp=00000029efbe3660 r12=00000029a8ca6c40 r13=00000029efbe79c0 r14=00000029a88f3040 r15=0000000000000003

2021-11-17T02:44:02.497Z| vcpu-1| | I005: Backtrace[7] 00000029efbe3670 rip=00000029ea428d3b rbx=0000000000000000 rbp=0000000000000000 r12=00000029ef9e2360 r13=00000029efbe79c0 r14=00000029a88f3040 r15=0000000000000003

2021-11-17T02:44:02.497Z| vcpu-1| | I005: Backtrace[8] 00000029efbe3780 rip=00000029ea7288ad rbx=0000000000000000 rbp=0000000000000000 r12=00000029ef9e2360 r13=00000029efbe79c0 r14=00000029a88f3040 r15=0000000000000003

2021-11-17T02:44:02.497Z| vcpu-1| | I005: Backtrace[9] 00000029efbe3788 rip=0000000000000000 rbx=0000000000000000 rbp=0000000000000000 r12=00000029ef9e2360 r13=00000029efbe79c0 r14=00000029a88f3040 r15=0000000000000003

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[0] 00000029efbe2d80 rip=00000029a7209779 in function Panic in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[1] 00000029efbe3290 rip=00000029a736ff61 in function (null) in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[2] 00000029efbe32c0 rip=00000029a7365e7a in function (null) in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[3] 00000029efbe3480 rip=00000029a767e767 in function (null) in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[4] 00000029efbe34d0 rip=00000029a768769d in function (null) in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[5] 00000029efbe3520 rip=00000029a767e922 in function (null) in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[6] 00000029efbe3540 rip=00000029a77b4037 in function (null) in object /bin/vmx loaded at 00000029a6faf000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[7] 00000029efbe3670 rip=00000029ea428d3b in function (null) in object /lib64/libpthread.so.0 loaded at 00000029ea421000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[8] 00000029efbe3780 rip=00000029ea7288ad in function clone in object /lib64/libc.so.6 loaded at 00000029ea63e000

2021-11-17T02:44:02.497Z| vcpu-1| | I005: SymBacktrace[9] 00000029efbe3788 rip=0000000000000000

 

 

VOBD Logs:

 

  • From the VOBD Logs we can see issues with the Datastore G800-RCG-INT-APP-NPRD-RH-01-R-003-603.

 

2021-11-17T01:38:46.070Z: [vmfsCorrelator] 2477863646095us: [vob.vmfs.heartbeat.unrecoverable] This host lost connectivity to volume 6174c00b-20f7956e-08e3-0031b51502cd (“G800-RCG-INT-APP-NPRD-RH-01-R-003-603”) and subsequent recovery attempts have failed

2021-11-17T01:38:46.070Z: [vmfsCorrelator] 2477863645774us: [esx.problem.vmfs.heartbeat.unrecoverable] 6174c00b-20f7956e-08e3-0031b51502cd G800-RCG-INT-APP-NPRD-RH-01-R-003-603

2021-11-17T02:44:02.494Z: [UserWorldCorrelator] 2481780070364us: [vob.uw.core.dumped] /bin/vmx(2389436) /vmfs/volumes/61659acc-89cee08e-924f-0031b515054f/xyz761_1636708105855/vmx-zdump.003

2021-11-17T02:44:02.494Z: [UserWorldCorrelator] 2481780070004us: [esx.problem.application.core.dumped] An application (/bin/vmx) running on ESXi host has crashed (8 time(s) so far). A core file may have been created at /vmfs/volumes/61659acc-89cee08e-924f-0031b515054f/xyz761_1636708105855/vmx-zdump.003.

 

VMKernel Logs:

 

  • From the Vmkernel Logs we can see the same set of events for the Datastore: 60060e80124569005040456900000603.

 

2021-11-17T01:38:46.069Z cpu2:2097202)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0xc0 (0x45b920dac300, 2097193) to dev “naa.60060e80124569005040456900000603” on path “vmhba2:C0:T3:L8” Failed:

2021-11-17T01:38:46.069Z cpu2:2097202)NMP: nmp_ThrottleLogForDevice:3869: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE. cmdId.initiator=0x4306a623b140 CmdSN 0x2a6

2021-11-17T01:38:46.069Z cpu2:2097202)ScsiDeviceIO: 4277: Cmd(0x45b920d07d00) 0xfe, CmdSN 0x2a6 from world 2097193 to dev “naa.60060e80124569005040456900000603” failed H:0x0 D:0x2 P:0x5

2021-11-17T01:38:46.069Z cpu2:2097202)HBX: 1947: ATS Miscompare detected between test and set HB images at offset 3153920 on vol ‘6174c00b-20f7956e-08e3-0031b51502cd’.

2021-11-17T01:38:46.069Z cpu2:2097202)HBX: 1948: ‘G800-RCG-INT-APP-NPRD-RH-01-R-003-603’: HB at offset 3153920 – Test version:2021-11-17T01:38:46.070Z cpu4:2097837)HBX: 5822: ‘G800-RCG-INT-APP-NPRD-RH-01-R-003-603’: HB at offset 3153920 – Cancelling all threads waiting for reclaim of HB:

2021-11-17T01:38:46.070Z cpu4:2097837)WARNING: HBX: 318: This host lost connectivity to volume 6174c00b-20f7956e-08e3-0031b51502cd (“G800-RCG-INT-APP-NPRD-RH-01-R-003-603”) and subsequent recovery attempts have failed

2021-11-17T01:38:46.070Z cpu4:2097837)HBX: 6106: ‘G800-RCG-INT-APP-NPRD-RH-01-R-003-603’: HB at offset 3153920 – Lost heartbeat. On-disk:

2021-11-17T02:44:05.843Z cpu22:2097244)WARNING: IOSpaceVMK: 264: IOSpaceVMK: Couldn’t find port VMCI resultHigh.

2021-11-17T02:44:05.843Z cpu22:2097244)WARNING: IOSpaceVMK: 264: IOSpaceVMK: Couldn’t find port VMCI resultLow.

2021-11-17T02:44:10.957Z cpu19:2097400)WARNING: Vol3: 2953: ‘G800-RCG-INT-APP-NPRD-RH-01-R-003-603’: Failed to clear journal address in on-disk HB. This could result in leak of journal block at <type 1 addr 1128496>

2021-11-17T06:55:47.913Z cpu9:2392011)World: vm 2392012: 6892: Starting world vmm0:xyz761 of type 8

 

 

Hostname: abcdmesxi842

ESXi Version: VMware ESXi 7.0.2 build-17867351

 

 

VM Name: xyz760

Time of Issue: 5:22 AM SGT

Time in GMT: 9:22 pm GMT

 

  • From the Logs of the other VM we can see the same set of Issue being Raised for the VM:

 

2021-11-16T02:09:28.195Z| vmx| | I005: Hostname=abcdmesxi842

 

2021-11-16T21:13:37.644Z| vcpu-1| | I005: Msg_Question:

2021-11-16T21:13:37.644Z| vcpu-1| | I005: [msg.hbacommon.locklost] The lock protecting ‘/vmfs/volumes/6174c00b-20f7956e-08e3-0031b51502cd/xyz804/xyz804.vmdk’ has been lost, possibly due to underlying storage issues.

2021-11-16T21:13:37.644Z| vcpu-1| | I005+ If this virtual machine is configured to be highly available, ensure that the virtual machine is running on some other host before clicking OK.

2021-11-16T21:13:37.644Z| vcpu-1| | I005: —————————————-

2021-11-16T21:17:37.712Z| vcpu-1| | I005: Timing out dialog 239331462021-11-16T21:17:37.713Z| vcpu-1| | I005: MsgQuestion: msg.hbacommon.locklost reply=0

2021-11-16T21:17:37.713Z| vcpu-1| | E001: PANIC: Exiting because of failed disk operation.

2021-11-16T21:17:38.377Z| vcpu-1| | W003: A core file is available in “/vmfs/volumes/61659acc-89cee08e-924f-0031b515054f/xyz760_1636708106786/vmx-zdump.002″

2021-11-16T21:17:38.377Z| mks| | W003: Panic in progress… ungrabbing2021-11-16T21:17:38.377Z| mks| | I005: MKS: Release starting (Panic)2021-11-16T21:17:38.377Z| mks| | I005: MKS: Release finished (Panic)

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace:

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[0] 0000000644838d80 rip=00000005fbddf779 rbx=0000000644838da0 rbp=0000000644839280 r12=00000005fcff8ac0 r13=0000000000000001 r14=00000005fcd6c440 r15=00000006428372e0

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[1] 0000000644839290 rip=00000005fbf45f61 rbx=00000005fd847238 rbp=00000006448392b0 r12=00000005fd793e20 r13=0000000000000002 r14=00000005fcd6c440 r15=00000006428372e0

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[2] 00000006448392c0 rip=00000005fbf3be7a rbx=00000005fd8409c0 rbp=0000000644839470 r12=00000005fcff8ac0 r13=000000000000014f r14=00000005fcd6c440 r15=00000006428372e0

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[3] 0000000644839480 rip=00000005fc254767 rbx=0000000000000230 rbp=00000006448394c0 r12=00000005fd2a2598 r13=000000000000014f r14=00000005fcd6c440 r15=00000006428372e0

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[4] 00000006448394d0 rip=00000005fc25d69d rbx=0000000642838368 rbp=0000000644839510 r12=00000006428372d0 r13=000000000000012d r14=00000006428372e0 r15=0000000000000001

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[5] 0000000644839520 rip=00000005fc254922 rbx=00000005fd2690e0 rbp=0000000644839530 r12=0000000000000001 r13=000000064483d9c0 r14=00000005fd4c9040 r15=0000000000000003

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[6] 0000000644839540 rip=00000005fc38a037 rbx=0000000644839540 rbp=0000000644839660 r12=00000005fd87ca80 r13=000000064483d9c0 r14=00000005fd4c9040 r15=0000000000000003

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[7] 0000000644839670 rip=000000063effed3b rbx=0000000000000000 rbp=0000000000000000 r12=0000000644638360 r13=000000064483d9c0 r14=00000005fd4c9040 r15=0000000000000003

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[8] 0000000644839780 rip=000000063f2fe8ad rbx=0000000000000000 rbp=0000000000000000 r12=0000000644638360 r13=000000064483d9c0 r14=00000005fd4c9040 r15=0000000000000003

2021-11-16T21:17:38.377Z| vcpu-1| | I005: Backtrace[9] 0000000644839788 rip=0000000000000000 rbx=0000000000000000 rbp=0000000000000000 r12=0000000644638360 r13=000000064483d9c0 r14=00000005fd4c9040 r15=0000000000000003

2021-11-16T21:17:38.377Z| vcpu-1| | I005: SymBacktrace[0] 0000000644838d80 rip=00000005fbddf779 in function Panic in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.377Z| vcpu-1| | I005: SymBacktrace[1] 0000000644839290 rip=00000005fbf45f61 in function (null) in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.377Z| vcpu-1| | I005: SymBacktrace[2] 00000006448392c0 rip=00000005fbf3be7a in function (null) in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.377Z| vcpu-1| | I005: SymBacktrace[3] 0000000644839480 rip=00000005fc254767 in function (null) in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.378Z| vcpu-1| | I005: SymBacktrace[4] 00000006448394d0 rip=00000005fc25d69d in function (null) in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.378Z| vcpu-1| | I005: SymBacktrace[5] 0000000644839520 rip=00000005fc254922 in function (null) in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.378Z| vcpu-1| | I005: SymBacktrace[6] 0000000644839540 rip=00000005fc38a037 in function (null) in object /bin/vmx loaded at 00000005fbb85000

2021-11-16T21:17:38.378Z| vcpu-1| | I005: SymBacktrace[7] 0000000644839670 rip=000000063effed3b in function (null) in object /lib64/libpthread.so.0 loaded at 000000063eff7000

2021-11-16T21:17:38.378Z| vcpu-1| | I005: SymBacktrace[8] 0000000644839780 rip=000000063f2fe8ad in function clone in object /lib64/libc.so.6 loaded at 000000063f214000

2021-11-16T21:17:38.378Z| vcpu-1| | I005: SymBacktrace[9] 0000000644839788 rip=0000000000000000

 

 

Conclusion:

 

If an ESX/ESXi host loses access to a datastore, I/O from running virtual machines on the datastore will time out and fail. The virtual machine pauses and an event message appears stating that the virtual machine lost access to its disk.

 

  • This issue seems to be triggered due to the disconnection of the Datastore: G800-RCG-INT-APP-NPRD-RH-01-N-007-25D

 

Action Plan:

 

  • Please check with your Storage Vendor to understand the Reason for the Storage Disconnect.

 

  • As per the Details I can see that the Driver which you are using for the HBA is nfnic version 4.0.0.65 and the recommended firmware version with the driver is 4.2(3), However we can see that the firmware version running on the Host is 4.4(3). Please check with your Hardware Vendor and get try to get to Same Driver/Firmware Version.

 

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37358&deviceCategory=io&details=1&VID=1137&DID=0045&SVID=1137&SSID=012c&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply