RCA 42: Datastore Inaccessible Due to LAG on the vCenter

Hostname: abc01xsdi001.sdi.corp.abc.com

ESXi Version: ESXi 6.7 P04

 

Time of Issue: 5/30/2021, 9:38:31 AM IST

Time in GMT:  5/30/2021, 4:08 AM GMT

 

  vmnic   

PCI bus address 

 link 

 speed 

 duplex 

 MTU   

driver     

 driver version 

 firmware version  

MAC address       

 VID  

 DID  

 SVID 

 SDID 

 name

  vmnic0  

0000:04:00.0    

 Up   

 10000 

 Full   

 9000  

nmlx5_core 

 4.17.70.1      

 14.27.4000        

9c:dc:71:49:20:00 

 15b3 

 1015 

 1590 

 00d3 

 Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

  vmnic1  

0000:04:00.1    

 Up   

 10000 

 Full   

 9000  

nmlx5_core 

 4.17.70.1      

 14.27.4000        

9c:dc:71:49:20:01 

 15b3 

 1015 

 1590 

 00d3 

 Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

 

 

Hostd Logs:

 

  • Reviewed the Hostd Logs and we can see that the issue started with one of the Uplink VMnic0 went down and moved out of link aggregation group.

 

2021-05-30T04:05:17.210Z info hostd[2103014] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277722 : LACP warning: uplink vmnic0 on VDS DvsPortset-0 is moved out of link aggregation group.
2021-05-30T04:05:18.000Z info hostd[2129525] [Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event: (vim.event.EventEx) {

 

  • Post this we can start seeing Datastore Connectivity Issues with the ESXi Host:

 

2021-05-30T04:08:35.649Z warning hostd[2102988] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘b7638e5a-87e6-d995-1d6f-9cdc7149f0d0’.
2021-05-30T04:08:35.651Z info hostd[2102988] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277732 : Lost access to volume 5a8e63b7-6fe5bf3f-b4c2-9cdc7149f0d0 (b7638e5a-87e6-d995-1d6f-9cdc7149f0d0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.653Z warning hostd[2103148] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘920e755f-8406-5114-aead-9cdc7149d7a0’.
2021-05-30T04:08:35.658Z info hostd[2103148] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277733 : Lost access to volume 5f750e93-a6a6743c-3e41-9cdc7149d7a0 (920e755f-8406-5114-aead-9cdc7149d7a0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.658Z warning hostd[2713025] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘10001a5f-bf31-9a55-4ab1-9cdc7149e750’.
2021-05-30T04:08:35.662Z info hostd[2713025] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277734 : Lost access to volume 5f1a0010-f8e7c778-4835-9cdc7149e750 (10001a5f-bf31-9a55-4ab1-9cdc7149e750) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.662Z warning hostd[2103147] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘961da25a-f2c3-a5bb-acd4-9cdc715e41e0’.
2021-05-30T04:08:35.664Z info hostd[2103147] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277735 : Lost access to volume 5aa21d96-19f04dfb-cdc2-9cdc715e41e0 (961da25a-f2c3-a5bb-acd4-9cdc715e41e0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.664Z warning hostd[2110436] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘b400765f-d707-f5da-fab9-9cdc715e41e0’.
2021-05-30T04:08:35.665Z info hostd[2110436] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277736 : Lost access to volume 5f7600b4-416f35fd-4480-9cdc715e41e0 (b400765f-d707-f5da-fab9-9cdc715e41e0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.665Z warning hostd[2103014] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘c6638e5a-6a3c-9e2d-8ca3-9cdc7149f0d0’.
2021-05-30T04:08:35.666Z info hostd[2103014] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277737 : Lost access to volume 5a8e63c6-0899ddc1-1443-9cdc7149f0d0 (c6638e5a-6a3c-9e2d-8ca3-9cdc7149f0d0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.667Z warning hostd[2107908] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘6cb49c5a-b426-1183-4906-e0071b770f00’.
2021-05-30T04:08:35.669Z info hostd[2107908] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277738 : Lost access to volume 5a9cb46c-596f050b-cb5e-e0071b770f00 (6cb49c5a-b426-1183-4906-e0071b770f00) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.670Z warning hostd[2129527] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘8c08dc5a-8f1c-73e2-102e-9cdc714a6310’.
2021-05-30T04:08:35.672Z info hostd[2129527] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277739 : Lost access to volume 5adc088c-83a78d30-daea-9cdc714a6310 (8c08dc5a-8f1c-73e2-102e-9cdc714a6310) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.677Z warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ’94bc155f-1f20-9dbb-cf87-e0071b8303f0′.
2021-05-30T04:08:35.684Z info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277740 : Lost access to volume 5f15bc94-b0928f22-ae54-e0071b8303f0 (94bc155f-1f20-9dbb-cf87-e0071b8303f0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.709Z warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘11001a5f-fcdd-0c84-52d8-9cdc71492080’.
2021-05-30T04:08:35.713Z info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277741 : Lost access to volume 5f1a0011-b54f076e-8471-9cdc71492080 (11001a5f-fcdd-0c84-52d8-9cdc71492080) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.713Z warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘107e195f-b354-a0df-1845-9cdc7149d7a0’.
2021-05-30T04:08:35.716Z info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277742 : Lost access to volume 5f197e10-74e0e03f-09b4-9cdc7149d7a0 (107e195f-b354-a0df-1845-9cdc7149d7a0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.719Z warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘c5f64e5e-9349-eb6c-593c-9cdc7149f0c0’.
2021-05-30T04:08:35.721Z info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277743 : Lost access to volume 5e4ef6c5-3365b617-eeb7-9cdc7149f0c0 (c5f64e5e-9349-eb6c-593c-9cdc7149f0c0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.721Z warning hostd[2713024] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘ad40185f-73fa-9a18-e97c-9cdc7149f070’.
2021-05-30T04:08:35.722Z info hostd[2713024] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277744 : Lost access to volume 5f1840ad-81f34605-eb01-9cdc7149f070 (ad40185f-73fa-9a18-e97c-9cdc7149f070) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.723Z warning hostd[2713024] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find datastore ‘0909765f-c307-75ff-6613-e0071b7bbef0’.

 

 

VOBD Logs:

 

  • In the VOBD Logs we can see errors related to the Heartbeat being Miss for the Datastore.

 

2021-05-30T04:08:35.663Z: [vmfsCorrelator] 6461638394563us: [esx.problem.vmfs.heartbeat.timedout] 5a742056-7970936b-04b8-9cdc71492000 5620745a-09ad-03ff-8b9a-9cdc71492000
2021-05-30T04:08:35.663Z: [vmfsCorrelator] 6461703690483us: [vob.vmfs.heartbeat.timedout] 5e5ec93b-92524db8-fd3d-e0071b83e250 3bc95e5e-bf65-9aec-4035-e0071b83e250
2021-05-30T04:08:35.663Z: [vmfsCorrelator] 6461638394799us: [esx.problem.vmfs.heartbeat.timedout] 5e5ec93b-92524db8-fd3d-e0071b83e250 3bc95e5e-bf65-9aec-4035-e0071b83e250
2021-05-30T04:08:35.663Z: [vmfsCorrelator] 6461703690487us: [vob.vmfs.heartbeat.timedout] 5f16e861-d6abf240-1262-e0071b777f90 61e8165f-39bb-7160-16d6-e0071b777f90
2021-05-30T04:08:35.663Z: [vmfsCorrelator] 6461638395013us: [esx.problem.vmfs.heartbeat.timedout] 5f16e861-d6abf240-1262-e0071b777f90 61e8165f-39bb-7160-16d6-e0071b777f90

  • Due to Datastore going inaccessible, we can see that the Virtual Machine has been terminated:

 

2021-05-30T04:09:27.217Z: [VMCorrelator] 6461755259413us: [vob.vm.kill.unexpected.fault.failure] The virtual machine using the configuration file /vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/lva20bmciias01v.vmx could not fault in a page from the swap file at /vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/lva20bmciias01v-3ca8d9fd.vswp. The virtual machine has been powered off.
2021-05-30T04:09:27.310Z: [VMCorrelator] 6461690042246us: [esx.problem.vm.kill.unexpected.fault.failure.2] /vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/lva20bmciias01v.vmx could not fault in a guest physical page from the hypervisor level swap file on vsan:5242821db07941cc-e8cc95162cb58c8c. The VM is terminated as further progress is impossible.
2021-05-30T04:09:27.311Z: No correlator for vob.vm.kill.panic

2021-05-30T04:09:44.395Z: [UserWorldCorrelator] 6461772437877us: [vob.uw.core.dumpFailed] /bin/vmx(2109526) /vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/vmx-zdump.000 dump failed
2021-05-30T04:09:44.395Z: [UserWorldCorrelator] 6461707127260us: [esx.problem.application.core.dumpFailed] An application (/bin/vmx) running on ESXi host has crashed (3 time(s) so far), but core dump creation failed.

 

2021-05-30T04:32:43.429Z: [netCorrelator] 6463151484003us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic1 is down. Affected dvPort: 3199/50 0c 59 99 82 ac a7 b4-60 e5 52 4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128
2021-05-30T04:32:43.429Z: [netCorrelator] 6463151484015us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic1 is down. Affected dvPort: 2048/50 0c 59 99 82 ac a7 b4-60 e5 52 4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128
2021-05-30T04:32:43.429Z: [netCorrelator] 6463151484019us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic1 is down. Affected dvPort: 3200/50 0c 59 99 82 ac a7 b4-60 e5 52 4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128
2021-05-30T04:32:43.429Z: [netCorrelator] 6463151484023us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic1 is down. Affected dvPort: 797/50 0c 59 99 82 ac a7 b4-60 e5 52 4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128

 

 

VMKernel Logs:

 

2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS: MasterCheckNode:7921: Lost contact with backup
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS: CMMDSHeartbeatCheckHBLogWork:733: Check node returned Failure for node 00000000-0000-0000-0000-e0071b770f00 count 5
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS: CMMDSStateDestroyNode:689: Destroying node 00000000-0000-0000-0000-e0071b770f00: Heartbeat timeout
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS: MasterLostBackup:426: Master Failover: MUUID bd0fb360-7ddf-115a-2770-9cdc71492000 old 66c36d60-69f7-e8b1-c98a-9cdc71492000
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS: MasterRemoveNodeFromMembership:6771: Removing node 00000000-0000-0000-0000-e0071b770f00 from the cluster membership

2021-05-30T04:08:33.667Z cpu4:2099558)DOM: DOMLeafSubscribeSSDHealth:3270: Failed to retrieve/unmarshal disk entry `523ae4b2-acb4-289a-b36a-7824e15fe1a0` for leaf object `0f2a215e-4cf7-78aa-7b9b-e0071b83e250`: Not found (0xbad0003)
2021-05-30T04:08:33.727Z cpu66:2099556)DOM: DOMLeafSubscribeSSDHealth:3270: Failed to retrieve/unmarshal disk entry `528a0494-6a67-4815-5f18-d33919ac2917` for leaf object `ba86415e-e5f5-ea26-40fe-9cdc7149f0c0`: Not found (0xbad0003)

2021-05-30T04:08:35.681Z cpu20:2099517)CMMDS: CMMDSHeartbeatCheckHBLogWork:733: Check node returned Failure for node 00000000-0000-0000-0000-e0071b7bbed0 count 11
2021-05-30T04:08:35.776Z cpu20:2099517)CMMDS: CMMDSHeartbeatCheckHBLogWork:733: Check node returned Failure for node 00000000-0000-0000-0000-9cdc7149f0d0 count 11
2021-05-30T04:08:35.776Z cpu11:2099560)DOM: DOMLeafSubscribeSSDHealth:3270: Failed to retrieve/unmarshal disk entry `5276c5d3-f670-002d-b5a4-b5ba31cd5256` for leaf object `bd4cab5e-91a8-9a1d-c91a-e0071b77cfb0`: Not found (0xbad0003)
2021-05-30T04:08:35.868Z cpu20:2099517)CMMDS: CMMDSHeartbeatCheckHBLogWork:733: Check node returned Failure for node 00000000-0000-0000-0000-e0071b77fed0 count 11
2021-05-30T04:08:39.906Z cpu15:2103576)HBX: 3041: ’30e21a5f-23a7-422a-e39c-e0071b77fef0′: HB at offset 3698688 – Waiting for timed out HB:
2021-05-30T04:08:41.023Z cpu6:2713024)HBX: 3041: ‘b7638e5a-87e6-d995-1d6f-9cdc7149f0d0’: HB at offset 3698688 – Waiting for timed out HB:
2021-05-30T04:08:41.065Z cpu42:2102986)HBX: 3041: ‘338ac95a-9328-2bad-d340-9cdc7149f0e0’: HB at offset 3698688 – Waiting for timed out HB:
2021-05-30T04:08:41.078Z cpu53:2110436)HBX: 3041: ‘c6638e5a-6a3c-9e2d-8ca3-9cdc7149f0d0’: HB at offset 3698688 – Waiting for timed out HB:
2021-05-30T04:08:44.316Z cpu12:2102981)HBX: 3041: ‘10001a5f-bf31-9a55-4ab1-9cdc7149e750’: HB at offset 3698688 – Waiting for timed out HB:

 

VM Name: Iva20bmciias01v 

 

  • VM Logs doesn’t have much details about the time of issue.

 

2021-03-16T09:31:49.596Z| vmx| I125: Hostname=abc01xsdi001.sdi.corp.abc.com
2021-03-25T18:45:28.020Z| vmx| I125: VigorTransportProcessClientPayload: opID=HB-SpecSync-host-46@591342-13468d0e-8-fb83 seq=920003: Receiving Sched.SetResourceGroup request.
2021-03-25T18:45:28.020Z| vmx| I125: VigorTransport_ServerSendResponse opID=HB-SpecSync-host-46@591342-13468d0e-8-fb83 seq=920003: Completed Sched request.
2021-03-25T18:45:37.918Z| vmx| I125: VigorTransportProcessClientPayload: opID=HB-SpecSync-host-46@591344-7b1c8008-9a-fc3e seq=920013: Receiving Sched.SetResourceGroup request.
2021-03-25T18:45:37.918Z| vmx| I125: VigorTransport_ServerSendResponse opID=HB-SpecSync-host-46@591344-7b1c8008-9a-fc3e seq=920013: Completed Sched request.

 

2021-05-30T04:11:10.333Z| vmx| I125: Hostname=abc01xsdi013.sdi.corp.abc.com
2021-05-30T04:11:10.333Z| vmx| I125: System uptime 5230886026632 us

 

Conclusion:

 

  • Based on the logs we can conclude that the issue seems to be starting with one of the uplink vmnic0 on VDS DvsPortset-0 is moved out of link aggregation group. Post which the Datastore becomes inaccessible.
  • Due to the Datastore being inaccessible, the Virtual machine got Terminated.

 

Action Plan:

 

  • I can see that you have already raised a case with the VSAN Team where they are currently looking for the aspects of Failure due to a Single Nic Failure. I will recommend you to continue on the case for more details.

 

The Supported version  of Firmware with Driver Version  4.17.70.1 is 14.27.1016.

 

  • I will recommend you to check with the Hardware vendor and confirm if you are running at the supported firmware version, else you can perform an update.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply