Hostname: abc01xsdi001.sdi.corp.abc.com
ESXi Version: ESXi 6.7 P04
Time of Issue: 5/30/2021, 9:38:31 AM IST
Time in GMT: 5/30/2021, 4:08 AM GMT
vmnic |
PCI bus address |
link |
speed |
duplex |
MTU |
driver |
driver version |
firmware version |
MAC address |
VID |
DID |
SVID |
SDID |
name |
vmnic0 |
0000:04:00.0 |
Up |
10000 |
Full |
9000 |
nmlx5_core |
4.17.70.1 |
14.27.4000 |
9c:dc:71:49:20:00 |
15b3 |
1015 |
1590 |
00d3 |
Mellanox Technologies MT27710 Family [ConnectX-4 Lx] |
vmnic1 |
0000:04:00.1 |
Up |
10000 |
Full |
9000 |
nmlx5_core |
4.17.70.1 |
14.27.4000 |
9c:dc:71:49:20:01 |
15b3 |
1015 |
1590 |
00d3 |
Mellanox Technologies MT27710 Family [ConnectX-4 Lx] |
Hostd Logs:
- Reviewed
the Hostd Logs and we can see that the issue started with one of the
Uplink VMnic0 went down and moved out of link aggregation group.
2021-05-30T04:05:17.210Z info
hostd[2103014] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277722 : LACP
warning: uplink vmnic0 on VDS DvsPortset-0 is moved out of link aggregation
group.
2021-05-30T04:05:18.000Z
info hostd[2129525] [Originator@6876 sub=Hostsvc.VmkVprobSource]
VmkVprobSource::Post event: (vim.event.EventEx) {
- Post this
we can start seeing Datastore Connectivity Issues with the ESXi Host:
2021-05-30T04:08:35.649Z warning
hostd[2102988] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘b7638e5a-87e6-d995-1d6f-9cdc7149f0d0’.
2021-05-30T04:08:35.651Z
info hostd[2102988] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277732 :
Lost access to volume 5a8e63b7-6fe5bf3f-b4c2-9cdc7149f0d0
(b7638e5a-87e6-d995-1d6f-9cdc7149f0d0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.653Z
warning hostd[2103148] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘920e755f-8406-5114-aead-9cdc7149d7a0’.
2021-05-30T04:08:35.658Z
info hostd[2103148] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277733 :
Lost access to volume 5f750e93-a6a6743c-3e41-9cdc7149d7a0
(920e755f-8406-5114-aead-9cdc7149d7a0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.658Z
warning hostd[2713025] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘10001a5f-bf31-9a55-4ab1-9cdc7149e750’.
2021-05-30T04:08:35.662Z
info hostd[2713025] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277734 :
Lost access to volume 5f1a0010-f8e7c778-4835-9cdc7149e750
(10001a5f-bf31-9a55-4ab1-9cdc7149e750) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.662Z
warning hostd[2103147] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘961da25a-f2c3-a5bb-acd4-9cdc715e41e0’.
2021-05-30T04:08:35.664Z
info hostd[2103147] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277735 :
Lost access to volume 5aa21d96-19f04dfb-cdc2-9cdc715e41e0
(961da25a-f2c3-a5bb-acd4-9cdc715e41e0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.664Z
warning hostd[2110436] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘b400765f-d707-f5da-fab9-9cdc715e41e0’.
2021-05-30T04:08:35.665Z
info hostd[2110436] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277736 :
Lost access to volume 5f7600b4-416f35fd-4480-9cdc715e41e0
(b400765f-d707-f5da-fab9-9cdc715e41e0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.665Z
warning hostd[2103014] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘c6638e5a-6a3c-9e2d-8ca3-9cdc7149f0d0’.
2021-05-30T04:08:35.666Z
info hostd[2103014] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277737 :
Lost access to volume 5a8e63c6-0899ddc1-1443-9cdc7149f0d0
(c6638e5a-6a3c-9e2d-8ca3-9cdc7149f0d0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.667Z
warning hostd[2107908] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘6cb49c5a-b426-1183-4906-e0071b770f00’.
2021-05-30T04:08:35.669Z
info hostd[2107908] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277738 :
Lost access to volume 5a9cb46c-596f050b-cb5e-e0071b770f00
(6cb49c5a-b426-1183-4906-e0071b770f00) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.670Z
warning hostd[2129527] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘8c08dc5a-8f1c-73e2-102e-9cdc714a6310’.
2021-05-30T04:08:35.672Z
info hostd[2129527] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277739 :
Lost access to volume 5adc088c-83a78d30-daea-9cdc714a6310
(8c08dc5a-8f1c-73e2-102e-9cdc714a6310) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.677Z
warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ’94bc155f-1f20-9dbb-cf87-e0071b8303f0′.
2021-05-30T04:08:35.684Z
info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277740 :
Lost access to volume 5f15bc94-b0928f22-ae54-e0071b8303f0
(94bc155f-1f20-9dbb-cf87-e0071b8303f0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.709Z
warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘11001a5f-fcdd-0c84-52d8-9cdc71492080’.
2021-05-30T04:08:35.713Z
info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277741 :
Lost access to volume 5f1a0011-b54f076e-8471-9cdc71492080
(11001a5f-fcdd-0c84-52d8-9cdc71492080) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.713Z
warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘107e195f-b354-a0df-1845-9cdc7149d7a0’.
2021-05-30T04:08:35.716Z
info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277742 :
Lost access to volume 5f197e10-74e0e03f-09b4-9cdc7149d7a0
(107e195f-b354-a0df-1845-9cdc7149d7a0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.719Z
warning hostd[2103576] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘c5f64e5e-9349-eb6c-593c-9cdc7149f0c0’.
2021-05-30T04:08:35.721Z
info hostd[2103576] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277743 :
Lost access to volume 5e4ef6c5-3365b617-eeb7-9cdc7149f0c0
(c5f64e5e-9349-eb6c-593c-9cdc7149f0c0) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.721Z
warning hostd[2713024] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘ad40185f-73fa-9a18-e97c-9cdc7149f070’.
2021-05-30T04:08:35.722Z
info hostd[2713024] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 277744 :
Lost access to volume 5f1840ad-81f34605-eb01-9cdc7149f070
(ad40185f-73fa-9a18-e97c-9cdc7149f070) due to connectivity issues. Recovery
attempt is in progress and outcome will be reported shortly.
2021-05-30T04:08:35.723Z
warning hostd[2713024] [Originator@6876 sub=Hostsvc.VmkVprobSource] Can’t find
datastore ‘0909765f-c307-75ff-6613-e0071b7bbef0’.
VOBD Logs:
- In the
VOBD Logs we can see errors related to the Heartbeat being Miss for the
Datastore.
2021-05-30T04:08:35.663Z: [vmfsCorrelator]
6461638394563us: [esx.problem.vmfs.heartbeat.timedout]
5a742056-7970936b-04b8-9cdc71492000 5620745a-09ad-03ff-8b9a-9cdc71492000
2021-05-30T04:08:35.663Z:
[vmfsCorrelator] 6461703690483us: [vob.vmfs.heartbeat.timedout]
5e5ec93b-92524db8-fd3d-e0071b83e250 3bc95e5e-bf65-9aec-4035-e0071b83e250
2021-05-30T04:08:35.663Z:
[vmfsCorrelator] 6461638394799us: [esx.problem.vmfs.heartbeat.timedout]
5e5ec93b-92524db8-fd3d-e0071b83e250 3bc95e5e-bf65-9aec-4035-e0071b83e250
2021-05-30T04:08:35.663Z:
[vmfsCorrelator] 6461703690487us: [vob.vmfs.heartbeat.timedout]
5f16e861-d6abf240-1262-e0071b777f90 61e8165f-39bb-7160-16d6-e0071b777f90
2021-05-30T04:08:35.663Z:
[vmfsCorrelator] 6461638395013us: [esx.problem.vmfs.heartbeat.timedout]
5f16e861-d6abf240-1262-e0071b777f90 61e8165f-39bb-7160-16d6-e0071b777f90
- Due to
Datastore going inaccessible, we can see that the Virtual Machine has been
terminated:
2021-05-30T04:09:27.217Z: [VMCorrelator]
6461755259413us: [vob.vm.kill.unexpected.fault.failure] The virtual machine
using the configuration file
/vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/lva20bmciias01v.vmx
could not fault in a page from the swap file at
/vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/lva20bmciias01v-3ca8d9fd.vswp.
The virtual machine has been powered off.
2021-05-30T04:09:27.310Z: [VMCorrelator] 6461690042246us: [esx.problem.vm.kill.unexpected.fault.failure.2]
/vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/lva20bmciias01v.vmx
could not fault in a guest physical page from the hypervisor level swap file on
vsan:5242821db07941cc-e8cc95162cb58c8c. The VM is terminated as further
progress is impossible.
2021-05-30T04:09:27.311Z:
No correlator for vob.vm.kill.panic
2021-05-30T04:09:44.395Z: [UserWorldCorrelator]
6461772437877us: [vob.uw.core.dumpFailed] /bin/vmx(2109526)
/vmfs/volumes/vsan:5242821db07941cc-e8cc95162cb58c8c/680b3c5e-be68-e38f-6688-e0071b77cfb0/vmx-zdump.000
dump failed
2021-05-30T04:09:44.395Z: [UserWorldCorrelator] 6461707127260us:
[esx.problem.application.core.dumpFailed] An application (/bin/vmx) running on
ESXi host has crashed (3 time(s) so far), but core dump creation failed.
2021-05-30T04:32:43.429Z: [netCorrelator]
6463151484003us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic1 is
down. Affected dvPort: 3199/50 0c 59 99 82 ac a7 b4-60 e5 52 4a cd c8 6b eb. 1
uplinks up. Failed criteria: 128
2021-05-30T04:32:43.429Z:
[netCorrelator] 6463151484015us: [vob.net.dvport.uplink.transition.down]
Uplink: vmnic1 is down. Affected dvPort: 2048/50 0c 59 99 82 ac a7 b4-60 e5 52
4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128
2021-05-30T04:32:43.429Z:
[netCorrelator] 6463151484019us: [vob.net.dvport.uplink.transition.down]
Uplink: vmnic1 is down. Affected dvPort: 3200/50 0c 59 99 82 ac a7 b4-60 e5 52
4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128
2021-05-30T04:32:43.429Z:
[netCorrelator] 6463151484023us: [vob.net.dvport.uplink.transition.down]
Uplink: vmnic1 is down. Affected dvPort: 797/50 0c 59 99 82 ac a7 b4-60 e5 52
4a cd c8 6b eb. 1 uplinks up. Failed criteria: 128
VMKernel Logs:
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS:
MasterCheckNode:7921: Lost contact with backup
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS:
CMMDSHeartbeatCheckHBLogWork:733: Check node returned Failure for node
00000000-0000-0000-0000-e0071b770f00 count 5
2021-05-30T04:08:29.347Z
cpu21:2099517)CMMDS: CMMDSStateDestroyNode:689: Destroying node 00000000-0000-0000-0000-e0071b770f00:
Heartbeat timeout
2021-05-30T04:08:29.347Z cpu21:2099517)CMMDS: MasterLostBackup:426:
Master Failover: MUUID bd0fb360-7ddf-115a-2770-9cdc71492000 old
66c36d60-69f7-e8b1-c98a-9cdc71492000
2021-05-30T04:08:29.347Z
cpu21:2099517)CMMDS: MasterRemoveNodeFromMembership:6771: Removing node
00000000-0000-0000-0000-e0071b770f00 from the cluster membership
2021-05-30T04:08:33.667Z cpu4:2099558)DOM:
DOMLeafSubscribeSSDHealth:3270: Failed to retrieve/unmarshal disk entry
`523ae4b2-acb4-289a-b36a-7824e15fe1a0` for leaf object
`0f2a215e-4cf7-78aa-7b9b-e0071b83e250`: Not found (0xbad0003)
2021-05-30T04:08:33.727Z
cpu66:2099556)DOM: DOMLeafSubscribeSSDHealth:3270: Failed to retrieve/unmarshal
disk entry `528a0494-6a67-4815-5f18-d33919ac2917` for leaf object
`ba86415e-e5f5-ea26-40fe-9cdc7149f0c0`: Not found (0xbad0003)
2021-05-30T04:08:35.681Z cpu20:2099517)CMMDS:
CMMDSHeartbeatCheckHBLogWork:733: Check node returned Failure for node
00000000-0000-0000-0000-e0071b7bbed0 count 11
2021-05-30T04:08:35.776Z
cpu20:2099517)CMMDS: CMMDSHeartbeatCheckHBLogWork:733: Check node returned
Failure for node 00000000-0000-0000-0000-9cdc7149f0d0 count 11
2021-05-30T04:08:35.776Z
cpu11:2099560)DOM: DOMLeafSubscribeSSDHealth:3270: Failed to retrieve/unmarshal
disk entry `5276c5d3-f670-002d-b5a4-b5ba31cd5256` for leaf object
`bd4cab5e-91a8-9a1d-c91a-e0071b77cfb0`: Not found (0xbad0003)
2021-05-30T04:08:35.868Z
cpu20:2099517)CMMDS: CMMDSHeartbeatCheckHBLogWork:733: Check node returned
Failure for node 00000000-0000-0000-0000-e0071b77fed0 count 11
2021-05-30T04:08:39.906Z
cpu15:2103576)HBX: 3041: ’30e21a5f-23a7-422a-e39c-e0071b77fef0′: HB at offset
3698688 – Waiting for timed out HB:
2021-05-30T04:08:41.023Z
cpu6:2713024)HBX: 3041: ‘b7638e5a-87e6-d995-1d6f-9cdc7149f0d0’: HB at offset
3698688 – Waiting for timed out HB:
2021-05-30T04:08:41.065Z
cpu42:2102986)HBX: 3041: ‘338ac95a-9328-2bad-d340-9cdc7149f0e0’: HB at offset
3698688 – Waiting for timed out HB:
2021-05-30T04:08:41.078Z
cpu53:2110436)HBX: 3041: ‘c6638e5a-6a3c-9e2d-8ca3-9cdc7149f0d0’: HB at offset
3698688 – Waiting for timed out HB:
2021-05-30T04:08:44.316Z
cpu12:2102981)HBX: 3041: ‘10001a5f-bf31-9a55-4ab1-9cdc7149e750’: HB at offset
3698688 – Waiting for timed out HB:
VM Name: Iva20bmciias01v
- VM Logs
doesn’t have much details about the time of issue.
2021-03-16T09:31:49.596Z| vmx| I125: Hostname=abc01xsdi001.sdi.corp.abc.com
2021-03-25T18:45:28.020Z| vmx| I125: VigorTransportProcessClientPayload:
opID=HB-SpecSync-host-46@591342-13468d0e-8-fb83 seq=920003: Receiving
Sched.SetResourceGroup request.
2021-03-25T18:45:28.020Z| vmx| I125: VigorTransport_ServerSendResponse
opID=HB-SpecSync-host-46@591342-13468d0e-8-fb83 seq=920003: Completed Sched
request.
2021-03-25T18:45:37.918Z| vmx| I125: VigorTransportProcessClientPayload:
opID=HB-SpecSync-host-46@591344-7b1c8008-9a-fc3e seq=920013: Receiving
Sched.SetResourceGroup request.
2021-03-25T18:45:37.918Z| vmx| I125: VigorTransport_ServerSendResponse
opID=HB-SpecSync-host-46@591344-7b1c8008-9a-fc3e seq=920013: Completed Sched
request.
2021-05-30T04:11:10.333Z| vmx| I125: Hostname=abc01xsdi013.sdi.corp.abc.com
2021-05-30T04:11:10.333Z| vmx| I125: System uptime 5230886026632 us
Conclusion:
- Based on
the logs we can conclude that the issue seems to be starting with one of
the uplink vmnic0 on VDS DvsPortset-0 is moved out of link aggregation
group. Post which the Datastore becomes inaccessible.
- Due to
the Datastore being inaccessible, the Virtual machine got Terminated.
Action Plan:
- I can see
that you have already raised a case with the VSAN Team where they are
currently looking for the aspects of Failure due to a Single Nic Failure.
I will recommend you to continue on the case for more details.
- From the
ESXi Host end I can see that the Network Adaptors are currently running at
the Driver Version 4.17.70.1 and Firmware Version: 14.27.4000.
However as per the Vmware Compatibility Matrix: https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=42311&deviceCategory=io&details=1&VID=15b3&DID=1015&SVID=1590&SSID=00d3&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc
The Supported version of Firmware with Driver Version 4.17.70.1 is 14.27.1016.
- I will
recommend you to check with the Hardware vendor and confirm if you are
running at the supported firmware version, else you can perform an update.