RCA 38: ESXi Host Disconnected from vCenter Server

 

Hostname: abcesx103.abc.com

 

ESXi Version: ESXi 6.0 EP 15

As per the Article: https://kb.vmware.com/s/article/66977 vSphere 6.0 is already out of General Support.

 

 

vmnic

PCI bus address

link

speed

duplex

MTU

driver

driver version

firmware version

MAC address

VID

DID

SVID

SDID

name

vmnic0

0000:04:00.0

Up

1000

Full

1500

nx_nic

0.25744213

0.173564815

80:c1:6e:6f:e3:14

4040

100

103c

705a

NetXen HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter

vmnic1

0000:04:00.1

Up

1000

Full

1500

nx_nic

0.25744213

0.173564815

80:c1:6e:6f:e3:15

4040

100

103c

705a

NetXen HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter

vmnic2

0000:04:00.2

Up

1000

Full

1500

nx_nic

0.25744213

0.173564815

80:c1:6e:6f:e3:16

4040

100

103c

705a

NetXen HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter

vmnic3

0000:04:00.3

Up

1000

Full

1500

nx_nic

0.25744213

0.173564815

80:c1:6e:6f:e3:17

4040

100

103c

705a

NetXen HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter

vmnic4

0000:10:00.0

Up

1000

Full

1500

e1000e

3.2.2.1-NAPI

5.12-2

00:26:55:db:a8:d1

8086

10bc

103c

704b

Intel Corporation 82571EB Gigabit Ethernet Controller (Copper)

vmnic5

0000:10:00.1

Up

1000

Full

1500

e1000e

3.2.2.1-NAPI

5.12-2

00:26:55:db:a8:d0

8086

10bc

103c

704b

Intel Corporation 82571EB Gigabit Ethernet Controller (Copper)

 

  • As per the Vmware HCL:

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=9278&deviceCategory=io&details=1&VID=4040&DID=0100&SVID=103c&SSID=705a&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

 

  • Driver Version 6.0.643 is compatible with 4.0.595 however we are running on firmware version: 4.0.596

 

 

VMKsummary:

 

  • Last Boot:

 

2021-03-01T16:16:22Z bootstop: Host has booted

 

VOBD Logs:

 

  • Reviewed the VOBD Logs but was not able to find any issues:

 

2021-03-01T16:14:38.601Z: [netCorrelator] 60982188us: [vob.net.vmnic.linkstate.up] vmnic vmnic4 linkstate up
2021-03-01T16:14:39.002Z: [netCorrelator] 61383500us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic4 linkstate is up
2021-03-01T16:14:39.002Z: An event (esx.clear.net.vmnic.linkstate.up) could not be sent immediately to hostd; queueing for retry.
2021-03-01T16:14:39.104Z: [netCorrelator] 61484667us: [vob.net.vmnic.linkstate.up] vmnic vmnic5 linkstate up
2021-03-01T16:14:40.002Z: [netCorrelator] 62383635us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic5 linkstate is up
2021-03-01T16:14:40.002Z: An event (esx.clear.net.vmnic.linkstate.up) could not be sent immediately to hostd; queueing for retry.

 

 

Hostd Logs:

 

  • Reviewed the Hostd Logs and we can see a Refresh Task running before the time of issue but with no specific Error which could crash the Host.

 

2021-03-01T16:15:46.161Z verbose hostd[FF94DB20] [Originator@6876 sub=Hostsvc.FSVolumeProvider] RefreshVMFSVolumes: refreshed volume, id 5ef0a76e-519f1a44-aaf1-002655dbbf1d, name abcvmax-ESXi_R5, version 5
2021-03-01T16:15:47.504Z verbose hostd[FF94DB20] [Originator@6876 sub=Hostsvc.FSVolumeProvider] RefreshVMFSVolumes: refreshed volume, id 5f047095-98f8bec1-3532-0017a4774010, name abcvmax_VM-001, version 5
2021-03-01T16:15:49.232Z verbose hostd[FF94DB20] [Originator@6876 sub=Hostsvc.FSVolumeProvider] RefreshVMFSVolumes: refreshed volume, id 5f0470dd-62ad6da6-0758-0017a4774010, name abcvmax_VM-003, version 5
2021-03-01T16:15:50.938Z verbose hostd[FF94DB20] [Originator@6876 sub=Hostsvc.FSVolumeProvider] RefreshVMFSVolumes: refreshed volume, id 5f0470b8-1910c456-b149-0017a4774010, name abcvmax_VM-002, version 5
2021-03-01T16:15:51.625Z verbose hostd[FF94DB20] [Originator@6876 sub=Hostsvc.FSVolumeProvider] RefreshVMFSVolumes: refreshed volume, id 51f6a550-460c0e78-f563-ac162dbd8980, name abcvmax-netscout001, version 5
2021-03-01T16:15:53.465Z verbose hostd[FF94DB20] [Originator@6876 sub=Hostsvc.FSVolumeProvider] RefreshVMFSVolumes: refreshed volume, id 5f047110-e35fb5f2-8741-0017a4774010, name abcvmax_VM-004, version 5
2021-03-01T16:15:53.755Z info hostd[FF94DB20] [Originator@6876 sub=Hostsvc.Datastore] Refresh: VMFS Datastore name was changed from ‘5ef0a76e-519f1a44-aaf1-002655dbbf1d’ to ‘abcvmax-ESXi_R5′.
2021-03-01T16:15:54.473Z info hostd[FF94DB20] [Originator@6876 sub=Hostsvc.Datastore] Refresh: VMFS Datastore name was changed from ‘5f047095-98f8bec1-3532-0017a4774010’ to ‘abcvmax_VM-001′.
2021-03-01T16:15:55.264Z info hostd[FF94DB20] [Originator@6876 sub=Hostsvc.Datastore] Refresh: VMFS Datastore name was changed from ‘5f0470dd-62ad6da6-0758-0017a4774010’ to ‘abcvmax_VM-003′.

  • Here we can see that the Server Restarted:

 

–>    fullName = “VMware ESX build-9313334”,
–>    version = “6.0.0”,
–>    build = “9313334”,
2021-03-01T16:16:04.776Z info hostd[FF94DB20] [Originator@6876 sub=Hostsvc.HaHost] Initialized StoragerRM/Vmsvc powering-on interceptor
2021-03-01T16:16:15.643Z info hostd[FF94DB20] [Originator@6876 sub=Default] BEGIN SERVICES

 

 

VMKernel Logs:

 

  • In Vmkernel logs we are not able to find much details:

 

2021-03-01T16:01:21.839Z cpu2:34405)FSS: 6264: Failed to open file ‘slotsfile’; Requested flags 0x8, world: 34405 [storageRM], (Existing flags 0x0, world: 0 [unknown]): Busy
2021-03-01T16:02:13.728Z cpu38:33187)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device “naa.60000970000192605719533030363733”.
2021-03-01T16:02:13.740Z cpu48:33186)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device “naa.60000970000192605719533030363546”.
2021-03-01T16:02:13.745Z cpu0:33190)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device “naa.60000970000192605719533030363233”.
2021-03-01T16:02:13.747Z cpu33:33189)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device “naa.60000970000192605719533030363442”.
2021-03-01T16:02:13.748Z cpu33:33189)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:1099: Could not select path for device “naa.60000970000192605719533030363337”.
2021-03-01T16:06:32.871Z cpu63:34405)DLX: 3876: vol ‘abcvmax_VM-003′, lock at 115658752: [Req mode 1] Checking liveness:
VMB: 49: mbMagic: 2badb002, mbInfo 0x1014b8

 

 

 

FDM Logs:

 

  • Here we can see that the Master for the Cluster is : host-633  which could possibly have more details about this issue:

 

2021-03-01T16:02:20.702Z info fdm[34E81B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterManagerImpl::LogState] hostId=host-602 state=Slave master=host-633 isolated=false host-list-version=144 config-version=270 vm-metadata-version=1187 slv-mst-tdiff-sec=0
2021-03-01T16:16:21.027Z verbose fdm[FFA707C0] [Originator@6876 sub=Default] Dumping early logs:
—— Early init logs start ——–
2021-03-01T16:16:21.026Z info -[FFA707C0] [Originator@6876 sub=Default] Glibc malloc guards disabled.
2021-03-01T16:16:21.026Z info -[FFA707C0] [Originator@6876 sub=Default] Initialized SystemFactory
—— Early init logs end   ——–
2021-03-01T16:16:21.027Z info fdm[FFA707C0] [Originator@6876 sub=Default] Logging uses fast path: false

 

IPMI Logs:

 

  • IPMI Logs seems to be not updated since 2018:

 

Record:12:
   Record Id: 12
   When: 2018-01-06T12:28:58
   Event Type: 111 (Unknown)
   SEL Type: 2 (System Event)
   Message: Assert + Power Supply Failure status
   Sensor Number: 6
   Raw:
   Formatted-Raw: 0c 00 02 0a c1 50 5a 41 00 04 08 06 6f 01 ff ff

 

 

 

FDM Master:

 

  • From the Logs we can see that the ESXi Host lost the Heartbeat from the Master Node and was considered as Dead:

 

2021-03-01T16:07:50.796Z verbose fdm[54550B70] [Originator@6876 sub=Placement opID=SWI-ff522be] [RR::ResetAwaitingChangeVms] 0 Vms have been updated
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Slave host-602 heartbeating again => Dead to Partitioned

2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Stopping datastore heartbeat checks for slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] host-602 /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterManagerImpl::StopHBDatastoreChecking] datastore /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010 slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StopHBDatastoreChecking] slave host-602 mac 00:26:55:db:a8:d1
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] host-602 /vmfs/volumes/5f047110-e35fb5f2-8741-0017a4774010
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterManagerImpl::StopHBDatastoreChecking] datastore /vmfs/volumes/5f047110-e35fb5f2-8741-0017a4774010 slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StopHBDatastoreChecking] slave host-602 mac 00:26:55:db:a8:d1
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StartHBDatastoreChecking] path /vmfs/volumes/5f047110-e35fb5f2-8741-0017a4774010 slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StartHBDatastoreChecking] slave host-602 uuid.mac 00:26:55:db:a8:d1
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StartHBDatastoreChecking] Forcing heartbeat check on datastore /vmfs/volumes/5f047110-e35fb5f2-8741-0017a4774010 for slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StartHBDatastoreChecking] path /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010 slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StartHBDatastoreChecking] slave host-602 uuid.mac 00:26:55:db:a8:d1
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::StartHBDatastoreChecking] Forcing heartbeat check on datastore /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010 for slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Starting datastore heartbeat checking for slave host-602
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::CheckSlaveHeartbeats] Checking heartbeat datastore /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010 on localhost
2021-03-01T16:07:51.180Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] [ClusterDatastore::CheckSlaveHeartbeats] Checking heartbeat datastore /vmfs/volumes/5f047110-e35fb5f2-8741-0017a4774010 on localhost
2021-03-01T16:07:51.181Z info fdm[539C2B70] [Originator@6876 sub=Invt opID=SWI-26bbf021] [HostStateChange::SaveToInventory] host host-602 changed state: Partitioned
2021-03-01T16:07:51.181Z verbose fdm[539C2B70] [Originator@6876 sub=Vmcp opID=SWI-26bbf021] Canceling VM reservation for non-live host host-602
2021-03-01T16:07:53.834Z verbose fdm[543CAB70] [Originator@6876 sub=FDM] [EventManagerImpl] Received event from host-617

 

  • We can see that the ESXi Host state changed from Partitioned to Dead:

 

2021-03-01T16:08:06.198Z verbose fdm[54181B70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] Waited 15 seconds for disk heartbeat for host host-602 – declaring dead
2021-03-01T16:08:06.201Z info fdm[5440BB70] [Originator@6876 sub=Invt opID=SWI-1dd14802] [HostStateChange::SaveToInventory] host host-602 changed state: Dead

 

  • Slaves Status at the time of issue:

 

–> Slave states (17):
–>     host-720: Live
–>     host-801: Live
–>     host-602: Dead
–>     host-736: Live
–>     host-777: Live
–>     host-750: Live
–>     host-704: Live
–>     host-693: Live
–>     host-648: Live
–>     host-593: Live
–>     host-640: Live
–>     host-679: Live
–>     host-764: Live
–>     host-617: Live
–>     host-791: Live
–>     host-579: Live
–>     host-660: Live

 

  • Post reboot the State of the ESXi Host came Live:

 

2021-03-01T16:09:41.086Z verbose fdm[545D2B70] [Originator@6876 sub=Invt opID=SWI-13f56d05] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010/abcaseemdb001/abcaseemdb001.vmx changed  guestHB=red
2021-03-01T16:09:52.454Z verbose fdm[531C2B70] [Originator@6876 sub=Invt opID=SWI-5fc992a1] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5f047110-e35fb5f2-8741-0017a4774010/abcintlqueuing003/abcintlqueuing003.vmx changed  guestHB=red
2021-03-01T16:13:48.957Z verbose fdm[543CAB70] [Originator@6876 sub=Invt opID=SWI-71d88f31] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5f0470dd-62ad6da6-0758-0017a4774010/abcsysdb001/abcsysdb001.vmx changed  guestHB=red
2021-03-01T16:15:06.243Z verbose fdm[53940B70] [Originator@6876 sub=Invt opID=SWI-49db1f66] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5f047095-98f8bec1-3532-0017a4774010/abcevent109/abcevent109.vmx changed toolsStatus=toolsNotRunning
2021-03-01T16:15:06.614Z info fdm[53181B70] [Originator@6876 sub=Invt opID=SWI-5560ab4d] [HostStateChange::SaveToInventory] host host-602 changed state: FDMUnreachable
2021-03-01T16:16:23.545Z info fdm[54654B70] [Originator@6876 sub=Invt opID=SWI-6b9c10a4] [HostStateChange::SaveToInventory] host host-602 changed state: Live

 

 

 

Conclusion:

 

  • Based on the logs we can see that the ESXi Host lost Communication with the vCenter Master HA Node and it was marked as Dead. However we are not able to find anything conclusive for the reason of restart of the ESXi Host.

 

As per the Article: https://kb.vmware.com/s/article/1019238

 

  • We don’t see any BlueScreenTimeout
  • No Power button task initiated a boot.
  • There are no haTask-ha-host-vim.HostSystem.reboot task in the Hostd which can be a sign the the reboot is being triggered by vCenter.

 

 

Action Plan:

 

  • I will recommend you to check with your Hardware vendor to have an Extensive Hardware Diagnostics done on the ESXi Host.
  • Check if there is any ASR Enabled for the ESXi Host which restart the ESXi Host without crashing.
  • Please check with your Hardware vendor to confirm if the currently running firmware version is supported with the Driver as per the Vmware HCL I can see we are currently running on 4.0.596 and the supported version is 4.0.595

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply