RCA-3: Cluster Node Eviction on VMware Platform

From the Windows Events I am not sure about the timezone but the issue happened during the night where the machine was not able to communicate with the DNS Servers and then later was removed from the Failover cluster membership.

However in Vcenter logs we were not able to find much information at this point as most of the logs were missing for that time frame. Please go through the logs and let me know if you have any questions and queries.

 

Host Information:

HostId                   Hostname                           ipAddress1         HostdPort             version                buil
host-1009                abcd-tand-abc18.abc.com            10.123.4.75        443                        6.0.0      7504637
host-1053                abcd-tand-abc19.abc.com            10.123.4.76        443                        6.0.0      7504637

 

Virtual Machine Name: HWBK-ABCD-DEV01

System Logs:

  • From the system logs we can see that the Machine logged Event ID 1135 On the Event Logs which is ongoing since 24th Jan, which explains that there was a Network Loss and the Cluster Node was Evicted from the FCM Console.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          25-01-2019 1.33.04 AM
Event ID:      1135
User:          SYSTEM
Computer:      HWBK-ABCD-DEV01.abc.com
Description: Cluster node 'BWBK-ABCD-DEV01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. 
  • Around the Time when we were facing the issue with DNS not responding to the requests of the Clients.

 

Log Name:      System
Source:        Microsoft-Windows-DNS-Client
Date:          25-01-2019 12.00.10 AM
Event ID:      1014
Level:         Warning
Computer:      HWBK-ABCD-DEV01.abc.com
Description: Name resolution for the name osce11.icrc.trendmicro.com timed out after none of the configured DNS servers responded.
 
Log Name:      System
Source:        Microsoft-Windows-Time-Service
Date:          25-01-2019 12.01.27 AM
Event ID:      134
Level:         Warning
Computer:      HWBK-ABCD-DEV01.abc.com
Description: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on 'brast-ntppp01.abc.com,0x8'. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)
 
Log Name:      System
Source:        Microsoft-Windows-DNS-Client
Date:          25-01-2019 12.01.29 AM
Event ID:      8015
Level:         Warning
Computer:      HWBK-ABCD-DEV01.abc.com
Description:
The system failed to register host (A or AAAA) resource records (RRs) for network adapter
with settings:

Adapter Name : {BD3F2C50-AF88-4D3D-AFAB-FA57337ADC8A}
Host Name : HWBK-ABCD-DEV01
Primary Domain Suffix : abc.com
DNS server list : 172.18.60.159, 172.18.60.160, 172.18.80.82
IP Address(es) :10.108.1.50
The reason the system could not register these RRs was because the update request it sent to the DNS server timed out. The most likely cause of this is that the DNS server authoritative for the name it was attempting to register or update is not running at this time. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator.

Log Name:      System
Source:        Service Control Manager
Date:          25-01-2019 12.05.49 AM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      HWBK-ABCD-DEV01.abc.com
Description:
The Cluster Service service terminated with the following service-specific error: The wait operation timed out.

 

Vmware Logs:

 

  • From Vmware logs we have seen issues happening in a Different timeline. Around 11:44 PM GMT on 24th we have seen that the Virtual Machine exhibiting performance issues.

 

FDM Logs:

 

  • From the FDM Logs of the Cluster Nodes we can see that the Vmware tools status changed to Not running which probably pointing out to an issue for VM Unresponsive but we are not sure as there is not much information in other logs:

 

2019-01-24T22:44:47.0842019-01-24T2Z verbose fdm[73B4BB70] [Originator@6876 sub=Invt opID=SWI-11902791] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-ABCD-DEV01/HWBK-ABCD-DEV01.vmx changed  guestHB=red
2019-01-24T22:52:42.224Z warning fdm[729C2B70] [Originator@6876 sub=Invt] [HalVmMonitor::GetIsCptFtFromFtInfo] Missing ftInfo. ftInfo is null for /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-ABCD-DEV01/HWBK-ABCD-DEV01.vmx
2019-01-24T22:52:54.325Z verbose fdm[73B0AB70] [Originator@6876 sub=Invt opID=SWI-1efbfe6d] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-ABCD-DEV01/HWBK-ABCD-DEV01.vmx changed toolsStatus=toolsNotRunning
2019-01-24T22:52:54.741Z verbose fdm[73640B70] [Originator@6876 sub=Invt opID=SWI-34de5f95] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-ABCD-DEV01/HWBK-ABCD-DEV01.vmx changed toolsStatus=toolsOk

 

 

 

Vmware Logs:

 

  • Now during this time from the Vmware Logs we can see that the VM was Migrated from Cluster Node: abcd-tand-abc18.abc.com to abcd-tand-abc19.abc.com.

 

2019-01-24T22:44:44.637Z| vcpu-3| I125: CDROM ide1:0: CMD 0x5a (MODE SENSE(10)) FAILED (key 0x5 asc 0x24 ascq 0)
2019-01-24T22:44:47.219Z| vcpu-1| I125: CDROM ide1:0: CMD 0xa4 (*UNKNOWN (0xa4)*) FAILED (key 0x5 asc 0x20 ascq 0)
2019-01-24T22:44:54.208Z| mks| W115: VNCENCODE 3147657 failed to allocate VNCBlitDetect
2019-01-24T22:46:37.017Z| vcpu-2| I125: TOOLS call to Resolution_Set failed.
2019-01-24T22:52:41.499Z| vmx| I125: MigrateSetInfo: state=1 srcIp=<10.105.3.75> dstIp=<10.105.3.76> mid=46814189629211740 uuid=33343438-3235-4753-4837-3330594c4143 priority=high
2019-01-24T22:52:41.499Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 1.
2019-01-24T22:52:42.495Z| vmx| I125: MigrateSetState: Transitioning from state 1 to 2.

2019-01-24T22:52:54.324Z| vmx| W115: VMX has left the building: 0.
2019-01-24T22:52:42.283Z| vmx| I125: Hostname=abcd-tand-abc19.abc.com

2019-01-25T00:58:58.709Z| vmx| I125: GuestRpc: Got error for connection 3605:Remote connection failure
2019-01-25T01:03:59.129Z| vmx| I125: GuestRpc: Got error for connection 3905:Remote connection failure


2019-01-25T01:04:38.874Z| mks| I125: SSL: syscall error 104: Connection reset by peer
2019-01-25T01:04:38.874Z| mks| I125: SOCKET 9 (153) recv error 104: Connection reset by peer
2019-01-25T01:13:37.816Z| vmx| I125: GuestRpc: Got error for connection 4005:Remote connection failure

 

  • Now reviewing the Host associated logs in the Node Name: abcd-tand-abc19.abc.com
  • VOBD logs doesn’t have any information regarding the Issue.
  • Hostd and Vmkernel Logs does not have the logs captured at the time of issue.
    • The Last Logs captured are from : 2019-02-03T16:22:00.468

 

Conclusion:

  • From the logs we are not able to conclude the issue, however we can see that the Virtual Machine was not able to communicate with the DNS Server and the Cluster Node got Evicted from the Failover Clustering Membership.
  • At this point if you have a syslog server configured, please share the logs for both the ESXi host for the Date 24th to 25th January or we will have to wait for the issue to reoccur.
  • Also I will strongly recommend to engage us on the Live Troubleshooting when we are facing this issue.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply