RCA-6: VM Communication issue

Host Information:

HostId	Hostname	ipAddress1	HostdPort	version	build
host-1009	hvit-****-prp18.abc.xyz.com	10.***.4.75	443	6.0.0	7504637
host-1053	hvit-****-prp19.abc.xyz.com	10.***.4.76	443	6.0.0	7504637

Virtual Machine Name: HWBK-****-DEV01

System Logs:

- From the system logs we can see that the Machine logged Event ID 1135 On the Event Logs which is ongoing since 24th Jan, which explains that there was a Network Loss and the Cluster Node was Evicted from the FCM Console.

Log Name: System

Source: Microsoft-Windows-FailoverClustering

Date: 25-01-2019 1.33.04 AM

Event ID: 1135

User: SYSTEM

Computer: HWBK-****-DEV01.abc.xyz.com

Description: Cluster node ‘BWBK-****-DEV01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

- Around the Time when we were facing the issue with DNS not responding to the requests of the Clients.

Log Name: System

Source: Microsoft-Windows-DNS-Client

Date: 25-01-2019 12.00.10 AM

Event ID: 1014

Level: Warning

Computer: HWBK-****-DEV01.abc.xyz.com

Description: Name resolution for the name o***11.icrc.trendmicro.com timed out after none of the configured DNS servers responded.

Log Name: System

Source: Microsoft-Windows-Time-Service

Date: 25-01-2019 12.01.27 AM

Event ID: 134

Level: Warning

Computer: HWBK-****-DEV01.abc.xyz.com

Description: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on ‘b***-ntppp01.abc.xyz.com,0x8’. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)

Log Name: System

Source: Microsoft-Windows-DNS-Client

Date: 25-01-2019 12.01.29 AM

Event ID: 8015

Level: Warning

Computer: HWBK-****-DEV01.abc.xyz.com

Description:

The system failed to register host (A or AAAA) resource records (RRs) for network adapter

with settings:

Adapter Name : {BD3F2C50-AF88-4D3D-AFAB-FA57337ADC8A}

Host Name : HWBK-****-DEV01

Primary Domain Suffix : abc.xyz.com

DNS server list : 172.**.60.159, 172.**.60.160, 172.**.80.82

IP Address(es) :10.***.1.50

The reason the system could not register these RRs was because the update request it sent to the DNS server timed out. The most likely cause of this is that the DNS server authoritative for the name it was attempting to register or update is not running at this time. You can manually retry DNS registration of the network adapter and its settings by typing ‘ipconfig /registerdns’ at the command prompt. If problems still persist, contact your DNS server or network systems administrator.

Log Name: System

Source: Service Control Manager

Date: 25-01-2019 12.05.49 AM

Event ID: 7024

Task Category: None

Level: Error

Keywords: Classic

User: N/A

Computer: HWBK-****-DEV01.abc.xyz.com

Description:

The Cluster Service service terminated with the following service-specific error: The wait operation timed out.

Vmware Logs:

- From Vmware logs we have seen issues happening in a Different timeline. Around 11:44 PM GMT on 24th we have seen that the Virtual Machine exhibiting performance issues.

FDM Logs:

- From the FDM Logs of the Cluster Nodes we can see that the Vmware tools status changed to Not running which probably pointing out to an issue for VM Unresponsive but we are not sure as there is not much information in other logs:

2019-01-24T22:44:47.0842019-01-24T2Z verbose fdm[73B4BB70] [Originator@6876 sub=Invt opID=SWI-11902791] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx changed guestHB=red
2019-01-24T22:52:42.224Z warning fdm[729C2B70] [Originator@6876 sub=Invt] [HalVmMonitor::GetIsCptFtFromFtInfo] Missing ftInfo. ftInfo is null for /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx
2019-01-24T22:52:54.325Z verbose fdm[73B0AB70] [Originator@6876 sub=Invt opID=SWI-1efbfe6d] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx changed toolsStatus=toolsNotRunning
2019-01-24T22:52:54.741Z verbose fdm[73640B70] [Originator@6876 sub=Invt opID=SWI-34de5f95] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx changed toolsStatus=toolsOk

Vmware Logs:

- Now during this time from the Vmware Logs we can see that the VM was Migrated from Cluster Node: hvit-****-prp18.abc.xyz.com to hvit-****-prp19.abc.xyz.com.

2019-01-24T22:44:44.637Z| vcpu-3| I125: CDROM ide1:0: CMD 0x5a (MODE SENSE(10)) FAILED (key 0x5 asc 0x24 ascq 0)
2019-01-24T22:44:47.219Z| vcpu-1| I125: CDROM ide1:0: CMD 0xa4 (*UNKNOWN (0xa4)*) FAILED (key 0x5 asc 0x20 ascq 0)
2019-01-24T22:44:54.208Z| mks| W115: VNCENCODE 3147657 failed to allocate VNCBlitDetect
2019-01-24T22:46:37.017Z| vcpu-2| I125: TOOLS call to Resolution_Set failed.
2019-01-24T22:52:41.499Z| vmx| I125: MigrateSetInfo: state=1 srcIp=<10.***.3.75> dstIp=<10.***.3.76> mid=46814189629211740 uuid=33343438-3235-4753-4837-3330594c4143 priority=high
2019-01-24T22:52:41.499Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 1.
2019-01-24T22:52:42.495Z| vmx| I125: MigrateSetState: Transitioning from state 1 to 2.

2019-01-24T22:52:54.324Z| vmx| W115: VMX has left the building: 0.
2019-01-24T22:52:42.283Z| vmx| I125: Hostname=hvit-****-prp19.abc.xyz.com

2019-01-25T00:58:58.709Z| vmx| I125: GuestRpc: Got error for connection 3605:Remote connection failure
2019-01-25T01:03:59.129Z| vmx| I125: GuestRpc: Got error for connection 3905:Remote connection failure

2019-01-25T01:04:38.874Z| mks| I125: SSL: syscall error 104: Connection reset by peer
2019-01-25T01:04:38.874Z| mks| I125: SOCKET 9 (153) recv error 104: Connection reset by peer
2019-01-25T01:13:37.816Z| vmx| I125: GuestRpc: Got error for connection 4005:Remote connection failure

- Now reviewing the Host associated logs in the Node Name: hvit-****-prp19.abc.xyz.com

- VOBD logs doesn’t have any information regarding the Issue.

- Hostd and Vmkernel Logs doesn’t have the logs captured at the time of issue.
  - The Last Logs captured are from : 2019-02-03T16:22:00.468

Conclusion:

- From the logs we are not able to conclude the issue, However we can see that the Virtual Machine was not able to communicate with the DNS Server and the Cluster Node got Evicted from the Failover Clustering Membership.
- At this point if you have a syslog server configured, please share the logs for both the ESXi host for the Date 24th to 25th January or we will have to wait for the issue to reoccur.
- Also I will strongly recommend to engage us on the Live Troubleshooting when we are facing this issue.