RCA-6: VM Communication issue

Host Information:

 

HostId    

Hostname                                                                  

 ipAddress1 

HostdPort

 version 

build

host-1009 

hvit-****-prp18.abc.xyz.com   

10.***.4.75 

443       

6.0.0   

7504637

host-1053 

hvit-****-prp19.abc.xyz.com   

10.***.4.76 

443       

6.0.0   

7504637

 

 

Virtual Machine Name: HWBK-****-DEV01

 

 

System Logs:

 

    • From the system logs we can see that the Machine logged Event ID 1135 On the Event Logs which is ongoing since 24th Jan, which explains that there was a Network Loss and the Cluster Node was Evicted from the FCM Console.

 

Log Name:      System

Source:        Microsoft-Windows-FailoverClustering

Date:          25-01-2019 1.33.04 AM

Event ID:      1135

User:          SYSTEM

Computer:      HWBK-****-DEV01.abc.xyz.com

Description: Cluster node ‘BWBK-****-DEV01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

 

    • Around the Time when we were facing the issue with DNS not responding to the requests of the Clients.

 

Log Name:      System

Source:        Microsoft-Windows-DNS-Client

Date:          25-01-2019 12.00.10 AM

Event ID:      1014

Level:         Warning

Computer:      HWBK-****-DEV01.abc.xyz.com

Description: Name resolution for the name o***11.icrc.trendmicro.com timed out after none of the configured DNS servers responded.

 

Log Name:      System

Source:        Microsoft-Windows-Time-Service

Date:          25-01-2019 12.01.27 AM

Event ID:      134

Level:         Warning

Computer:      HWBK-****-DEV01.abc.xyz.com

Description: NtpClient was unable to set a manual peer to use as a time source because of DNS resolution error on ‘b***-ntppp01.abc.xyz.com,0x8’. NtpClient will try again in 15 minutes and double the reattempt interval thereafter. The error was: No such host is known. (0x80072AF9)

 

Log Name:      System

Source:        Microsoft-Windows-DNS-Client

Date:          25-01-2019 12.01.29 AM

Event ID:      8015

Level:         Warning

Computer:      HWBK-****-DEV01.abc.xyz.com

Description:

The system failed to register host (A or AAAA) resource records (RRs) for network adapter

with settings:

 

 Adapter Name : {BD3F2C50-AF88-4D3D-AFAB-FA57337ADC8A}

 Host Name : HWBK-****-DEV01

 Primary Domain Suffix : abc.xyz.com

 DNS server list : 172.**.60.159, 172.**.60.160, 172.**.80.82

 IP Address(es) :10.***.1.50

The reason the system could not register these RRs was because the update request it sent to the DNS server timed out. The most likely cause of this is that the DNS server authoritative for the name it was attempting to register or update is not running at this time. You can manually retry DNS registration of the network adapter and its settings by typing ‘ipconfig /registerdns’ at the command prompt. If problems still persist, contact your DNS server or network systems administrator.

 

Log Name:      System

Source:        Service Control Manager

Date:          25-01-2019 12.05.49 AM

Event ID:      7024

Task Category: None

Level:         Error

Keywords:      Classic

User:          N/A

Computer:      HWBK-****-DEV01.abc.xyz.com

Description:

The Cluster Service service terminated with the following service-specific error: The wait operation timed out.

 

Vmware Logs:

 

    • From Vmware logs we have seen issues happening in a Different timeline. Around 11:44 PM GMT on 24th we have seen that the Virtual Machine exhibiting performance issues.

 

FDM Logs:

 

    • From the FDM Logs of the Cluster Nodes we can see that the Vmware tools status changed to Not running which probably pointing out to an issue for VM Unresponsive but we are not sure as there is not much information in other logs:

 

2019-01-24T22:44:47.0842019-01-24T2Z verbose fdm[73B4BB70] [Originator@6876 sub=Invt opID=SWI-11902791] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx changed  guestHB=red
2019-01-24T22:52:42.224Z warning fdm[729C2B70] [Originator@6876 sub=Invt] [HalVmMonitor::GetIsCptFtFromFtInfo] Missing ftInfo. ftInfo is null for /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx
2019-01-24T22:52:54.325Z verbose fdm[73B0AB70] [Originator@6876 sub=Invt opID=SWI-1efbfe6d] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx changed toolsStatus=toolsNotRunning
2019-01-24T22:52:54.741Z verbose fdm[73640B70] [Originator@6876 sub=Invt opID=SWI-34de5f95] [VmHeartbeatStateChange::SaveToInventory] vm /vmfs/volumes/5562cb81-dd6dc3f2-910d-c4346baf8db8/HWBK-****-DEV01/HWBK-****-DEV01.vmx changed toolsStatus=toolsOk

 

 

 

Vmware Logs:

 

    • Now during this time from the Vmware Logs we can see that the VM was Migrated from Cluster Node: hvit-****-prp18.abc.xyz.com to hvit-****-prp19.abc.xyz.com.

 

2019-01-24T22:44:44.637Z| vcpu-3| I125: CDROM ide1:0: CMD 0x5a (MODE SENSE(10)) FAILED (key 0x5 asc 0x24 ascq 0)
2019-01-24T22:44:47.219Z| vcpu-1| I125: CDROM ide1:0: CMD 0xa4 (*UNKNOWN (0xa4)*) FAILED (key 0x5 asc 0x20 ascq 0)
2019-01-24T22:44:54.208Z| mks| W115: VNCENCODE 3147657 failed to allocate VNCBlitDetect
2019-01-24T22:46:37.017Z| vcpu-2| I125: TOOLS call to Resolution_Set failed.
2019-01-24T22:52:41.499Z| vmx| I125: MigrateSetInfo: state=1 srcIp=<10.***.3.75> dstIp=<10.***.3.76> mid=46814189629211740 uuid=33343438-3235-4753-4837-3330594c4143 priority=high
2019-01-24T22:52:41.499Z| vmx| I125: MigrateSetState: Transitioning from state 0 to 1.
2019-01-24T22:52:42.495Z| vmx| I125: MigrateSetState: Transitioning from state 1 to 2.

2019-01-24T22:52:54.324Z| vmx| W115: VMX has left the building: 0.
2019-01-24T22:52:42.283Z| vmx| I125: Hostname=hvit-****-prp19.abc.xyz.com

2019-01-25T00:58:58.709Z| vmx| I125: GuestRpc: Got error for connection 3605:Remote connection failure
2019-01-25T01:03:59.129Z| vmx| I125: GuestRpc: Got error for connection 3905:Remote connection failure

2019-01-25T01:04:38.874Z| mks| I125: SSL: syscall error 104: Connection reset by peer
2019-01-25T01:04:38.874Z| mks| I125: SOCKET 9 (153) recv error 104: Connection reset by peer
2019-01-25T01:13:37.816Z| vmx| I125: GuestRpc: Got error for connection 4005:Remote connection failure

 

    • Now reviewing the Host associated logs in the Node Name: hvit-****-prp19.abc.xyz.com

 

    • VOBD logs doesn’t have any information regarding the Issue.

 

    • Hostd and Vmkernel Logs doesn’t have the logs captured at the time of issue.
      • The Last Logs captured are from : 2019-02-03T16:22:00.468

 

 

Conclusion:

 

    • From the logs we are not able to conclude the issue, However we can see that the Virtual Machine was not able to communicate with the DNS Server and the Cluster Node got Evicted from the Failover Clustering Membership.
    • At this point if you have a syslog server configured, please share the logs for both the ESXi host for the Date 24th to 25th January or we will have to wait for the issue to reoccur.
    • Also I will strongly recommend to engage us on the Live Troubleshooting when we are facing this issue.

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply