RCA – 14 – Virtual Machine Failover

Issue Description:

We find that in cluster “MC-BOS-CLUSTER01” the VMs hosted on a Hyper-V node have been moved to another Hyper-V node “MC-BOS-VMS02”  running a copy of Windows 2012 Datacenter and get the error message on the latter node: “Connecting to Virtual Machine Management service…”.

29th at 4:50 PM

___________________________________________________________________________

System Information: ABCD2-XYZHOST03

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCD2-XYZHOST03

System Manufacturer        HP

System Model        ProLiant DL360p Gen8

System Type        x64-based PC

System SKU        646900-421

Processor        Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz, 1796 Mhz, 4 Core(s), 4 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz, 1796 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date        HP P71, 01/07/2015

System Events:

  • Checked the logs on the Node3 at the time of issue and found that the node was evicted from the Failover Cluster Manager. Which generally mentions that there is a disconnect between this node and the other nodes which are part of the Cluster.

  • Just as the node got evicted the CSV went to paused state and the quorum was lost from this Node due to which the Cluster Service got terminated.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/10/2017

7:20:09 AM

Critical

ABCD2-XYZHOST03.ABCD2ad.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCD2-XYZHOST01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/10/2017

7:20:09 AM

Critical

ABCD2-XYZHOST03.ABCD2ad.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCD2-XYZHOST02’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/10/2017

7:20:09 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c0000203)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/10/2017

7:20:09 AM

Critical

ABCD2-XYZHOST03.ABCD2ad.com

1177

Microsoft-Windows-FailoverClustering

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/10/2017

7:20:09 AM

Critical

ABCD2-XYZHOST03.ABCD2ad.com

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

1/10/2017

7:20:10 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

7024

Service Control Manager

The Cluster Service service terminated with the following service-specific error:  A quorum of cluster nodes was not present to form a cluster.

1/10/2017

7:20:10 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

7031

Service Control Manager

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

PS C:\Users\adix5025.INDIA\Downloads\ERR> & '.\err(vista).exe' c0000203
# for hex 0xc0000203 / decimal -1073741309
  STATUS_USER_SESSION_DELETED                                    ntstatus.h
# The remote user session has been deleted.
# 1 matches found for "c0000203"



Application Events:

 

  • Reviewed the Application logs and found that the SOFOS Router is showing an error regarding the communication lost from the Router. However we are getting these errors around 10 minutes after the issue reported on the Cluster.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/10/2017

7:30:53 AM

Warning

ABCD2-XYZHOST03.ABCD2ad.com

8004

Sophos Message Router

Failed to communicate with parent router ‘192.0.0.235’. For more information, see the RMS status report. To open the report, click Start, point to All Programs, point to Sophos, point to Sophos Endpoint Security and Control, and then click View Sophos Network Communications Report.

1/10/2017

7:36:48 AM

Information

ABCD2-XYZHOST03.ABCD2ad.com

8224

VSS

The VSS service is shutting down due to idle timeout. 

1/10/2017

7:41:18 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

8005

Sophos Message Router

DNS lookup failure trying to resolve the following addresses: fe80::b5f3:26e7:2292:a1bf. For more information, see the RMS status report. To open the report, click Start, point to All Programs, point to Sophos, point to Sophos Endpoint Security and Control, and then click View Sophos Network Communications Report.

 

Hyper-V Events:

 

  • Since the node was evicted the Cluster Node was not able to keep the VMs Online as it was not able to access its configuration files.

 

1/10/2017

7:20:19 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

16400

Microsoft-Windows-Hyper-V-VMMS

‘ABCD2-LDN-LCESVR’ cannot access the data folder of the virtual machine. The worker process (Process ID 5736) may not be functional anymore. (Virtual machine ID 15941E4D-ADE6-4A2A-BA88-4FC730896078)

1/10/2017

7:20:19 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

16400

Microsoft-Windows-Hyper-V-VMMS

‘ABCD2-SFTP’ cannot access the data folder of the virtual machine. The worker process (Process ID 7572) may not be functional anymore. (Virtual machine ID 4E26F8A8-9802-4DF8-8D5F-3D2A360D67D3)

1/10/2017

7:20:19 AM

Error

ABCD2-XYZHOST03.ABCD2ad.com

16400

Microsoft-Windows-Hyper-V-VMMS

‘ABCD2-LDN-PVS’ cannot access the data folder of the virtual machine. The worker process (Process ID 16696) may not be functional anymore. (Virtual machine ID 3441E58A-83EF-48D8-B37F-2FD5A1683658)

 

Cluster Events:

 

  • Referred to the Cluster logs for confirmation and found the exact reason for the heartbeat misses. Cluster Node was ABCD2-XYZHOST03 was not able to communicate properly on IP Address 192.0.0.163:~3343 due to which the Cluster Service Terminated.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/10/2017

7:19:51 AM

Information

ABCD2-XYZHOST03.ABCD2ad.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.

1/10/2017

7:19:53 AM

Information

ABCD2-XYZHOST03.ABCD2ad.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.161:~3343~.

1/10/2017

7:19:59 AM

Information

ABCD2-XYZHOST03.ABCD2ad.com

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.

1/10/2017

7:19:59 AM

Information

ABCD2-XYZHOST03.ABCD2ad.com

1650

Microsoft-Windows-FailoverClustering

Cluster has established a UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.

1/10/2017

7:19:59 AM

Information

ABCD2-XYZHOST03.ABCD2ad.com

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.

 

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 23:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

2/18/2014 16:02

(3.23:1.0)

(3.23:1.0)

Sophos Limited

SAV On-Access and HIPS for Windows Vista (AMD64)

7/28/2014 15:26

(10.3:13.0)

(3.4:9.0)

Sophos Limited

Sophos Web Intelligence

 

___________________________________________________________________________

 

 

 

Conclusion:

 

  • After analyzing the logs we found that the Node ABCD2-XYZHOST03 lost communication with the other Nodes on IP 192.0.0.163:~3343 due to which the Cluster Node got evicted from the Failover Cluster Manager.

 

  • However Based on the events we are not able to see any failure from any network adaptor end which possibly points out an issue with any intermediate device.

 

  • The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
    • The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
    • The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
    • The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

 

  1. Please follow the article to add the antivirus exclusion for http://support.microsoft.com/kb/309422 .

 

  1.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

https://support.microsoft.com/en-us/help/2920151/recommended-hotfixes-and-updates-for-windows-server-2012-r2-based-failover-clusters

 

  1.  Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

 

  1. There are a few settings to tweak via the command line, and here are the maximum values you can configure to make it “less sensitive”:
    cluster /prop SameSubnetDelay=2000:DWORD
    cluster /prop CrossSubnetDelay=4000:DWORD
    cluster /prop CrossSubnetThreshold=10:DWORD
    cluster /prop SameSubnetThreshold=10:DWORD

 

  1. Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.

Recommended private “Heartbeat” configuration on a cluster server 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply