RCA - 14 - Virtual Machine Failover

Issue Description:

We find that in cluster “MC-BOS-CLUSTER01” the VMs hosted on a Hyper-V node have been moved to another Hyper-V node “MC-BOS-VMS02” running a copy of Windows 2012 Datacenter and get the error message on the latter node: “Connecting to Virtual Machine Management service…”.

29th at 4:50 PM

___________________________________________________________________________

System Information: ABCD2-XYZHOST03

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCD2-XYZHOST03

System Manufacturer HP

System Model ProLiant DL360p Gen8

System Type x64-based PC

System SKU 646900-421

Processor Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz, 1796 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date HP P71, 01/07/2015

System Events:

Checked the logs on the Node3 at the time of issue and found that the node was evicted from the Failover Cluster Manager. Which generally mentions that there is a disconnect between this node and the other nodes which are part of the Cluster.

Just as the node got evicted the CSV went to paused state and the quorum was lost from this Node due to which the Cluster Service got terminated.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
1/10/2017	7:20:09 AM	Critical	ABCD2-XYZHOST03.ABCD2ad.com	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCD2-XYZHOST01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
1/10/2017	7:20:09 AM	Critical	ABCD2-XYZHOST03.ABCD2ad.com	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCD2-XYZHOST02’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
1/10/2017	7:20:09 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c0000203)’. All I/O will temporarily be queued until a path to the volume is reestablished.
1/10/2017	7:20:09 AM	Critical	ABCD2-XYZHOST03.ABCD2ad.com	1177	Microsoft-Windows-FailoverClustering	The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
1/10/2017	7:20:09 AM	Critical	ABCD2-XYZHOST03.ABCD2ad.com	1146	Microsoft-Windows-FailoverClustering	The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

1/10/2017	7:20:10 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	7024	Service Control Manager	The Cluster Service service terminated with the following service-specific error: A quorum of cluster nodes was not present to form a cluster.
1/10/2017	7:20:10 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	7031	Service Control Manager	The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.

PS C:\Users\adix5025.INDIA\Downloads\ERR> & '.\err(vista).exe' c0000203
# for hex 0xc0000203 / decimal -1073741309
  STATUS_USER_SESSION_DELETED                                    ntstatus.h
# The remote user session has been deleted.
# 1 matches found for "c0000203"

Application Events:

Reviewed the Application logs and found that the SOFOS Router is showing an error regarding the communication lost from the Router. However we are getting these errors around 10 minutes after the issue reported on the Cluster.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
1/10/2017	7:30:53 AM	Warning	ABCD2-XYZHOST03.ABCD2ad.com	8004	Sophos Message Router	Failed to communicate with parent router ‘192.0.0.235’. For more information, see the RMS status report. To open the report, click Start, point to All Programs, point to Sophos, point to Sophos Endpoint Security and Control, and then click View Sophos Network Communications Report.
1/10/2017	7:36:48 AM	Information	ABCD2-XYZHOST03.ABCD2ad.com	8224	VSS	The VSS service is shutting down due to idle timeout.
1/10/2017	7:41:18 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	8005	Sophos Message Router	DNS lookup failure trying to resolve the following addresses: fe80::b5f3:26e7:2292:a1bf. For more information, see the RMS status report. To open the report, click Start, point to All Programs, point to Sophos, point to Sophos Endpoint Security and Control, and then click View Sophos Network Communications Report.

Hyper-V Events:

Since the node was evicted the Cluster Node was not able to keep the VMs Online as it was not able to access its configuration files.

1/10/2017	7:20:19 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	16400	Microsoft-Windows-Hyper-V-VMMS	‘ABCD2-LDN-LCESVR’ cannot access the data folder of the virtual machine. The worker process (Process ID 5736) may not be functional anymore. (Virtual machine ID 15941E4D-ADE6-4A2A-BA88-4FC730896078)
1/10/2017	7:20:19 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	16400	Microsoft-Windows-Hyper-V-VMMS	‘ABCD2-SFTP’ cannot access the data folder of the virtual machine. The worker process (Process ID 7572) may not be functional anymore. (Virtual machine ID 4E26F8A8-9802-4DF8-8D5F-3D2A360D67D3)
1/10/2017	7:20:19 AM	Error	ABCD2-XYZHOST03.ABCD2ad.com	16400	Microsoft-Windows-Hyper-V-VMMS	‘ABCD2-LDN-PVS’ cannot access the data folder of the virtual machine. The worker process (Process ID 16696) may not be functional anymore. (Virtual machine ID 3441E58A-83EF-48D8-B37F-2FD5A1683658)

Cluster Events:

Referred to the Cluster logs for confirmation and found the exact reason for the heartbeat misses. Cluster Node was ABCD2-XYZHOST03 was not able to communicate properly on IP Address 192.0.0.163:~3343 due to which the Cluster Service Terminated.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
1/10/2017	7:19:51 AM	Information	ABCD2-XYZHOST03.ABCD2ad.com	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.
1/10/2017	7:19:53 AM	Information	ABCD2-XYZHOST03.ABCD2ad.com	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.161:~3343~.
1/10/2017	7:19:59 AM	Information	ABCD2-XYZHOST03.ABCD2ad.com	1650	Microsoft-Windows-FailoverClustering	Cluster has lost the UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.
1/10/2017	7:19:59 AM	Information	ABCD2-XYZHOST03.ABCD2ad.com	1650	Microsoft-Windows-FailoverClustering	Cluster has established a UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.
1/10/2017	7:19:59 AM	Information	ABCD2-XYZHOST03.ABCD2ad.com	1650	Microsoft-Windows-FailoverClustering	Cluster has lost the UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
2/12/2010 23:33	(3.0:0.0)	(3.0:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3 PSHED Plugin Driver
2/18/2014 16:02	(3.23:1.0)	(3.23:1.0)	Sophos Limited	SAV On-Access and HIPS for Windows Vista (AMD64)
7/28/2014 15:26	(10.3:13.0)	(3.4:9.0)	Sophos Limited	Sophos Web Intelligence

___________________________________________________________________________

Conclusion:

After analyzing the logs we found that the Node ABCD2-XYZHOST03 lost communication with the other Nodes on IP 192.0.0.163:~3343 due to which the Cluster Node got evicted from the Failover Cluster Manager.

However Based on the events we are not able to see any failure from any network adaptor end which possibly points out an issue with any intermediate device.

The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

Please follow the article to add the antivirus exclusion for http://support.microsoft.com/kb/309422 .

Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

https://support.microsoft.com/en-us/help/2920151/recommended-hotfixes-and-updates-for-windows-server-2012-r2-based-failover-clusters

Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

Investigation of Network Issues :

We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

There are a few settings to tweak via the command line, and here are the maximum values you can configure to make it “less sensitive”:
cluster /prop SameSubnetDelay=2000:DWORD
cluster /prop CrossSubnetDelay=4000:DWORD
cluster /prop CrossSubnetThreshold=10:DWORD
cluster /prop SameSubnetThreshold=10:DWORD

Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.