Issue Description:
We find that in cluster “MC-BOS-CLUSTER01” the VMs hosted on a Hyper-V node have been moved to another Hyper-V node “MC-BOS-VMS02” running a copy of Windows 2012 Datacenter and get the error message on the latter node: “Connecting to Virtual Machine Management service…”.
29th at 4:50 PM
___________________________________________________________________________
System Information: ABCD2-XYZHOST03
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCD2-XYZHOST03
System Manufacturer HP
System Model ProLiant DL360p Gen8
System Type x64-based PC
System SKU 646900-421
Processor Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz, 1796 Mhz, 4 Core(s), 4 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz, 1796 Mhz, 4 Core(s), 4 Logical Processor(s)
BIOS Version/Date HP P71, 01/07/2015
System Events:
- Checked the logs on the Node3 at the time of issue and found that the node was evicted from the Failover Cluster Manager. Which generally mentions that there is a disconnect between this node and the other nodes which are part of the Cluster.
- Just as the node got evicted the CSV went to paused state and the quorum was lost from this Node due to which the Cluster Service got terminated.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
1/10/2017 |
7:20:09 AM |
Critical |
ABCD2-XYZHOST03.ABCD2ad.com |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABCD2-XYZHOST01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
1/10/2017 |
7:20:09 AM |
Critical |
ABCD2-XYZHOST03.ABCD2ad.com |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABCD2-XYZHOST02’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
1/10/2017 |
7:20:09 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
5120 |
Microsoft-Windows-FailoverClustering |
Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c0000203)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
1/10/2017 |
7:20:09 AM |
Critical |
ABCD2-XYZHOST03.ABCD2ad.com |
1177 |
Microsoft-Windows-FailoverClustering |
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
1/10/2017 |
7:20:09 AM |
Critical |
ABCD2-XYZHOST03.ABCD2ad.com |
1146 |
Microsoft-Windows-FailoverClustering |
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue. |
1/10/2017 |
7:20:10 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
7024 |
Service Control Manager |
The Cluster Service service terminated with the following service-specific error: A quorum of cluster nodes was not present to form a cluster. |
1/10/2017 |
7:20:10 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
7031 |
Service Control Manager |
The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service. |
PS C:\Users\adix5025.INDIA\Downloads\ERR> & '.\err(vista).exe' c0000203
# for hex 0xc0000203 / decimal -1073741309
STATUS_USER_SESSION_DELETED ntstatus.h
# The remote user session has been deleted.
# 1 matches found for "c0000203"
Application Events:
- Reviewed the Application logs and found that the SOFOS Router is showing an error regarding the communication lost from the Router. However we are getting these errors around 10 minutes after the issue reported on the Cluster.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
1/10/2017 |
7:30:53 AM |
Warning |
ABCD2-XYZHOST03.ABCD2ad.com |
8004 |
Sophos Message Router |
Failed to communicate with parent router ‘192.0.0.235’. For more information, see the RMS status report. To open the report, click Start, point to All Programs, point to Sophos, point to Sophos Endpoint Security and Control, and then click View Sophos Network Communications Report. |
1/10/2017 |
7:36:48 AM |
Information |
ABCD2-XYZHOST03.ABCD2ad.com |
8224 |
VSS |
The VSS service is shutting down due to idle timeout. |
1/10/2017 |
7:41:18 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
8005 |
Sophos Message Router |
DNS lookup failure trying to resolve the following addresses: fe80::b5f3:26e7:2292:a1bf. For more information, see the RMS status report. To open the report, click Start, point to All Programs, point to Sophos, point to Sophos Endpoint Security and Control, and then click View Sophos Network Communications Report. |
Hyper-V Events:
- Since the node was evicted the Cluster Node was not able to keep the VMs Online as it was not able to access its configuration files.
1/10/2017 |
7:20:19 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
16400 |
Microsoft-Windows-Hyper-V-VMMS |
‘ABCD2-LDN-LCESVR’ cannot access the data folder of the virtual machine. The worker process (Process ID 5736) may not be functional anymore. (Virtual machine ID 15941E4D-ADE6-4A2A-BA88-4FC730896078) |
1/10/2017 |
7:20:19 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
16400 |
Microsoft-Windows-Hyper-V-VMMS |
‘ABCD2-SFTP’ cannot access the data folder of the virtual machine. The worker process (Process ID 7572) may not be functional anymore. (Virtual machine ID 4E26F8A8-9802-4DF8-8D5F-3D2A360D67D3) |
1/10/2017 |
7:20:19 AM |
Error |
ABCD2-XYZHOST03.ABCD2ad.com |
16400 |
Microsoft-Windows-Hyper-V-VMMS |
‘ABCD2-LDN-PVS’ cannot access the data folder of the virtual machine. The worker process (Process ID 16696) may not be functional anymore. (Virtual machine ID 3441E58A-83EF-48D8-B37F-2FD5A1683658) |
Cluster Events:
- Referred to the Cluster logs for confirmation and found the exact reason for the heartbeat misses. Cluster Node was ABCD2-XYZHOST03 was not able to communicate properly on IP Address 192.0.0.163:~3343 due to which the Cluster Service Terminated.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
1/10/2017 |
7:19:51 AM |
Information |
ABCD2-XYZHOST03.ABCD2ad.com |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has missed two consecutive heartbeats for the local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~. |
1/10/2017 |
7:19:53 AM |
Information |
ABCD2-XYZHOST03.ABCD2ad.com |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has missed two consecutive heartbeats for the local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.161:~3343~. |
1/10/2017 |
7:19:59 AM |
Information |
ABCD2-XYZHOST03.ABCD2ad.com |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has lost the UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~. |
1/10/2017 |
7:19:59 AM |
Information |
ABCD2-XYZHOST03.ABCD2ad.com |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has established a UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~. |
1/10/2017 |
7:19:59 AM |
Information |
ABCD2-XYZHOST03.ABCD2ad.com |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has lost the UDP connection from local endpoint 192.0.0.163:~3343~ connected to remote endpoint 192.0.0.162:~3343~. |
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
2/12/2010 23:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
2/18/2014 16:02 |
(3.23:1.0) |
(3.23:1.0) |
Sophos Limited |
SAV On-Access and HIPS for Windows Vista (AMD64) |
7/28/2014 15:26 |
(10.3:13.0) |
(3.4:9.0) |
Sophos Limited |
Sophos Web Intelligence |
___________________________________________________________________________
Conclusion:
- After analyzing the logs we found that the Node ABCD2-XYZHOST03 lost communication with the other Nodes on IP 192.0.0.163:~3343 due to which the Cluster Node got evicted from the Failover Cluster Manager.
- However Based on the events we are not able to see any failure from any network adaptor end which possibly points out an issue with any intermediate device.
- The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
- The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
- The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
- The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)
- Please follow the article to add the antivirus exclusion for http://support.microsoft.com/kb/309422 .
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
- Investigate the Network timeout / latency / packet drops with the help of in house networking team.
Please Note : This step is the most critical while dealing with network connectivity issues.
Investigation of Network Issues :
We need to investigate the Network Connectivity Issues with the help of in-house networking team.
In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.
- There are a few settings to tweak via the command line,
and here are the maximum values you can configure to make it “less
sensitive”:
cluster /prop SameSubnetDelay=2000:DWORD
cluster /prop CrossSubnetDelay=4000:DWORD
cluster /prop CrossSubnetThreshold=10:DWORD
cluster /prop SameSubnetThreshold=10:DWORD
- Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.
Recommended private “Heartbeat” configuration on a cluster server