Issue Description:
Need to find out why clustered roles running on Server W2K12R2STD did not failover from Node: “ABCDX96E” to Node: “ABCDX97E” when Node: “ABCDX96E” became unresponsive on 10th May around 05:20 AM. Cluster name is “exo-clust”.
Initial Description:
Understanding how Failover Clustering Recovers from Unresponsive Resources
When a Virtual Machine is clustered, there is a cluster “Virtual Machine” resource created which controls that VM. The “Virtual Machine” resource and its associated resource DLL communicates with the VMMS service and tells the VM when to start, when to stop, and it also does health checks to ensure the VM is ok.
Resources all run in a component of the Failover Clustering feature called the Resource Hosting Subsystem (RHS). These VM actions from the user map to entry point calls that RHS makes to resources, such as Online, Offline, IsAlive, and LooksAlive. You can find the full list of resource DLL entry-point functions.
The most interesting in most cases where resources go unresponsive and you see clustering need to recover is with the LooksAlive and IsAlive which is a health check to the resource.
- LooksAlive is a quick light lightweight check that happens every 5 seconds by default
- IsAlive is a more verbose check that happens every 60 seconds by default.
Health check calls to the resource continue constantly while resources are online. If a resource returns a failure for the lightweight LooksAlive health check, RHS will then immediately do a more comprehensive health check and call IsAlive to see if the resource is really healthy. A resource is considered failed as the result of an IsAlive failure.
Think of it like this… Every 60 seconds RHS calls IsAlive and basically is asking the resource “Are you ok?”. And the resource then responds to RHS “Yes, I am doing fine.” This periodic health check goes on and on… Until, there can be a case where something happens to the resource and it doesn’t respond. Think of it like a dropped call on your cell phone, how long are you willing to sit there going “Hello? Hello? Hello?”… before you give up and call the person back? Basically resetting the connection…
Failover Clustering has this same concept. RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action. By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it. This is configurable with the resource DeadlockTimeout common property.
To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command:
(Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000
Or if you want to modify the DeadlockTimeout for all resources of that type you can modify it at the resource type level with the following syntax (this example will be for all virtual machine resources):
(Get-ClusterResourceType “Virtual Machine”).DeadlockTimeout = 300000
Resources are expected to respond to an IsAlive or LooksAlive within a few hundred milliseconds, so waiting 5 minutes for a resource to respond is a really long time. Something pretty bad happened if a resource which normally responds in milliseconds, suddenly takes longer than 5 minutes. So it is generally recommended to stay with the default values.
If the resource doesn’t respond in 5 minutes, RHS decides that there must be something wrong with the resource and that it should take recovery action to get it back up and running. Remember that the resource has gone silent; RHS has no idea what is wrong with it. The only way to recover and get the resource back up and running is that the RHS process is terminated, then RHS restarts, which will then restart the resource, and everything is back up and running. You may also see the associated entries in the System event log:
Event ID 1230
Cluster resource ‘Resource Name’ (resource type ‘Resource Type Name’, DLL ‘DLL Name’) did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource.
Event ID 1146
The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.
The next layer of protection is that when clustering issues a request to terminate the RHS process, it will wait four times the DeadlockTimeout value (which equates to 20 minutes by default) for the RHS process to terminate. If RHS does not terminate in 20 minutes, clustering will deem that the server has some serious health issues and will bugcheck the server to force failover and recovery. The bugcheck code will be Stop 0x0000009E ( Parameter1 , Parameter2 , 0x0000000000000005 , Parameter4 ). Note: that Parameter3 will always be a value of 0x5 if it is the result of an RHS process failing to terminate.
This is the way clustering is designed to work… it is monitoring the health of the system, it detects something is wrong, and recovers. This is a good thing!
Summary of Recovery Behavior:
- RHS calls an entry point to resource
- RHS waits DeadlockTimeout (5 minutes) for resource to respond
- If resource does not respond, Cluster Service terminates RHS process to recover from unresponsive resource
- Cluster Service waits DeadlockTimeout x 4 (20 minutes) for the RHS process to terminate
- If RHS process does not terminate, Cluster Service calls NetFT to bugcheck the node to recover from RHS termination failure
- NetFT bugchecks the node with a STOP 0x9e
_________________________________________________________________________________________________________________
System Information: ABCDX97E
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCDX97E
System Manufacturer Dell Inc.
System Model PowerEdge R630
System Type x64-based PC
System SKU SKU=NotProvided;ModelName=PowerEdge R630
Processor Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date Dell Inc. 1.2.10, 09/03/2015
System Events:
- Checked the system logs but was not able to find any event around 5:30 AM. However around 6:30 we get event id 1135 which says that the cluster node is removed from FCM.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
5/10/2016 | 6:31:14 AM | Critical | ABCDX97E.abc.com | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCDX96E’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
Application Events:
- Checked the application logs but was not able to find any events related to the issue.
Cluster Events:
- Checked the Cluster events and found that there are two consecutive heartbeat misses and just after which the cluster node is evicted from the failover cluster manager.
5/10/2016 | 6:31:10 AM | Information | ABCDX97E.abc.com | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local endpoint 192.168.2.102:~3343~ connected to remote endpoint 192.168.2.101:~3343~. |
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
10/31/2014 10:43 | (6.0:0.389) | (6.0:0.389) | EMC Corporation | PowerPath MultiPath Module |
10/22/2014 14:03 | (7.12:0.0) | (7.12:0.0) | QLogic Corporation | QLogic Gigabit Ethernet VBD |
10/17/2014 19:03 | (6.0:0.382) | (6.0:0.382) | EMC Corporation | EMS Generic Purpose Extension |
6/5/2014 18:37 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath CLARiiON LAM Module |
6/5/2014 18:37 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath Generic LAM Module |
6/5/2014 18:38 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath Invista LAM Module |
6/5/2014 18:39 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath Symmetrix LAM Module |
6/5/2014 18:39 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath XtremIO LAM Module |
12/19/2014 1:11 | (9.1:13.20) | (9.1:13.20) | QLogic Corporation | QLogic Fibre Channel Stor Miniport Driver |
12/4/2014 6:04 | (6.3:9600.17038) | (7.12:0.0) | QLogic Corporation | iSCSI offload x64 FREE |
6/9/2014 18:47 | (6.0:0.307) | (6.0:0.307) | EMC Corporation | EMS Class Driver |
8/16/2012 22:18 | (12.1:1.2) | (12.1:1.2) | Symantec Corporation | Common Client Settings Driver |
11/1/2012 20:16 | (14.2:0.0) | (14.2:0.6) | Symantec Corporation | Symantec AutoProtect |
11/15/2011 22:30 | (12.6:0.0) | (12.6:0.16) | Symantec Corporation | Symantec AutoProtect |
8/22/2012 6:33 | (12.9:3.1) | (12.9:3.1) | Symantec Corporation | Symantec Event Library |
11/3/2012 13:53 | (12.1:2015.2015) | (12.1:2015.2015) | Symantec Corporation | Symantec CMC Firewall SysPlant |
7/21/2012 3:07 | (13.1:0.0) | (13.1:0.7) | Symantec Corporation | Network Security Driver |
11/20/2014 13:09 | (7.12:2.0) | (7.12:2.0) | QLogic Corporation | AMD64 BXND NDIS6.0 Driver |
8/13/2013 12:12 | (7.4:0.3999) | (7.4:0.3999) | Dell Inc. | Dell BASE Device Driver |
Cluster Logs:
- There is no Cluster logs generated at the time of issue.
_________________________________________________________________________________________________________________
System Information: ABCDX96E
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCDX96E
System Manufacturer Dell Inc.
System Model PowerEdge R630
System Type x64-based PC
System SKU SKU=NotProvided;ModelName=PowerEdge R630
Processor Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date Dell Inc. 1.2.10, 09/03/2015
System Events:
- Checked the events and found that the Machine Crashed unexpectedly on 06:30:43 on 10/05/2016.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
5/10/2016 | 6:34:34 AM | Error | ABCDX96E.abc.com | 6008 | EventLog | The previous system shutdown at 06:30:43 on ?10/?05/?2016 was unexpected. |
- As per the events the machine haven’t crashed with any bug check of event id 41.
- As per this scenario the possible causes are: Hardware going bad, Hardware firmware or driver is misbehaving, Mini filter drivers (i.e. antivirus and other security software) is/are misbehaving etc…
- There is No Events logged around 6:30 due to which we are not able to find the events which caused the machine crashed.
Application Events:
- Checked the application logs at the time of issue and found that there are some SQL errors around 1:02 AM and as per the Event 7 the Antivirus application was also active at the time of issue.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
5/10/2016 | 5:02:00 AM | Error | ABCDX96E.abc.com | 14421 | MSSQL$SQLCLUINST03 | The log shipping secondary database SQLCLUINST03\SQLCLUINST03.FP5_AD_Content has restore threshold of 240 minutes and is out of sync. No restore was performed for 246641 minutes. Restored latency is 0 minutes. Check agent log and logshipping monitor information. |
5/10/2016 | 5:02:47 AM | Information | ABCDX96E.abc.com | 7 | Symantec AntiVirus | New virus definition file loaded. Version: 160509024. |
- Initially there was a SQL backup job running around 3:00 AM.
5/10/2016 | 3:00:16 AM | Information | ABCDX96E.abc.com | 18265 | MSSQL$SQLCLUINST03 | Log was backed up. Database: FP5_Content_epieuropropinternational, creation date(time): 2015/12/13(12:42:15), first LSN: 3552:11104:1, last LSN: 3555:162:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘\\172.24.21.200\syncdata\Inst3\TLOG3\FP5_Content_epieuropropinternational_20160510020005.trn’}). This is an informational message only. No user action is required. |
Cluster Events:
- Checked the Events and was not able to find any event around 5:30 AM. Events started from 6:30 when the Node Join process started.
5/10/2016 | 6:35:15 AM | Information | ABCDX96E.abc.com | 1281 | Microsoft-Windows-FailoverClustering | Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘ABCDX97E’ |
Cluster Logs:
000011bc.00004894::2016/05/10-04:36:47.108 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e45953f3-78bf-47ac-8f85-c9d89ab14907:Netbios
00000aa8.00000aac::2016/05/10-05:34:53.520 INFO —————————–+ LOG BEGIN +—————————–
00000aa8.00000aac::2016/05/10-05:34:53.520 INFO [CS] Starting clussvc as a service
00000aa8.00000dc0::2016/05/10-05:35:27.489 ERR mscs::QuorumAgent::FormLeaderWorker::operator (): (5038)’ because of ‘Failed to bring quorum resource 12e4512b-aeda-4ff5-9bde-880e3ef31aca online, status 5038’
00000aa8.00000d1c::2016/05/10-05:35:27.506 DBG [GEM] Node 1: ProcessNew message from node 2
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
10/31/2014 10:43 | (6.0:0.389) | (6.0:0.389) | EMC Corporation | PowerPath MultiPath Module |
10/17/2014 19:02 | (6.0:0.382) | (6.0:0.382) | EMC Corporation | EMS Base Driver |
10/17/2014 19:03 | (6.0:0.382) | (6.0:0.382) | EMC Corporation | EMS Generic Purpose Extension |
6/5/2014 18:37 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath CLARiiON LAM Module |
6/5/2014 18:37 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath Generic LAM Module |
6/5/2014 18:38 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath Invista LAM Module |
6/5/2014 18:39 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath Symmetrix LAM Module |
6/5/2014 18:39 | (6.0:0.303) | (6.0:0.303) | EMC Corporation | PowerPath XtremIO LAM Module |
12/19/2014 1:11 | (9.1:13.20) | (9.1:13.20) | QLogic Corporation | QLogic Fibre Channel Stor Miniport Driver |
10/31/2014 10:41 | (6.0:0.389) | (6.0:0.389) | EMC Corporation | EMS Data Migration Manager |
7/23/2012 23:58 | (2.1:0.0) | (2.1:0.9) | Symantec Corporation | Symantec Data Store |
10/3/2012 20:00 | (4.1:0.0) | (4.1:1.9) | Symantec Corporation | Symantec Extended File Attributes |
6/9/2014 18:47 | (6.0:0.307) | (6.0:0.307) | EMC Corporation | EMS Class Driver |
8/16/2012 22:18 | (12.1:1.2) | (12.1:1.2) | Symantec Corporation | Common Client Settings Driver |
11/1/2012 20:16 | (14.2:0.0) | (14.2:0.6) | Symantec Corporation | Symantec AutoProtect |
11/15/2011 22:30 | (12.6:0.0) | (12.6:0.16) | Symantec Corporation | Symantec AutoProtect |
11/3/2012 13:53 | (12.1:2015.2015) | (12.1:2015.2015) | Symantec Corporation | Symantec CMC Firewall SysPlant |
7/21/2012 3:07 | (13.1:0.0) | (13.1:0.7) | Symantec Corporation | Network Security Driver |
8/13/2013 12:12 | (7.4:0.3999) | (7.4:0.3999) | Dell Inc. | Dell BASE Device Driver |
12/13/2012 16:54 | (8.2:2.0) | (5.0:0.34) | Tripwire, Inc. | Event Generator File System Filter Driver |
_________________________________________________________________________________________________________________\
Conclusion:
- The logs are not seems to be conclusive as we don’t have enough logs at the time of issue. However as per the event that we found the cluster didn’t failed over the resources as the Isalive and Looksalive test didn’t failed and the Cluster node was not able to understand the unresponsiveness of the SQL resource which was running on the node.
- The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)
- Please follow the article to add the antivirus exclusion for SQL: https://support.microsoft.com/en-gb/kb/309422.
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
https://support.microsoft.com/en-us/kb/2920151
- Kindly Update the Outdated Drivers from the List that I have shared with you and some of them are not been update since 2012.
- Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.
Recommended private “Heartbeat” configuration on a cluster server