RCA - 3 - Cluster Resource Went Unresponsive but Didn’t Failed Over

Issue Description:

Need to find out why clustered roles running on Server W2K12R2STD did not failover from Node: “ABCDX96E” to Node: “ABCDX97E” when Node: “ABCDX96E” became unresponsive on 10th May around 05:20 AM. Cluster name is “exo-clust”.

Initial Description:

Understanding how Failover Clustering Recovers from Unresponsive Resources

When a Virtual Machine is clustered, there is a cluster “Virtual Machine” resource created which controls that VM. The “Virtual Machine” resource and its associated resource DLL communicates with the VMMS service and tells the VM when to start, when to stop, and it also does health checks to ensure the VM is ok.

Resources all run in a component of the Failover Clustering feature called the Resource Hosting Subsystem (RHS). These VM actions from the user map to entry point calls that RHS makes to resources, such as Online, Offline, IsAlive, and LooksAlive. You can find the full list of resource DLL entry-point functions.

The most interesting in most cases where resources go unresponsive and you see clustering need to recover is with the LooksAlive and IsAlive which is a health check to the resource.

LooksAlive is a quick light lightweight check that happens every 5 seconds by default
IsAlive is a more verbose check that happens every 60 seconds by default.

Health check calls to the resource continue constantly while resources are online. If a resource returns a failure for the lightweight LooksAlive health check, RHS will then immediately do a more comprehensive health check and call IsAlive to see if the resource is really healthy. A resource is considered failed as the result of an IsAlive failure.

Think of it like this… Every 60 seconds RHS calls IsAlive and basically is asking the resource “Are you ok?”. And the resource then responds to RHS “Yes, I am doing fine.” This periodic health check goes on and on… Until, there can be a case where something happens to the resource and it doesn’t respond. Think of it like a dropped call on your cell phone, how long are you willing to sit there going “Hello? Hello? Hello?”… before you give up and call the person back? Basically resetting the connection…

Failover Clustering has this same concept. RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action. By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it. This is configurable with the resource DeadlockTimeout common property.

To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command:

(Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000

Or if you want to modify the DeadlockTimeout for all resources of that type you can modify it at the resource type level with the following syntax (this example will be for all virtual machine resources):

(Get-ClusterResourceType “Virtual Machine”).DeadlockTimeout = 300000

Resources are expected to respond to an IsAlive or LooksAlive within a few hundred milliseconds, so waiting 5 minutes for a resource to respond is a really long time. Something pretty bad happened if a resource which normally responds in milliseconds, suddenly takes longer than 5 minutes. So it is generally recommended to stay with the default values.

If the resource doesn’t respond in 5 minutes, RHS decides that there must be something wrong with the resource and that it should take recovery action to get it back up and running. Remember that the resource has gone silent; RHS has no idea what is wrong with it. The only way to recover and get the resource back up and running is that the RHS process is terminated, then RHS restarts, which will then restart the resource, and everything is back up and running. You may also see the associated entries in the System event log:

Event ID 1230

Cluster resource ‘Resource Name’ (resource type ‘Resource Type Name’, DLL ‘DLL Name’) did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource.

Event ID 1146

The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.

The next layer of protection is that when clustering issues a request to terminate the RHS process, it will wait four times the DeadlockTimeout value (which equates to 20 minutes by default) for the RHS process to terminate. If RHS does not terminate in 20 minutes, clustering will deem that the server has some serious health issues and will bugcheck the server to force failover and recovery. The bugcheck code will be Stop 0x0000009E ( Parameter1 , Parameter2 , 0x0000000000000005 , Parameter4 ). Note: that Parameter3 will always be a value of 0x5 if it is the result of an RHS process failing to terminate.

This is the way clustering is designed to work… it is monitoring the health of the system, it detects something is wrong, and recovers. This is a good thing!

Summary of Recovery Behavior:

RHS calls an entry point to resource
RHS waits DeadlockTimeout (5 minutes) for resource to respond
If resource does not respond, Cluster Service terminates RHS process to recover from unresponsive resource
Cluster Service waits DeadlockTimeout x 4 (20 minutes) for the RHS process to terminate
If RHS process does not terminate, Cluster Service calls NetFT to bugcheck the node to recover from RHS termination failure
NetFT bugchecks the node with a STOP 0x9e

From <https://blogs.msdn.microsoft.com/clustering/2013/01/24/understanding-how-failover-clustering-recovers-from-unresponsive-resources/>

_________________________________________________________________________________________________________________

System Information: ABCDX97E

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCDX97E

System Manufacturer Dell Inc.

System Model PowerEdge R630

System Type x64-based PC

System SKU SKU=NotProvided;ModelName=PowerEdge R630

Processor Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)

BIOS Version/Date Dell Inc. 1.2.10, 09/03/2015

System Events:

Checked the system logs but was not able to find any event around 5:30 AM. However around 6:30 we get event id 1135 which says that the cluster node is removed from FCM.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
5/10/2016	6:31:14 AM	Critical	ABCDX97E.abc.com	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCDX96E’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Application Events:

Checked the application logs but was not able to find any events related to the issue.

Cluster Events:

Checked the Cluster events and found that there are two consecutive heartbeat misses and just after which the cluster node is evicted from the failover cluster manager.

5/10/2016

6:31:10 AM

Information

ABCDX97E.abc.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.168.2.102:~3343~ connected to remote endpoint 192.168.2.101:~3343~.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
10/31/2014 10:43	(6.0:0.389)	(6.0:0.389)	EMC Corporation	PowerPath MultiPath Module
10/22/2014 14:03	(7.12:0.0)	(7.12:0.0)	QLogic Corporation	QLogic Gigabit Ethernet VBD
10/17/2014 19:03	(6.0:0.382)	(6.0:0.382)	EMC Corporation	EMS Generic Purpose Extension
6/5/2014 18:37	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath CLARiiON LAM Module
6/5/2014 18:37	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath Generic LAM Module
6/5/2014 18:38	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath Invista LAM Module
6/5/2014 18:39	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath Symmetrix LAM Module
6/5/2014 18:39	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath XtremIO LAM Module
12/19/2014 1:11	(9.1:13.20)	(9.1:13.20)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Driver
12/4/2014 6:04	(6.3:9600.17038)	(7.12:0.0)	QLogic Corporation	iSCSI offload x64 FREE
6/9/2014 18:47	(6.0:0.307)	(6.0:0.307)	EMC Corporation	EMS Class Driver
8/16/2012 22:18	(12.1:1.2)	(12.1:1.2)	Symantec Corporation	Common Client Settings Driver
11/1/2012 20:16	(14.2:0.0)	(14.2:0.6)	Symantec Corporation	Symantec AutoProtect
11/15/2011 22:30	(12.6:0.0)	(12.6:0.16)	Symantec Corporation	Symantec AutoProtect
8/22/2012 6:33	(12.9:3.1)	(12.9:3.1)	Symantec Corporation	Symantec Event Library
11/3/2012 13:53	(12.1:2015.2015)	(12.1:2015.2015)	Symantec Corporation	Symantec CMC Firewall SysPlant
7/21/2012 3:07	(13.1:0.0)	(13.1:0.7)	Symantec Corporation	Network Security Driver
11/20/2014 13:09	(7.12:2.0)	(7.12:2.0)	QLogic Corporation	AMD64 BXND NDIS6.0 Driver
8/13/2013 12:12	(7.4:0.3999)	(7.4:0.3999)	Dell Inc.	Dell BASE Device Driver

Cluster Logs:

There is no Cluster logs generated at the time of issue.

_________________________________________________________________________________________________________________

System Information: ABCDX96E

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCDX96E

System Manufacturer Dell Inc.

System Model PowerEdge R630

System Type x64-based PC

System SKU SKU=NotProvided;ModelName=PowerEdge R630

Processor Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)

BIOS Version/Date Dell Inc. 1.2.10, 09/03/2015

System Events:

Checked the events and found that the Machine Crashed unexpectedly on 06:30:43 on 10/05/2016.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
5/10/2016	6:34:34 AM	Error	ABCDX96E.abc.com	6008	EventLog	The previous system shutdown at 06:30:43 on ?10/?05/?2016 was unexpected.

As per the events the machine haven’t crashed with any bug check of event id 41.
As per this scenario the possible causes are: Hardware going bad, Hardware firmware or driver is misbehaving, Mini filter drivers (i.e. antivirus and other security software) is/are misbehaving etc…
- For More Information please refer: https://blogs.technet.microsoft.com/yongrhee/2015/05/22/how-to-troubleshoot-an-event-id-6008-the-previous-system-shutdown-at-time-on-date-was-unexpected/

There is No Events logged around 6:30 due to which we are not able to find the events which caused the machine crashed.

Application Events:

Checked the application logs at the time of issue and found that there are some SQL errors around 1:02 AM and as per the Event 7 the Antivirus application was also active at the time of issue.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
5/10/2016	5:02:00 AM	Error	ABCDX96E.abc.com	14421	MSSQL$SQLCLUINST03	The log shipping secondary database SQLCLUINST03\SQLCLUINST03.FP5_AD_Content has restore threshold of 240 minutes and is out of sync. No restore was performed for 246641 minutes. Restored latency is 0 minutes. Check agent log and logshipping monitor information.
5/10/2016	5:02:47 AM	Information	ABCDX96E.abc.com	7	Symantec AntiVirus	New virus definition file loaded. Version: 160509024.

Initially there was a SQL backup job running around 3:00 AM.

5/10/2016

3:00:16 AM

Information

ABCDX96E.abc.com

18265

MSSQL$SQLCLUINST03

Log was backed up. Database: FP5_Content_epieuropropinternational, creation date(time): 2015/12/13(12:42:15), first LSN: 3552:11104:1, last LSN: 3555:162:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘\\172.24.21.200\syncdata\Inst3\TLOG3\FP5_Content_epieuropropinternational_20160510020005.trn’}). This is an informational message only. No user action is required.

Cluster Events:

Checked the Events and was not able to find any event around 5:30 AM. Events started from 6:30 when the Node Join process started.

5/10/2016

6:35:15 AM

Information

ABCDX96E.abc.com

1281

Microsoft-Windows-FailoverClustering

Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘ABCDX97E’

Cluster Logs:

000011bc.00004894::2016/05/10-04:36:47.108 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e45953f3-78bf-47ac-8f85-c9d89ab14907:Netbios

00000aa8.00000aac::2016/05/10-05:34:53.520 INFO —————————–+ LOG BEGIN +—————————–

00000aa8.00000aac::2016/05/10-05:34:53.520 INFO [CS] Starting clussvc as a service

00000aa8.00000dc0::2016/05/10-05:35:27.489 ERR mscs::QuorumAgent::FormLeaderWorker::operator (): (5038)’ because of ‘Failed to bring quorum resource 12e4512b-aeda-4ff5-9bde-880e3ef31aca online, status 5038’

00000aa8.00000d1c::2016/05/10-05:35:27.506 DBG [GEM] Node 1: ProcessNew message from node 2

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
10/31/2014 10:43	(6.0:0.389)	(6.0:0.389)	EMC Corporation	PowerPath MultiPath Module
10/17/2014 19:02	(6.0:0.382)	(6.0:0.382)	EMC Corporation	EMS Base Driver
10/17/2014 19:03	(6.0:0.382)	(6.0:0.382)	EMC Corporation	EMS Generic Purpose Extension
6/5/2014 18:37	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath CLARiiON LAM Module
6/5/2014 18:37	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath Generic LAM Module
6/5/2014 18:38	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath Invista LAM Module
6/5/2014 18:39	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath Symmetrix LAM Module
6/5/2014 18:39	(6.0:0.303)	(6.0:0.303)	EMC Corporation	PowerPath XtremIO LAM Module
12/19/2014 1:11	(9.1:13.20)	(9.1:13.20)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Driver
10/31/2014 10:41	(6.0:0.389)	(6.0:0.389)	EMC Corporation	EMS Data Migration Manager
7/23/2012 23:58	(2.1:0.0)	(2.1:0.9)	Symantec Corporation	Symantec Data Store
10/3/2012 20:00	(4.1:0.0)	(4.1:1.9)	Symantec Corporation	Symantec Extended File Attributes
6/9/2014 18:47	(6.0:0.307)	(6.0:0.307)	EMC Corporation	EMS Class Driver
8/16/2012 22:18	(12.1:1.2)	(12.1:1.2)	Symantec Corporation	Common Client Settings Driver
11/1/2012 20:16	(14.2:0.0)	(14.2:0.6)	Symantec Corporation	Symantec AutoProtect
11/15/2011 22:30	(12.6:0.0)	(12.6:0.16)	Symantec Corporation	Symantec AutoProtect
11/3/2012 13:53	(12.1:2015.2015)	(12.1:2015.2015)	Symantec Corporation	Symantec CMC Firewall SysPlant
7/21/2012 3:07	(13.1:0.0)	(13.1:0.7)	Symantec Corporation	Network Security Driver
8/13/2013 12:12	(7.4:0.3999)	(7.4:0.3999)	Dell Inc.	Dell BASE Device Driver
12/13/2012 16:54	(8.2:2.0)	(5.0:0.34)	Tripwire, Inc.	Event Generator File System Filter Driver

_________________________________________________________________________________________________________________\

Conclusion:

The logs are not seems to be conclusive as we don’t have enough logs at the time of issue. However as per the event that we found the cluster didn’t failed over the resources as the Isalive and Looksalive test didn’t failed and the Cluster node was not able to understand the unresponsiveness of the SQL resource which was running on the node.

The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

Please follow the article to add the antivirus exclusion for SQL: https://support.microsoft.com/en-gb/kb/309422.

Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

https://support.microsoft.com/en-us/kb/2920151

Kindly Update the Outdated Drivers from the List that I have shared with you and some of them are not been update since 2012.

Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.