RCA – 3 – Cluster Resource Went Unresponsive but Didn’t Failed Over

Issue Description:

 

Need to find out why clustered roles running on Server W2K12R2STD did not failover from Node: “ABCDX96E”  to  Node: “ABCDX97E” when Node: “ABCDX96E”  became unresponsive on 10th May around  05:20 AM. Cluster name is “exo-clust”.

 

Initial Description:

 

Understanding how Failover Clustering Recovers from Unresponsive Resources

 

When a Virtual Machine is clustered, there is a cluster “Virtual Machine” resource created which controls that VM.  The “Virtual Machine” resource and its associated resource DLL communicates with the VMMS service and tells the VM when to start, when to stop, and it also does health checks to ensure the VM is ok. 

Resources all run in a component of the Failover Clustering feature called the Resource Hosting Subsystem (RHS).  These VM actions from the user map to entry point calls that RHS makes to resources, such as Online, Offline, IsAlive, and LooksAlive.  You can find the full list of resource DLL entry-point functions. 

The most interesting in most cases where resources go unresponsive and you see clustering need to recover is with the LooksAlive and IsAlive which is a health check to the resource. 

  • LooksAlive is a quick light lightweight check that happens every 5 seconds by default
  • IsAlive is a more verbose check that happens every 60 seconds by default. 

Health check calls to the resource continue constantly while resources are online.  If a resource returns a failure for the lightweight LooksAlive health check, RHS will then immediately do a more comprehensive health check and call IsAlive to see if the resource is really healthy.  A resource is considered failed as the result of an IsAlive failure.

Think of it like this… Every 60 seconds RHS calls IsAlive and basically is asking the resource “Are you ok?”.  And the resource then responds to RHS “Yes, I am doing fine.”  This periodic health check goes on and on…      Until, there can be a case where something happens to the resource and it doesn’t respond.  Think of it like a dropped call on your cell phone, how long are you willing to sit there going “Hello?  Hello?  Hello?”… before you give up and call the person back?  Basically resetting the connection…

Failover Clustering has this same concept.  RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action.  By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it.  This is configurable with the resource DeadlockTimeout common property. 

To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command:

(Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000

Or if you want to modify the DeadlockTimeout for all resources of that type you can modify it at the resource type level with the following syntax (this example will be for all virtual machine resources):

(Get-ClusterResourceType “Virtual Machine”).DeadlockTimeout = 300000

Resources are expected to respond to an IsAlive or LooksAlive within a few hundred milliseconds, so waiting 5 minutes for a resource to respond is a really long time.  Something pretty bad happened if a resource which normally responds in milliseconds, suddenly takes longer than 5 minutes.  So it is generally recommended to stay with the default values.

If the resource doesn’t respond in 5 minutes, RHS decides that there must be something wrong with the resource and that it should take recovery action to get it back up and running.  Remember that the resource has gone silent; RHS has no idea what is wrong with it.  The only way to recover and get the resource back up and running is that the RHS process is terminated, then RHS restarts, which will then restart the resource, and everything is back up and running.  You may also see the associated entries in the System event log:

Event ID 1230

Cluster resource ‘Resource Name’ (resource type ‘Resource Type Name’, DLL ‘DLL Name’) did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource.

Event ID 1146

The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource. 

The next layer of protection is that when clustering issues a request to terminate the RHS process, it will wait four times the DeadlockTimeout value (which equates to 20 minutes by default) for the RHS process to terminate.  If RHS does not terminate in 20 minutes, clustering will deem that the server has some serious health issues and will bugcheck the server to force failover and recovery.  The bugcheck code will be Stop 0x0000009E ( Parameter1 , Parameter2 , 0x0000000000000005 , Parameter4 ).  Note: that Parameter3 will always be a value of 0x5 if it is the result of an RHS process failing to terminate.

This is the way clustering is designed to work… it is monitoring the health of the system, it detects something is wrong, and recovers.  This is a good thing!

Summary of Recovery Behavior:

 

  1. RHS calls an entry point to resource
  2. RHS waits DeadlockTimeout (5 minutes) for resource to respond
  3. If resource does not respond, Cluster Service terminates RHS process to recover from unresponsive resource
  4. Cluster Service waits DeadlockTimeout x 4 (20 minutes) for the RHS process to terminate
  5. If RHS process does not terminate, Cluster Service calls NetFT to bugcheck the node to recover from RHS termination failure
  6. NetFT bugchecks the node with a STOP 0x9e

 

From <https://blogs.msdn.microsoft.com/clustering/2013/01/24/understanding-how-failover-clustering-recovers-from-unresponsive-resources/>

 

 

 

_________________________________________________________________________________________________________________ 

System Information: ABCDX97E

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCDX97E

System Manufacturer        Dell Inc.

System Model        PowerEdge R630

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R630

Processor        Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)

BIOS Version/Date        Dell Inc. 1.2.10, 09/03/2015

 

System Events:

 

  • Checked the system logs but was not able to find any event around 5:30 AM. However around 6:30 we get event id 1135 which says that the cluster node is removed from FCM.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/10/2016

6:31:14 AM

Critical

ABCDX97E.abc.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCDX96E’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Application Events:

 

  • Checked the application logs but was not able to find any events related to the issue.

 

Cluster Events:

 

  • Checked the Cluster events and found that there are two consecutive heartbeat misses and just after which the cluster node is evicted from the failover cluster manager.

 

5/10/2016

6:31:10 AM

Information

ABCDX97E.abc.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.168.2.102:~3343~ connected to remote endpoint 192.168.2.101:~3343~.

 

 

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

10/31/2014 10:43

(6.0:0.389)

(6.0:0.389)

EMC Corporation

PowerPath MultiPath Module

10/22/2014 14:03

(7.12:0.0)

(7.12:0.0)

QLogic Corporation

QLogic Gigabit Ethernet VBD

10/17/2014 19:03

(6.0:0.382)

(6.0:0.382)

EMC Corporation

EMS Generic Purpose Extension

6/5/2014 18:37

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath CLARiiON LAM Module

6/5/2014 18:37

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath Generic LAM Module

6/5/2014 18:38

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath Invista LAM Module

6/5/2014 18:39

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath Symmetrix LAM Module

6/5/2014 18:39

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath XtremIO LAM Module

12/19/2014 1:11

(9.1:13.20)

(9.1:13.20)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

12/4/2014 6:04

(6.3:9600.17038)

(7.12:0.0)

QLogic Corporation

iSCSI offload x64 FREE

6/9/2014 18:47

(6.0:0.307)

(6.0:0.307)

EMC Corporation

EMS Class Driver

8/16/2012 22:18

(12.1:1.2)

(12.1:1.2)

Symantec Corporation

Common Client Settings Driver

11/1/2012 20:16

(14.2:0.0)

(14.2:0.6)

Symantec Corporation

Symantec AutoProtect

11/15/2011 22:30

(12.6:0.0)

(12.6:0.16)

Symantec Corporation

Symantec AutoProtect

8/22/2012 6:33

(12.9:3.1)

(12.9:3.1)

Symantec Corporation

Symantec Event Library

11/3/2012 13:53

(12.1:2015.2015)

(12.1:2015.2015)

Symantec Corporation

Symantec CMC Firewall SysPlant

7/21/2012 3:07

(13.1:0.0)

(13.1:0.7)

Symantec Corporation

Network Security Driver

11/20/2014 13:09

(7.12:2.0)

(7.12:2.0)

QLogic Corporation

AMD64 BXND NDIS6.0 Driver

8/13/2013 12:12

(7.4:0.3999)

(7.4:0.3999)

Dell Inc.

Dell BASE Device Driver

 

 

Cluster Logs:

 

  • There is no Cluster logs generated at the time of issue.

_________________________________________________________________________________________________________________

System Information: ABCDX96E

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCDX96E

System Manufacturer        Dell Inc.

System Model        PowerEdge R630

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R630

Processor        Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz, 3500 Mhz, 4 Core(s), 8 Logical Processor(s)

BIOS Version/Date        Dell Inc. 1.2.10, 09/03/2015

 

 

System Events:

 

  • Checked the events and found that the Machine Crashed unexpectedly on 06:30:43 on 10/05/2016.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/10/2016

6:34:34 AM

Error

ABCDX96E.abc.com

6008

EventLog

The previous system shutdown at 06:30:43 on ?10/?05/?2016 was unexpected.

 

  • There is No Events logged around 6:30 due to which we are not able to find the events which caused the machine crashed.

 

Application Events:

 

  • Checked the application logs at the time of issue and found that there are some SQL errors around 1:02 AM and as per the Event 7 the Antivirus application was also active at the time of issue.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/10/2016

5:02:00 AM

Error

ABCDX96E.abc.com

14421

MSSQL$SQLCLUINST03

The log shipping secondary database SQLCLUINST03\SQLCLUINST03.FP5_AD_Content has restore threshold of 240 minutes and is out of sync. No restore was performed for 246641 minutes. Restored latency is 0 minutes. Check agent log and logshipping monitor information.

5/10/2016

5:02:47 AM

Information

ABCDX96E.abc.com

7

Symantec AntiVirus

  New virus definition file loaded. Version: 160509024.

 

  • Initially there was a SQL backup job running around 3:00 AM.

5/10/2016

3:00:16 AM

Information

ABCDX96E.abc.com

18265

MSSQL$SQLCLUINST03

Log was backed up. Database: FP5_Content_epieuropropinternational, creation date(time): 2015/12/13(12:42:15), first LSN: 3552:11104:1, last LSN: 3555:162:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘\\172.24.21.200\syncdata\Inst3\TLOG3\FP5_Content_epieuropropinternational_20160510020005.trn’}). This is an informational message only. No user action is required.

 

 

Cluster Events:

 

  • Checked the Events and was not able to find any event around 5:30 AM. Events started from 6:30 when the Node Join process started.

 

5/10/2016

6:35:15 AM

Information

ABCDX96E.abc.com

1281

Microsoft-Windows-FailoverClustering

Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘ABCDX97E’

 

Cluster Logs:

 

000011bc.00004894::2016/05/10-04:36:47.108 INFO  [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e45953f3-78bf-47ac-8f85-c9d89ab14907:Netbios

00000aa8.00000aac::2016/05/10-05:34:53.520 INFO  —————————–+ LOG BEGIN +—————————–

00000aa8.00000aac::2016/05/10-05:34:53.520 INFO  [CS] Starting clussvc as a service

00000aa8.00000dc0::2016/05/10-05:35:27.489 ERR   mscs::QuorumAgent::FormLeaderWorker::operator (): (5038)’ because of ‘Failed to bring quorum resource 12e4512b-aeda-4ff5-9bde-880e3ef31aca online, status 5038’

00000aa8.00000d1c::2016/05/10-05:35:27.506 DBG   [GEM] Node 1: ProcessNew message from node 2

 

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

10/31/2014 10:43

(6.0:0.389)

(6.0:0.389)

EMC Corporation

PowerPath MultiPath Module

10/17/2014 19:02

(6.0:0.382)

(6.0:0.382)

EMC Corporation

EMS Base Driver

10/17/2014 19:03

(6.0:0.382)

(6.0:0.382)

EMC Corporation

EMS Generic Purpose Extension

6/5/2014 18:37

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath CLARiiON LAM Module

6/5/2014 18:37

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath Generic LAM Module

6/5/2014 18:38

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath Invista LAM Module

6/5/2014 18:39

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath Symmetrix LAM Module

6/5/2014 18:39

(6.0:0.303)

(6.0:0.303)

EMC Corporation

PowerPath XtremIO LAM Module

12/19/2014 1:11

(9.1:13.20)

(9.1:13.20)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

10/31/2014 10:41

(6.0:0.389)

(6.0:0.389)

EMC Corporation

EMS Data Migration Manager

7/23/2012 23:58

(2.1:0.0)

(2.1:0.9)

Symantec Corporation

Symantec Data Store

10/3/2012 20:00

(4.1:0.0)

(4.1:1.9)

Symantec Corporation

Symantec Extended File Attributes

6/9/2014 18:47

(6.0:0.307)

(6.0:0.307)

EMC Corporation

EMS Class Driver

8/16/2012 22:18

(12.1:1.2)

(12.1:1.2)

Symantec Corporation

Common Client Settings Driver

11/1/2012 20:16

(14.2:0.0)

(14.2:0.6)

Symantec Corporation

Symantec AutoProtect

11/15/2011 22:30

(12.6:0.0)

(12.6:0.16)

Symantec Corporation

Symantec AutoProtect

11/3/2012 13:53

(12.1:2015.2015)

(12.1:2015.2015)

Symantec Corporation

Symantec CMC Firewall SysPlant

7/21/2012 3:07

(13.1:0.0)

(13.1:0.7)

Symantec Corporation

Network Security Driver

8/13/2013 12:12

(7.4:0.3999)

(7.4:0.3999)

Dell Inc.

Dell BASE Device Driver

12/13/2012 16:54

(8.2:2.0)

(5.0:0.34)

Tripwire, Inc.

Event Generator File System Filter Driver

 

_________________________________________________________________________________________________________________\ 

 

 

Conclusion:

 

  • The logs are not seems to be conclusive as we don’t have enough logs at the time of issue. However as per the event that we found the cluster didn’t failed over the resources as the Isalive and Looksalive test didn’t failed and the Cluster node was not able to understand the unresponsiveness of the SQL resource which was running on the node.

 

  • The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

 

  1. Please follow the article to add the antivirus exclusion for SQL: https://support.microsoft.com/en-gb/kb/309422.

 

  1.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

https://support.microsoft.com/en-us/kb/2920151

 

  1. Kindly Update the Outdated Drivers from the List that I have shared with you and some of them are not been update since 2012.

  

  1. Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.

 

Recommended private “Heartbeat” configuration on a cluster server 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply