RCA - 12 - Failover of SQL Resources

Issue Description:

Want to know probable cause of unplanned failover on Windows Svr 2012 R2 Standard cluster ” SQLABCFC “

__________________________________________________________________________________________

System Information: DR-SQLABCDB1

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name DR-SQLABCDB1

System Manufacturer Microsoft Corporation

System Model Virtual Machine

System Type x64-based PC

System SKU None

Processor Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, 2261 Mhz, 2 Core(s), 2 Logical Processor(s)

BIOS Version/Date Microsoft Corporation Hyper-V UEFI Release v1.0, 26/11/2012

System Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:41:53 AM	Critical	DR-SQLABCDB1.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Application Events:

Checked the events and found few issues with the SQL communication as well.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:40:04 AM	Information	DR-SQLABCDB1.abc.local	9012	MSSQLSERVER	There have been 23136768 misaligned log IOs which required falling back to synchronous IO. The current IO is on file L:\Data\pgcs_sit1_5.ldf.
10/22/2016	11:41:18 AM	Information	DR-SQLABCDB1.abc.local	9012	MSSQLSERVER	There have been 23137024 misaligned log IOs which required falling back to synchronous IO. The current IO is on file L:\Data\pgcs_sit1_5.ldf.
10/22/2016	11:41:53 AM	Information	DR-SQLABCDB1.abc.local	41093	MSSQLSERVER	AlwaysOn: The local replica of availability group ‘SQLABCAG’ is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required.
10/22/2016	11:41:53 AM	Information	DR-SQLABCDB1.abc.local	19406	MSSQLSERVER	The state of the local availability replica in availability group ‘SQLABCAG’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.
10/22/2016	11:41:53 AM	Information	DR-SQLABCDB1.abc.local	35267	MSSQLSERVER	AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘JDEPostings’ on the availability replica ‘SQLABCDB1’ with Replica ID: {e2b895a3-8b8e-4f85-9176-42677ba2fd9a}. This is an informational message only. No user action is required.

Cluster Events:

Checked and found the same events generated here.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:41:49 AM	Information	DR-SQLABCDB1.abc.local	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 172.0.0.160:~3343~ connected to remote endpoint 172.12.0.10:~3343~.
10/22/2016	11:41:52 AM	Information	DR-SQLABCDB1.abc.local	1650	Microsoft-Windows-FailoverClustering	Cluster has lost the UDP connection from local endpoint 172.0.0.160:~3343~ connected to remote endpoint 172.12.0.10:~3343~.
10/22/2016	11:41:53 AM	Information	DR-SQLABCDB1.abc.local	1641	Microsoft-Windows-FailoverClustering	Clustered role ‘SQLABCAG’ is moving to cluster node ‘SQLABCDB2’.

List of outdated drivers:

There is no Third-party software running.

____________________________________________________________________________________________

System Information: SQLABCDB1

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name SQLABCDB1

System Manufacturer Dell Inc.

System Model PowerEdge R720

System Type x64-based PC

System SKU SKU=NotProvided;ModelName=PowerEdge R720

Processor Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date Dell Inc. 2.4.3, 09/07/2014

System Events:

Checked the events and found that SQLABCDB1 was removed from the failover cluster manager.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:41:53 AM	Critical	SQLABCDB1.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘SQLABCDB2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
10/22/2016	11:41:53 AM	Critical	SQLABCDB1.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘DR-SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
10/22/2016	11:41:53 AM	Critical	SQLABCDB1.abc.local	1177	Microsoft-Windows-FailoverClustering	The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
10/22/2016	11:41:53 AM	Information	SQLABCDB1.abc.local	7036	Service Control Manager	The Cluster Service service entered the stopped state.
10/22/2016	11:41:53 AM	Error	SQLABCDB1.abc.local	7024	Service Control Manager	The Cluster Service service terminated with the following service-specific error: A quorum of cluster nodes was not present to form a cluster.
10/22/2016	11:41:53 AM	Error	SQLABCDB1.abc.local	7031	Service Control Manager	The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.

Cluster Events:

Checked the events and found that we are getting communication issue with the Node.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:42:53 AM	Information	SQLABCDB1.abc.local	1281	Microsoft-Windows-FailoverClustering	Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘DR-SQLABCDB1’
10/22/2016	11:42:53 AM	Information	SQLABCDB1.abc.local	1281	Microsoft-Windows-FailoverClustering	Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘SQLABCDB2’
10/22/2016	11:42:53 AM	Information	SQLABCDB1.abc.local	1281	Microsoft-Windows-FailoverClustering	Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’1′ and Timeout =’40000′ for the target = ‘DR-SQLABCDB1’
10/22/2016	11:42:56 AM	Information	SQLABCDB1.abc.local	1062	Microsoft-Windows-FailoverClustering	This node has successfully joined the failover cluster ‘SQLABCFC’.

Application Events:

Checked the events but was not able to find anything specific.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
3/31/2016 4:06	(6.0:0.0)	(6.0:3792.276)	Arcserve	Arcserve Unified Data Protection
12/19/2013 5:36	(16.4:0.2)	(16.4:0.2)	Broadcom Corporation	Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.
3/21/2016 15:40	(6.0:0.0)	(6.0:3792.269)	Arcserve	Arcserve Unified Data Protection

_________________________________________________________________________________________________

System Information: SQLABCDB2

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name SQLABCDB2

System Manufacturer Dell Inc.

System Model PowerEdge R720

System Type x64-based PC

System SKU SKU=NotProvided;ModelName=PowerEdge R720

Processor Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date Dell Inc. 2.4.3, 09/07/2014

System Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:41:53 AM	Critical	SQLABCDB2.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
10/22/2016	11:41:53 AM	Warning	SQLABCDB2.abc.local	1045	Microsoft-Windows-FailoverClustering	No matching network interface found for resource ‘SQLABCAG_172.0.0.171’ IP address ‘172.0.0.171’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal.
10/22/2016	11:41:53 AM	Error	SQLABCDB2.abc.local	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘SQLABCAG_172.0.0.171’ of type ‘IP Address’ in clustered role ‘SQLABCAG’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Cluster Events:

Checked the cluster logs and found that there is an event for the heartbeat miss on the cluster.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
10/22/2016	11:41:49 AM	Information	SQLABCDB2.abc.local	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 172.12.0.11:~3343~ connected to remote endpoint 172.12.0.10:~3343~.
10/22/2016	11:41:53 AM	Information	SQLABCDB2.abc.local	1641	Microsoft-Windows-FailoverClustering	Clustered role ‘SQLABCAG’ is moving to cluster node ‘SQLABCDB2’.
10/22/2016	11:41:53 AM	Information	SQLABCDB2.abc.local	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘SQLABCAG_172.12.0.14’ in clustered role ‘SQLABCAG’ has transitioned from state Offline to state OnlineCallIssued.

Application Events:

Checked the events but was not able to find anything specific.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
3/31/2016 4:06	(6.0:0.0)	(6.0:3792.276)	Arcserve	Arcserve Unified Data Protection
12/19/2013 5:36	(16.4:0.2)	(16.4:0.2)	Broadcom Corporation	Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.
3/21/2016 15:40	(6.0:0.0)	(6.0:3792.269)	Arcserve	Arcserve Unified Data Protection

_______________________________________________________________________________________________________________

Conclusion:

After analyzing the logs we found that the Issue happened probably due to the Cluster loosing heartbeat. This issue is happening between the production nodes, this could be happening due to few reasons. If we have not configured a heartbeat network the cluster traffic will be routed to the Management traffic which already is occupied. Due to this reason once the heartbeats are missed cluster service gets terminated and the resources are transferred to another node.

Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

Investigation of Network Issues :

We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

There are a few settings to tweak via the command line, and here are the maximum values you can configure to make it “less sensitive”:
cluster /prop SameSubnetDelay=2000:DWORD
cluster /prop CrossSubnetDelay=4000:DWORD
cluster /prop CrossSubnetThreshold=10:DWORD
cluster /prop SameSubnetThreshold=10:DWORD

Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.