Issue Description:
Want to know probable cause of unplanned failover on Windows Svr 2012 R2 Standard cluster ” SQLABCFC “
__________________________________________________________________________________________
System Information: DR-SQLABCDB1
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name DR-SQLABCDB1
System Manufacturer Microsoft Corporation
System Model Virtual Machine
System Type x64-based PC
System SKU None
Processor Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, 2261 Mhz, 2 Core(s), 2 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU L5640 @ 2.27GHz, 2261 Mhz, 2 Core(s), 2 Logical Processor(s)
BIOS Version/Date Microsoft Corporation Hyper-V UEFI Release v1.0, 26/11/2012
System Events:
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:41:53 AM |
Critical |
DR-SQLABCDB1.abc.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
Application Events:
- Checked the events and found few issues with the SQL communication as well.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:40:04 AM |
Information |
DR-SQLABCDB1.abc.local |
9012 |
MSSQLSERVER |
There have been 23136768 misaligned log IOs which required falling back to synchronous IO. The current IO is on file L:\Data\pgcs_sit1_5.ldf. |
10/22/2016 |
11:41:18 AM |
Information |
DR-SQLABCDB1.abc.local |
9012 |
MSSQLSERVER |
There have been 23137024 misaligned log IOs which required falling back to synchronous IO. The current IO is on file L:\Data\pgcs_sit1_5.ldf. |
10/22/2016 |
11:41:53 AM |
Information |
DR-SQLABCDB1.abc.local |
41093 |
MSSQLSERVER |
AlwaysOn: The local replica of availability group ‘SQLABCAG’ is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required. |
10/22/2016 |
11:41:53 AM |
Information |
DR-SQLABCDB1.abc.local |
19406 |
MSSQLSERVER |
The state of the local availability replica in availability group ‘SQLABCAG’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log. |
10/22/2016 |
11:41:53 AM |
Information |
DR-SQLABCDB1.abc.local |
35267 |
MSSQLSERVER |
AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘JDEPostings’ on the availability replica ‘SQLABCDB1’ with Replica ID: {e2b895a3-8b8e-4f85-9176-42677ba2fd9a}. This is an informational message only. No user action is required. |
Cluster Events:
- Checked and found the same events generated here.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:41:49 AM |
Information |
DR-SQLABCDB1.abc.local |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has missed two consecutive heartbeats for the local endpoint 172.0.0.160:~3343~ connected to remote endpoint 172.12.0.10:~3343~. |
10/22/2016 |
11:41:52 AM |
Information |
DR-SQLABCDB1.abc.local |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has lost the UDP connection from local endpoint 172.0.0.160:~3343~ connected to remote endpoint 172.12.0.10:~3343~. |
10/22/2016 |
11:41:53 AM |
Information |
DR-SQLABCDB1.abc.local |
1641 |
Microsoft-Windows-FailoverClustering |
Clustered role ‘SQLABCAG’ is moving to cluster node ‘SQLABCDB2’. |
List of outdated drivers:
- There is no Third-party software running.
____________________________________________________________________________________________
System Information: SQLABCDB1
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name SQLABCDB1
System Manufacturer Dell Inc.
System Model PowerEdge R720
System Type x64-based PC
System SKU SKU=NotProvided;ModelName=PowerEdge R720
Processor Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)
BIOS Version/Date Dell Inc. 2.4.3, 09/07/2014
System Events:
- Checked the events and found that SQLABCDB1 was removed from the failover cluster manager.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:41:53 AM |
Critical |
SQLABCDB1.abc.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘SQLABCDB2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
10/22/2016 |
11:41:53 AM |
Critical |
SQLABCDB1.abc.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘DR-SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
10/22/2016 |
11:41:53 AM |
Critical |
SQLABCDB1.abc.local |
1177 |
Microsoft-Windows-FailoverClustering |
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
10/22/2016 |
11:41:53 AM |
Information |
SQLABCDB1.abc.local |
7036 |
Service Control Manager |
The Cluster Service service entered the stopped state. |
10/22/2016 |
11:41:53 AM |
Error |
SQLABCDB1.abc.local |
7024 |
Service Control Manager |
The Cluster Service service terminated with the following service-specific error: A quorum of cluster nodes was not present to form a cluster. |
10/22/2016 |
11:41:53 AM |
Error |
SQLABCDB1.abc.local |
7031 |
Service Control Manager |
The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service. |
Cluster Events:
- Checked the events and found that we are getting communication issue with the Node.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:42:53 AM |
Information |
SQLABCDB1.abc.local |
1281 |
Microsoft-Windows-FailoverClustering |
Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘DR-SQLABCDB1’ |
10/22/2016 |
11:42:53 AM |
Information |
SQLABCDB1.abc.local |
1281 |
Microsoft-Windows-FailoverClustering |
Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘SQLABCDB2’ |
10/22/2016 |
11:42:53 AM |
Information |
SQLABCDB1.abc.local |
1281 |
Microsoft-Windows-FailoverClustering |
Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’1′ and Timeout =’40000′ for the target = ‘DR-SQLABCDB1’ |
10/22/2016 |
11:42:56 AM |
Information |
SQLABCDB1.abc.local |
1062 |
Microsoft-Windows-FailoverClustering |
This node has successfully joined the failover cluster ‘SQLABCFC’. |
Application Events:
- Checked the events but was not able to find anything specific.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
3/31/2016 4:06 |
(6.0:0.0) |
(6.0:3792.276) |
Arcserve |
Arcserve Unified Data Protection |
12/19/2013 5:36 |
(16.4:0.2) |
(16.4:0.2) |
Broadcom Corporation |
Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
3/21/2016 15:40 |
(6.0:0.0) |
(6.0:3792.269) |
Arcserve |
Arcserve Unified Data Protection |
_________________________________________________________________________________________________
System Information: SQLABCDB2
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name SQLABCDB2
System Manufacturer Dell Inc.
System Model PowerEdge R720
System Type x64-based PC
System SKU SKU=NotProvided;ModelName=PowerEdge R720
Processor Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)
BIOS Version/Date Dell Inc. 2.4.3, 09/07/2014
System Events:
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:41:53 AM |
Critical |
SQLABCDB2.abc.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
10/22/2016 |
11:41:53 AM |
Warning |
SQLABCDB2.abc.local |
1045 |
Microsoft-Windows-FailoverClustering |
No matching network interface found for resource ‘SQLABCAG_172.0.0.171’ IP address ‘172.0.0.171’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal. |
10/22/2016 |
11:41:53 AM |
Error |
SQLABCDB2.abc.local |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘SQLABCAG_172.0.0.171’ of type ‘IP Address’ in clustered role ‘SQLABCAG’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
Cluster Events:
- Checked the cluster logs and found that there is an event for the heartbeat miss on the cluster.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
10/22/2016 |
11:41:49 AM |
Information |
SQLABCDB2.abc.local |
1650 |
Microsoft-Windows-FailoverClustering |
Cluster has missed two consecutive heartbeats for the local endpoint 172.12.0.11:~3343~ connected to remote endpoint 172.12.0.10:~3343~. |
10/22/2016 |
11:41:53 AM |
Information |
SQLABCDB2.abc.local |
1641 |
Microsoft-Windows-FailoverClustering |
Clustered role ‘SQLABCAG’ is moving to cluster node ‘SQLABCDB2’. |
10/22/2016 |
11:41:53 AM |
Information |
SQLABCDB2.abc.local |
1637 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘SQLABCAG_172.12.0.14’ in clustered role ‘SQLABCAG’ has transitioned from state Offline to state OnlineCallIssued. |
Application Events:
- Checked the events but was not able to find anything specific.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
3/31/2016 4:06 |
(6.0:0.0) |
(6.0:3792.276) |
Arcserve |
Arcserve Unified Data Protection |
12/19/2013 5:36 |
(16.4:0.2) |
(16.4:0.2) |
Broadcom Corporation |
Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
3/21/2016 15:40 |
(6.0:0.0) |
(6.0:3792.269) |
Arcserve |
Arcserve Unified Data Protection |
_______________________________________________________________________________________________________________
Conclusion:
- After analyzing the logs we found that the Issue happened probably due to the Cluster loosing heartbeat. This issue is happening between the production nodes, this could be happening due to few reasons. If we have not configured a heartbeat network the cluster traffic will be routed to the Management traffic which already is occupied. Due to this reason once the heartbeats are missed cluster service gets terminated and the resources are transferred to another node.
- Investigate the Network timeout / latency / packet drops with the help of in house networking team.
Please Note : This step is the most critical while dealing with network connectivity issues.
Investigation of Network Issues :
We need to investigate the Network Connectivity Issues with the help of in-house networking team.
In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.
- There are a few settings to tweak via the command line,
and here are the maximum values you can configure to make it “less
sensitive”:
cluster /prop SameSubnetDelay=2000:DWORD
cluster /prop CrossSubnetDelay=4000:DWORD
cluster /prop CrossSubnetThreshold=10:DWORD
cluster /prop SameSubnetThreshold=10:DWORD
- Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.
Recommended private “Heartbeat” configuration on a cluster server