RCA – 12 – Failover of SQL Resources

Issue Description:

Want to know probable cause of unplanned failover on Windows Svr 2012 R2 Standard cluster ” SQLABCFC  “

__________________________________________________________________________________________

System Information:  DR-SQLABCDB1

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        DR-SQLABCDB1

System Manufacturer        Microsoft Corporation

System Model        Virtual Machine

System Type        x64-based PC

System SKU        None

Processor        Intel(R) Xeon(R) CPU           L5640  @ 2.27GHz, 2261 Mhz, 2 Core(s), 2 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU           L5640  @ 2.27GHz, 2261 Mhz, 2 Core(s), 2 Logical Processor(s)

BIOS Version/Date        Microsoft Corporation Hyper-V UEFI Release v1.0, 26/11/2012

System Events:

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:41:53 AM

Critical

DR-SQLABCDB1.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Application Events:

  • Checked the events and found few issues with the SQL communication as well.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:40:04 AM

Information

DR-SQLABCDB1.abc.local

9012

MSSQLSERVER

There have been 23136768 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file L:\Data\pgcs_sit1_5.ldf.

10/22/2016

11:41:18 AM

Information

DR-SQLABCDB1.abc.local

9012

MSSQLSERVER

There have been 23137024 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file L:\Data\pgcs_sit1_5.ldf.

10/22/2016

11:41:53 AM

Information

DR-SQLABCDB1.abc.local

41093

MSSQLSERVER

AlwaysOn: The local replica of availability group ‘SQLABCAG’ is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required.

10/22/2016

11:41:53 AM

Information

DR-SQLABCDB1.abc.local

19406

MSSQLSERVER

The state of the local availability replica in availability group ‘SQLABCAG’ has changed from ‘SECONDARY_NORMAL’ to ‘RESOLVING_NORMAL’.  The state changed because the availability group state has changed in Windows Server Failover Clustering (WSFC).  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

10/22/2016

11:41:53 AM

Information

DR-SQLABCDB1.abc.local

35267

MSSQLSERVER

AlwaysOn Availability Groups connection with primary database terminated for secondary database ‘JDEPostings’ on the availability replica ‘SQLABCDB1’ with Replica ID: {e2b895a3-8b8e-4f85-9176-42677ba2fd9a}. This is an informational message only. No user action is required.

Cluster Events:

  • Checked and found the same events generated here.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:41:49 AM

Information

DR-SQLABCDB1.abc.local

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.0.0.160:~3343~ connected to remote endpoint 172.12.0.10:~3343~.

10/22/2016

11:41:52 AM

Information

DR-SQLABCDB1.abc.local

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint 172.0.0.160:~3343~ connected to remote endpoint 172.12.0.10:~3343~.

10/22/2016

11:41:53 AM

Information

DR-SQLABCDB1.abc.local

1641

Microsoft-Windows-FailoverClustering

Clustered role ‘SQLABCAG’ is moving to cluster node ‘SQLABCDB2’.

List of outdated drivers:

  • There is no Third-party software running.

____________________________________________________________________________________________

System Information: SQLABCDB1

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        SQLABCDB1

System Manufacturer        Dell Inc.

System Model        PowerEdge R720

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R720

Processor        Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date        Dell Inc. 2.4.3, 09/07/2014

System Events:

  • Checked the events and found that SQLABCDB1 was removed from the failover cluster manager.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:41:53 AM

Critical

SQLABCDB1.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘SQLABCDB2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

10/22/2016

11:41:53 AM

Critical

SQLABCDB1.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘DR-SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

10/22/2016

11:41:53 AM

Critical

SQLABCDB1.abc.local

1177

Microsoft-Windows-FailoverClustering

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

10/22/2016

11:41:53 AM

Information

SQLABCDB1.abc.local

7036

Service Control Manager

The Cluster Service service entered the stopped state.

10/22/2016

11:41:53 AM

Error

SQLABCDB1.abc.local

7024

Service Control Manager

The Cluster Service service terminated with the following service-specific error:  A quorum of cluster nodes was not present to form a cluster.

10/22/2016

11:41:53 AM

Error

SQLABCDB1.abc.local

7031

Service Control Manager

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Cluster Events:

  • Checked the events and found that we are getting communication issue with the Node.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:42:53 AM

Information

SQLABCDB1.abc.local

1281

Microsoft-Windows-FailoverClustering

Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘DR-SQLABCDB1’

10/22/2016

11:42:53 AM

Information

SQLABCDB1.abc.local

1281

Microsoft-Windows-FailoverClustering

Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’0′ and Timeout =’40000′ for the target = ‘SQLABCDB2’

10/22/2016

11:42:53 AM

Information

SQLABCDB1.abc.local

1281

Microsoft-Windows-FailoverClustering

Joiner tried to Create Security Context using Package=’Kerberos/NTLM’ with Context Requirement =’1′ and Timeout =’40000′ for the target = ‘DR-SQLABCDB1’

10/22/2016

11:42:56 AM

Information

SQLABCDB1.abc.local

1062

Microsoft-Windows-FailoverClustering

This node has successfully joined the failover cluster ‘SQLABCFC’.

Application Events:

  • Checked the events but was not able to find anything specific.

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

3/31/2016 4:06

(6.0:0.0)

(6.0:3792.276)

Arcserve

Arcserve Unified Data Protection

12/19/2013 5:36

(16.4:0.2)

(16.4:0.2)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

3/21/2016 15:40

(6.0:0.0)

(6.0:3792.269)

Arcserve

Arcserve Unified Data Protection

_________________________________________________________________________________________________

 

System Information: SQLABCDB2

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        SQLABCDB2

System Manufacturer        Dell Inc.

System Model        PowerEdge R720

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R720

Processor        Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date        Dell Inc. 2.4.3, 09/07/2014

 

 

System Events:

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:41:53 AM

Critical

SQLABCDB2.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘SQLABCDB1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

10/22/2016

11:41:53 AM

Warning

SQLABCDB2.abc.local

1045

Microsoft-Windows-FailoverClustering

No matching network interface found for resource ‘SQLABCAG_172.0.0.171’ IP address ‘172.0.0.171’ (return code was ‘5035’).  If your cluster nodes span different subnets, this may be normal.

10/22/2016

11:41:53 AM

Error

SQLABCDB2.abc.local

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘SQLABCAG_172.0.0.171’ of type ‘IP Address’ in clustered role ‘SQLABCAG’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

 

 

Cluster Events:

 

  • Checked the cluster logs and found that there is an event for the heartbeat miss on the cluster.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

10/22/2016

11:41:49 AM

Information

SQLABCDB2.abc.local

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.12.0.11:~3343~ connected to remote endpoint 172.12.0.10:~3343~.

10/22/2016

11:41:53 AM

Information

SQLABCDB2.abc.local

1641

Microsoft-Windows-FailoverClustering

Clustered role ‘SQLABCAG’ is moving to cluster node ‘SQLABCDB2’.

10/22/2016

11:41:53 AM

Information

SQLABCDB2.abc.local

1637

Microsoft-Windows-FailoverClustering

Cluster resource ‘SQLABCAG_172.12.0.14’ in clustered role ‘SQLABCAG’ has transitioned from state Offline to state OnlineCallIssued.

 

 

Application Events:

 

  • Checked the events but was not able to find anything specific.

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

3/31/2016 4:06

(6.0:0.0)

(6.0:3792.276)

Arcserve

Arcserve Unified Data Protection

12/19/2013 5:36

(16.4:0.2)

(16.4:0.2)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

3/21/2016 15:40

(6.0:0.0)

(6.0:3792.269)

Arcserve

Arcserve Unified Data Protection

 

_______________________________________________________________________________________________________________

 

Conclusion:

 

  • After analyzing the logs we found that the Issue happened probably due to the Cluster loosing heartbeat. This issue is happening between the production nodes, this could be happening due to few reasons. If we have not configured a heartbeat network the cluster traffic will be routed to the Management traffic which already is occupied. Due to this reason once the heartbeats are missed cluster service gets terminated and the resources are transferred to another node.

 

  •   Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

 

 

  • There are a few settings to tweak via the command line, and here are the maximum values you can configure to make it “less sensitive”:
    cluster /prop SameSubnetDelay=2000:DWORD
    cluster /prop CrossSubnetDelay=4000:DWORD
    cluster /prop CrossSubnetThreshold=10:DWORD
    cluster /prop SameSubnetThreshold=10:DWORD

 

  • Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.

Recommended private “Heartbeat” configuration on a cluster server 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply