RCA – 18 – Quorum Lost in a Cluster

System Information: 2ABCRV038

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        2ABCRV038

System Manufacturer        Dell Inc.

System Model        PowerEdge R730xd

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R730xd

Processor        Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2200 Mhz, 12 Core(s), 24 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2200 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date        Dell Inc. 2.3.4, 08/11/2016

System Events:

  • Issue started with the File share witness failed the health check. This is something which is done by the cluster service so check If the cluster is able to communicate with the File share.
  • After which the cluster witness wait to failed state.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

3/20/2017

12:27:07 AM

Warning

2ABCRV038.abc.uk

1562

Microsoft-Windows-FailoverClustering

File share witness resource ‘File Share Witness’ failed a periodic health check on file share ‘\\2ABCRV094\SQLWitness‘. Please ensure that file share ‘\\2ABCRV094\SQLWitness‘ exists and is accessible by the cluster.

3/20/2017

12:27:07 AM

Error

2ABCRV038.abc.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘File Share Witness’ of type ‘File Share Witness’ in clustered role ‘Cluster Group’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

3/20/2017

12:27:28 AM

Critical

2ABCRV038.abc.uk

1564

Microsoft-Windows-FailoverClustering

File share witness resource ‘File Share Witness’ failed to arbitrate for the file share ‘\\2ABCRV094\SQLWitness‘. Please ensure that file share ‘\\2ABCRV094\SQLWitness‘ exists and is accessible by the cluster.

3/20/2017

12:27:28 AM

Error

2ABCRV038.abc.uk

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered role ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

  • After this we can see an event being occurred on the Cluster Network name went to failed state.

3/20/2017

9:29:16 AM

Error

2ABCRV038.abc.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘2ABCSQL_2ABCSQL’ of type ‘Network Name’ in clustered role ‘2ABCSQL’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Application Events:

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

058952 000019e0.0000270c::2017/03/11-00:28:26.694 ERR   [RES] File Share Witness <File Share Witness>: Failed to create or open directory \\2ABCRV094\SQLWitness\e8c5e3e9-c9cf-4f79-8367-a0265b3627cf, error 53.

058953 000019e0.0000270c::2017/03/11-00:28:26.694 ERR   [RES] File Share Witness <File Share Witness>: Failed to validate an access to the active share \\2ABCRV094\SQLWitness\e8c5e3e9-c9cf-4f79-8367-a0265b3627cf with 53.

058954 000019e0.0000270c::2017/03/11-00:28:26.694 ERR   [RES] File Share Witness <File Share Witness>: Failed to create or open directory \\2ABCRV094\SQLWitness\e8c5e3e9-c9cf-4f79-8367-a0265b3627cf, error 53.

058955 000019e0.0000270c::2017/03/11-00:28:26.694 ERR   [RES] File Share Witness <File Share Witness>: Failed to validate an access to the active share \\2ABCRV094\SQLWitness\e8c5e3e9-c9cf-4f79-8367-a0265b3627cf with 53.

Error 53

The most common symptom of a problem in NetBIOS name resolution is when the Ping utility returns an Error 53 message. The Error 53 message is generally returned when name resolution fails for a particular computer name. Error 53 can also occur when there is a problem establishing a NetBIOS session. To distinguish between these two cases, use the following procedure:

To determine the cause of an Error 53 message

  1. From the Start menu, open a command prompt.
  2. At the command prompt, type:
    net view \\< hostname>
    where <
    hostname> is a network resource you know is active.
    If this works, your name resolution is probably not the source of the problem. To confirm this, ping the host name, as name resolution can sometimes function properly and yet net use returns an Error 53 (such as when a DNS or WINS server has a bad entry). If Ping also shows that name resolution fails (by returning the “Unknown host” message), check the status of your NetBIOS session.

To check the status of your NetBIOS session

  1. From the Start menu, open a command prompt.
  2. At the command prompt, type:
    net view \\< IP address>
    where <
    IP address> is the same network resource you used in the above procedure. If this also fails, the problem is in establishing a session.

  • This indicates that you could make a connection to that share. However, if you get the message “System error 53 has occurred. The network path was not found,” this indicates a TCP/IP configuration problem with the network card.
 
 ____________________________________________________________________________________________________

 

System Information: 2ABCRV039

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        2ABCRV039

System Manufacturer        Dell Inc.

System Model        PowerEdge R730xd

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R730xd

Processor        Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2200 Mhz, 12 Core(s), 24 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2200 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date        Dell Inc. 2.3.4, 08/11/2016

 

 

System Events:

 

  • Analyzed the logs and after that the issue seems to be the Network due to which the SQL Server Availability Group went to failed state.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

3/20/2017

9:25:26 AM

Error

2ABCRV039.abc.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘2ABCSQL’ of type ‘SQL Server Availability Group’ in clustered role ‘2ABCSQL’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

3/20/2017

9:25:27 AM

Error

2ABCRV039.abc.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘2ABCSQL’ of type ‘SQL Server Availability Group’ in clustered role ‘2ABCSQL’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

3/20/2017

9:25:27 AM

Error

2ABCRV039.abc.uk

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered role ‘2ABCSQL’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

3/20/2017

9:25:27 AM

Error

2ABCRV039.abc.uk

1254

Microsoft-Windows-FailoverClustering

Clustered role ‘2ABCSQL’ has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

 

 

3/20/2017

9:26:58 AM

Error

2ABCRV039.abc.uk

4

Microsoft-Windows-Security-Kerberos

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server 2ABCRV039$. The target name used was HTTP/2GSQLCLUSTER.abc.uk. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (abc.uk) is different from the client domain (abc.uk), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

3/20/2017

9:27:31 AM

Error

2ABCRV039.abc.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘2ABCSQL’ of type ‘SQL Server Availability Group’ in clustered role ‘2ABCSQL’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

3/20/2017

9:27:31 AM

Error

2ABCRV039.abc.uk

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered role ‘2ABCSQL’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

 

 

  • At this point the cluster is not able to communicate on Network: 10.175.00.154 as the Network is showing in unavailable.

 

‘.\err(vista).exe’ 5035

# for decimal 5035 / hex 0x13ab

  ERROR_NETWORK_NOT_AVAILABLE                                    winerror.h

# A cluster network is not available for this operation.

 

 

3/20/2017

9:33:30 AM

Warning

2ABCRV039.abc.uk

1045

Microsoft-Windows-FailoverClustering

No matching network interface found for resource ‘2ABCSQL_10.175.00.154’ IP address ‘10.175.00.154’ (return code was ‘5035’).  If your cluster nodes span different subnets, this may be normal.

3/20/2017

9:33:30 AM

Error

2ABCRV039.abc.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘2ABCSQL_10.175.00.154’ of type ‘IP Address’ in clustered role ‘2ABCSQL’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

3/20/2017

9:33:34 AM

Warning

2ABCRV039.abc.uk

1045

Microsoft-Windows-FailoverClustering

No matching network interface found for resource ‘2ABCSQL_10.175.00.154’ IP address ‘10.175.00.154’ (return code was ‘5035’).  If your cluster nodes span different subnets, this may be normal.

 

3/20/2017

9:41:34 AM

Warning

2ABCRV039.abc.uk

4

b57nd60a

Broadcom NetXtreme Gigabit Ethernet #6: The network link is down.  Check to make sure the network cable is properly connected.

 

 

 _________________________________________________________________________________________________

 

Conclusion:

 

  • Based on the logs we can conclude that the issue started with the Cluster Network due to which the Node: 2ABCRV038 was not able to communicate with the External environment due to which the Fileshare failed and then later the entire Node.

 

Plan:

 

 

 

  • Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply