RCA – 24 – RCA for SQL Instance Failed Over

Issue Description:

 

Need to know why the SQL Instances Failed over from one node to another on Cluster Name: gl-dac1-sqlent01 running a copy of Microsoft Windows Server 2016 Standard Version 10.0.14393 Build 14393.

 

___________________________________________________________________________

 

System Information: ABCD01-V-MSQL07

 

System Events:

 

  • Reviewed the logs and found that the Issue started after we started getting events regarding the Connection between the ISCSI Target lost from the initiator.

 

Time Generated

Entry Type

Machine Name

Event ID

Source

Message

30/10/2017 20:11:02

Information

ABCD01-V-MSQL07.abc.com

34

iScsiPrt

A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name.

30/10/2017 20:11:02

Information

ABCD01-V-MSQL07.abc.com

32

iScsiPrt

Initiator received an asynchronous logout message. The Target name is given in the dump data.

 

  • While going through the logs we can see that the Disk 15 is getting failed with ReleaseDiskPRFailure. Which is probably due to the Persistent reservation issue.

 

30/10/2017 21:33:42

Error

ABCD01-V-MSQL07.abc.com

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

30/10/2017 21:33:43

Error

ABCD01-V-MSQL07.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinstr-mountpoints

Device Number: 15

Device Guid: {79c6f6e1-e0cb-f8ee-9d8e-6dd2a07513d0}

Error Code: 1726

Additional reason: ReleaseDiskPRFailure

 

  • Cluster service eventually failed after waiting for the disk.

 

30/10/2017 21:33:43

Error

ABCD01-V-MSQL07.abc.com

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

30/10/2017 21:33:43

0

ABCD01-V-MSQL07.abc.com

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

 

 

30/10/2017 21:33:44

Error

ABCD01-V-MSQL07.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinstt-bak1

Device Number: 28

Device Guid: {020f75e5-1107-bfef-457c-9bb97594bf77}

Error Code: 1168

Additional reason: ReleaseDiskPRFailure

30/10/2017 21:33:45

Error

ABCD01-V-MSQL07.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinstt-data1

Device Number: 29

Device Guid: {1cc28a38-b15f-8baf-875b-02b013aed710}

Error Code: 1168

Additional reason: ReleaseDiskPRFailure

30/10/2017 21:33:45

Error

ABCD01-V-MSQL07.abc.com

7034

Service Control Manager

The SQL Server (INSTANCER) service terminated unexpectedly.  It has done this 2 time(s).

 

 

 

Application Events:

 

  • SQL Application logs are not giving much information.

 

TimeGenerated

EntryType

MachineName

EventID

Source

Message

30/10/2017 21:36:51

Information

ABCD01-V-MSQL07.abc.com

3406

MSSQL$INSTANCEW

539 transactions rolled forward in database ‘RLS_CBCOTTONWD’ (196:0). This is an informational message only. No user action is required.

30/10/2017 21:36:51

FailureAudit

ABCD01-V-MSQL07.abc.com

18456

MSSQL$INSTANCEW

Login failed for user ‘usr_abclking’. Reason: Failed to open the explicitly specified database ‘RLS_abclking’. [CLIENT: 192.168.120.55]

30/10/2017 21:36:51

FailureAudit

ABCD01-V-MSQL07.abc.com

18456

MSSQL$INSTANCES

Login failed for user ‘usr_ahbarn’. Reason: Failed to open the explicitly specified database ‘RLS_AHBARN’. [CLIENT: 192.168.120.57]

 

 

Failover Clustering:

 

TimeGenerated

EntryType

MachineName

EventID

Source

Message

30/10/2017 21:34:53

ABCD01-V-MSQL07.abc.com

1641

microsoft-windows-failoverclustering/operational

Information

Clustered role ‘Cluster Group’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD02-V-MSQL06’.

30/10/2017 21:34:59

ABCD01-V-MSQL07.abc.com

1641

microsoft-windows-failoverclustering/operational

Information

Clustered role ‘ABCD02-sql (INSTANCEG)’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD02-V-MSQL06’.

30/10/2017 21:34:59

ABCD01-V-MSQL07.abc.com

1641

microsoft-windows-failoverclustering/operational

Information

Clustered role ‘ABCD-sql-AGp04’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD01-V-MSQL10’.

30/10/2017 21:34:59

ABCD01-V-MSQL07.abc.com

1637

microsoft-windows-failoverclustering/operational

Information

Cluster resource ‘SQL Server (INSTANCEU)’ in clustered role ‘ABCD01-sql (INSTANCEU)’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘SQL Server (INSTANCEU)’ is waiting on the following resources: SQL Network Name (ABCD01-sql-u) SQL IP Address 2 (ABCD01-sql-u) ABCD01-sqlinstu-data1 ABCD01-sqlinstu-bak1 ABCD01-sqlinstu-mountpoints ABCD01-sqlinstu-log1 ABCD01-sqlinstu-temp1.

30/10/2017 21:34:59

ABCD01-V-MSQL07.abc.com

1637

microsoft-windows-failoverclustering/operational

Information

Cluster resource ‘ABCD01-sqlinstr-mountpoints’ in clustered role ‘ABCD01-sql (INSTANCER)’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘ABCD01-sqlinstr-mountpoints’ is waiting on the following resources: .

 

 

 

 

__________________________________________________________________________

System Information: ABCD01-V-MSQL08

 

System Events:

 

  • Checked and found that we are getting the same set of Events for different disk.

 

Time Generated

Entry Type

Machine Name

Event ID

Source

Message

30/10/2017 21:31:10

Information

ABCD01-V-MSQL08.abc.com

32

iScsiPrt

Initiator received an asynchronous logout message. The Target name is given in the dump data.

30/10/2017 21:31:10

Information

ABCD01-V-MSQL08.abc.com

34

iScsiPrt

A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name.

 

Time Generated

Entry Type

Machine Name

Event ID

Source

Message

30/10/2017 21:33:42

Error

ABCD01-V-MSQL08.abc.com

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

30/10/2017 21:33:42

Error

ABCD01-V-MSQL08.abc.com

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

30/10/2017 21:33:43

Error

ABCD01-V-MSQL08.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinsts-mountpoints

Device Number: 17

Device Guid: {ec6d6b9a-d2a3-7301-c4b3-052d6f3faf3d}

Error Code: 1168

Additional reason: ReleaseDiskPRFailure

 

Time Generated

Entry Type

Machine Name

Event ID

Source

Message

30/10/2017 21:42:55

Error

ABCD01-V-MSQL08.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘ABCD-sql-AGp02_192.168.0.79’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp02’ failed due to an attempt to block a required state change in that cluster resource.

30/10/2017 21:42:55

Error

ABCD01-V-MSQL08.abc.com

10016

DCOM

The description for Event ID ‘10016’ in Source ‘DCOM’ cannot be found.  The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them.  The following information is part of the event:’application-specific’, ‘Local’, ‘Activation’, ‘{D63B10C5-BB46-4990-A94F-E40B9D520160}’, ‘{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}’, ‘DAC1’, ‘servABCD01SQLS-Agent’, ‘S-1-5-21-711275897-3083864546-3660386873-3110’, ‘LocalHost (Using LRPC)’, ‘Unavailable’, ‘Unavailable’

 

Time Generated

Entry Type

Machine Name

Event ID

Source

Message

30/10/2017 23:54:04

Information

ABCD01-V-MSQL08.abc.com

2

mpio

Added device to \Device\MPIODisk1. DumpData contains the current number of paths.

30/10/2017 23:54:08

Information

ABCD01-V-MSQL08.abc.com

2

mpio

Added device to \Device\MPIODisk30. DumpData contains the current number of paths.

30/10/2017 23:54:11

Information

ABCD01-V-MSQL08.abc.com

46

mpio

Path 77020009 was removed from \Device\MPIODisk26 due to a PnP event. The dump data contains the current number of paths.

30/10/2017 23:54:12

Information

ABCD01-V-MSQL08.abc.com

2

mpio

Added device to \Device\MPIODisk26. DumpData contains the current number of paths.

 

 

 

________________________________________________________________________

 

System Information: ABCD01-V-MSQL10

 

System Events:

 

  • Checked and found that we are getting the same set of events.

 

30/10/2017 21:33:42

Error

ABCD01-V-MSQL10.abc.com

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

30/10/2017 21:33:43

0

ABCD01-V-MSQL10.abc.com

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

 

30/10/2017 21:34:59

Error

ABCD01-V-MSQL10.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource.

30/10/2017 21:34:59

Warning

ABCD01-V-MSQL10.abc.com

1045

Microsoft-Windows-FailoverClustering

No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’).  If your cluster nodes span different subnets, this may be normal.

30/10/2017 21:35:09

Warning

ABCD01-V-MSQL10.abc.com

1045

Microsoft-Windows-FailoverClustering

No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’).  If your cluster nodes span different subnets, this may be normal.

30/10/2017 21:35:09

Error

ABCD01-V-MSQL10.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource.

30/10/2017 21:35:09

Error

ABCD01-V-MSQL10.abc.com

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered role ‘ABCD-sql-AGp04’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

 

 

30/10/2017 21:42:52

Error

ABCD01-V-MSQL10.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource.

30/10/2017 21:42:52

Warning

ABCD01-V-MSQL10.abc.com

1045

Microsoft-Windows-FailoverClustering

No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’).  If your cluster nodes span different subnets, this may be normal.

 

 

Application Events:

 

 

 

30/10/2017 21:34:04

Warning

ABCD01-V-MSQL10.abc.com

4350

MSDTC Client 2

The description for Event ID ‘1073746174’ in Source ‘MSDTC Client 2′ cannot be found.  The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them.  The following information is part of the event:’0x800706D9’, ‘OpenClusterEx’, ‘ ‘

30/10/2017 21:34:23

Error

ABCD01-V-MSQL10.abc.com

24583

SQLWRITER

Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80004005.

 

SQLSTATE: 08S01, Native Error: 121

Source: Microsoft SQL Server Native Client 11.0

Error message: Communication link failure

SQLSTATE: 08S01, Native Error: 121

Error state: 1, Severity: 16

Source: Microsoft SQL Server Native Client 11.0

Error message: TCP Provider: The semaphore timeout period has expired.

 

 

30/10/2017 21:34:25

Information

ABCD01-V-MSQL10.abc.com

0

EHCMService

Bad service request

30/10/2017 21:34:31

Information

ABCD01-V-MSQL10.abc.com

9112

DFSR

The description for Event ID ‘1073750936’ in Source ‘DFSR’ cannot be found.  The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them.  The following information is part of the event:

 

 

30/10/2017 21:42:42

Error

ABCD01-V-MSQL10.abc.com

4025

EqualLogic

iSCSI login error 0x490 when connecting to vss-control volume for group 192.168.125.20.

 

 

 

_________________________________________________________________________

 

Conclusion:

 

  • As per the Events we can conclude that the issue is happening due to the Issues in the Persistent Reservation of the Physical Disk due to which the cluster service is timing out and during the same time there is a Backup operation Is also running which could be the trigger to this Event.
  • Unpresent and Represent the LUN on the Cluster.
  • Update the Storage Components like SAN Firmware, Net Adaptor Firmware.
  • Install all latest windows updates on the cluster as there are a lot of windows updates which have resolved a lot of known issues regarding the Storage, Server connections.
  • Since this issue is happening every day, please try to disable the Backup Schedule for a day and check the statue
  • Disable Antivirus Application on the Cluster and its scan might be overlapping with the Backup Application.
  •  Incase if you are using Equalogic MPIO, please update that as well.

 

 

This is an issue which is clearly coming from the Storage End, however we are still not sure if this is being triggered by the Backup Application or not. This is best to narrow down the issue using the Divide and conquer.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply