RCA - 24 - RCA for SQL Instance Failed Over

Issue Description:

Need to know why the SQL Instances Failed over from one node to another on Cluster Name: gl-dac1-sqlent01 running a copy of Microsoft Windows Server 2016 Standard Version 10.0.14393 Build 14393.

___________________________________________________________________________

System Information: ABCD01-V-MSQL07

System Events:

Reviewed the logs and found that the Issue started after we started getting events regarding the Connection between the ISCSI Target lost from the initiator.

Time Generated	Entry Type	Machine Name	Event ID	Source	Message
30/10/2017 20:11:02	Information	ABCD01-V-MSQL07.abc.com	34	iScsiPrt	A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name.
30/10/2017 20:11:02	Information	ABCD01-V-MSQL07.abc.com	32	iScsiPrt	Initiator received an asynchronous logout message. The Target name is given in the dump data.

While going through the logs we can see that the Disk 15 is getting failed with ReleaseDiskPRFailure. Which is probably due to the Persistent reservation issue.

30/10/2017 21:33:42

Error

ABCD01-V-MSQL07.abc.com

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

30/10/2017 21:33:43

Error

ABCD01-V-MSQL07.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinstr-mountpoints

Device Number: 15

Device Guid: {79c6f6e1-e0cb-f8ee-9d8e-6dd2a07513d0}

Error Code: 1726

Additional reason: ReleaseDiskPRFailure

Cluster service eventually failed after waiting for the disk.

30/10/2017 21:33:43	Error	ABCD01-V-MSQL07.abc.com	5377	Microsoft-Windows-FailoverClustering	An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
30/10/2017 21:33:43	0	ABCD01-V-MSQL07.abc.com	1146	Microsoft-Windows-FailoverClustering	The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

30/10/2017 21:33:44

Error

ABCD01-V-MSQL07.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinstt-bak1

Device Number: 28

Device Guid: {020f75e5-1107-bfef-457c-9bb97594bf77}

Error Code: 1168

Additional reason: ReleaseDiskPRFailure

30/10/2017 21:33:45

Error

ABCD01-V-MSQL07.abc.com

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error.

Physical Disk resource name: ABCD01-sqlinstt-data1

Device Number: 29

Device Guid: {1cc28a38-b15f-8baf-875b-02b013aed710}

Error Code: 1168

Additional reason: ReleaseDiskPRFailure

30/10/2017 21:33:45

Error

ABCD01-V-MSQL07.abc.com

7034

Service Control Manager

The SQL Server (INSTANCER) service terminated unexpectedly. It has done this 2 time(s).

Application Events:

SQL Application logs are not giving much information.

TimeGenerated	EntryType	MachineName	EventID	Source	Message
30/10/2017 21:36:51	Information	ABCD01-V-MSQL07.abc.com	3406	MSSQL$INSTANCEW	539 transactions rolled forward in database ‘RLS_CBCOTTONWD’ (196:0). This is an informational message only. No user action is required.
30/10/2017 21:36:51	FailureAudit	ABCD01-V-MSQL07.abc.com	18456	MSSQL$INSTANCEW	Login failed for user ‘usr_abclking’. Reason: Failed to open the explicitly specified database ‘RLS_abclking’. [CLIENT: 192.168.120.55]
30/10/2017 21:36:51	FailureAudit	ABCD01-V-MSQL07.abc.com	18456	MSSQL$INSTANCES	Login failed for user ‘usr_ahbarn’. Reason: Failed to open the explicitly specified database ‘RLS_AHBARN’. [CLIENT: 192.168.120.57]

Failover Clustering:

TimeGenerated	EntryType	MachineName	EventID	Source	Message
30/10/2017 21:34:53	ABCD01-V-MSQL07.abc.com	1641	microsoft-windows-failoverclustering/operational	Information	Clustered role ‘Cluster Group’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD02-V-MSQL06’.
30/10/2017 21:34:59	ABCD01-V-MSQL07.abc.com	1641	microsoft-windows-failoverclustering/operational	Information	Clustered role ‘ABCD02-sql (INSTANCEG)’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD02-V-MSQL06’.
30/10/2017 21:34:59	ABCD01-V-MSQL07.abc.com	1641	microsoft-windows-failoverclustering/operational	Information	Clustered role ‘ABCD-sql-AGp04’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD01-V-MSQL10’.
30/10/2017 21:34:59	ABCD01-V-MSQL07.abc.com	1637	microsoft-windows-failoverclustering/operational	Information	Cluster resource ‘SQL Server (INSTANCEU)’ in clustered role ‘ABCD01-sql (INSTANCEU)’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘SQL Server (INSTANCEU)’ is waiting on the following resources: SQL Network Name (ABCD01-sql-u) SQL IP Address 2 (ABCD01-sql-u) ABCD01-sqlinstu-data1 ABCD01-sqlinstu-bak1 ABCD01-sqlinstu-mountpoints ABCD01-sqlinstu-log1 ABCD01-sqlinstu-temp1.
30/10/2017 21:34:59	ABCD01-V-MSQL07.abc.com	1637	microsoft-windows-failoverclustering/operational	Information	Cluster resource ‘ABCD01-sqlinstr-mountpoints’ in clustered role ‘ABCD01-sql (INSTANCER)’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘ABCD01-sqlinstr-mountpoints’ is waiting on the following resources: .

__________________________________________________________________________

System Information: ABCD01-V-MSQL08

System Events:

Checked and found that we are getting the same set of Events for different disk.

Time Generated	Entry Type	Machine Name	Event ID	Source	Message
30/10/2017 21:31:10	Information	ABCD01-V-MSQL08.abc.com	32	iScsiPrt	Initiator received an asynchronous logout message. The Target name is given in the dump data.
30/10/2017 21:31:10	Information	ABCD01-V-MSQL08.abc.com	34	iScsiPrt	A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name.

Time Generated	Entry Type	Machine Name	Event ID	Source	Message
30/10/2017 21:33:42	Error	ABCD01-V-MSQL08.abc.com	5377	Microsoft-Windows-FailoverClustering	An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
30/10/2017 21:33:42	Error	ABCD01-V-MSQL08.abc.com	5377	Microsoft-Windows-FailoverClustering	An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
30/10/2017 21:33:43	Error	ABCD01-V-MSQL08.abc.com	1795	Microsoft-Windows-FailoverClustering	Cluster physical disk resource terminate encountered an error. Physical Disk resource name: ABCD01-sqlinsts-mountpoints Device Number: 17 Device Guid: {ec6d6b9a-d2a3-7301-c4b3-052d6f3faf3d} Error Code: 1168 Additional reason: ReleaseDiskPRFailure

Time Generated	Entry Type	Machine Name	Event ID	Source	Message
30/10/2017 21:42:55	Error	ABCD01-V-MSQL08.abc.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘ABCD-sql-AGp02_192.168.0.79’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp02’ failed due to an attempt to block a required state change in that cluster resource.
30/10/2017 21:42:55	Error	ABCD01-V-MSQL08.abc.com	10016	DCOM	The description for Event ID ‘10016’ in Source ‘DCOM’ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event:’application-specific’, ‘Local’, ‘Activation’, ‘{D63B10C5-BB46-4990-A94F-E40B9D520160}’, ‘{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}’, ‘DAC1’, ‘servABCD01SQLS-Agent’, ‘S-1-5-21-711275897-3083864546-3660386873-3110’, ‘LocalHost (Using LRPC)’, ‘Unavailable’, ‘Unavailable’

Time Generated	Entry Type	Machine Name	Event ID	Source	Message
30/10/2017 23:54:04	Information	ABCD01-V-MSQL08.abc.com	2	mpio	Added device to \Device\MPIODisk1. DumpData contains the current number of paths.
30/10/2017 23:54:08	Information	ABCD01-V-MSQL08.abc.com	2	mpio	Added device to \Device\MPIODisk30. DumpData contains the current number of paths.
30/10/2017 23:54:11	Information	ABCD01-V-MSQL08.abc.com	46	mpio	Path 77020009 was removed from \Device\MPIODisk26 due to a PnP event. The dump data contains the current number of paths.
30/10/2017 23:54:12	Information	ABCD01-V-MSQL08.abc.com	2	mpio	Added device to \Device\MPIODisk26. DumpData contains the current number of paths.

________________________________________________________________________

System Information: ABCD01-V-MSQL10

System Events:

Checked and found that we are getting the same set of events.

30/10/2017 21:33:42	Error	ABCD01-V-MSQL10.abc.com	5377	Microsoft-Windows-FailoverClustering	An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
30/10/2017 21:33:43	0	ABCD01-V-MSQL10.abc.com	1146	Microsoft-Windows-FailoverClustering	The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

30/10/2017 21:34:59	Error	ABCD01-V-MSQL10.abc.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource.
30/10/2017 21:34:59	Warning	ABCD01-V-MSQL10.abc.com	1045	Microsoft-Windows-FailoverClustering	No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal.
30/10/2017 21:35:09	Warning	ABCD01-V-MSQL10.abc.com	1045	Microsoft-Windows-FailoverClustering	No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal.
30/10/2017 21:35:09	Error	ABCD01-V-MSQL10.abc.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource.
30/10/2017 21:35:09	Error	ABCD01-V-MSQL10.abc.com	1205	Microsoft-Windows-FailoverClustering	The Cluster service failed to bring clustered role ‘ABCD-sql-AGp04’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

30/10/2017 21:42:52	Error	ABCD01-V-MSQL10.abc.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource.
30/10/2017 21:42:52	Warning	ABCD01-V-MSQL10.abc.com	1045	Microsoft-Windows-FailoverClustering	No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal.

Application Events:

30/10/2017 21:34:04	Warning	ABCD01-V-MSQL10.abc.com	4350	MSDTC Client 2	The description for Event ID ‘1073746174’ in Source ‘MSDTC Client 2′ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event:’0x800706D9’, ‘OpenClusterEx’, ‘ ‘
30/10/2017 21:34:23	Error	ABCD01-V-MSQL10.abc.com	24583	SQLWRITER	Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80004005. SQLSTATE: 08S01, Native Error: 121 Source: Microsoft SQL Server Native Client 11.0 Error message: Communication link failure SQLSTATE: 08S01, Native Error: 121 Error state: 1, Severity: 16 Source: Microsoft SQL Server Native Client 11.0 Error message: TCP Provider: The semaphore timeout period has expired.
30/10/2017 21:34:25	Information	ABCD01-V-MSQL10.abc.com	0	EHCMService	Bad service request
30/10/2017 21:34:31	Information	ABCD01-V-MSQL10.abc.com	9112	DFSR	The description for Event ID ‘1073750936’ in Source ‘DFSR’ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event:

30/10/2017 21:42:42

Error

ABCD01-V-MSQL10.abc.com

4025

EqualLogic

iSCSI login error 0x490 when connecting to vss-control volume for group 192.168.125.20.

_________________________________________________________________________

Conclusion:

As per the Events we can conclude that the issue is happening due to the Issues in the Persistent Reservation of the Physical Disk due to which the cluster service is timing out and during the same time there is a Backup operation Is also running which could be the trigger to this Event.
Unpresent and Represent the LUN on the Cluster.
Update the Storage Components like SAN Firmware, Net Adaptor Firmware.
Install all latest windows updates on the cluster as there are a lot of windows updates which have resolved a lot of known issues regarding the Storage, Server connections.
Since this issue is happening every day, please try to disable the Backup Schedule for a day and check the statue
Disable Antivirus Application on the Cluster and its scan might be overlapping with the Backup Application.
Incase if you are using Equalogic MPIO, please update that as well.

This is an issue which is clearly coming from the Storage End, however we are still not sure if this is being triggered by the Backup Application or not. This is best to narrow down the issue using the Divide and conquer.