Issue Description:
Need to know why the SQL Instances Failed over from one node to another on Cluster Name: gl-dac1-sqlent01 running a copy of Microsoft Windows Server 2016 Standard Version 10.0.14393 Build 14393.
___________________________________________________________________________
System Information: ABCD01-V-MSQL07
System Events:
- Reviewed the logs and found that the Issue started after we started getting events regarding the Connection between the ISCSI Target lost from the initiator.
Time Generated | Entry Type | Machine Name | Event ID | Source | Message |
30/10/2017 20:11:02 | Information | ABCD01-V-MSQL07.abc.com | 34 | iScsiPrt | A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name. |
30/10/2017 20:11:02 | Information | ABCD01-V-MSQL07.abc.com | 32 | iScsiPrt | Initiator received an asynchronous logout message. The Target name is given in the dump data. |
- While going through the logs we can see that the Disk 15 is getting failed with ReleaseDiskPRFailure. Which is probably due to the Persistent reservation issue.
30/10/2017 21:33:42 | Error | ABCD01-V-MSQL07.abc.com | 5377 | Microsoft-Windows-FailoverClustering | An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
30/10/2017 21:33:43 | Error | ABCD01-V-MSQL07.abc.com | 1795 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource terminate encountered an error. Physical Disk resource name: ABCD01-sqlinstr-mountpoints Device Number: 15 Device Guid: {79c6f6e1-e0cb-f8ee-9d8e-6dd2a07513d0} Error Code: 1726 Additional reason: ReleaseDiskPRFailure |
- Cluster service eventually failed after waiting for the disk.
30/10/2017 21:33:43 | Error | ABCD01-V-MSQL07.abc.com | 5377 | Microsoft-Windows-FailoverClustering | An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
30/10/2017 21:33:43 | 0 | ABCD01-V-MSQL07.abc.com | 1146 | Microsoft-Windows-FailoverClustering | The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue. |
30/10/2017 21:33:44 | Error | ABCD01-V-MSQL07.abc.com | 1795 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource terminate encountered an error. Physical Disk resource name: ABCD01-sqlinstt-bak1 Device Number: 28 Device Guid: {020f75e5-1107-bfef-457c-9bb97594bf77} Error Code: 1168 Additional reason: ReleaseDiskPRFailure |
30/10/2017 21:33:45 | Error | ABCD01-V-MSQL07.abc.com | 1795 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource terminate encountered an error. Physical Disk resource name: ABCD01-sqlinstt-data1 Device Number: 29 Device Guid: {1cc28a38-b15f-8baf-875b-02b013aed710} Error Code: 1168 Additional reason: ReleaseDiskPRFailure |
30/10/2017 21:33:45 | Error | ABCD01-V-MSQL07.abc.com | 7034 | Service Control Manager | The SQL Server (INSTANCER) service terminated unexpectedly. It has done this 2 time(s). |
Application Events:
- SQL Application logs are not giving much information.
TimeGenerated | EntryType | MachineName | EventID | Source | Message |
30/10/2017 21:36:51 | Information | ABCD01-V-MSQL07.abc.com | 3406 | MSSQL$INSTANCEW | 539 transactions rolled forward in database ‘RLS_CBCOTTONWD’ (196:0). This is an informational message only. No user action is required. |
30/10/2017 21:36:51 | FailureAudit | ABCD01-V-MSQL07.abc.com | 18456 | MSSQL$INSTANCEW | Login failed for user ‘usr_abclking’. Reason: Failed to open the explicitly specified database ‘RLS_abclking’. [CLIENT: 192.168.120.55] |
30/10/2017 21:36:51 | FailureAudit | ABCD01-V-MSQL07.abc.com | 18456 | MSSQL$INSTANCES | Login failed for user ‘usr_ahbarn’. Reason: Failed to open the explicitly specified database ‘RLS_AHBARN’. [CLIENT: 192.168.120.57] |
Failover Clustering:
TimeGenerated | EntryType | MachineName | EventID | Source | Message |
30/10/2017 21:34:53 | ABCD01-V-MSQL07.abc.com | 1641 | microsoft-windows-failoverclustering/operational | Information | Clustered role ‘Cluster Group’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD02-V-MSQL06’. |
30/10/2017 21:34:59 | ABCD01-V-MSQL07.abc.com | 1641 | microsoft-windows-failoverclustering/operational | Information | Clustered role ‘ABCD02-sql (INSTANCEG)’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD02-V-MSQL06’. |
30/10/2017 21:34:59 | ABCD01-V-MSQL07.abc.com | 1641 | microsoft-windows-failoverclustering/operational | Information | Clustered role ‘ABCD-sql-AGp04’ is moving from cluster node ‘ABCD01-V-MSQL12’ to cluster node ‘ABCD01-V-MSQL10’. |
30/10/2017 21:34:59 | ABCD01-V-MSQL07.abc.com | 1637 | microsoft-windows-failoverclustering/operational | Information | Cluster resource ‘SQL Server (INSTANCEU)’ in clustered role ‘ABCD01-sql (INSTANCEU)’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘SQL Server (INSTANCEU)’ is waiting on the following resources: SQL Network Name (ABCD01-sql-u) SQL IP Address 2 (ABCD01-sql-u) ABCD01-sqlinstu-data1 ABCD01-sqlinstu-bak1 ABCD01-sqlinstu-mountpoints ABCD01-sqlinstu-log1 ABCD01-sqlinstu-temp1. |
30/10/2017 21:34:59 | ABCD01-V-MSQL07.abc.com | 1637 | microsoft-windows-failoverclustering/operational | Information | Cluster resource ‘ABCD01-sqlinstr-mountpoints’ in clustered role ‘ABCD01-sql (INSTANCER)’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘ABCD01-sqlinstr-mountpoints’ is waiting on the following resources: . |
__________________________________________________________________________
System Information: ABCD01-V-MSQL08
System Events:
- Checked and found that we are getting the same set of Events for different disk.
Time Generated | Entry Type | Machine Name | Event ID | Source | Message |
30/10/2017 21:31:10 | Information | ABCD01-V-MSQL08.abc.com | 32 | iScsiPrt | Initiator received an asynchronous logout message. The Target name is given in the dump data. |
30/10/2017 21:31:10 | Information | ABCD01-V-MSQL08.abc.com | 34 | iScsiPrt | A connection to the target was lost, but Initiator successfully reconnected to the target. Dump data contains the target name. |
Time Generated | Entry Type | Machine Name | Event ID | Source | Message |
30/10/2017 21:33:42 | Error | ABCD01-V-MSQL08.abc.com | 5377 | Microsoft-Windows-FailoverClustering | An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
30/10/2017 21:33:42 | Error | ABCD01-V-MSQL08.abc.com | 5377 | Microsoft-Windows-FailoverClustering | An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
30/10/2017 21:33:43 | Error | ABCD01-V-MSQL08.abc.com | 1795 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource terminate encountered an error. Physical Disk resource name: ABCD01-sqlinsts-mountpoints Device Number: 17 Device Guid: {ec6d6b9a-d2a3-7301-c4b3-052d6f3faf3d} Error Code: 1168 Additional reason: ReleaseDiskPRFailure |
Time Generated | Entry Type | Machine Name | Event ID | Source | Message |
30/10/2017 21:42:55 | Error | ABCD01-V-MSQL08.abc.com | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘ABCD-sql-AGp02_192.168.0.79’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp02’ failed due to an attempt to block a required state change in that cluster resource. |
30/10/2017 21:42:55 | Error | ABCD01-V-MSQL08.abc.com | 10016 | DCOM | The description for Event ID ‘10016’ in Source ‘DCOM’ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event:’application-specific’, ‘Local’, ‘Activation’, ‘{D63B10C5-BB46-4990-A94F-E40B9D520160}’, ‘{9CA88EE3-ACB7-47C8-AFC4-AB702511C276}’, ‘DAC1’, ‘servABCD01SQLS-Agent’, ‘S-1-5-21-711275897-3083864546-3660386873-3110’, ‘LocalHost (Using LRPC)’, ‘Unavailable’, ‘Unavailable’ |
Time Generated | Entry Type | Machine Name | Event ID | Source | Message |
30/10/2017 23:54:04 | Information | ABCD01-V-MSQL08.abc.com | 2 | mpio | Added device to \Device\MPIODisk1. DumpData contains the current number of paths. |
30/10/2017 23:54:08 | Information | ABCD01-V-MSQL08.abc.com | 2 | mpio | Added device to \Device\MPIODisk30. DumpData contains the current number of paths. |
30/10/2017 23:54:11 | Information | ABCD01-V-MSQL08.abc.com | 46 | mpio | Path 77020009 was removed from \Device\MPIODisk26 due to a PnP event. The dump data contains the current number of paths. |
30/10/2017 23:54:12 | Information | ABCD01-V-MSQL08.abc.com | 2 | mpio | Added device to \Device\MPIODisk26. DumpData contains the current number of paths. |
________________________________________________________________________
System Information: ABCD01-V-MSQL10
System Events:
- Checked and found that we are getting the same set of events.
30/10/2017 21:33:42 | Error | ABCD01-V-MSQL10.abc.com | 5377 | Microsoft-Windows-FailoverClustering | An internal Cluster service operation exceeded the defined threshold of ‘150’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
30/10/2017 21:33:43 | 0 | ABCD01-V-MSQL10.abc.com | 1146 | Microsoft-Windows-FailoverClustering | The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue. |
30/10/2017 21:34:59 | Error | ABCD01-V-MSQL10.abc.com | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource. |
30/10/2017 21:34:59 | Warning | ABCD01-V-MSQL10.abc.com | 1045 | Microsoft-Windows-FailoverClustering | No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal. |
30/10/2017 21:35:09 | Warning | ABCD01-V-MSQL10.abc.com | 1045 | Microsoft-Windows-FailoverClustering | No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal. |
30/10/2017 21:35:09 | Error | ABCD01-V-MSQL10.abc.com | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource. |
30/10/2017 21:35:09 | Error | ABCD01-V-MSQL10.abc.com | 1205 | Microsoft-Windows-FailoverClustering | The Cluster service failed to bring clustered role ‘ABCD-sql-AGp04’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. |
30/10/2017 21:42:52 | Error | ABCD01-V-MSQL10.abc.com | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘ABCD-sql-AGp04_192.168.0.83’ of type ‘IP Address’ in clustered role ‘ABCD-sql-AGp04’ failed due to an attempt to block a required state change in that cluster resource. |
30/10/2017 21:42:52 | Warning | ABCD01-V-MSQL10.abc.com | 1045 | Microsoft-Windows-FailoverClustering | No matching network interface found for resource ‘ABCD-sql-AGp04_192.168.0.83’ IP address ‘192.168.0.83’ (return code was ‘5035’). If your cluster nodes span different subnets, this may be normal. |
Application Events:
30/10/2017 21:34:04 | Warning | ABCD01-V-MSQL10.abc.com | 4350 | MSDTC Client 2 | The description for Event ID ‘1073746174’ in Source ‘MSDTC Client 2′ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event:’0x800706D9’, ‘OpenClusterEx’, ‘ ‘ |
30/10/2017 21:34:23 | Error | ABCD01-V-MSQL10.abc.com | 24583 | SQLWRITER | Sqllib error: OLEDB Error encountered calling ICommandText::Execute. hr = 0x80004005.
SQLSTATE: 08S01, Native Error: 121 Source: Microsoft SQL Server Native Client 11.0 Error message: Communication link failure SQLSTATE: 08S01, Native Error: 121 Error state: 1, Severity: 16 Source: Microsoft SQL Server Native Client 11.0 Error message: TCP Provider: The semaphore timeout period has expired. |
30/10/2017 21:34:25 | Information | ABCD01-V-MSQL10.abc.com | 0 | EHCMService | Bad service request |
30/10/2017 21:34:31 | Information | ABCD01-V-MSQL10.abc.com | 9112 | DFSR | The description for Event ID ‘1073750936’ in Source ‘DFSR’ cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event: |
30/10/2017 21:42:42 | Error | ABCD01-V-MSQL10.abc.com | 4025 | EqualLogic | iSCSI login error 0x490 when connecting to vss-control volume for group 192.168.125.20. |
_________________________________________________________________________
Conclusion:
- As per the Events we can conclude that the issue is happening due to the Issues in the Persistent Reservation of the Physical Disk due to which the cluster service is timing out and during the same time there is a Backup operation Is also running which could be the trigger to this Event.
- Unpresent and Represent the LUN on the Cluster.
- Update the Storage Components like SAN Firmware, Net Adaptor Firmware.
- Install all latest windows updates on the cluster as there are a lot of windows updates which have resolved a lot of known issues regarding the Storage, Server connections.
- Since this issue is happening every day, please try to disable the Backup Schedule for a day and check the statue
- Disable Antivirus Application on the Cluster and its scan might be overlapping with the Backup Application.
- Incase if you are using Equalogic MPIO, please update that as well.
This is an issue which is clearly coming from the Storage End, however we are still not sure if this is being triggered by the Backup Application or not. This is best to narrow down the issue using the Divide and conquer.