RCA – 11 – SQL Resource didn’t come online post Unexpected Failover

Issue Description:

 

At 2:03pm EST we had our SQL cluster for our phone system fail.  The second node was up but didn’t assume control.  We shut the secondary node down and all services resumed on the primary node.  We restarted this server to bring it back clean and are holding for the results of this ticket before troubleshooting further.  It looks like the virtual disk service failed. 

 

Initial Description:

 

As we know that in this case the resources failover from one  node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

 

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

 

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up.  If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it.  This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down.  If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

                Having a problem with nodes being removed from active Failover Cluster membership?

                http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

 

Issue happened on 10/8/2016  : 

 

____________________________________________________________________________

 

System Information: ABC8SQL1

 

OS Name        Microsoft Windows Server 2008 R2 Enterprise

Version        6.1.7601 Service Pack 1 Build 7601

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABC8SQL1

System Manufacturer        Dell Inc.

System Model        PowerEdge R710

System Type        x64-based PC

Processor        Intel(R) Xeon(R) CPU           X5667  @ 3.07GHz, 3059 Mhz, 4 Core(s), 8 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU           X5667  @ 3.07GHz, 3059 Mhz, 4 Core(s), 8 Logical Processor(s)

BIOS Version/Date        Dell Inc. 3.0.0, 1/31/2011

 

System Events:

 

  • Checked the events at the time of Issue and found that the Cluster Network 1 went down after which the Cluster Network went to a Partitioned state and then the Cluster Node: ABC8SQL1 was evicted from the Cluster.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

2:03:19 PM

Warning

ABC8SQL1.abc.com

1126

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABC8SQL2 – CLUSTER_XOVER’ for cluster node ‘ABC8SQL2’ on network ‘Cluster Network 1’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

8/9/2016

2:03:19 PM

Warning

ABC8SQL1.abc.com

1126

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABC8SQL1 – CLUSTER_XOVER’ for cluster node ‘ABC8SQL1’ on network ‘Cluster Network 1’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

8/9/2016

2:03:19 PM

Error

ABC8SQL1.abc.com

1129

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Network 1’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

8/9/2016

2:03:19 PM

Error

ABC8SQL1.abc.com

1129

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Network 1’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

  • Got the Event for the Node Eviction around 2:08 PM.

 

8/9/2016

2:08:02 PM

Critical

ABC8SQL1.abc.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABC8SQL2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

  • As soon as the node is evicted from the Cluster the Entire cluster went down due to the Quorum Failure.

 

8/9/2016

2:08:02 PM

Error

ABC8SQL1.abc.com

1557

Microsoft-Windows-FailoverClustering

Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.

8/9/2016

2:08:02 PM

Warning

ABC8SQL1.abc.com

1558

Microsoft-Windows-FailoverClustering

The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.

8/9/2016

2:08:02 PM

Error

ABC8SQL1.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 1’ in clustered service or application ‘Cluster Group’ failed.

8/9/2016

2:08:23 PM

Critical

ABC8SQL1.abc.com

1177

Microsoft-Windows-FailoverClustering

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

 

Application Events:

 

  • Checked the Application events and found that a Backup job was running in the background when the issue happened.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57476

Backup Exec

The operating system returned an unusual error while backing up the following file:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\VS SCC\1033\msempui.dll  It is possible that this file is incomplete and therefore should not be restored.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57484

Backup Exec

Error querying extended attributes (EAs) for the following file:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\VS SCC\1033\msempui.dll   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57481

Backup Exec

An unusual error (21) was encountered while enumerating the contents of the directory:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\VS SCC\1033.  It is possible that files or subdirectories have not been backed up. Please examine your job log or catalogs to ensure this directory tree was backed up in its entirety.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57481

Backup Exec

An unusual error (55) was encountered while enumerating the contents of the directory:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\Xml.  It is possible that files or subdirectories have not been backed up. Please examine your job log or catalogs to ensure this directory tree was backed up in its entirety.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57481

Backup Exec

An unusual error (55) was encountered while enumerating the contents of the directory:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\Packages.  It is possible that files or subdirectories have not been backed up. Please examine your job log or catalogs to ensure this directory tree was backed up in its entirety.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

10

MSSQLServerOLAPService

An error occurred while closing the trace output file, \\?\I:\OLAP\Log\FlightRecorderCurrent.trc.

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

4/26/2009 7:14

(10.100:4.0)

(10.100:4.0)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

12/23/2010 21:00

(6.1:7600.16385)

(6.2:1.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

8/6/2006 21:51

(1.0:1.1)

(1.0:1.6)

Brother Industries Ltd.

Brotehr Serial I/F Driver (WDM)

12/16/2010 14:09

(6.2:3.0)

(6.2:3.0)

Broadcom Corporation

Broadcom NetXtreme II Diagnostic Driver

2/4/2011 19:58

(6.2:9.0)

(6.2:9.0)

Broadcom Corporation

AMD64 BXND NDIS6.0 Driver

12/10/2010 16:06

(6.1:7600.16385)

(6.2:7.0)

Broadcom Corporation

iSCSI offload x64 FREE

1/6/2011 13:55

(6.2:8.0)

(6.2:8.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

2/1/2011 5:15

(6.2:16.0)

(6.2:16.0)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

9/10/2010 15:10

(1.3:306.409)

(1.3:306.409)

Dell, Inc.

Dell MD Series Device Specific Module for Multi-Path X64-bit

9/10/2010 15:11

(1.3:306.409)

(1.3:306.409)

Dell, Inc.

Dell MD Series UTM Disk Driver for X64-bit

3/30/2012 12:21

(6.605:12.328)

(6.605:12.328)

Symantec Corporation

PureDisk Virtual File System Driver

1/22/2009 18:05

(9.1:8.6)

(9.1:8.6)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

5/18/2009 21:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

2/17/2009 18:03

(5.0:1.1)

(5.0:1.1)

Promise Technology

Promise  SuperTrak EX Series Driver for Windows

10/25/2011 11:23

(2.0:82.0)

(2.0:82.0)

Symantec Corporation

Allows granular display of back ups.

 

 

Cluster Logs:

 

 

00000abc.00001480::2016/09/08-05:07:57.557 INFO  [NM] Received request from client address 192.10.2.246.

00000abc.0000184c::2016/09/08-07:03:09.213 WARN  [API] s_ApiOpenResourceEx: Resource Backup Exec Job Engine not found, status = 5007

00000abc.00000f00::2016/09/08-07:04:58.618 WARN  [API] s_ApiOpenResourceEx: Resource Backup Exec Job Engine not found, status = 5007

00000abc.00000b44::2016/09/08-07:06:47.040 WARN  [API] s_ApiOpenResourceEx: Resource Backup Exec Job Engine not found, status = 5007

 

00000abc.000017e4::2016/09/08-07:08:35.478 WARN  [API] s_ApiOpenResourceEx: Resource Backup Exec Job Engine not found, status = 5007

 

________________________________________________________________________

 

 System Information: ABC8SQL2

 

OS Name        Microsoft Windows Server 2008 R2 Enterprise

Version        6.1.7601 Service Pack 1 Build 7601

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABC8SQL2

System Manufacturer        Dell Inc.

System Model        PowerEdge R710

System Type        x64-based PC

Processor        Intel(R) Xeon(R) CPU           X5667  @ 3.07GHz, 3059 Mhz, 4 Core(s), 8 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU           X5667  @ 3.07GHz, 3059 Mhz, 4 Core(s), 8 Logical Processor(s)

BIOS Version/Date        Dell Inc. 3.0.0, 1/31/2011

 

System Events:

 

 

  • Checked the events and found the Same events which mentioned that the Cluster network interface: Cluster Network 1 went unreachable after which the Cluster failed.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

2:03:19 PM

Warning

ABC8SQL2.abc.com

1126

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABC8SQL2 – CLUSTER_XOVER’ for cluster node ‘ABC8SQL2’ on network ‘Cluster Network 1’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

8/9/2016

2:03:19 PM

Warning

ABC8SQL2.abc.com

1126

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABC8SQL1 – CLUSTER_XOVER’ for cluster node ‘ABC8SQL1’ on network ‘Cluster Network 1’ is unreachable by at least one other cluster node attached to the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

8/9/2016

2:03:19 PM

Error

ABC8SQL2.abc.com

1129

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Network 1’ is partitioned. Some attached failover cluster nodes cannot communicate with each other over the network. The failover cluster was not able to determine the location of the failure. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

  • Got the Event id 1135 at 2:07 which mentions that ABC8SQL1 is evicted from the Cluster.

 

8/9/2016

2:07:46 PM

Critical

ABC8SQL2.abc.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABC8SQL1’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

8/9/2016

2:07:53 PM

Error

ABC8SQL2.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 1’ in clustered service or application ‘Cluster Group’ failed.

8/9/2016

2:08:01 PM

Error

ABC8SQL2.abc.com

4199

Tcpip

The system detected an address conflict for IP address 192.10.2.248 with the system having network hardware address 78-2B-CB-19-51-92. Network operations on this system may be disrupted as a result.

 

8/9/2016

2:08:06 PM

Error

ABC8SQL2.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 3’ in clustered service or application ‘SQL Server (MSSQLSERVER)’ failed.

8/9/2016

2:08:06 PM

Error

ABC8SQL2.abc.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 2’ in clustered service or application ‘SQL Server (MSSQLSERVER)’ failed.

 

Application Events:

 

  • Checked the Application events and found that there was a Backup operation running in the Background.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57476

Backup Exec

The operating system returned an unusual error while backing up the following file:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\VS SCC\1033\msempui.dll  It is possible that this file is incomplete and therefore should not be restored.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57484

Backup Exec

Error querying extended attributes (EAs) for the following file:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\VS SCC\1033\msempui.dll   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57481

Backup Exec

An unusual error (21) was encountered while enumerating the contents of the directory:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\VS SCC\1033.  It is possible that files or subdirectories have not been backed up. Please examine your job log or catalogs to ensure this directory tree was backed up in its entirety.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57481

Backup Exec

An unusual error (55) was encountered while enumerating the contents of the directory:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\Xml.  It is possible that files or subdirectories have not been backed up. Please examine your job log or catalogs to ensure this directory tree was backed up in its entirety.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

57481

Backup Exec

An unusual error (55) was encountered while enumerating the contents of the directory:  \\ABC8SQL1.abc.com\C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\Packages.  It is possible that files or subdirectories have not been backed up. Please examine your job log or catalogs to ensure this directory tree was backed up in its entirety.   For more information, click the following link:  http://eventlookup.veritas.com/eventlookup/EventLookup.jhtml

8/9/2016

2:08:29 PM

Error

ABC8SQL1.abc.com

10

MSSQLServerOLAPService

An error occurred while closing the trace output file, \\?\I:\OLAP\Log\FlightRecorderCurrent.trc.

 

 

List of outdated drivers:

 

 Time/Date String

Product Version

File Version

Company Name

File Description

4/26/2009 7:14

(10.100:4.0)

(10.100:4.0)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

12/23/2010 21:00

(6.1:7600.16385)

(6.2:1.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

8/6/2006 21:51

(1.0:1.1)

(1.0:1.6)

Brother Industries Ltd.

Brotehr Serial I/F Driver (WDM)

8/6/2006 21:51

(1.0:0.20)

(1.0:0.20)

Brother Industries Ltd.

Brother Serial driver (WDM version)

8/6/2006 21:51

(6.0:5479.0)

(1.0:0.12)

Brother Industries Ltd.

Brother USB MDM Driver

8/9/2006 8:11

(6.0:5479.0)

(1.0:1.3)

Brother Industries Ltd.

Brother USB Serial Driver

12/16/2010 14:09

(6.2:3.0)

(6.2:3.0)

Broadcom Corporation

Broadcom NetXtreme II Diagnostic Driver

2/4/2011 19:58

(6.2:9.0)

(6.2:9.0)

Broadcom Corporation

AMD64 BXND NDIS6.0 Driver

12/10/2010 16:06

(6.1:7600.16385)

(6.2:7.0)

Broadcom Corporation

iSCSI offload x64 FREE

1/6/2011 13:55

(6.2:8.0)

(6.2:8.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

2/1/2011 5:15

(6.2:16.0)

(6.2:16.0)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

3/30/2012 12:21

(6.605:12.328)

(6.605:12.328)

Symantec Corporation

PureDisk Virtual File System Driver

8/9/2010 14:21

(4.31:1.64)

(4.31:1.64)

LSI Corporation

MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64

1/22/2009 18:05

(9.1:8.6)

(9.1:8.6)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

5/18/2009 21:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

7/13/2009 20:00

(6.1:7600.16385)

(6.1:7600.16385)

Brother Industries Ltd.

Brotehr Serial I/F Driver (WDM)

2/17/2009 18:03

(5.0:1.1)

(5.0:1.1)

Promise Technology

Promise  SuperTrak EX Series Driver for Windows

8/3/2015 8:44

(1.0:0.0)

(1.1:0.15)

SolarWinds, Inc.

SolarWinds Mini Filter Driver

1/15/2015 21:53

(12.9:6.12)

(12.9:6.12)

Symantec Corporation

Symantec Event Library

10/25/2011 11:23

(2.0:82.0)

(2.0:82.0)

Symantec Corporation

Allows granular display of back ups.

11/4/2010 14:33

(4.2:0.58)

(4.0:1.58)

VMware, Inc.

VMware Virtual Storage Volume Driver

 

 

Cluster Logs:

 

00000444.00000860::2016/08/09-18:07:45.987 INFO  [NODE] Node 2: n1 node object is closing its connections

00000444.00000860::2016/08/09-18:07:45.987 INFO  [NODE] Node 2: closing n1 node object channels

00000444.00000860::2016/08/09-18:07:45.987 INFO  [CORE] Node 2: Clearing cookie 8a98b30b-9b4d-489c-9ce4-15d506d5365a

00000444.00000958::2016/08/09-18:07:45.987 INFO  [CHANNEL 169.254.1.182:~60190~] graceful close, status (of previous failure, may not indicate problem) ERROR_IO_PENDING(997)

00000444.00000958::2016/08/09-18:07:45.987 WARN  [PULLER ABC8SQL1] ReadObject failed with GracefulClose(1226)’ because of ‘channel to remote endpoint 169.254.1.182:~60190~ is closed’

00000444.00000958::2016/08/09-18:07:45.987 ERR   [NODE] Node 2: Connection to Node 1 is broken. Reason GracefulClose(1226)’ because of ‘channel to remote endpoint 169.254.1.182:~60190~ is closed’

00000444.00000860::2016/08/09-18:07:45.987 INFO  [NETFT] Route <struct mscs::FaultTolerantRoute>

00000444.00000860::2016/08/09-18:07:45.987 INFO    <realLocal>172.00..30.2:~3343~</realLocal>

00000444.00000860::2016/08/09-18:07:45.987 INFO    <realRemote>172.00..30.1:~3343~</realRemote>

00000444.00000860::2016/08/09-18:07:45.987 INFO    <virtualLocal>169.254.2.92:~0~</virtualLocal>

 

 

00000444.00000860::2016/08/09-18:07:45.987 INFO  </class mscs::detail::ConsensusMessage>

00000444.00000934::2016/08/09-18:07:45.987 INFO  [RGP] Node 2: Timer Tick Started

00000444.00000934::2016/08/09-18:07:45.987 WARN  [RGP] Node 2: only local suspects are missing (1). moving to the next stage (shortcut compensation time 05.000)

00000444.00000934::2016/08/09-18:07:45.987 INFO  [RGP] Node 2: Opening`1 => Closing`2

00000444.00000934::2016/08/09-18:07:45.987 INFO  <class mscs::detail::ConsensusMessage>

 

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [VER] Version check passed: node and cluster highest supported versions match.

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [SV] Negotiating message security level.

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [SV] Already protecting connection with message security level ‘Sign’.

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [FTI] Got new raw TCP/IP connection.

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [FTI][Follower] This node (2) is not the initiator

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [CHANNEL 172.00..30.1:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)

00000444.000011dc::2016/08/09-18:07:46.767 INFO  [CORE] Node 2: Clearing cookie 8a98b30b-9b4d-489c-9ce4-15d506d5365a

00000444.000011dc::2016/08/09-18:07:46.767 WARN  cxl::ConnectWorker::operator (): GracefulClose(1226)’ because of ‘channel to remote endpoint 172.00..30.1:~3343~ is closed’

00000444.000011dc::2016/08/09-18:07:46.907 DBG   [NETFTAPI] received NsiParameterNotification  for 169.254.2.92 (IpDadStateDeprecated )

00000444.000011dc::2016/08/09-18:07:46.907 DBG   [NETFTAPI] Signaled NetftLocalDisconnect  event for 169.254.2.92

00000444.00000888::2016/08/09-18:07:46.907 INFO  [IM] got event: Local endpoint 169.254.2.92:~0~ disconnected

00000444.00000934::2016/08/09-18:07:46.923 INFO  [RGP] Node 2: Timer Tick Started

00000444.00000934::2016/08/09-18:07:46.923 INFO  [RGP] Node 2: Cleanup`6 => Stable_`0

00000444.00000934::2016/08/09-18:07:46.923 INFO  [CORE] Node 2: New View is <ViewChanged joiners=() downers=(1) newView=2202(2) oldView=2102(1 2) joiner=false form=false/> (Start Dispatch)

00000444.00000934::2016/08/09-18:07:46.923 INFO  [MRR] Node 2: Process view 2202(2)

00000444.00000934::2016/08/09-18:07:46.923 INFO  [GUM] Node 2: one of the active nodes went down, epoch change is required

00000444.00000934::2016/08/09-18:07:46.923 INFO  [CORE] NeedStateView is adjusted to (). SetGumView ((2))

00000444.00000934::2016/08/09-18:07:46.923 INFO  [GUM] Node 2 some of the active nodes went down. Laucnhing dummy update

00000444.000011dc::2016/08/09-18:07:46.923 INFO  [RCM] Moving groups from downed nodes (1)

00000444.00000934::2016/08/09-18:07:46.923 INFO  [QUORUM] Node 2: quorum resource owner 1 died

00000444.00000934::2016/08/09-18:07:46.923 WARN  [QUORUM] Node 2: One off quorum (2)

 

00000444.00000958::2016/08/09-18:07:46.923 WARN  [RCM] Moving orphaned group Cluster Group from downed node ABC8SQL1 to node ABC8SQL2.

 

00000b4c.00000bb0::2016/08/09-18:07:46.923 INFO  [RES] Physical Disk: Enter EnumerateDevices: EnumDevice 0

00000b4c.00000bb0::2016/08/09-18:07:46.923 INFO  [RES] Physical Disk: Exit EnumerateDevices: status 0

00000b4c.00000bb0::2016/08/09-18:07:47.219 WARN  [RES] Physical Disk <Cluster Disk 1>: PR reserve failed, status 170

00000b4c.00000bb0::2016/08/09-18:07:47.219 INFO  [RES] Physical Disk: ValidateReservations: size of reservations 16

00000b4c.00000bb0::2016/08/09-18:07:47.219 INFO  [RES] Physical Disk: Key: 1df27466734d, type 5 scope 0

00000b4c.00000bb0::2016/08/09-18:07:47.219 INFO  [RES] Physical Disk: Sleeping for 6 secs

00000444.000011a0::2016/08/09-18:07:50.636 INFO  [CONNECT] 192.10.2.246:~3343~: Established connection to remote endpoint 192.10.2.246:~3343~.

00000444.000011a0::2016/08/09-18:07:50.636 INFO  [SV] Securing route from (192.10.2.247:~54347~) to remote ABC8SQL1 (192.10.2.246:~3343~).

00000444.000011a0::2016/08/09-18:07:50.636 INFO  [SV] Got a new outgoing stream to ABC8SQL1 at 192.10.2.246:~3343~

00000444.000011a0::2016/08/09-18:07:52.040 INFO  [SV] Authentication and authorization were successful

00000b4c.00000bb0::2016/08/09-18:07:53.256 ERR   [RES] Physical Disk <Cluster Disk 1>: Failed to preempt reservation, status 170

00000444.00000958::2016/08/09-18:07:53.288 ERR   [RCM] Arbitrating resource ‘Cluster Disk 1’ returned error 170

00000444.00000958::2016/08/09-18:07:53.288 INFO  [RCM] TransitionToState(Cluster Disk 1) OnlineCallIssued–>ArbitrationFailure.

00000444.00000958::2016/08/09-18:07:53.288 ERR   [RCM] rcm::RcmResource::HandleFailure: (Cluster Disk 1)

00000444.00000958::2016/08/09-18:07:53.288 INFO  [QUORUM] Node 2: PostRelease for e2d07397-08b2-4f59-af3e-e5473753f120

00000444.00000958::2016/08/09-18:07:53.288 INFO  [DM] Node 2: DetachWitness

00000444.00000958::2016/08/09-18:07:53.288 INFO  [QUORUM] Node 2: quorum is not owned by anyone

00000444.00000958::2016/08/09-18:07:53.288 WARN  [QUORUM] Node 2: One off quorum (2)

00000444.00000958::2016/08/09-18:07:53.288 INFO  [QUORUM] Node 2: death timer is already running, 13 seconds left.

 

 

# Repair

  ERROR_BUSY                                                winerror.h

# The requested resource is in use.

# 2 matches found for “170”

PS N:\ERR>

 

 

___________________________________________________________________________

 

 

Conclusion:

 

  • After analyzing the logs we can conclude that the issue has started after the Cluster Network 1 Went down due to which the Communication was lost between the nodes and then ABC8SQL1  went out of the Cluster and the resources were failed over from One node to another which is ABC8SQL2. The reason why they were not able to come online on ABC8SQL2 was because of the PR reservation Failure on ABC8SQL2 as there was a Backup running at the time of issue which must be having an exclusive handle on the Disks.

 

  • When you restarted the Node ABC8SQL2 . Everything came online as it cleared the Reservation on the Disk.

 

 

  1. Update the Network Adaptor Drivers and Check if there is any ongoing issues with any intermittent device connected between the Two Servers.

 

  1. Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

Updates for Cluster Binaries for 2008 R2

http://support.microsoft.com/kb/2914680                Ntfs.sys

https://support.microsoft.com/kb/2779069              Clussvc.exe

http://support.microsoft.com/kb/2786667                Clusdisk.sys

http://support.microsoft.com/kb/2795696                clusres.dll

http://support.microsoft.com/kb/2907244                RHS.exe

 

 

  1.  Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

 

 

  1.  Update the Symantec Backup Utility on the server. 

 

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply