RCA – 10 – RCA on Cluster Resource Failover

Issue Description:

 

Need to Know the Reason for the Cluster unexpected failover on Cluster Name: abcdCluster01 Running a copy of Microsoft Windows Server 2012 Standard Version 6.2.9200 Build 9200 on August 09 @ 8:53pm.

 

Initial Description:

 

As we know that in this case the resources failover from one  node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

 

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

 

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

 

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up.  If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it.  This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down.  If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

                Having a problem with nodes being removed from active Failover Cluster membership?

                http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

 

Issue happened on August 09 @ 8:53pm.

 

_________________________________________

 

System Information: ABCC1DOCS02

 

OS Name        Microsoft
Windows Server 2012 Standard

Version        6.2.9200 Build
9200

Other OS Description        
Not Available

OS Manufacturer       
Microsoft Corporation

System Name        ABCC1DOCS02

System Manufacturer       
VMware, Inc.

System Model        VMware
Virtual Platform

System Type        x64-based
PC

System SKU       

Processor        AMD
Opteron(tm) Processor 6380, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date       
Phoenix Technologies LTD 6.00, 9/17/2015

 

 

System Events:

 

  • Checked the events and found that the Node: ABCC1DOCS02 is evicted from the cluster after we got the Error of the ISCSI disconnect.

 

  • Checked the events and found that we got Event ID 20 and 7 before the issue. Both the event mentions that the Server (initiator) loses the connection with the Target.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

8:45:21 PM

Error

ABCC1DOCS02.abc.local

20

iScsiPrt

Connection to the target was lost. The initiator will attempt to
retry the connection.

8/9/2016

8:45:21 PM

Error

ABCC1DOCS02.abc.local

7

iScsiPrt

The initiator could not send an iSCSI PDU. Error status is given
in the dump data.

8/9/2016

8:45:30 PM

Critical

ABCC1DOCS02.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCC1DOCS01’ was removed from the active failover
cluster membership. The Cluster service on this node may have stopped. This
could also be due to the node having lost communication with other active
nodes in the failover cluster. Run the Validate a Configuration wizard to
check your network configuration. If the condition persists, check for
hardware or software errors related to the network adapters on this node.
Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.

 

  • Checked other events at the time of issue and found that the shadow copy operation was running.
  • We are also getting event id 46 as the path of the MPIO was removed.

 

8/9/2016

8:46:06 PM

Error

ABCC1DOCS02.abc.local

29

volsnap

The shadow copies of volume T: were aborted during detection.

8/9/2016

8:46:06 PM

Information

ABCC1DOCS02.abc.local

46

mpio

Path 77040004 was removed from \Device\MPIODisk1 due to a PnP
event. The dump data contains the current number of paths.

8/9/2016

8:46:06 PM

Error

ABCC1DOCS02.abc.local

9295

AAFsFlt

Metadata could not be written to volume \Device\HarddiskVolume3.
Status is in the data.

8/9/2016

8:46:06 PM

Warning

ABCC1DOCS02.abc.local

9293

AAFsFlt

Volume \Device\HarddiskVolume3 has been disabled because of a PnP
surprise removal event.

8/9/2016

8:46:06 PM

Warning

ABCC1DOCS02.abc.local

140

Microsoft-Windows-Ntfs

The system failed to flush data to the transaction log. Corruption
may occur in VolumeId: Q:, DeviceName: \Device\HarddiskVolume3. (A device
which does not exist was specified.)

 

  • We are getting events related to the AppAssure which was running at the time of issue.

 

8/9/2016

8:54:28 PM

Error

ABCC1DOCS02.abc.local

9295

AAFsFlt

Metadata could not be written to volume \Device\HarddiskVolume9.
Status is in the data.

8/9/2016

8:54:28 PM

Error

ABCC1DOCS02.abc.local

9295

AAFsFlt

Metadata could not be written to volume \Device\HarddiskVolume9.
Status is in the data.

8/9/2016

8:54:28 PM

Error

ABCC1DOCS02.abc.local

9289

AAFsFlt

Log file could not be opened or re-opened for device
\Device\HarddiskVolume9. The volume has been disabled. The failure status
code is the last word of the data.

8/9/2016

8:54:28 PM

Error

ABCC1DOCS02.abc.local

27

volsnap

The shadow copies of volume T: were aborted during detection
because a critical control file could not be opened.

 

  • This events points out that the Kaspersky service is also active at the time of issue.

 

8/9/2016

8:57:39 PM

Error

ABCC1DOCS02.abc.local

5

KLIF

Volume ID of the volume ‘\Device\HarddiskVolume13’ is already in
use. iSwift and Section Verdict Cache will not apply to this volume.

 

 

8/9/2016

10:03:18 PM

Error

ABCC1DOCS02.abc.local

20

iScsiPrt

Connection to the target was lost. The initiator will attempt to
retry the connection.

8/9/2016

10:03:18 PM

Error

ABCC1DOCS02.abc.local

7

iScsiPrt

The initiator could not send an iSCSI PDU. Error status is given
in the dump data.

 

 Application Events:

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

8:42:46 PM

Error

ABCC1DOCS02.abc.local

8194

VSS

Volume Shadow Copy Service error: Unexpected error querying for
the IVssWriterCallback interface.  hr =
0x80070005, Access is denied. . This is often caused by incorrect security
settings in either the writer or requestor process.  Operation:    Gathering Writer Data Context:    Writer Class Id:
{e8132975-6f93-4464-a53e-1050253ae220}   
Writer Name: System Writer   
Writer Instance ID: {99d281ed-b864-41b3-a3bd-eefa7f51623d}

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/27/2007 19:04

(6.0:6001.16459)

(7.2:0.0)

Adaptec, Inc.

Adaptec StorPort Ultra320 SCSI Driver (X64)

6/5/2012 13:59

(6.2:8415.0)

(6.2:8415.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

6/5/2012 14:02

(6.2:8415.0)

(6.2:8415.0)

Brocade Communications Systems,
Inc.

Brocade FC/FCoE HBA Stor Miniport
Driver

2/23/2012 20:21

(6.2:8220.0)

(7.0:51.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

11/3/2011 15:06

(6.2:8128.0)

(7.0:50.0)

Broadcom Corporation

Broadcom NetXtreme FCoE Crash Dump (x64)

11/10/2011 1:04

(6.2:8128.0)

(7.0:16.50)

Broadcom Corporation

FCoE offload x64 FREE

2/23/2012 20:17

(6.2:8220.0)

(7.0:13.50)

Broadcom Corporation

iSCSI offload x64 FREE

7/23/2012 19:30

(7.0:1.36)

(7.0:1.36)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

9/21/2012 1:07

(4.5:0.6409)

(4.5:0.6409)

Dell

Dell EqualLogic Device Specific Module

7/24/2012 8:22

(7.0:35.95)

(7.0:35.95)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

6/6/2006 17:11

(7.10:0.0)

(7.10:0.0)

IBM Corporation

IBM ServeRAID Controller Driver

5/31/2012 13:11

(9.1:9.205)

(9.1:9.205)

QLogic Corporation

QLogic Fibre Channel Stor Miniport
Driver

12/7/2011 19:48

(2.1:5.10)

(2.1:5.10)

QLogic Corporation

QLogic iSCSI Storport Miniport
Driver

 

______________________________________________________________________________

 

System Information: ABCC1DOCS01

 

OS Name        Microsoft
Windows Server 2012 Standard

Version        6.2.9200 Build
9200

Other OS Description        
Not Available

OS Manufacturer       
Microsoft Corporation

System Name        ABCC1DOCS01

System Manufacturer       
VMware, Inc.

System Model        VMware
Virtual Platform

System Type        x64-based
PC

System SKU       

Processor        AMD
Opteron(tm) Processor 6380, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date       
Phoenix Technologies LTD 6.00, 9/17/2015

 

System Events:

 

  • Checked the event and found the event is 1135.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

8/9/2016

8:45:28 PM

Critical

ABCC1DOCS01.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCC1DOCS02’ was removed from the active failover
cluster membership. The Cluster service on this node may have stopped. This
could also be due to the node having lost communication with other active
nodes in the failover cluster. Run the Validate a Configuration wizard to
check your network configuration. If the condition persists, check for
hardware or software errors related to the network adapters on this node.
Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.

 

  • Got the events and found that we are getting the same iscsi errors on this end which explains that there is a storage disconnect.

 

8/9/2016

8:45:32 PM

Error

ABCC1DOCS01.abc.local

20

iScsiPrt

Connection to the target was lost. The initiator will attempt to
retry the connection.

8/9/2016

8:45:32 PM

Error

ABCC1DOCS01.abc.local

7

iScsiPrt

The initiator could not send an iSCSI PDU. Error status is given
in the dump data.

 

  • Since the disks are disconnected we got the events for ntfs as it was not able to flush the transactional logs.

 

8/9/2016

8:45:58 PM

Warning

ABCC1DOCS01.abc.local

140

Microsoft-Windows-Ntfs

The system failed to flush data to the transaction log. Corruption
may occur in VolumeId: Q:, DeviceName: \Device\HarddiskVolume10. ({Write
Protect Error} The disk cannot be written to because it is write protected.
Please remove the write protection from the volume %hs in drive %hs.)

 

  • Got the events related to the backup Job running.

 

8/9/2016

8:45:58 PM

Error

ABCC1DOCS01.abc.local

9295

AAFsFlt

Metadata could not be written to volume \Device\HarddiskVolume10.
Status is in the data.

8/9/2016

8:45:59 PM

Error

ABCC1DOCS01.abc.local

1038

Microsoft-Windows-FailoverClustering

Ownership of cluster disk ‘Cluster Data (T:)’ has been
unexpectedly lost by this node. Run the Validate a Configuration wizard to
check your storage configuration.

8/9/2016

8:46:02 PM

Error

ABCC1DOCS01.abc.local

14

volsnap

The shadow copies of volume T: were aborted because of an IO
failure on volume T:.

 

  • Since the disk was unexpectedly removed from the Node. This has caused corruption on the Volume Q.

 

8/9/2016

8:58:33 PM

Error

ABCC1DOCS01.abc.local

55

Ntfs

A corruption was discovered in the file system structure on volume
Q:. The exact nature of the corruption is unknown.  The file system structures need to be
scanned and fixed offline.

 

  • Last event occurred at 9:08:09 PM at 8/9/2016 after which the issue is resolved.

 

8/9/2016

9:08:09 PM

Error

ABCC1DOCS01.abc.local

7

iScsiPrt

The initiator could not send an iSCSI PDU. Error status is given
in the dump data.

 

 

Application Events:

 

  • Checked the events but was not able to find anything specific related to the issue.

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/27/2007 19:04

(6.0:6001.16459)

(7.2:0.0)

Adaptec, Inc.

Adaptec StorPort Ultra320 SCSI Driver (X64)

6/5/2012 13:59

(6.2:8415.0)

(6.2:8415.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

6/5/2012 14:02

(6.2:8415.0)

(6.2:8415.0)

Brocade Communications Systems,
Inc.

Brocade FC/FCoE HBA Stor Miniport
Driver

2/23/2012 20:21

(6.2:8220.0)

(7.0:51.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

11/3/2011 15:06

(6.2:8128.0)

(7.0:50.0)

Broadcom Corporation

Broadcom NetXtreme FCoE Crash Dump (x64)

11/10/2011 1:04

(6.2:8128.0)

(7.0:16.50)

Broadcom Corporation

FCoE offload x64 FREE

2/23/2012 20:17

(6.2:8220.0)

(7.0:13.50)

Broadcom Corporation

iSCSI offload x64 FREE

7/23/2012 19:30

(7.0:1.36)

(7.0:1.36)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

9/21/2012 1:07

(4.5:0.6409)

(4.5:0.6409)

Dell

Dell EqualLogic Device Specific Module

7/24/2012 8:22

(7.0:35.95)

(7.0:35.95)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

6/6/2006 17:11

(7.10:0.0)

(7.10:0.0)

IBM Corporation

IBM ServeRAID Controller Driver

5/31/2012 13:11

(9.1:9.205)

(9.1:9.205)

QLogic Corporation

QLogic Fibre Channel Stor Miniport
Driver

12/7/2011 19:48

(2.1:5.10)

(2.1:5.10)

QLogic Corporation

QLogic iSCSI Storport Miniport
Driver

 

_________________________________________________________________________________

 

 

 

Conclusion:

 

  • After analyzing the logs we found that the Issue started after the Nodes lost the connection to the storage. This could happen due to the failure of Any device which is used to establish the connection between the Nodes and the SAN for example a Switch or router. Since the issue happened on both the nodes this could be a device which is in common between these two VMs.

 

  • The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from
virus scanning.(Applicable for Cluster 2003)

 

 

  1. Kindly check internally if there is any device which is connected between the Server and the SAN which malfunctioned. Because the issue was started after the storage disconnect.
  2.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

https://support.microsoft.com/en-us/kb/2784261

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply