RCA - 10 - RCA on Cluster Resource Failover

Issue Description:

Need to Know the Reason for the Cluster unexpected failover on Cluster Name: abcdCluster01 Running a copy of Microsoft Windows Server 2012 Standard Version 6.2.9200 Build 9200 on August 09 @ 8:53pm.

Initial Description:

As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

Having a problem with nodes being removed from active Failover Cluster membership?

http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

Issue happened on August 09 @ 8:53pm.

_________________________________________

System Information: ABCC1DOCS02

OS Name Microsoft
Windows Server 2012 Standard

Version 6.2.9200 Build
9200

Other OS Description
Not Available

OS Manufacturer
Microsoft Corporation

System Name ABCC1DOCS02

System Manufacturer
VMware, Inc.

System Model VMware
Virtual Platform

System Type x64-based
PC

System SKU

Processor AMD
Opteron(tm) Processor 6380, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date
Phoenix Technologies LTD 6.00, 9/17/2015

System Events:

Checked the events and found that the Node: ABCC1DOCS02 is evicted from the cluster after we got the Error of the ISCSI disconnect.

Checked the events and found that we got Event ID 20 and 7 before the issue. Both the event mentions that the Server (initiator) loses the connection with the Target.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
8/9/2016	8:45:21 PM	Error	ABCC1DOCS02.abc.local	20	iScsiPrt	Connection to the target was lost. The initiator will attempt to retry the connection.
8/9/2016	8:45:21 PM	Error	ABCC1DOCS02.abc.local	7	iScsiPrt	The initiator could not send an iSCSI PDU. Error status is given in the dump data.
8/9/2016	8:45:30 PM	Critical	ABCC1DOCS02.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCC1DOCS01’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Checked other events at the time of issue and found that the shadow copy operation was running.
We are also getting event id 46 as the path of the MPIO was removed.

8/9/2016	8:46:06 PM	Error	ABCC1DOCS02.abc.local	29	volsnap	The shadow copies of volume T: were aborted during detection.
8/9/2016	8:46:06 PM	Information	ABCC1DOCS02.abc.local	46	mpio	Path 77040004 was removed from \Device\MPIODisk1 due to a PnP event. The dump data contains the current number of paths.
8/9/2016	8:46:06 PM	Error	ABCC1DOCS02.abc.local	9295	AAFsFlt	Metadata could not be written to volume \Device\HarddiskVolume3. Status is in the data.
8/9/2016	8:46:06 PM	Warning	ABCC1DOCS02.abc.local	9293	AAFsFlt	Volume \Device\HarddiskVolume3 has been disabled because of a PnP surprise removal event.
8/9/2016	8:46:06 PM	Warning	ABCC1DOCS02.abc.local	140	Microsoft-Windows-Ntfs	The system failed to flush data to the transaction log. Corruption may occur in VolumeId: Q:, DeviceName: \Device\HarddiskVolume3. (A device which does not exist was specified.)

We are getting events related to the AppAssure which was running at the time of issue.

8/9/2016	8:54:28 PM	Error	ABCC1DOCS02.abc.local	9295	AAFsFlt	Metadata could not be written to volume \Device\HarddiskVolume9. Status is in the data.
8/9/2016	8:54:28 PM	Error	ABCC1DOCS02.abc.local	9295	AAFsFlt	Metadata could not be written to volume \Device\HarddiskVolume9. Status is in the data.
8/9/2016	8:54:28 PM	Error	ABCC1DOCS02.abc.local	9289	AAFsFlt	Log file could not be opened or re-opened for device \Device\HarddiskVolume9. The volume has been disabled. The failure status code is the last word of the data.
8/9/2016	8:54:28 PM	Error	ABCC1DOCS02.abc.local	27	volsnap	The shadow copies of volume T: were aborted during detection because a critical control file could not be opened.

This events points out that the Kaspersky service is also active at the time of issue.

8/9/2016

8:57:39 PM

Error

ABCC1DOCS02.abc.local

KLIF

Volume ID of the volume ‘\Device\HarddiskVolume13’ is already in
use. iSwift and Section Verdict Cache will not apply to this volume.

8/9/2016	10:03:18 PM	Error	ABCC1DOCS02.abc.local	20	iScsiPrt	Connection to the target was lost. The initiator will attempt to retry the connection.
8/9/2016	10:03:18 PM	Error	ABCC1DOCS02.abc.local	7	iScsiPrt	The initiator could not send an iSCSI PDU. Error status is given in the dump data.

Application Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
8/9/2016	8:42:46 PM	Error	ABCC1DOCS02.abc.local	8194	VSS	Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied. . This is often caused by incorrect security settings in either the writer or requestor process. Operation: Gathering Writer Data Context: Writer Class Id: {e8132975-6f93-4464-a53e-1050253ae220} Writer Name: System Writer Writer Instance ID: {99d281ed-b864-41b3-a3bd-eefa7f51623d}

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
2/27/2007 19:04	(6.0:6001.16459)	(7.2:0.0)	Adaptec, Inc.	Adaptec StorPort Ultra320 SCSI Driver (X64)
6/5/2012 13:59	(6.2:8415.0)	(6.2:8415.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
6/5/2012 14:02	(6.2:8415.0)	(6.2:8415.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
2/23/2012 20:21	(6.2:8220.0)	(7.0:51.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
11/3/2011 15:06	(6.2:8128.0)	(7.0:50.0)	Broadcom Corporation	Broadcom NetXtreme FCoE Crash Dump (x64)
11/10/2011 1:04	(6.2:8128.0)	(7.0:16.50)	Broadcom Corporation	FCoE offload x64 FREE
2/23/2012 20:17	(6.2:8220.0)	(7.0:13.50)	Broadcom Corporation	iSCSI offload x64 FREE
7/23/2012 19:30	(7.0:1.36)	(7.0:1.36)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
9/21/2012 1:07	(4.5:0.6409)	(4.5:0.6409)	Dell	Dell EqualLogic Device Specific Module
7/24/2012 8:22	(7.0:35.95)	(7.0:35.95)	Broadcom Corporation	Broadcom NetXtreme II 10 GigE VBD
6/6/2006 17:11	(7.10:0.0)	(7.10:0.0)	IBM Corporation	IBM ServeRAID Controller Driver
5/31/2012 13:11	(9.1:9.205)	(9.1:9.205)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Driver
12/7/2011 19:48	(2.1:5.10)	(2.1:5.10)	QLogic Corporation	QLogic iSCSI Storport Miniport Driver

______________________________________________________________________________

System Information: ABCC1DOCS01

OS Name Microsoft
Windows Server 2012 Standard

Version 6.2.9200 Build
9200

Other OS Description
Not Available

OS Manufacturer
Microsoft Corporation

System Name ABCC1DOCS01

System Manufacturer
VMware, Inc.

System Model VMware
Virtual Platform

System Type x64-based
PC

System SKU

Processor AMD
Opteron(tm) Processor 6380, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)

BIOS Version/Date
Phoenix Technologies LTD 6.00, 9/17/2015

System Events:

Checked the event and found the event is 1135.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
8/9/2016	8:45:28 PM	Critical	ABCC1DOCS01.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCC1DOCS02’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Got the events and found that we are getting the same iscsi errors on this end which explains that there is a storage disconnect.

8/9/2016	8:45:32 PM	Error	ABCC1DOCS01.abc.local	20	iScsiPrt	Connection to the target was lost. The initiator will attempt to retry the connection.
8/9/2016	8:45:32 PM	Error	ABCC1DOCS01.abc.local	7	iScsiPrt	The initiator could not send an iSCSI PDU. Error status is given in the dump data.

Since the disks are disconnected we got the events for ntfs as it was not able to flush the transactional logs.

8/9/2016

8:45:58 PM

Warning

ABCC1DOCS01.abc.local

140

Microsoft-Windows-Ntfs

The system failed to flush data to the transaction log. Corruption
may occur in VolumeId: Q:, DeviceName: \Device\HarddiskVolume10. ({Write
Protect Error} The disk cannot be written to because it is write protected.
Please remove the write protection from the volume %hs in drive %hs.)

Got the events related to the backup Job running.

8/9/2016	8:45:58 PM	Error	ABCC1DOCS01.abc.local	9295	AAFsFlt	Metadata could not be written to volume \Device\HarddiskVolume10. Status is in the data.
8/9/2016	8:45:59 PM	Error	ABCC1DOCS01.abc.local	1038	Microsoft-Windows-FailoverClustering	Ownership of cluster disk ‘Cluster Data (T:)’ has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.
8/9/2016	8:46:02 PM	Error	ABCC1DOCS01.abc.local	14	volsnap	The shadow copies of volume T: were aborted because of an IO failure on volume T:.

Since the disk was unexpectedly removed from the Node. This has caused corruption on the Volume Q.

8/9/2016

8:58:33 PM

Error

ABCC1DOCS01.abc.local

Ntfs

A corruption was discovered in the file system structure on volume
Q:. The exact nature of the corruption is unknown. The file system structures need to be
scanned and fixed offline.

Last event occurred at 9:08:09 PM at 8/9/2016 after which the issue is resolved.

8/9/2016

9:08:09 PM

Error

ABCC1DOCS01.abc.local

iScsiPrt

The initiator could not send an iSCSI PDU. Error status is given
in the dump data.

Application Events:

Checked the events but was not able to find anything specific related to the issue.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
2/27/2007 19:04	(6.0:6001.16459)	(7.2:0.0)	Adaptec, Inc.	Adaptec StorPort Ultra320 SCSI Driver (X64)
6/5/2012 13:59	(6.2:8415.0)	(6.2:8415.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
6/5/2012 14:02	(6.2:8415.0)	(6.2:8415.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
2/23/2012 20:21	(6.2:8220.0)	(7.0:51.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
11/3/2011 15:06	(6.2:8128.0)	(7.0:50.0)	Broadcom Corporation	Broadcom NetXtreme FCoE Crash Dump (x64)
11/10/2011 1:04	(6.2:8128.0)	(7.0:16.50)	Broadcom Corporation	FCoE offload x64 FREE
2/23/2012 20:17	(6.2:8220.0)	(7.0:13.50)	Broadcom Corporation	iSCSI offload x64 FREE
7/23/2012 19:30	(7.0:1.36)	(7.0:1.36)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
9/21/2012 1:07	(4.5:0.6409)	(4.5:0.6409)	Dell	Dell EqualLogic Device Specific Module
7/24/2012 8:22	(7.0:35.95)	(7.0:35.95)	Broadcom Corporation	Broadcom NetXtreme II 10 GigE VBD
6/6/2006 17:11	(7.10:0.0)	(7.10:0.0)	IBM Corporation	IBM ServeRAID Controller Driver
5/31/2012 13:11	(9.1:9.205)	(9.1:9.205)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Driver
12/7/2011 19:48	(2.1:5.10)	(2.1:5.10)	QLogic Corporation	QLogic iSCSI Storport Miniport Driver

_________________________________________________________________________________

Conclusion:

After analyzing the logs we found that the Issue started after the Nodes lost the connection to the storage. This could happen due to the failure of Any device which is used to establish the connection between the Nodes and the SAN for example a Switch or router. Since the issue happened on both the nodes this could be a device which is in common between these two VMs.

The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from
virus scanning.(Applicable for Cluster 2003)

Kindly check internally if there is any device which is connected between the Server and the SAN which malfunctioned. Because the issue was started after the storage disconnect.
Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

https://support.microsoft.com/en-us/kb/2784261