RCA - 9 - Cluster Nodes evicted with event id 1135 on Cluster

Issue Description:

Cluster Nodes evicted with event id 1135 on Cluster Name: aaborhv-clstr-2 Running a copy of Microsoft Windows Server 2012 R2 Standard Version 6.3.9600 Build 9600 on 20th July

Initial Description:

As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

Having a problem with nodes being removed from active Failover Cluster membership?

http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

Scenario 1

Issue happened on 25th .

__________________________________________________________________________________________

System Information: ABCOUHV01

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCOUHV01

System Manufacturer HP

System Model ProLiant BL460c Gen8

System Type x64-based PC

System SKU 641016-B21

Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date HP I31, 6/1/2015

System Events:

Check the Machine and found that we are getting event id 1135 which says that the Cluster node: ABCOUHV03 was removed from the Cluster.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/25/2016	10:52:21 PM	Critical	ABCOUHV01.abcdgrp.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCOUHV03’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
7/25/2016	10:52:22 PM	Error	ABCOUHV01.abcdgrp.local	21502	Microsoft-Windows-Hyper-V-High-Availability	‘SCVMM ABCOUWIN801 Configuration’ failed to register the virtual machine with the virtual machine management service.
7/25/2016	10:52:22 PM	Error	ABCOUHV01.abcdgrp.local	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘SCVMM ABCOUWIN801 Configuration’ of type ‘Virtual Machine Configuration’ in clustered role ‘SCVMM ABCOUWIN801 Resources’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.‘). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.

Based on the error of the VM failure VM failed because it was not able to find the configuration file. As it was missing : The system cannot find the file specified

Application Events:

Checked the application events but was not able to find anything specific related to the issue.

List of outdated drivers:

Date	Time	Type/Level	Source	Description
12/20/2014 8:47	(10.4:215.0)	(10.4:215.0)	Emulex	Emulex Plus Driver
10/26/2015 16:58	(10.4:364.0)	(10.4:364.0)	Emulex	Emulex FCoE Storport Miniport Driver
12/26/2015 17:03	(9.0:0.902)	(9.0:0.902)	Veeam Software AG	CTK file system minifilter
12/30/2015 12:31	(6.3:9600.16384)	(10.7:206.0)	Emulex	Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64)
8/6/2013 16:00	(9.15:1.102)	(9.15:1.102)	Matrox Graphics Inc.	MxG2hDO64.sys

Cluster Logs:

00005548.000051a8::2016/07/25-03:48:53.654 INFO [RES] Physical Disk: Supplied device path Q: is a disk path, status 2

00005548.000051a8::2016/07/25-03:48:53.654 INFO [RES] Physical Disk: Failed to open device Q:, status 3

00005548.000051a8::2016/07/25-03:48:53.654 WARN [RHS] Error 3 from resource type control for restype Physical Disk.

00003328.00003618::2016/07/25-03:48:53.785 INFO [GEM] Node 2: Deleting [3:54298 , 3:54310] (both included) as it has been ack’d by every node

00003328.00004648::2016/07/25-03:53:27.359 ERR [REMP] HandleHwprvPostCommitSnapshot: unknown snapshot set 7488383a-f77f-49c6-abd5-3f30519b790c

00003328.00001338::2016/07/25-03:53:27.359 INFO [WRTA] OnEndLocalSnapshotSet: snapshot set 97a00019-cb80-40ad-b304-5d5f70da06e1, initiator SnapshotInitiatorAgentHwProvider

00003328.00004648::2016/07/25-03:53:27.381 INFO [DCM] ClusterSnapshotSetHandler(exit): 97a00019-cb80-40ad-b304-5d5f70da06e1, isLocal false, event HwPostCommitSnapshot, HrError(0x00000000)

00003328.00001338::2016/07/25-03:53:27.359 INFO [WRTA] OnEndLocalSnapshotSet: snapshot set 97a00019-cb80-40ad-b304-5d5f70da06e1, initiator SnapshotInitiatorAgentHwProvider

00003328.00004648::2016/07/25-03:53:27.381 INFO [DCM] ClusterSnapshotSetHandler(exit): 97a00019-cb80-40ad-b304-5d5f70da06e1, isLocal false, event HwPostCommitSnapshot, HrError(0x00000000)

____________________________________________________________________________________________

System
Information: ABCOUHV07

OS Name Microsoft
Windows Server 2012 R2 Standard

Version 6.3.9600 Build
9600

Other OS Description
Not Available

OS Manufacturer
Microsoft Corporation

System Name ABCOUHV07

System Manufacturer HP

System Model ProLiant
BL460c Gen8

System Type x64-based
PC

System SKU 641016-B21

Processor Intel(R)
Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date HP
I31, 6/1/2015

System
Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/20/2016	6:44:18 AM	Critical	ABCOUHV07	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCOUHV06’ was removed from the active failover cluster membership.

Application
Events:

Checked the
application events but was not able to find anything specific related to
the issue.

List
of outdated drivers:

Date	Time	Type/Level	Source	Description
12/20/2014 8:47	(10.4:215.0)	(10.4:215.0)	Emulex	Emulex Plus Driver
1/17/2015 21:57	(10.4:246.0)	(10.4:246.0)	Emulex	Emulex FCoE Storport Miniport Driver
3/2/2015 12:01	(6.3:9600.17246)	(63.10:0.64)	PMC-Sierra, Inc.	Smart Array SAS/SATA Controller Storport Driver
12/26/2015 17:03	(9.0:0.902)	(9.0:0.902)	Veeam Software AG	CTK file system minifilter
9/3/2015 14:01	(6.3:9600.16384)	(10.6:236.0)	Emulex	Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64)

__________________________________________________________________________________________

System Information: ABCOUHV06

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCOUHV06

System Manufacturer HP

System Model ProLiant BL460c Gen8

System Type x64-based PC

System SKU 641016-B21

Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date HP I31, 6/1/2015

System Events:

Checked the system events and found that the MPIO paths were removed before the issue.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/20/2016	6:23:47 AM	Information	ABCOUHV06.abcdgrp.local	46	mpio	Path 77010000 was removed from \Device\MPIODisk523 due to a PnP event. The dump data contains the current number of paths.
7/20/2016	6:23:47 AM	Information	ABCOUHV06.abcdgrp.local	46	mpio	Path 77010000 was removed from \Device\MPIODisk524 due to a PnP event. The dump data contains the current number of paths.

Checked the node and found that the cluster shared volume went to Paused state. This generally happens when a backup job is running.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/20/2016	6:44:06 AM	Error	ABCOUHV06.abcdgrp.local	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

7/20/2016	6:44:06 AM	Error	ABCOUHV06.abcdgrp.local	7024	Service Control Manager	The Cluster Service service terminated with the following service-specific error: The semaphore timeout period has expired.
7/20/2016	6:44:06 AM	Error	ABCOUHV06.abcdgrp.local	7031	Service Control Manager	The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.

Application Events:

In the Application logs we first got the event that the Target LUN is not HP after which Cluster Shared Volume went to Paused state and Cluster server terminated.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/20/2016	5:44:02 AM	Error	ABCOUHV06.abcdgrp.local	5065	3PARVSSProvider	3PARVSS5065: ERROR: Target LUN HP is not a 3PAR Virtual Volume.
7/20/2016	5:46:39 AM	Error	ABCOUHV06.abcdgrp.local	12305	VSS	Volume Shadow Copy Service error: Volume/disk not connected or not found. Error context: DeviceIoControl(\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy864 – 0000000000000180,0×00560038,0000000000000000,0,0000008854909D70,4096,[0]). Operation: Removing auto-release shadow copies Loading provider Context: Volume Name: \\?\Volume{d45ca367-4e59-11e6-80c9-0017a4770050}\ Execution Context: System Provider

As per the Link : http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04533976-2.pdf

Error:

3PARVSS5065: ERROR: Target LUN <Lun Id> is not a 3PAR Virtual Volume.

Details:

The designated target LUN is not an HP 3PAR virtual volume.

Resolution:

• Verify that all database and log destinations belong to the HP 3PAR

StoreServ Storage System.

List of outdated drivers:

Date	Time	Type/Level	Source	Description
12/20/2014 8:47	(10.4:215.0)	(10.4:215.0)	Emulex	Emulex Plus Driver
1/17/2015 21:57	(10.4:246.0)	(10.4:246.0)	Emulex	Emulex FCoE Storport Miniport Driver
3/2/2015 12:01	(6.3:9600.17246)	(63.10:0.64)	PMC-Sierra, Inc.	Smart Array SAS/SATA Controller Storport Driver
12/26/2015 17:03	(9.0:0.902)	(9.0:0.902)	Veeam Software AG	CTK file system minifilter
9/3/2015 14:01	(6.3:9600.16384)	(10.6:236.0)	Emulex	Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64)

__________________________________________________________________________________________

Conclusion:

We analyzed the logs for 20th and 25th. On both the cases we found that the issue is started with the MPIO paths going in failed state, which happens only when we take the backup. At this point we can see that we have issues with the Third-party provider as well as we are getting errors on that end. As this point we can proceed with the following Action Plan:

Please install the following Hotfixes on all the nodes of the Cluster.
- https://support.microsoft.com/en-us/kb/3156418 (For the issue in which CSV is going in Paused State.)
- Please install the Recommended Hotfixes on all the nodes: https://support.microsoft.com/en-us/kb/2920151

The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

Please update the Backup Software first as the issue is happening while we are running the backup.

Since MPIO path are failing please update the Components related to Storage and SAN.
1. Please update the Fibre Channel Drivers :

Manufacture: Emulex Corporation

Serial Number: H3533664NP

Model: 554FLB

Driver version: 10.4.364.0

firmware version: 10.5.155.0

Please check with the Backup Team or HP team to find the reason of the issue because we have seen issues happening with the software related to the SAN while running a backup from a third-party application. This could be an issue with any intermediate components.

All the Nodes of the Cluster are not in the Same OU. Kindly move ABCOUHV01 to Computers.

Fqdn	Domain	Domain Role	Site Name	Organizational Unit
ABCOUHV01.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	OU=Hyper-V Hosts
ABCOUHV02.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV03.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV04.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV05.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV06.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV07.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV08.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV09.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers
ABCOUHV10.abcdgrp.local	abcdgrp.local	Member Server	TRG-Headquarters	Computers