RCA – 9 – Cluster Nodes evicted with event id 1135 on Cluster

Issue Description:

 

Cluster Nodes evicted with event id 1135 on Cluster Name: aaborhv-clstr-2 Running a copy of Microsoft Windows Server 2012 R2 Standard Version 6.3.9600 Build 9600 on 20th July

 

Initial Description:

 

As we know that in this case the resources failover from one  node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

 

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

 

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up.  If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it.  This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down.  If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

                Having a problem with nodes being removed from active Failover Cluster membership?

                http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

 

Scenario 1

 

Issue happened on 25th .

__________________________________________________________________________________________

 

System Information: ABCOUHV01

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCOUHV01

System Manufacturer        HP

System Model        ProLiant BL460c Gen8

System Type        x64-based PC

System SKU        641016-B21

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP I31, 6/1/2015

 

System Events:

 

  • Check the Machine and found that we are getting event id 1135 which says that the Cluster node: ABCOUHV03 was removed from the Cluster.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/25/2016

10:52:21 PM

Critical

ABCOUHV01.abcdgrp.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCOUHV03’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

7/25/2016

10:52:22 PM

Error

ABCOUHV01.abcdgrp.local

21502

Microsoft-Windows-Hyper-V-High-Availability

‘SCVMM ABCOUWIN801 Configuration’ failed to register the virtual machine with the virtual machine management service.

7/25/2016

10:52:22 PM

Error

ABCOUHV01.abcdgrp.local

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘SCVMM ABCOUWIN801 Configuration’ of type ‘Virtual Machine Configuration’ in clustered role ‘SCVMM ABCOUWIN801 Resources’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.‘). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.

 

  • Based on the error of the VM failure VM failed because it was not able to find the configuration file. As it was missing : The system cannot find the file specified

 

Application Events:

 

  • Checked the application events but was not able to find anything specific related to the issue.

 

List of outdated drivers:

 

 

Date

Time

Type/Level

Source

Description

12/20/2014 8:47

(10.4:215.0)

(10.4:215.0)

Emulex

Emulex Plus Driver

10/26/2015 16:58

(10.4:364.0)

(10.4:364.0)

Emulex

Emulex FCoE Storport Miniport Driver

12/26/2015 17:03

(9.0:0.902)

(9.0:0.902)

Veeam Software AG

CTK file system minifilter

12/30/2015 12:31

(6.3:9600.16384)

(10.7:206.0)

Emulex

Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64)

8/6/2013 16:00

(9.15:1.102)

(9.15:1.102)

Matrox Graphics Inc.

MxG2hDO64.sys

 

Cluster Logs:

 

 

00005548.000051a8::2016/07/25-03:48:53.654 INFO  [RES] Physical Disk: Supplied device path Q: is a disk path, status 2

00005548.000051a8::2016/07/25-03:48:53.654 INFO  [RES] Physical Disk: Failed to open device Q:, status 3

00005548.000051a8::2016/07/25-03:48:53.654 WARN  [RHS] Error 3 from resource type control for restype Physical Disk.

00003328.00003618::2016/07/25-03:48:53.785 INFO  [GEM] Node 2: Deleting [3:54298 , 3:54310] (both included) as it has been ack’d by every node

 

 

00003328.00004648::2016/07/25-03:53:27.359 ERR   [REMP] HandleHwprvPostCommitSnapshot: unknown snapshot set 7488383a-f77f-49c6-abd5-3f30519b790c

00003328.00001338::2016/07/25-03:53:27.359 INFO  [WRTA] OnEndLocalSnapshotSet: snapshot set 97a00019-cb80-40ad-b304-5d5f70da06e1, initiator SnapshotInitiatorAgentHwProvider

00003328.00004648::2016/07/25-03:53:27.381 INFO  [DCM] ClusterSnapshotSetHandler(exit): 97a00019-cb80-40ad-b304-5d5f70da06e1, isLocal false, event HwPostCommitSnapshot, HrError(0x00000000)

 

00003328.00001338::2016/07/25-03:53:27.359 INFO  [WRTA] OnEndLocalSnapshotSet: snapshot set 97a00019-cb80-40ad-b304-5d5f70da06e1, initiator SnapshotInitiatorAgentHwProvider

00003328.00004648::2016/07/25-03:53:27.381 INFO  [DCM] ClusterSnapshotSetHandler(exit): 97a00019-cb80-40ad-b304-5d5f70da06e1, isLocal false, event HwPostCommitSnapshot, HrError(0x00000000)

____________________________________________________________________________________________

 

System
Information: ABCOUHV07

 

OS Name        Microsoft
Windows Server 2012 R2 Standard

Version        6.3.9600 Build
9600

Other OS Description        
Not Available

OS Manufacturer       
Microsoft Corporation

System Name        ABCOUHV07

System Manufacturer        HP

System Model        ProLiant
BL460c Gen8

System Type        x64-based
PC

System SKU        641016-B21

Processor        Intel(R)
Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R)
Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP
I31, 6/1/2015

 

System
Events:

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/20/2016

6:44:18 AM

Critical

ABCOUHV07

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCOUHV06’ was removed from the active failover
cluster membership.

 

Application
Events:

 

  • Checked the
    application events but was not able to find anything specific related to
    the issue.

 

 

List
of outdated drivers:

 

 

 

Date

Time

Type/Level

Source

Description

12/20/2014 8:47

(10.4:215.0)

(10.4:215.0)

Emulex

Emulex Plus Driver

1/17/2015 21:57

(10.4:246.0)

(10.4:246.0)

Emulex

Emulex FCoE Storport Miniport Driver

3/2/2015 12:01

(6.3:9600.17246)

(63.10:0.64)

PMC-Sierra, Inc.

Smart Array SAS/SATA Controller Storport Driver

12/26/2015 17:03

(9.0:0.902)

(9.0:0.902)

Veeam Software AG

CTK file system minifilter

9/3/2015 14:01

(6.3:9600.16384)

(10.6:236.0)

Emulex

Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64)

__________________________________________________________________________________________

 

System Information: ABCOUHV06

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCOUHV06

System Manufacturer        HP

System Model        ProLiant BL460c Gen8

System Type        x64-based PC

System SKU        641016-B21

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP I31, 6/1/2015

 

System Events:

 

  • Checked the system events and found that the MPIO paths were removed before the issue.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/20/2016

6:23:47 AM

Information

ABCOUHV06.abcdgrp.local

46

mpio

Path 77010000 was removed from \Device\MPIODisk523 due to a PnP event. The dump data contains the current number of paths.

7/20/2016

6:23:47 AM

Information

ABCOUHV06.abcdgrp.local

46

mpio

Path 77010000 was removed from \Device\MPIODisk524 due to a PnP event. The dump data contains the current number of paths.

 

  • Checked the node and found that the cluster shared volume went to Paused state. This generally happens when a backup job is running.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/20/2016

6:44:06 AM

Error

ABCOUHV06.abcdgrp.local

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

 

7/20/2016

6:44:06 AM

Error

ABCOUHV06.abcdgrp.local

7024

Service Control Manager

The Cluster Service service terminated with the following service-specific error:  The semaphore timeout period has expired.

7/20/2016

6:44:06 AM

Error

ABCOUHV06.abcdgrp.local

7031

Service Control Manager

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

 

Application Events:

 

  • In the Application logs we first got the event that the Target LUN is not HP after which Cluster Shared Volume went to Paused state and Cluster server terminated.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/20/2016

5:44:02 AM

Error

ABCOUHV06.abcdgrp.local

5065

3PARVSSProvider

3PARVSS5065: ERROR: Target LUN HP is not a 3PAR Virtual Volume.

7/20/2016

5:46:39 AM

Error

ABCOUHV06.abcdgrp.local

12305

VSS

Volume Shadow Copy Service error: Volume/disk not connected or not found. Error context: DeviceIoControl(\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy864 – 0000000000000180,0×00560038,0000000000000000,0,0000008854909D70,4096,[0]).  Operation:    Removing auto-release shadow copies    Loading provider Context:    Volume Name: \\?\Volume{d45ca367-4e59-11e6-80c9-0017a4770050}\    Execution Context: System Provider

 

As per the Link : http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04533976-2.pdf

 

Error:

3PARVSS5065: ERROR: Target LUN <Lun Id> is not a 3PAR Virtual Volume.

 

Details:

The designated target LUN is not an HP 3PAR virtual volume.

 

Resolution:

• Verify that all database and log destinations belong to the HP 3PAR

StoreServ Storage System.

 

 

List of outdated drivers:

 

 

Date

Time

Type/Level

Source

Description

12/20/2014 8:47

(10.4:215.0)

(10.4:215.0)

Emulex

Emulex Plus Driver

1/17/2015 21:57

(10.4:246.0)

(10.4:246.0)

Emulex

Emulex FCoE Storport Miniport Driver

3/2/2015 12:01

(6.3:9600.17246)

(63.10:0.64)

PMC-Sierra, Inc.

Smart Array SAS/SATA Controller Storport Driver

12/26/2015 17:03

(9.0:0.902)

(9.0:0.902)

Veeam Software AG

CTK file system minifilter

9/3/2015 14:01

(6.3:9600.16384)

(10.6:236.0)

Emulex

Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64)

 

__________________________________________________________________________________________

 

Conclusion:

 

  • We analyzed the logs for 20th and 25th. On both the cases we found that the issue is started with the MPIO paths going in failed state, which happens only when we take the backup. At this point we can see that we have issues with the Third-party provider as well as we are getting errors on that end. As this point we can proceed with the following Action Plan:

 

 

  • The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

 

  1. Please update the Backup Software first as the issue is happening while we are running the backup.

 

  1. Since MPIO path are failing please update the Components related to Storage and SAN.
    1. Please update the Fibre Channel Drivers :

   Manufacture: Emulex Corporation

   Serial Number: H3533664NP

   Model: 554FLB

   Driver version: 10.4.364.0

   firmware version: 10.5.155.0

 

  1. Please check with the Backup Team or HP team to find the reason of the issue because we have seen issues happening with the software related to the SAN while running a backup from a third-party application. This could be an issue with any intermediate components.

 

 

  1. All the Nodes of the Cluster are not in the Same OU. Kindly move ABCOUHV01 to Computers.

Fqdn

Domain

Domain Role

Site Name

Organizational Unit

ABCOUHV01.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

OU=Hyper-V Hosts

ABCOUHV02.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV03.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV04.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV05.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV06.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV07.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV08.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV09.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

ABCOUHV10.abcdgrp.local

abcdgrp.local

Member Server

TRG-Headquarters

Computers

 

Note: This might not be related to the issue that you are facing but it’s just a best practice which is missing from the Machine.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply