RCA – 5 – Getting Event ID 1135, Cluster Node evicted from Failover Membership

Issue Description:

Cluster unable to communicate to DC on Cluster Running on Node ABSVS3 and ABSVS4 Running a copy of Microsoft Windows Server 2012 R2 Datacenter Version 6.3.9600 Build 9600

Initial Description:

>>As we know that in this case the resources failover from one  node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up.  If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it.  This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down.  If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

                Having a problem with nodes being removed from active Failover Cluster membership?

                http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

_____________________________________________________________________________________

  • Checked the Logs of the VM which was crashing and found that the machine crashed on :

_________________________________________________________________________

Log Name: System
Source: EventLog
Date: 6/8/2016 1:39:07 AM
Event ID: 6008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: SPSQUEEN.abc.local
Description:
The previous system shutdown at 21:06:45 on ‎07/‎06/‎2016 was unexpected.

_________________________________________________________________________

  • As per this log the issue occurred on 21:06:45 on 07/06/2016
  • Checked the events on the cluster at the time of issue.

System Information: ABSVS3

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABSVS3

System Manufacturer        HP

System Model        ProLiant DL380 Gen9

System Type        x64-based PC

System SKU        K8P38A

Processor        Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP P89, 27/12/2015

System Events:

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/7/2016

9:06:51 PM

Error

ABSVS3.abc.local

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

  • Checked the events and found that the Network start going down after the Backup Job started on the Server.

6/7/2016

9:06:53 PM

Critical

ABSVS3.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABSVS4’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

6/7/2016

9:06:55 PM

Warning

ABSVS3.abc.local

9

bxfcoe

The SAN link is down for port WWN 20:00:2C:44:FD:99:F5:B9.  Check to make sure the network cable is properly connected. 

6/7/2016

9:06:55 PM

Warning

ABSVS3.abc.local

140

Microsoft-Windows-Ntfs

The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, DeviceName: \Device\HarddiskVolume8. (A device which does not exist was specified.)

6/7/2016

9:06:56 PM

Warning

ABSVS3.abc.local

4

l2nd

HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199: The network link is down.  Check to make sure the network cable is properly connected.

6/7/2016

9:06:56 PM

Warning

ABSVS3.abc.local

22

Microsoft-Windows-Hyper-V-VmSwitch

Media disconnected on NIC /DEVICE/{406F2556-68B8-466C-A934-13988D1727B9} (Friendly Name: HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199).

6/7/2016

9:06:56 PM

Error

ABSVS3.abc.local

1127

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABSVS3 – Embedded LOM 1 Port 1’ for cluster node ‘ABSVS3’ on network ‘Cluster Network 2’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

6/7/2016

9:06:56 PM

Error

ABSVS3.abc.local

1130

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Network 2’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

6/7/2016

9:08:48 PM

Error

ABSVS3.abc.local

1291

NIC Agents

NIC Agent: Connectivity has been lost for the NIC in slot 0, port 1. [SNMP TRAP: 18012 in CPQNIC.MIB]

6/7/2016

9:08:49 PM

Warning

ABSVS3.abc.local

1014

Microsoft-Windows-DNS-Client

Name resolution for the name _kerberos._tcp.Default-First-Site-Name._sites.dc._msdcs.abc.local. timed out after none of the configured DNS servers responded.

Application Events:

  • Checked the event logs and found that the Backup Job Started on 9:00:02 PM and it failed on 9:09:23 PM.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/7/2016

9:00:02 PM

Information

ABSVS3.abc.local

5632

BackupAssist

Starting Job ‘DailyDataBackup’ for scheduled time: 07/06/2016 21:00 Job Method: File Replication Destination: Network location Job Execution ID: 5.454 Tag:FtpY1Ml3VmPO+DP9lqNlwkenKQK/EMTsA/1IVBnw6fw=

6/7/2016

9:09:23 PM

Error

ABSVS3.abc.local

5634

BackupAssist

Backup job DailyDataBackup failed with errors. Information: Could not copy directory attributes Ultra critical error: The network path was not found Destination: Network location Bytes: 123781013031 Files: 108217 Start time: 07/06/2016 21:00:07 End time: 07/06/2016 21:09:14 Duration: 00:09:07.4098342 Job Execution ID: 5.454

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 23:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

2/18/2014 16:02

(3.23:1.0)

(3.23:1.0)

Sophos Limited

SAV On-Access and HIPS for Windows Vista (AMD64)

7/28/2014 15:26

(10.3:13.0)

(3.4:9.0)

Sophos Limited

Sophos Web Intelligence

5/22/2013 22:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

11/24/2013 2:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

2/26/2014 13:04

(1.1:303.0)

(1.1:303.0)

Certit PTY LTD

VHD Virtual Disk Driver

3/1/2013 1:31

(4.1:0.2980)

(4.1:0.2980)

Riverbed Technology, Inc.

npf.sys (NT5/6 AMD64) Kernel Driver

_________________________________________________________________________________________

 

 

System Information: ABSVS4

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABSVS4

System Manufacturer        HP

System Model        ProLiant DL380 Gen9

System Type        x64-based PC

System SKU        K8P38A

Processor        Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP P89, 27/12/2015

 

 

System Events:

 

  • Checked the logs and found the Node S3 got evicted from the FCM at the time of Issue.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/7/2016

9:06:53 PM

Critical

ABSVS4.abc.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABSVS3’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

  • After this we found that the network start going down.

 

6/7/2016

9:06:55 PM

Warning

ABSVS4.abc.local

9

bxfcoe

The SAN link is down for port WWN 20:00:2C:44:FD:99:D2:89.  Check to make sure the network cable is properly connected. 

6/7/2016

9:06:56 PM

Warning

ABSVS4.abc.local

4

q57nd60a

HP Ethernet 1Gb 4-port 331i Adapter: The network link is down.  Check to make sure the network cable is properly connected.

6/7/2016

9:06:56 PM

Error

ABSVS4.abc.local

1127

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABSVS4 – Embedded LOM 1 Port 1’ for cluster node ‘ABSVS4’ on network ‘Cluster Network 2’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

6/7/2016

9:06:56 PM

Error

ABSVS4.abc.local

1130

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Network 2’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

6/7/2016

9:06:56 PM

Warning

ABSVS4.abc.local

4

l2nd

HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199: The network link is down.  Check to make sure the network cable is properly connected.

6/7/2016

9:06:56 PM

Warning

ABSVS4.abc.local

22

Microsoft-Windows-Hyper-V-VmSwitch

Media disconnected on NIC /DEVICE/{118796E5-45DA-489C-B23F-C321AA44E99D} (Friendly Name: HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199).

 

 

Application Events:

 

  • Checked the Application logs but was not able to find any events at the time of issue.

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 23:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

2/18/2014 16:02

(3.23:1.0)

(3.23:1.0)

Sophos Limited

SAV On-Access and HIPS for Windows Vista (AMD64)

7/28/2014 15:26

(10.3:13.0)

(3.4:9.0)

Sophos Limited

Sophos Web Intelligence

5/22/2013 22:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

11/24/2013 2:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

2/26/2014 13:04

(1.1:303.0)

(1.1:303.0)

Certit PTY LTD

VHD Virtual Disk Driver

 

_________________________________________________________________________

 

 

 

Conclusion:

 

  • After analyzing the logs we can see that there are network various Network failure on the Cluster due to which the Node got evicted and which in the end crashed the Virtual Machines on the Cluster. As I can also check that the issue started after the backup job initiated. Kindly uninstall the Backup utility and monitor the machine.

 

 

  1. For monitoring purposes kindly uninstall the Antivirus from the Cluster.

 

  1.  Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply