RCA - 5 - Getting Event ID 1135, Cluster Node evicted from Failover

Issue Description:

Cluster unable to communicate to DC on Cluster Running on Node ABSVS3 and ABSVS4 Running a copy of Microsoft Windows Server 2012 R2 Datacenter Version 6.3.9600 Build 9600

Initial Description:

>>As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

Having a problem with nodes being removed from active Failover Cluster membership?

http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

_____________________________________________________________________________________

Checked the Logs of the VM which was crashing and found that the machine crashed on :

_________________________________________________________________________

Log Name: System
Source: EventLog
Date: 6/8/2016 1:39:07 AM
Event ID: 6008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: SPSQUEEN.abc.local
Description:
The previous system shutdown at 21:06:45 on ‎07/‎06/‎2016 was unexpected.

_________________________________________________________________________

As per this log the issue occurred on 21:06:45 on 07/06/2016
Checked the events on the cluster at the time of issue.

System Information: ABSVS3

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABSVS3

System Manufacturer HP

System Model ProLiant DL380 Gen9

System Type x64-based PC

System SKU K8P38A

Processor Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date HP P89, 27/12/2015

System Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
6/7/2016	9:06:51 PM	Error	ABSVS3.abc.local	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

Checked the events and found that the Network start going down after the Backup Job started on the Server.

6/7/2016	9:06:53 PM	Critical	ABSVS3.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABSVS4’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
6/7/2016	9:06:55 PM	Warning	ABSVS3.abc.local	9	bxfcoe	The SAN link is down for port WWN 20:00:2C:44:FD:99:F5:B9. Check to make sure the network cable is properly connected.
6/7/2016	9:06:55 PM	Warning	ABSVS3.abc.local	140	Microsoft-Windows-Ntfs	The system failed to flush data to the transaction log. Corruption may occur in VolumeId: G:, DeviceName: \Device\HarddiskVolume8. (A device which does not exist was specified.)
6/7/2016	9:06:56 PM	Warning	ABSVS3.abc.local	4	l2nd	HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199: The network link is down. Check to make sure the network cable is properly connected.
6/7/2016	9:06:56 PM	Warning	ABSVS3.abc.local	22	Microsoft-Windows-Hyper-V-VmSwitch	Media disconnected on NIC /DEVICE/{406F2556-68B8-466C-A934-13988D1727B9} (Friendly Name: HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199).

6/7/2016	9:06:56 PM	Error	ABSVS3.abc.local	1127	Microsoft-Windows-FailoverClustering	Cluster network interface ‘ABSVS3 – Embedded LOM 1 Port 1’ for cluster node ‘ABSVS3’ on network ‘Cluster Network 2’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
6/7/2016	9:06:56 PM	Error	ABSVS3.abc.local	1130	Microsoft-Windows-FailoverClustering	Cluster network ‘Cluster Network 2’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

6/7/2016	9:08:48 PM	Error	ABSVS3.abc.local	1291	NIC Agents	NIC Agent: Connectivity has been lost for the NIC in slot 0, port 1. [SNMP TRAP: 18012 in CPQNIC.MIB]
6/7/2016	9:08:49 PM	Warning	ABSVS3.abc.local	1014	Microsoft-Windows-DNS-Client	Name resolution for the name _kerberos._tcp.Default-First-Site-Name._sites.dc._msdcs.abc.local. timed out after none of the configured DNS servers responded.

Application Events:

Checked the event logs and found that the Backup Job Started on 9:00:02 PM and it failed on 9:09:23 PM.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
6/7/2016	9:00:02 PM	Information	ABSVS3.abc.local	5632	BackupAssist	Starting Job ‘DailyDataBackup’ for scheduled time: 07/06/2016 21:00 Job Method: File Replication Destination: Network location Job Execution ID: 5.454 Tag:FtpY1Ml3VmPO+DP9lqNlwkenKQK/EMTsA/1IVBnw6fw=
6/7/2016	9:09:23 PM	Error	ABSVS3.abc.local	5634	BackupAssist	Backup job DailyDataBackup failed with errors. Information: Could not copy directory attributes Ultra critical error: The network path was not found Destination: Network location Bytes: 123781013031 Files: 108217 Start time: 07/06/2016 21:00:07 End time: 07/06/2016 21:09:14 Duration: 00:09:07.4098342 Job Execution ID: 5.454

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
2/12/2010 23:33	(3.0:0.0)	(3.0:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3 PSHED Plugin Driver
2/18/2014 16:02	(3.23:1.0)	(3.23:1.0)	Sophos Limited	SAV On-Access and HIPS for Windows Vista (AMD64)
7/28/2014 15:26	(10.3:13.0)	(3.4:9.0)	Sophos Limited	Sophos Web Intelligence
5/22/2013 22:41	(3.9:0.0)	(3.9:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3/4 Management Controller Core Driver
11/24/2013 2:26	(3.10:0.0)	(3.10:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3/4 Channel Interface Driver
2/26/2014 13:04	(1.1:303.0)	(1.1:303.0)	Certit PTY LTD	VHD Virtual Disk Driver
3/1/2013 1:31	(4.1:0.2980)	(4.1:0.2980)	Riverbed Technology, Inc.	npf.sys (NT5/6 AMD64) Kernel Driver

_________________________________________________________________________________________

System Information: ABSVS4

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABSVS4

System Manufacturer HP

System Model ProLiant DL380 Gen9

System Type x64-based PC

System SKU K8P38A

Processor Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date HP P89, 27/12/2015

System Events:

Checked the logs and found the Node S3 got evicted from the FCM at the time of Issue.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
6/7/2016	9:06:53 PM	Critical	ABSVS4.abc.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABSVS3’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

After this we found that the network start going down.

6/7/2016	9:06:55 PM	Warning	ABSVS4.abc.local	9	bxfcoe	The SAN link is down for port WWN 20:00:2C:44:FD:99:D2:89. Check to make sure the network cable is properly connected.
6/7/2016	9:06:56 PM	Warning	ABSVS4.abc.local	4	q57nd60a	HP Ethernet 1Gb 4-port 331i Adapter: The network link is down. Check to make sure the network cable is properly connected.
6/7/2016	9:06:56 PM	Error	ABSVS4.abc.local	1127	Microsoft-Windows-FailoverClustering	Cluster network interface ‘ABSVS4 – Embedded LOM 1 Port 1’ for cluster node ‘ABSVS4’ on network ‘Cluster Network 2’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
6/7/2016	9:06:56 PM	Error	ABSVS4.abc.local	1130	Microsoft-Windows-FailoverClustering	Cluster network ‘Cluster Network 2’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
6/7/2016	9:06:56 PM	Warning	ABSVS4.abc.local	4	l2nd	HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199: The network link is down. Check to make sure the network cable is properly connected.
6/7/2016	9:06:56 PM	Warning	ABSVS4.abc.local	22	Microsoft-Windows-Hyper-V-VmSwitch	Media disconnected on NIC /DEVICE/{118796E5-45DA-489C-B23F-C321AA44E99D} (Friendly Name: HP FlexFabric 10Gb 2-port 533FLR-T Adapter #199).

Application Events:

Checked the Application logs but was not able to find any events at the time of issue.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
2/12/2010 23:33	(3.0:0.0)	(3.0:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3 PSHED Plugin Driver
2/18/2014 16:02	(3.23:1.0)	(3.23:1.0)	Sophos Limited	SAV On-Access and HIPS for Windows Vista (AMD64)
7/28/2014 15:26	(10.3:13.0)	(3.4:9.0)	Sophos Limited	Sophos Web Intelligence
5/22/2013 22:41	(3.9:0.0)	(3.9:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3/4 Management Controller Core Driver
11/24/2013 2:26	(3.10:0.0)	(3.10:0.0)	Hewlett-Packard Company	HP ProLiant iLO 3/4 Channel Interface Driver
2/26/2014 13:04	(1.1:303.0)	(1.1:303.0)	Certit PTY LTD	VHD Virtual Disk Driver

_________________________________________________________________________

Conclusion:

After analyzing the logs we can see that there are network various Network failure on the Cluster due to which the Node got evicted and which in the end crashed the Virtual Machines on the Cluster. As I can also check that the issue started after the backup job initiated. Kindly uninstall the Backup utility and monitor the machine.

For monitoring purposes kindly uninstall the Antivirus from the Cluster.

Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

Investigation of Network Issues :

We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.