Issue Description:
Cluster Nodes evicted with event id 1135 on Cluster Name: aaborhv-clstr-2 Running a copy of Microsoft Windows Server 2012 R2 Standard Version 6.3.9600 Build 9600 on 20th July
Initial Description:
As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.
Why is Event ID 1135 Logged ?
This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.
What caused the node to be marked down?
All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:
If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.
By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.
If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.
Reference :
Having a problem with nodes being removed from active Failover Cluster membership?
Scenario 1
Issue happened on 25th .
__________________________________________________________________________________________
System Information: ABCOUHV01
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCOUHV01
System Manufacturer HP
System Model ProLiant BL460c Gen8
System Type x64-based PC
System SKU 641016-B21
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP I31, 6/1/2015
System Events:
- Check the Machine and found that we are getting event id 1135 which says that the Cluster node: ABCOUHV03 was removed from the Cluster.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/25/2016 | 10:52:21 PM | Critical | ABCOUHV01.abcdgrp.local | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCOUHV03’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
7/25/2016 | 10:52:22 PM | Error | ABCOUHV01.abcdgrp.local | 21502 | Microsoft-Windows-Hyper-V-High-Availability | ‘SCVMM ABCOUWIN801 Configuration’ failed to register the virtual machine with the virtual machine management service. |
7/25/2016 | 10:52:22 PM | Error | ABCOUHV01.abcdgrp.local | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘SCVMM ABCOUWIN801 Configuration’ of type ‘Virtual Machine Configuration’ in clustered role ‘SCVMM ABCOUWIN801 Resources’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.‘). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. |
- Based on the error of the VM failure VM failed because it was not able to find the configuration file. As it was missing : The system cannot find the file specified
Application Events:
- Checked the application events but was not able to find anything specific related to the issue.
List of outdated drivers:
Date | Time | Type/Level | Source | Description |
12/20/2014 8:47 | (10.4:215.0) | (10.4:215.0) | Emulex | Emulex Plus Driver |
10/26/2015 16:58 | (10.4:364.0) | (10.4:364.0) | Emulex | Emulex FCoE Storport Miniport Driver |
12/26/2015 17:03 | (9.0:0.902) | (9.0:0.902) | Veeam Software AG | CTK file system minifilter |
12/30/2015 12:31 | (6.3:9600.16384) | (10.7:206.0) | Emulex | Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64) |
8/6/2013 16:00 | (9.15:1.102) | (9.15:1.102) | Matrox Graphics Inc. | MxG2hDO64.sys |
Cluster Logs:
00005548.000051a8::2016/07/25-03:48:53.654 INFO [RES] Physical Disk: Supplied device path Q: is a disk path, status 2
00005548.000051a8::2016/07/25-03:48:53.654 INFO [RES] Physical Disk: Failed to open device Q:, status 3
00005548.000051a8::2016/07/25-03:48:53.654 WARN [RHS] Error 3 from resource type control for restype Physical Disk.
00003328.00003618::2016/07/25-03:48:53.785 INFO [GEM] Node 2: Deleting [3:54298 , 3:54310] (both included) as it has been ack’d by every node
00003328.00004648::2016/07/25-03:53:27.359 ERR [REMP] HandleHwprvPostCommitSnapshot: unknown snapshot set 7488383a-f77f-49c6-abd5-3f30519b790c
00003328.00001338::2016/07/25-03:53:27.359 INFO [WRTA] OnEndLocalSnapshotSet: snapshot set 97a00019-cb80-40ad-b304-5d5f70da06e1, initiator SnapshotInitiatorAgentHwProvider
00003328.00004648::2016/07/25-03:53:27.381 INFO [DCM] ClusterSnapshotSetHandler(exit): 97a00019-cb80-40ad-b304-5d5f70da06e1, isLocal false, event HwPostCommitSnapshot, HrError(0x00000000)
00003328.00001338::2016/07/25-03:53:27.359 INFO [WRTA] OnEndLocalSnapshotSet: snapshot set 97a00019-cb80-40ad-b304-5d5f70da06e1, initiator SnapshotInitiatorAgentHwProvider
00003328.00004648::2016/07/25-03:53:27.381 INFO [DCM] ClusterSnapshotSetHandler(exit): 97a00019-cb80-40ad-b304-5d5f70da06e1, isLocal false, event HwPostCommitSnapshot, HrError(0x00000000)
____________________________________________________________________________________________
System
Information: ABCOUHV07
OS Name Microsoft
Windows Server 2012 R2 Standard
Version 6.3.9600 Build
9600
Other OS Description
Not Available
OS Manufacturer
Microsoft Corporation
System Name ABCOUHV07
System Manufacturer HP
System Model ProLiant
BL460c Gen8
System Type x64-based
PC
System SKU 641016-B21
Processor Intel(R)
Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R)
Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP
I31, 6/1/2015
System
Events:
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/20/2016 | 6:44:18 AM | Critical | ABCOUHV07 | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCOUHV06’ was removed from the active failover |
Application
Events:
- Checked the
application events but was not able to find anything specific related to
the issue.
List
of outdated drivers:
Date | Time | Type/Level | Source | Description |
12/20/2014 8:47 | (10.4:215.0) | (10.4:215.0) | Emulex | Emulex Plus Driver |
1/17/2015 21:57 | (10.4:246.0) | (10.4:246.0) | Emulex | Emulex FCoE Storport Miniport Driver |
3/2/2015 12:01 | (6.3:9600.17246) | (63.10:0.64) | PMC-Sierra, Inc. | Smart Array SAS/SATA Controller Storport Driver |
12/26/2015 17:03 | (9.0:0.902) | (9.0:0.902) | Veeam Software AG | CTK file system minifilter |
9/3/2015 14:01 | (6.3:9600.16384) | (10.6:236.0) | Emulex | Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64) |
__________________________________________________________________________________________
System Information: ABCOUHV06
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCOUHV06
System Manufacturer HP
System Model ProLiant BL460c Gen8
System Type x64-based PC
System SKU 641016-B21
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP I31, 6/1/2015
System Events:
- Checked the system events and found that the MPIO paths were removed before the issue.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/20/2016 | 6:23:47 AM | Information | ABCOUHV06.abcdgrp.local | 46 | mpio | Path 77010000 was removed from \Device\MPIODisk523 due to a PnP event. The dump data contains the current number of paths. |
7/20/2016 | 6:23:47 AM | Information | ABCOUHV06.abcdgrp.local | 46 | mpio | Path 77010000 was removed from \Device\MPIODisk524 due to a PnP event. The dump data contains the current number of paths. |
- Checked the node and found that the cluster shared volume went to Paused state. This generally happens when a backup job is running.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/20/2016 | 6:44:06 AM | Error | ABCOUHV06.abcdgrp.local | 5120 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
7/20/2016 | 6:44:06 AM | Error | ABCOUHV06.abcdgrp.local | 7024 | Service Control Manager | The Cluster Service service terminated with the following service-specific error: The semaphore timeout period has expired. |
7/20/2016 | 6:44:06 AM | Error | ABCOUHV06.abcdgrp.local | 7031 | Service Control Manager | The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service. |
Application Events:
- In the Application logs we first got the event that the Target LUN is not HP after which Cluster Shared Volume went to Paused state and Cluster server terminated.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/20/2016 | 5:44:02 AM | Error | ABCOUHV06.abcdgrp.local | 5065 | 3PARVSSProvider | 3PARVSS5065: ERROR: Target LUN HP is not a 3PAR Virtual Volume. |
7/20/2016 | 5:46:39 AM | Error | ABCOUHV06.abcdgrp.local | 12305 | VSS | Volume Shadow Copy Service error: Volume/disk not connected or not found. Error context: DeviceIoControl(\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy864 – 0000000000000180,0×00560038,0000000000000000,0,0000008854909D70,4096,[0]). Operation: Removing auto-release shadow copies Loading provider Context: Volume Name: \\?\Volume{d45ca367-4e59-11e6-80c9-0017a4770050}\ Execution Context: System Provider |
As per the Link : http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c04533976-2.pdf
Error:
3PARVSS5065: ERROR: Target LUN <Lun Id> is not a 3PAR Virtual Volume.
Details:
The designated target LUN is not an HP 3PAR virtual volume.
Resolution:
• Verify that all database and log destinations belong to the HP 3PAR
StoreServ Storage System.
List of outdated drivers:
Date | Time | Type/Level | Source | Description |
12/20/2014 8:47 | (10.4:215.0) | (10.4:215.0) | Emulex | Emulex Plus Driver |
1/17/2015 21:57 | (10.4:246.0) | (10.4:246.0) | Emulex | Emulex FCoE Storport Miniport Driver |
3/2/2015 12:01 | (6.3:9600.17246) | (63.10:0.64) | PMC-Sierra, Inc. | Smart Array SAS/SATA Controller Storport Driver |
12/26/2015 17:03 | (9.0:0.902) | (9.0:0.902) | Veeam Software AG | CTK file system minifilter |
9/3/2015 14:01 | (6.3:9600.16384) | (10.6:236.0) | Emulex | Emulex Win8.1/Win2012R2 NDIS 6.40 miniport (x64) |
__________________________________________________________________________________________
Conclusion:
- We analyzed the logs for 20th and 25th. On both the cases we found that the issue is started with the MPIO paths going in failed state, which happens only when we take the backup. At this point we can see that we have issues with the Third-party provider as well as we are getting errors on that end. As this point we can proceed with the following Action Plan:
- Please install the following Hotfixes on all the nodes of the Cluster.
- https://support.microsoft.com/en-us/kb/3156418 (For the issue in which CSV is going in Paused State.)
- Please install the Recommended Hotfixes on all the nodes: https://support.microsoft.com/en-us/kb/2920151
- The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)
- Please update the Backup Software first as the issue is happening while we are running the backup.
- Since MPIO path are failing please update the Components related to Storage and SAN.
- Please update the Fibre Channel Drivers :
Manufacture: Emulex Corporation
Serial Number: H3533664NP
Model: 554FLB
Driver version: 10.4.364.0
firmware version: 10.5.155.0
- Please check with the Backup Team or HP team to find the reason of the issue because we have seen issues happening with the software related to the SAN while running a backup from a third-party application. This could be an issue with any intermediate components.
- All the Nodes of the Cluster are not in the Same OU. Kindly move ABCOUHV01 to Computers.
Fqdn | Domain | Domain Role | Site Name | Organizational Unit |
ABCOUHV01.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | OU=Hyper-V Hosts |
ABCOUHV02.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV03.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV04.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV05.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV06.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV07.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV08.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV09.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
ABCOUHV10.abcdgrp.local | abcdgrp.local | Member Server | TRG-Headquarters | Computers |
Note: This might not be related to the issue that you are facing but it’s just a best practice which is missing from the Machine.