Issue Description:
Need to Know the Reason for the Cluster unexpected failover on Cluster Name: abcdCluster01 Running a copy of Microsoft Windows Server 2012 Standard Version 6.2.9200 Build 9200 on August 09 @ 8:53pm.
Initial Description:
As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.
Why is Event ID 1135 Logged ?
This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.
What caused the node to be marked down?
All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:
If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.
By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.
If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.
Reference :
Having a problem with nodes being removed from active Failover Cluster membership?
Issue happened on August 09 @ 8:53pm.
_________________________________________
System Information: ABCC1DOCS02
OS Name Microsoft
Windows Server 2012 Standard
Version 6.2.9200 Build
9200
Other OS Description
Not Available
OS Manufacturer
Microsoft Corporation
System Name ABCC1DOCS02
System Manufacturer
VMware, Inc.
System Model VMware
Virtual Platform
System Type x64-based
PC
System SKU
Processor AMD
Opteron(tm) Processor 6380, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)
BIOS Version/Date
Phoenix Technologies LTD 6.00, 9/17/2015
System Events:
- Checked the events and found that the Node: ABCC1DOCS02 is evicted from the cluster after we got the Error of the ISCSI disconnect.
- Checked the events and found that we got Event ID 20 and 7 before the issue. Both the event mentions that the Server (initiator) loses the connection with the Target.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
8/9/2016 | 8:45:21 PM | Error | ABCC1DOCS02.abc.local | 20 | iScsiPrt | Connection to the target was lost. The initiator will attempt to |
8/9/2016 | 8:45:21 PM | Error | ABCC1DOCS02.abc.local | 7 | iScsiPrt | The initiator could not send an iSCSI PDU. Error status is given |
8/9/2016 | 8:45:30 PM | Critical | ABCC1DOCS02.abc.local | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCC1DOCS01’ was removed from the active failover |
- Checked other events at the time of issue and found that the shadow copy operation was running.
- We are also getting event id 46 as the path of the MPIO was removed.
8/9/2016 | 8:46:06 PM | Error | ABCC1DOCS02.abc.local | 29 | volsnap | The shadow copies of volume T: were aborted during detection. |
8/9/2016 | 8:46:06 PM | Information | ABCC1DOCS02.abc.local | 46 | mpio | Path 77040004 was removed from \Device\MPIODisk1 due to a PnP |
8/9/2016 | 8:46:06 PM | Error | ABCC1DOCS02.abc.local | 9295 | AAFsFlt | Metadata could not be written to volume \Device\HarddiskVolume3. |
8/9/2016 | 8:46:06 PM | Warning | ABCC1DOCS02.abc.local | 9293 | AAFsFlt | Volume \Device\HarddiskVolume3 has been disabled because of a PnP |
8/9/2016 | 8:46:06 PM | Warning | ABCC1DOCS02.abc.local | 140 | Microsoft-Windows-Ntfs | The system failed to flush data to the transaction log. Corruption |
- We are getting events related to the AppAssure which was running at the time of issue.
8/9/2016 | 8:54:28 PM | Error | ABCC1DOCS02.abc.local | 9295 | AAFsFlt | Metadata could not be written to volume \Device\HarddiskVolume9. |
8/9/2016 | 8:54:28 PM | Error | ABCC1DOCS02.abc.local | 9295 | AAFsFlt | Metadata could not be written to volume \Device\HarddiskVolume9. |
8/9/2016 | 8:54:28 PM | Error | ABCC1DOCS02.abc.local | 9289 | AAFsFlt | Log file could not be opened or re-opened for device |
8/9/2016 | 8:54:28 PM | Error | ABCC1DOCS02.abc.local | 27 | volsnap | The shadow copies of volume T: were aborted during detection |
- This events points out that the Kaspersky service is also active at the time of issue.
8/9/2016 | 8:57:39 PM | Error | ABCC1DOCS02.abc.local | 5 | KLIF | Volume ID of the volume ‘\Device\HarddiskVolume13’ is already in |
8/9/2016 | 10:03:18 PM | Error | ABCC1DOCS02.abc.local | 20 | iScsiPrt | Connection to the target was lost. The initiator will attempt to |
8/9/2016 | 10:03:18 PM | Error | ABCC1DOCS02.abc.local | 7 | iScsiPrt | The initiator could not send an iSCSI PDU. Error status is given |
Application Events:
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
8/9/2016 | 8:42:46 PM | Error | ABCC1DOCS02.abc.local | 8194 | VSS | Volume Shadow Copy Service error: Unexpected error querying for |
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
2/27/2007 19:04 | (6.0:6001.16459) | (7.2:0.0) | Adaptec, Inc. | Adaptec StorPort Ultra320 SCSI Driver (X64) |
6/5/2012 13:59 | (6.2:8415.0) | (6.2:8415.0) | Brocade Communications Systems, Inc. | Brocade FC/FCoE HBA Stor Miniport Driver |
6/5/2012 14:02 | (6.2:8415.0) | (6.2:8415.0) | Brocade Communications Systems, | Brocade FC/FCoE HBA Stor Miniport |
2/23/2012 20:21 | (6.2:8220.0) | (7.0:51.0) | Broadcom Corporation | Broadcom NetXtreme Unified Crash Dump (x64) |
11/3/2011 15:06 | (6.2:8128.0) | (7.0:50.0) | Broadcom Corporation | Broadcom NetXtreme FCoE Crash Dump (x64) |
11/10/2011 1:04 | (6.2:8128.0) | (7.0:16.50) | Broadcom Corporation | FCoE offload x64 FREE |
2/23/2012 20:17 | (6.2:8220.0) | (7.0:13.50) | Broadcom Corporation | iSCSI offload x64 FREE |
7/23/2012 19:30 | (7.0:1.36) | (7.0:1.36) | Broadcom Corporation | Broadcom NetXtreme II GigE VBD |
9/21/2012 1:07 | (4.5:0.6409) | (4.5:0.6409) | Dell | Dell EqualLogic Device Specific Module |
7/24/2012 8:22 | (7.0:35.95) | (7.0:35.95) | Broadcom Corporation | Broadcom NetXtreme II 10 GigE VBD |
6/6/2006 17:11 | (7.10:0.0) | (7.10:0.0) | IBM Corporation | IBM ServeRAID Controller Driver |
5/31/2012 13:11 | (9.1:9.205) | (9.1:9.205) | QLogic Corporation | QLogic Fibre Channel Stor Miniport |
12/7/2011 19:48 | (2.1:5.10) | (2.1:5.10) | QLogic Corporation | QLogic iSCSI Storport Miniport |
______________________________________________________________________________
System Information: ABCC1DOCS01
OS Name Microsoft
Windows Server 2012 Standard
Version 6.2.9200 Build
9200
Other OS Description
Not Available
OS Manufacturer
Microsoft Corporation
System Name ABCC1DOCS01
System Manufacturer
VMware, Inc.
System Model VMware
Virtual Platform
System Type x64-based
PC
System SKU
Processor AMD
Opteron(tm) Processor 6380, 2500 Mhz, 4 Core(s), 4 Logical Processor(s)
BIOS Version/Date
Phoenix Technologies LTD 6.00, 9/17/2015
System Events:
- Checked the event and found the event is 1135.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
8/9/2016 | 8:45:28 PM | Critical | ABCC1DOCS01.abc.local | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCC1DOCS02’ was removed from the active failover |
- Got the events and found that we are getting the same iscsi errors on this end which explains that there is a storage disconnect.
8/9/2016 | 8:45:32 PM | Error | ABCC1DOCS01.abc.local | 20 | iScsiPrt | Connection to the target was lost. The initiator will attempt to |
8/9/2016 | 8:45:32 PM | Error | ABCC1DOCS01.abc.local | 7 | iScsiPrt | The initiator could not send an iSCSI PDU. Error status is given |
- Since the disks are disconnected we got the events for ntfs as it was not able to flush the transactional logs.
8/9/2016 | 8:45:58 PM | Warning | ABCC1DOCS01.abc.local | 140 | Microsoft-Windows-Ntfs | The system failed to flush data to the transaction log. Corruption |
- Got the events related to the backup Job running.
8/9/2016 | 8:45:58 PM | Error | ABCC1DOCS01.abc.local | 9295 | AAFsFlt | Metadata could not be written to volume \Device\HarddiskVolume10. |
8/9/2016 | 8:45:59 PM | Error | ABCC1DOCS01.abc.local | 1038 | Microsoft-Windows-FailoverClustering | Ownership of cluster disk ‘Cluster Data (T:)’ has been |
8/9/2016 | 8:46:02 PM | Error | ABCC1DOCS01.abc.local | 14 | volsnap | The shadow copies of volume T: were aborted because of an IO |
- Since the disk was unexpectedly removed from the Node. This has caused corruption on the Volume Q.
8/9/2016 | 8:58:33 PM | Error | ABCC1DOCS01.abc.local | 55 | Ntfs | A corruption was discovered in the file system structure on volume |
- Last event occurred at 9:08:09 PM at 8/9/2016 after which the issue is resolved.
8/9/2016 | 9:08:09 PM | Error | ABCC1DOCS01.abc.local | 7 | iScsiPrt | The initiator could not send an iSCSI PDU. Error status is given |
Application Events:
- Checked the events but was not able to find anything specific related to the issue.
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
2/27/2007 19:04 | (6.0:6001.16459) | (7.2:0.0) | Adaptec, Inc. | Adaptec StorPort Ultra320 SCSI Driver (X64) |
6/5/2012 13:59 | (6.2:8415.0) | (6.2:8415.0) | Brocade Communications Systems, Inc. | Brocade FC/FCoE HBA Stor Miniport Driver |
6/5/2012 14:02 | (6.2:8415.0) | (6.2:8415.0) | Brocade Communications Systems, | Brocade FC/FCoE HBA Stor Miniport |
2/23/2012 20:21 | (6.2:8220.0) | (7.0:51.0) | Broadcom Corporation | Broadcom NetXtreme Unified Crash Dump (x64) |
11/3/2011 15:06 | (6.2:8128.0) | (7.0:50.0) | Broadcom Corporation | Broadcom NetXtreme FCoE Crash Dump (x64) |
11/10/2011 1:04 | (6.2:8128.0) | (7.0:16.50) | Broadcom Corporation | FCoE offload x64 FREE |
2/23/2012 20:17 | (6.2:8220.0) | (7.0:13.50) | Broadcom Corporation | iSCSI offload x64 FREE |
7/23/2012 19:30 | (7.0:1.36) | (7.0:1.36) | Broadcom Corporation | Broadcom NetXtreme II GigE VBD |
9/21/2012 1:07 | (4.5:0.6409) | (4.5:0.6409) | Dell | Dell EqualLogic Device Specific Module |
7/24/2012 8:22 | (7.0:35.95) | (7.0:35.95) | Broadcom Corporation | Broadcom NetXtreme II 10 GigE VBD |
6/6/2006 17:11 | (7.10:0.0) | (7.10:0.0) | IBM Corporation | IBM ServeRAID Controller Driver |
5/31/2012 13:11 | (9.1:9.205) | (9.1:9.205) | QLogic Corporation | QLogic Fibre Channel Stor Miniport |
12/7/2011 19:48 | (2.1:5.10) | (2.1:5.10) | QLogic Corporation | QLogic iSCSI Storport Miniport |
_________________________________________________________________________________
Conclusion:
- After analyzing the logs we found that the Issue started after the Nodes lost the connection to the storage. This could happen due to the failure of Any device which is used to establish the connection between the Nodes and the SAN for example a Switch or router. Since the issue happened on both the nodes this could be a device which is in common between these two VMs.
- The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from
virus scanning.(Applicable for Cluster 2003)
- Kindly check internally if there is any device which is connected between the Server and the SAN which malfunctioned. Because the issue was started after the storage disconnect.
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes: