Issue Description:
Getting Event Id: 1230 “Cluster resource ‘FileServer-(Condor)’ (resource type ”, DLL ‘clusres.dll’) either crashed or deadlocked. ” on Cluster Name: EKNCL04 Running a copy of Microsoft Windows Server 2008 R2 Enterprise Version 6.1.7601 Service Pack 1 Build 7601
Initial Description:
>>As we know that in this case the resources failover from one node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.
Why is Event ID 1135 Logged ?
This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.
What caused the node to be marked down?
All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:
If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.
By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.
If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.
Reference :
Having a problem with nodes being removed from active Failover Cluster membership?
________________________________________________________________________
System Information: CLSTRFILE04
OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name CLSTRFILE04
System Manufacturer VMware, Inc.
System Model VMware Virtual Platform
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5649 @ 2.53GHz, 2533 Mhz, 2 Core(s), 2 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5649 @ 2.53GHz, 2533 Mhz, 2 Core(s), 2 Logical Processor(s)
BIOS Version/Date Phoenix Technologies LTD 6.00, 30/07/2013
System Events:
- Checked the events and found that the Cluster Node: ABCFILE08 got evicted from the FCM around 4:29:55 PM.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
6/1/2016 | 4:29:55 PM | Critical | CLSTRFILE04.ABC.com | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCFILE08’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
Application Events:
- Checked the application logs but was not able to find any event related to the issue.
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
2/28/2007 0:04 | (6.0:6001.16459) | (7.2:0.0) | Adaptec, Inc. | Adaptec StorPort Ultra320 SCSI Driver (X64) |
3/20/2009 18:36 | (3.6:1540.127) | (3.6:1540.127) | AMD Technologies Inc. | AMD Technology AHCI Compatible Controller Driver for Windows – AMD64 platform |
1/14/2009 19:27 | (5.2:0.16119) | (5.2:0.16119) | Adaptec, Inc. | Adaptec SAS RAID WS03 Driver |
4/26/2009 12:14 | (10.100:4.0) | (10.100:4.0) | Broadcom Corporation | Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
8/7/2006 2:51 | (1.0:1.1) | (1.0:1.6) | Brother Industries Ltd. | Brotehr Serial I/F Driver (WDM) |
8/7/2006 2:51 | (6.0:5479.0) | (1.0:0.12) | Brother Industries Ltd. | Brother USB MDM Driver |
2/13/2009 22:18 | (4.8:2.0) | (4.8:2.0) | Broadcom Corporation | Broadcom NetXtreme II GigE VBD |
5/29/2008 0:14 | (6.0:6001.18000) | (8.4:1.0) | Intel Corporation | Intel(R) PRO/1000 Adapter NDIS 6 deserialized driver |
12/31/2008 16:29 | (4.8:13.0) | (4.8:13.0) | Broadcom Corporation | Broadcom NetXtreme II 10 GigE VBD |
12/13/2005 21:47 | (0.4:22.0) | (5.4:22.0) | Intel Corp./ICP vortex GmbH | Intel/ICP Raid Storport Driver |
4/16/2009 23:13 | (6.1:7083.0) | (1.28:3.67) | LSI Corporation | LSI Fusion-MPT SCSI Driver (StorPort) |
5/19/2009 2:09 | (4.5:1.64) | (4.5:1.64) | LSI Corporation | MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64 |
5/19/2009 2:25 | 13.05.0409.2009 | (13.5:409.2009) | LSI Corporation, Inc. | LSI MegaRAID Software RAID Driver |
6/6/2006 22:11 | (7.10:0.0) | (7.10:0.0) | IBM Corporation | IBM ServeRAID Controller Driver |
8/10/2007 0:47 | (1.2:78.3) | (1.2:78.3) | Intel Corporation | Intel(R) 5000 Series Chipsets Integrated Device – 1A38 |
1/22/2009 23:05 | (9.1:8.6) | (9.1:8.6) | QLogic Corporation | QLogic Fibre Channel Stor Miniport Driver |
5/19/2009 2:18 | (2.1:3.20) | (2.1:3.20) | QLogic Corporation | QLogic iSCSI Storport Miniport Driver |
9/13/2006 14:18 | (4.3:86.0) | (4.3:86.0) | Macrovision Corporation, Macrovision Europe Limited, and Macrovision Japan and Asia K.K. | Macrovision SECURITY Driver |
7/14/2009 0:19 | (6.0:6000.170) | (6.0:6000.170) | VIA Technologies, Inc. | VIA Generic PCI IDE Bus Driver |
1/31/2009 1:18 | (6.0:6000.6210) | (6.0:6000.6210) | VIA Technologies Inc.,Ltd | VIA RAID DRIVER FOR AMD-X86-64 |
Cluster Events:
- Checked the events and found that the cluster networks are coming online.
6/1/2016 | 4:29:55 PM | Information | CLSTRFILE04.ABC.com | 1204 | Microsoft-Windows-FailoverClustering | The Cluster service successfully brought the clustered service or application ‘Available Storage’ offline. |
6/1/2016 | 4:29:55 PM | Information | CLSTRFILE04.ABC.com | 1125 | Microsoft-Windows-FailoverClustering | Cluster network interface ‘CLSTRFILE04 – Service LAN’ for cluster node ‘CLSTRFILE04’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network. |
______________________________________________________________________________
System Information: ABCFILE07
OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCFILE07
System Manufacturer HP
System Model ProLiant DL360p Gen8
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P71, 08/09/2013
System Events:
- Getting an event: 1085 related to folder redirection.
- At 4:29:55 PM Cluster node ABCFILE08 is removed from the FCM.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
6/1/2016 | 4:25:49 PM | Warning | ABCFILE07.ABC.com | 1085 | Microsoft-Windows-GroupPolicy | Windows failed to apply the Folder Redirection settings. Folder Redirection settings might have its own log file. Please click on the ‘More information’ link. |
6/1/2016 | 4:29:55 PM | Critical | ABCFILE07.ABC.com | 1135 | Microsoft-Windows-FailoverClustering | Cluster node ‘ABCFILE08’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
Application Events:
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
6/1/2016 | 4:25:48 PM | Error | ABCFILE07.ABC.com | 502 | Microsoft-Windows-Folder Redirection | Failed to apply policy and redirect folder ‘Documents’ to ‘\\abcfs01\abc$\dave.farmer\My Documents’. Redirection options=0x9231. The following error occurred: ‘Can not create folder ‘\\abcfs01\abc$\dave.farmer\My Documents”. Error details: ‘This security ID may not be assigned as the owner of this object. ‘. |
Cluster Events:
6/1/2016 | 4:29:55 PM | Information | ABCFILE07.ABC.com | 1125 | Microsoft-Windows-FailoverClustering | Cluster network interface ‘CLSTRFILE04 – Service LAN’ for cluster node ‘CLSTRFILE04’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network. |
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
12/5/2008 23:54 | (6.1:3790.0) | (1.6:6.4) | Adaptec, Inc. | Adaptec Windows SAS/SATA Storport Driver |
5/1/2007 18:30 | (6.0:3790.16512) | (1.6:6.1) | Adaptec, Inc. | Adaptec Windows SATA Storport Driver |
2/28/2007 0:04 | (6.0:6001.16459) | (7.2:0.0) | Adaptec, Inc. | Adaptec StorPort Ultra320 SCSI Driver (X64) |
3/19/2010 16:18 | (1.1:2.5) | (1.1:2.5) | Advanced Micro Devices | Storage Filter Driver |
2/13/2009 22:18 | (4.8:2.0) | (4.8:2.0) | Broadcom Corporation | Broadcom NetXtreme II GigE VBD |
2/3/2009 22:52 | (7.2:10.211) | (7.2:10.211) | Emulex | Storport Miniport Driver for LightPulse HBAs |
12/31/2008 16:29 | (4.8:13.0) | (4.8:13.0) | Broadcom Corporation | Broadcom NetXtreme II 10 GigE VBD |
4/24/2003 19:03 | (6.0:1.0) | (6.0:1.0) | Broadcom Corporation | Frame Access Driver |
6/11/2010 1:46 | (8.6:2.1014) | (8.6:2.1014) | Intel Corporation | Intel Matrix Storage Manager driver – x64 |
12/13/2005 21:47 | (0.4:22.0) | (5.4:22.0) | Intel Corp./ICP vortex GmbH | Intel/ICP Raid Storport Driver |
12/2/2009 21:36 | (5.2:3790.1830) | (1.3:0.4) | Intel Corporation | Intel(R) Network Adapter Diagnostic Driver |
5/19/2009 2:09 | (4.5:1.64) | (4.5:1.64) | LSI Corporation | MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64 |
5/19/2009 2:25 | 13.05.0409.2009 | (13.5:409.2009) | LSI Corporation, Inc. | LSI MegaRAID Software RAID Driver |
6/6/2006 22:11 | (7.10:0.0) | (7.10:0.0) | IBM Corporation | IBM ServeRAID Controller Driver |
8/10/2007 0:47 | (1.2:78.3) | (1.2:78.3) | Intel Corporation | Intel(R) 5000 Series Chipsets Integrated Device – 1A38 |
1/22/2009 23:05 | (9.1:8.6) | (9.1:8.6) | QLogic Corporation | QLogic Fibre Channel Stor Miniport Driver |
5/19/2009 2:18 | (2.1:3.20) | (2.1:3.20) | QLogic Corporation | QLogic iSCSI Storport Miniport Driver |
9/24/2008 19:28 | (5.1:1039.2600) | (5.1:1039.2600) | Silicon Integrated Systems Corp. | SiS RAID Stor Miniport Driver |
2/17/2009 23:03 | (5.0:1.1) | (5.0:1.1) | Promise Technology | Promise SuperTrak EX Series Driver for Windows |
__________________________________________________________________________________
System Information: ABCFILE08
OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCFILE08
System Manufacturer HP
System Model ProLiant DL360p Gen8
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P71, 9/8/2013
System Events:
- Getting an event related to the Schannel with Error state: 1203
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
6/1/2016 | 3:50:52 PM | Error | ABCFILE08.ABC.com | 36888 | Schannel | The following fatal alert was generated: 10. The internal error state is 1203. |
6/1/2016 | 3:50:52 PM | Error | ABCFILE08.ABC.com | 36888 | Schannel | The following fatal alert was generated: 10. The internal error state is 1203. |
6/1/2016 | 4:19:05 PM | Error | ABCFILE08.ABC.com | 1230 | Microsoft-Windows-FailoverClustering | Cluster resource ‘FileServer-(Condor)’ (resource type ”, DLL ‘clusres.dll’) either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor. |
- Checked and found that the Clusres.dll got deadlocked.
- Just after the Events 2012 all the resources started to fail which generally points out the issue from the networking End.
As per the Article: https://support.microsoft.com/en-us/kb/2885205
In Words
0000: 00040000 002C0001 00000000 800007DC
0010: 00000000 C0000184 00000000 00000000
0020: 00000000 00000000 0000058F
C0000184 = STATUS_INVALID_DEVICE_STATE , The device is not in a valid state to perform this request.
- This basically is an error that the network driver is giving SRV on the send IRPs. It usually indicates a send is issued on a connection which is no longer in a state valid for sending. For example, send when a connection has not reached connected state will return STATUS_INVALID_DEVICE_STATE. If disconnect has been initiated, then the same error would be returned for further sends.
As per the Article: https://blogs.technet.microsoft.com/yongrhee/2015/05/16/event-id-2012-while-transmitting-or-receiving-data-the-server-encountered-a-network-error/
- Cause:
=======
1. Antivirus Filter driver interfering with the network stack
2. An outdated or bad network card driver
3. A bad NIC
4. Network Teaming software
5. WAN Optimization devices
6. Mismatched Speed and Duplex settings between the NIC and switch
7. A spotty connection to a switch port
- Resolution:
==========
- Make sure that the firmware for the network switches/WAN accelerators and routers are up-to-date.
- Update the NIC firmware and driver.
- Update the NIC teaming software/driver.
- Update the Antivirus software or completely uninstall (for relief, and follow-up w/ the AV vendor)
- Manually set the speed/duplex
- Replace the network cable(s)
- Try a different switch port
- For the WAN optimizers, to try getting the packets from being modified, try using encapsulating the packets using IPsec.
6/1/2016 | 4:26:30 PM | Warning | ABCFILE08.ABC.com | 2012 | srv | While transmitting or receiving data, the server encountered a network error. Occassional errors are expected, but large amounts of these indicate a possible error in your network configuration. The error status code is contained within the returned data (formatted as Words) and may point you towards the problem. |
6/1/2016 | 4:26:30 PM | Critical | ABCFILE08.ABC.com | 1146 | Microsoft-Windows-FailoverClustering | The cluster resource host subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually due to a problem in a resource DLL. Please determine which resource DLL is causing the issue and report the problem to the resource vendor. |
6/1/2016 | 4:26:30 PM | Warning | ABCFILE08.ABC.com | 2012 | srv | While transmitting or receiving data, the server encountered a network error. Occasional errors are expected, but large amounts of these indicate a possible error in your network configuration. The error status code is contained within the returned data (formatted as Words) and may point you towards the problem. |
6/1/2016 | 4:26:30 PM | Error | ABCFILE08.ABC.com | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘FileServer-(Condor)’ in clustered service or application ‘Condor’ failed. |
- Cluster disk started to fail with Ntfs Errors.
6/1/2016 | 4:26:37 PM | Error | ABCFILE08.ABC.com | 137 | Ntfs | The default transaction resource manager on volume T: encountered a non-retryable error and could not start. The data contains the error code. |
6/1/2016 | 4:26:59 PM | Error | ABCFILE08.ABC.com | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Disk N:\’ in clustered service or application ‘Condor’ failed. |
- After we restarted the machine we are getting events related to the Team.
6/1/2016 | 4:32:53 PM | Warning | ABCFILE08.ABC.com | 461 | CPQTeamMP | Team ID: 0 Aggregation ID: 0 Team Member ID: 0 PROBLEM: 802.3ad link aggregation (LACP) has failed. ACTION: Ensure all ports are connected to LACP-aware devices. |
6/1/2016 | 4:33:02 PM | Warning | ABCFILE08.ABC.com | 434 | CPQTeamMP | HP Network Team #1: PROBLEM: A non-Primary Network Link is not receiving. Receive-path validation has been enabled for this Team by selecting the Enable receive-path validation Heartbeat Setting. ACTION: Please check your cabling to the link partner. Check the switch port status, including verifying that the switch port is not configured as a Switch-assist Channel. Generate Broadcast traffic on the network to test whether these are being received. Also make sure all teamed NICs are on the same broadcast domain. Run diagnostics to test card. Drop the NIC from the team, determine whether it is receiving broadcast traffic in that configuration. |
6/1/2016 | 4:35:03 PM | Error | ABCFILE08.ABC.com | 103 | MSiSCSI | Timeout waiting for iSCSI persistently bound volumes. If there are any services or applications that use information stored on these volumes then they may not start or may report errors. |
Application Events:
- Checked the application logs and found that the issue is with the connections between the Server and the SAN.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
6/1/2016 | 4:35:10 PM | Error | ABCFILE08.ABC.com | 2004 | Microsoft-Windows-PerfNet | Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code. |
6/1/2016 | 4:35:32 PM | Warning | ABCFILE08.ABC.com | 281 | SnapDrive | Failed to get data for an iSCSI HBA. HBA WMI class instance name: Root\ISCSIPRT\0000_0 Error code = 0x8004100c Error description = WDM specific return code: 4200 |
6/1/2016 | 4:35:37 PM | Warning | ABCFILE08.ABC.com | 317 | SnapDrive | Failed to enumerate LUN. Device path: ‘\\?\mpio#disk&ven_netapp&prod_lun&rev_811a#1&7f6ac24&0&3630413938303033373asdas32232135413330373835363730#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}‘ Storage path: ‘/vol/vol_ISCSI_EKNCL04_QUORUM/qtree_ISCSI_EKNCL04_QUORUM/lun_ISCSI_EKNCL04_QUORUM’ SCSI address: (3,0,0,0) Error code: 0xc00402fa Error description: A LUN with device path \\?\mpio#disk&ven_netapp&prod_lun&rev_811a#1&7f6ac24&0&36304139383030333735343333344637313544333335413330373835363730#{53f56307-b6bf-11sa21312130a0c91efb8b} and SCSI address (3, 0, 0, 0) is exposed through an unsupported initiator. |
Cluster Events:
6/1/2016 4:35:30 PM Information ABCFILE08.ABC.com 1062 Microsoft-Windows-FailoverClustering This node has successfully joined the failover cluster ‘EKNCL04’.
Cluster Logs:
00000c88.00002938::2016/06/01-15:06:02.872 ERR mscs::TopologyPersister::TryGetNetworkPrivateProperties: ERROR_FILE_NOT_FOUND(2)’ because of ‘OpenSubKey failed.’
00000c88.00002938::2016/06/01-15:06:02.872 INFO [NM] Received request from client address ABCFILE08.
000015c4.000055dc::2016/06/01-15:06:04.447 WARN [RES] File Server <FileServer-(Condor)>: Failed in NetShareGetInfo(Condor, PST Exports from old server), status 2310. Tolerating…
000015c4.000055dc::2016/06/01-15:06:04.463 WARN [RES] File Server <FileServer-(Condor)>: Failed in NetShareGetInfo(Condor, sp4$), status 2310. Tolerating…
000015c4.000015d4::2016/06/01-15:19:05.014 ERR [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource ‘FileServer-(Condor)’.
000015c4.000015d4::2016/06/01-15:19:05.014 INFO [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.
000015c4.000015d4::2016/06/01-15:19:05.014 ERR [RHS] Resource FileServer-(Condor) handling deadlock. Cleaning current operation and terminating RHS process.
000015c4.000015d4::2016/06/01-15:19:05.014 ERR [RHS] About to send WER report.
00000c88.0000369c::2016/06/01-15:19:05.014 WARN [RCM] HandleMonitorReply: FAILURENOTIFICATION for ‘FileServer-(Condor)’, gen(0) result 4.
00000c88.0000369c::2016/06/01-15:19:05.014 INFO [RCM] rcm::RcmResource::HandleMonitorReply: Resource ‘FileServer-(Condor)’ consecutive failure count 1.
00000c88.00007224::2016/06/01-15:25:29.369 ERR [RCM] rcm::RcmResControl::DoResourceControl: ERROR_RESOURCE_CALL_TIMED_OUT(5910)’ because of ‘Control(STORAGE_GET_DISK_INFO) to resource ‘Disk L:\’ timed out.’
00000c88.00007224::2016/06/01-15:25:29.369 WARN [RCM] ResourceControl(STORAGE_GET_DISK_INFO) to Disk L:\ returned 5910.
000015c4.00007b6c::2016/06/01-15:26:29.960 WARN [RES] File Server <FileServer-(Condor)>: Failed in NetShareGetInfo(Condor, sp4$), status 2310. Tolerating…
00000c88.00006594::2016/06/01-15:26:30.927 INFO [RCM] rcm::RcmResource::ReattachToMonitorProcess: (IP Address 193.27.213.16, Offline)
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
12/5/2008 23:54 | (6.1:3790.0) | (1.6:6.4) | Adaptec, Inc. | Adaptec Windows SAS/SATA Storport Driver |
5/1/2007 18:30 | (6.0:3790.16512) | (1.6:6.1) | Adaptec, Inc. | Adaptec Windows SATA Storport Driver |
2/28/2007 0:04 | (6.0:6001.16459) | (7.2:0.0) | Adaptec, Inc. | Adaptec StorPort Ultra320 SCSI Driver (X64) |
3/19/2010 16:18 | (1.1:2.5) | (1.1:2.5) | Advanced Micro Devices | Storage Filter Driver |
2/13/2009 22:18 | (4.8:2.0) | (4.8:2.0) | Broadcom Corporation | Broadcom NetXtreme II GigE VBD |
2/3/2009 22:52 | (7.2:10.211) | (7.2:10.211) | Emulex | Storport Miniport Driver for LightPulse HBAs |
12/31/2008 16:29 | (4.8:13.0) | (4.8:13.0) | Broadcom Corporation | Broadcom NetXtreme II 10 GigE VBD |
4/24/2003 19:03 | (6.0:1.0) | (6.0:1.0) | Broadcom Corporation | Frame Access Driver |
6/11/2010 1:46 | (8.6:2.1014) | (8.6:2.1014) | Intel Corporation | Intel Matrix Storage Manager driver – x64 |
12/13/2005 21:47 | (0.4:22.0) | (5.4:22.0) | Intel Corp./ICP vortex GmbH | Intel/ICP Raid Storport Driver |
12/2/2009 21:36 | (5.2:3790.1830) | (1.3:0.4) | Intel Corporation | Intel(R) Network Adapter Diagnostic Driver |
5/19/2009 2:09 | (4.5:1.64) | (4.5:1.64) | LSI Corporation | MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64 |
5/19/2009 2:25 | 13.05.0409.2009 | (13.5:409.2009) | LSI Corporation, Inc. | LSI MegaRAID Software RAID Driver |
6/6/2006 22:11 | (7.10:0.0) | (7.10:0.0) | IBM Corporation | IBM ServeRAID Controller Driver |
8/10/2007 0:47 | (1.2:78.3) | (1.2:78.3) | Intel Corporation | Intel(R) 5000 Series Chipsets Integrated Device – 1A38 |
1/22/2009 23:05 | (9.1:8.6) | (9.1:8.6) | QLogic Corporation | QLogic Fibre Channel Stor Miniport Driver |
5/19/2009 2:18 | (2.1:3.20) | (2.1:3.20) | QLogic Corporation | QLogic iSCSI Storport Miniport Driver |
9/24/2008 19:28 | (5.1:1039.2600) | (5.1:1039.2600) | Silicon Integrated Systems Corp. | SiS RAID Stor Miniport Driver |
2/17/2009 23:03 | (5.0:1.1) | (5.0:1.1) | Promise Technology | Promise SuperTrak EX Series Driver for Windows |
_________________________________________________________________
Conclusion:
- After analyzing the logs we can see that the issue started from the Networking End which went offline on Node ABCFILE08 due to which we got the event ID 1135 and the Node got evicted from the Cluster. At 4:35 the Node is added back when the Network after we restarted the Machine. As per the Events we are getting Event ID 2012 which usually indicates a send is issued on a connection which is no longer in a state valid for sending.
- Make sure that the firmware for the network switches/WAN accelerators and routers are up-to-date.
- Update the NIC firmware and driver.
- Update the NIC teaming software/driver.
- Update the Antivirus software or completely uninstall (for relief, and follow-up w/ the AV vendor)
- Manually set the speed/duplex
- Replace the network cable(s)
- Try a different switch port
- For the WAN optimizers, to try getting the packets from being modified, try using encapsulating the packets using IPsec.
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
Updates for Cluster Binaries for 2008 R2 : https://support.microsoft.com/en-us/kb/2545685
- Investigate the Network timeout / latency / packet drops with the help of in house networking team.
Please Note : This step is the most critical while dealing with network connectivity issues.
Investigation of Network Issues :
We need to investigate the Network Connectivity Issues with the help of in-house networking team.
In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.