RCA- 4 – Cluster Resource Crashed or Deadlocked

Issue Description:

 

Getting Event Id: 1230 “Cluster resource ‘FileServer-(Condor)’ (resource type ”, DLL ‘clusres.dll’) either crashed or deadlocked. ” on Cluster Name: EKNCL04 Running a copy of Microsoft Windows Server 2008 R2 Enterprise Version 6.1.7601 Service Pack 1 Build 7601

 

Initial Description:

 

>>As we know that in this case the resources failover from one  node to another this generally happens when the node on which the resource was running is no more capable of running that resource. This may be due to lack of essential components like unable to access storage or Loss of network connectivity. Sometimes the Node on which the resource was running gets evicted from the failover clustering membership (event id 1135) which makes the resources to failover to another node.

 

Why is Event ID 1135 Logged ?

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

image

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up.  If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it.  This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down.  If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

Reference :

                Having a problem with nodes being removed from active Failover Cluster membership?

                http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

 

________________________________________________________________________

 

System Information:  CLSTRFILE04

 

OS Name        Microsoft Windows Server 2008 R2 Enterprise

Version        6.1.7601 Service Pack 1 Build 7601

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        CLSTRFILE04

System Manufacturer        VMware, Inc.

System Model        VMware Virtual Platform

System Type        x64-based PC

Processor        Intel(R) Xeon(R) CPU           E5649  @ 2.53GHz, 2533 Mhz, 2 Core(s), 2 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU           E5649  @ 2.53GHz, 2533 Mhz, 2 Core(s), 2 Logical Processor(s)

BIOS Version/Date        Phoenix Technologies LTD 6.00, 30/07/2013

 

System Events:

 

  • Checked the events and found that the Cluster Node:  ABCFILE08 got evicted from the FCM around 4:29:55 PM.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/1/2016

4:29:55 PM

Critical

CLSTRFILE04.ABC.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCFILE08’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Application Events:

 

  • Checked the application logs but was not able to find any event related to the issue.

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/28/2007 0:04

(6.0:6001.16459)

(7.2:0.0)

Adaptec, Inc.

Adaptec StorPort Ultra320 SCSI Driver (X64)

3/20/2009 18:36

(3.6:1540.127)

(3.6:1540.127)

AMD Technologies Inc.

AMD Technology AHCI Compatible Controller Driver for Windows – AMD64 platform

1/14/2009 19:27

(5.2:0.16119)

(5.2:0.16119)

Adaptec, Inc.

Adaptec SAS RAID WS03 Driver

4/26/2009 12:14

(10.100:4.0)

(10.100:4.0)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

8/7/2006 2:51

(1.0:1.1)

(1.0:1.6)

Brother Industries Ltd.

Brotehr Serial I/F Driver (WDM)

8/7/2006 2:51

(6.0:5479.0)

(1.0:0.12)

Brother Industries Ltd.

Brother USB MDM Driver

2/13/2009 22:18

(4.8:2.0)

(4.8:2.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

5/29/2008 0:14

(6.0:6001.18000)

(8.4:1.0)

Intel Corporation

Intel(R) PRO/1000 Adapter NDIS 6 deserialized driver

12/31/2008 16:29

(4.8:13.0)

(4.8:13.0)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

12/13/2005 21:47

(0.4:22.0)

(5.4:22.0)

Intel Corp./ICP vortex GmbH

Intel/ICP Raid Storport Driver

4/16/2009 23:13

(6.1:7083.0)

(1.28:3.67)

LSI Corporation

LSI Fusion-MPT SCSI Driver (StorPort)

5/19/2009 2:09

(4.5:1.64)

(4.5:1.64)

LSI Corporation

MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64

5/19/2009 2:25

13.05.0409.2009

(13.5:409.2009)

LSI Corporation, Inc.

LSI MegaRAID Software RAID Driver

6/6/2006 22:11

(7.10:0.0)

(7.10:0.0)

IBM Corporation

IBM ServeRAID Controller Driver

8/10/2007 0:47

(1.2:78.3)

(1.2:78.3)

Intel Corporation

Intel(R) 5000 Series Chipsets Integrated Device – 1A38

1/22/2009 23:05

(9.1:8.6)

(9.1:8.6)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

5/19/2009 2:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

9/13/2006 14:18

(4.3:86.0)

(4.3:86.0)

Macrovision Corporation, Macrovision Europe Limited, and Macrovision Japan and Asia K.K.

Macrovision SECURITY Driver

7/14/2009 0:19

(6.0:6000.170)

(6.0:6000.170)

VIA Technologies, Inc.

VIA Generic PCI IDE Bus Driver

1/31/2009 1:18

(6.0:6000.6210)

(6.0:6000.6210)

VIA Technologies Inc.,Ltd

VIA RAID DRIVER FOR AMD-X86-64

 

 

Cluster Events:

 

  • Checked the events and found that the cluster networks are coming online.

 

6/1/2016

4:29:55 PM

Information

CLSTRFILE04.ABC.com

1204

Microsoft-Windows-FailoverClustering

The Cluster service successfully brought the clustered service or application ‘Available Storage’ offline.

6/1/2016

4:29:55 PM

Information

CLSTRFILE04.ABC.com

1125

Microsoft-Windows-FailoverClustering

Cluster network interface ‘CLSTRFILE04 – Service LAN’ for cluster node ‘CLSTRFILE04’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network.

______________________________________________________________________________

 

 

 

System Information:  ABCFILE07

 

OS Name        Microsoft Windows Server 2008 R2 Enterprise

Version        6.1.7601 Service Pack 1 Build 7601

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCFILE07

System Manufacturer        HP

System Model        ProLiant DL360p Gen8

System Type        x64-based PC

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP P71, 08/09/2013

 

 

System Events:

 

 

  • Getting an event: 1085 related to folder redirection. 

 

  • At 4:29:55 PM Cluster node ABCFILE08 is removed from the FCM.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/1/2016

4:25:49 PM

Warning

ABCFILE07.ABC.com

1085

Microsoft-Windows-GroupPolicy

Windows failed to apply the Folder Redirection settings. Folder Redirection settings might have its own log file. Please click on the ‘More information’ link.

6/1/2016

4:29:55 PM

Critical

ABCFILE07.ABC.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCFILE08’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

 

Application Events:

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/1/2016

4:25:48 PM

Error

ABCFILE07.ABC.com

502

Microsoft-Windows-Folder Redirection

Failed to apply policy and redirect folder ‘Documents’ to ‘\\abcfs01\abc$\dave.farmer\My Documents’.  Redirection options=0x9231.  The following error occurred: ‘Can not create folder ‘\\abcfs01\abc$\dave.farmer\My Documents”.  Error details: ‘This security ID may not be assigned as the owner of this object. ‘.

 

Cluster Events:

 

6/1/2016

4:29:55 PM

Information

ABCFILE07.ABC.com

1125

Microsoft-Windows-FailoverClustering

Cluster network interface ‘CLSTRFILE04 – Service LAN’ for cluster node ‘CLSTRFILE04’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network.

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

12/5/2008 23:54

(6.1:3790.0)

(1.6:6.4)

Adaptec, Inc.

Adaptec Windows SAS/SATA Storport Driver

5/1/2007 18:30

(6.0:3790.16512)

(1.6:6.1)

Adaptec, Inc.

Adaptec Windows SATA Storport Driver

2/28/2007 0:04

(6.0:6001.16459)

(7.2:0.0)

Adaptec, Inc.

Adaptec StorPort Ultra320 SCSI Driver (X64)

3/19/2010 16:18

(1.1:2.5)

(1.1:2.5)

Advanced Micro Devices

Storage Filter Driver

2/13/2009 22:18

(4.8:2.0)

(4.8:2.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

2/3/2009 22:52

(7.2:10.211)

(7.2:10.211)

Emulex

Storport Miniport Driver for LightPulse HBAs

12/31/2008 16:29

(4.8:13.0)

(4.8:13.0)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

4/24/2003 19:03

(6.0:1.0)

(6.0:1.0)

Broadcom Corporation

Frame Access Driver

6/11/2010 1:46

(8.6:2.1014)

(8.6:2.1014)

Intel Corporation

Intel Matrix Storage Manager driver – x64

12/13/2005 21:47

(0.4:22.0)

(5.4:22.0)

Intel Corp./ICP vortex GmbH

Intel/ICP Raid Storport Driver

12/2/2009 21:36

(5.2:3790.1830)

(1.3:0.4)

Intel Corporation

Intel(R) Network Adapter Diagnostic Driver

5/19/2009 2:09

(4.5:1.64)

(4.5:1.64)

LSI Corporation

MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64

5/19/2009 2:25

13.05.0409.2009

(13.5:409.2009)

LSI Corporation, Inc.

LSI MegaRAID Software RAID Driver

6/6/2006 22:11

(7.10:0.0)

(7.10:0.0)

IBM Corporation

IBM ServeRAID Controller Driver

8/10/2007 0:47

(1.2:78.3)

(1.2:78.3)

Intel Corporation

Intel(R) 5000 Series Chipsets Integrated Device – 1A38

1/22/2009 23:05

(9.1:8.6)

(9.1:8.6)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

5/19/2009 2:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

9/24/2008 19:28

(5.1:1039.2600)

(5.1:1039.2600)

Silicon Integrated Systems Corp.

SiS RAID Stor Miniport Driver

2/17/2009 23:03

(5.0:1.1)

(5.0:1.1)

Promise Technology

Promise  SuperTrak EX Series Driver for Windows

 

 

 

__________________________________________________________________________________

 

System Information:  ABCFILE08

 

OS Name        Microsoft Windows Server 2008 R2 Enterprise

Version        6.1.7601 Service Pack 1 Build 7601

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCFILE08

System Manufacturer        HP

System Model        ProLiant DL360p Gen8

System Type        x64-based PC

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2500 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP P71, 9/8/2013

 

 

System Events:

 

  • Getting an event related to the Schannel with Error state: 1203
  •  

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/1/2016

3:50:52 PM

Error

ABCFILE08.ABC.com

36888

Schannel

The following fatal alert was generated: 10. The internal error state is 1203.

6/1/2016

3:50:52 PM

Error

ABCFILE08.ABC.com

36888

Schannel

The following fatal alert was generated: 10. The internal error state is 1203.

6/1/2016

4:19:05 PM

Error

ABCFILE08.ABC.com

1230

Microsoft-Windows-FailoverClustering

Cluster resource ‘FileServer-(Condor)’ (resource type ”, DLL ‘clusres.dll’) either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.

 

  • Checked and found that the Clusres.dll got deadlocked.

 

  • Just after the Events 2012 all the resources started to fail which generally points out the issue from the networking End.

 

As per the Article: https://support.microsoft.com/en-us/kb/2885205

 

In Words

0000: 00040000 002C0001 00000000 800007DC

0010: 00000000 C0000184 00000000 00000000

0020: 00000000 00000000 0000058F

 

C0000184 = STATUS_INVALID_DEVICE_STATE , The device is not in a valid state to perform this request.

 

  • This basically is an error that the network driver is giving SRV on the send IRPs. It usually indicates a send is issued on a connection which is no longer in a state valid for sending. For example, send when a connection has not reached connected state will return STATUS_INVALID_DEVICE_STATE. If disconnect has been initiated, then the same error would be returned for further sends.

As per the Article: https://blogs.technet.microsoft.com/yongrhee/2015/05/16/event-id-2012-while-transmitting-or-receiving-data-the-server-encountered-a-network-error/

 

  • Cause:

=======

1. Antivirus Filter driver interfering with the network stack

2. An outdated or bad network card driver

3. A bad NIC

4. Network Teaming software

5. WAN Optimization devices

6. Mismatched Speed and Duplex settings between the NIC and switch

7. A spotty connection to a switch port

 

  • Resolution:

==========

  • Make sure that the firmware for the network switches/WAN accelerators and routers are up-to-date.
  • Update the NIC firmware and driver.
  • Update the NIC teaming software/driver.
  • Update the Antivirus software or completely uninstall (for relief, and follow-up w/ the AV vendor)
  • Manually set the speed/duplex
  • Replace the network cable(s)
  • Try a different switch port
  • For the WAN optimizers, to try getting the packets from being modified, try using encapsulating the packets using IPsec.

 

 

 

 

6/1/2016

4:26:30 PM

Warning

ABCFILE08.ABC.com

2012

srv

While transmitting or receiving data, the server encountered a network error. Occassional errors are expected, but large amounts of these indicate a possible error in your network configuration.  The error status code is contained within the returned data (formatted as Words) and may point you towards the problem.

6/1/2016

4:26:30 PM

Critical

ABCFILE08.ABC.com

1146

Microsoft-Windows-FailoverClustering

The cluster resource host subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually due to a problem in a resource DLL. Please determine which resource DLL is causing the issue and report the problem to the resource vendor.

6/1/2016

4:26:30 PM

Warning

ABCFILE08.ABC.com

2012

srv

While transmitting or receiving data, the server encountered a network error. Occasional errors are expected, but large amounts of these indicate a possible error in your network configuration.  The error status code is contained within the returned data (formatted as Words) and may point you towards the problem.

6/1/2016

4:26:30 PM

Error

ABCFILE08.ABC.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘FileServer-(Condor)’ in clustered service or application ‘Condor’ failed.

 

  • Cluster disk started to fail with Ntfs Errors.

 

6/1/2016

4:26:37 PM

Error

ABCFILE08.ABC.com

137

Ntfs

The default transaction resource manager on volume T: encountered a non-retryable error and could not start.  The data contains the error code.

6/1/2016

4:26:59 PM

Error

ABCFILE08.ABC.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk N:\’ in clustered service or application ‘Condor’ failed.

 

  • After we restarted the machine we are getting events related to the Team.

 

6/1/2016

4:32:53 PM

Warning

ABCFILE08.ABC.com

461

CPQTeamMP

Team ID: 0 Aggregation ID: 0 Team Member ID: 0  PROBLEM: 802.3ad link aggregation (LACP) has failed. ACTION: Ensure all ports are connected to LACP-aware devices.

6/1/2016

4:33:02 PM

Warning

ABCFILE08.ABC.com

434

CPQTeamMP

HP Network Team #1: PROBLEM: A non-Primary Network Link is not receiving. Receive-path validation has been enabled for this Team by selecting the Enable receive-path validation Heartbeat Setting.  ACTION: Please check your cabling to the link partner. Check the switch port status, including verifying that the switch port is not configured as a Switch-assist Channel. Generate Broadcast traffic on the network to test whether these are being received. Also make sure all teamed NICs are on the same broadcast domain. Run diagnostics to test card. Drop the NIC from the team, determine whether it is receiving broadcast traffic in that configuration.

6/1/2016

4:35:03 PM

Error

ABCFILE08.ABC.com

103

MSiSCSI

Timeout waiting for iSCSI persistently bound volumes. If there are any services or applications that use information stored on these volumes then they may not start or may report errors.

 

 

Application Events:

 

  • Checked the application logs and found that the issue is with the connections between the Server and the SAN.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

6/1/2016

4:35:10 PM

Error

ABCFILE08.ABC.com

2004

Microsoft-Windows-PerfNet

Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code.

6/1/2016

4:35:32 PM

Warning

ABCFILE08.ABC.com

281

SnapDrive

Failed to get data for an iSCSI HBA. HBA WMI class instance name: Root\ISCSIPRT\0000_0 Error code = 0x8004100c Error description = WDM specific return code: 4200

6/1/2016

4:35:37 PM

Warning

ABCFILE08.ABC.com

317

SnapDrive

Failed to enumerate LUN.  Device path: ‘\\?\mpio#disk&ven_netapp&prod_lun&rev_811a#1&7f6ac24&0&3630413938303033373asdas32232135413330373835363730#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}‘  Storage path: ‘/vol/vol_ISCSI_EKNCL04_QUORUM/qtree_ISCSI_EKNCL04_QUORUM/lun_ISCSI_EKNCL04_QUORUM’  SCSI address: (3,0,0,0)  Error code: 0xc00402fa  Error description: A LUN with device path \\?\mpio#disk&ven_netapp&prod_lun&rev_811a#1&7f6ac24&0&36304139383030333735343333344637313544333335413330373835363730#{53f56307-b6bf-11sa21312130a0c91efb8b} and SCSI address (3, 0, 0, 0) is exposed through an unsupported initiator.

 

 

Cluster Events:

 

6/1/2016        4:35:30 PM        Information        ABCFILE08.ABC.com        1062        Microsoft-Windows-FailoverClustering        This node has successfully joined the failover cluster ‘EKNCL04’.

 

Cluster Logs:

 

00000c88.00002938::2016/06/01-15:06:02.872 ERR   mscs::TopologyPersister::TryGetNetworkPrivateProperties: ERROR_FILE_NOT_FOUND(2)’ because of ‘OpenSubKey failed.’

00000c88.00002938::2016/06/01-15:06:02.872 INFO  [NM] Received request from client address ABCFILE08.

000015c4.000055dc::2016/06/01-15:06:04.447 WARN  [RES] File Server <FileServer-(Condor)>: Failed in NetShareGetInfo(Condor, PST Exports from old server), status 2310. Tolerating…

000015c4.000055dc::2016/06/01-15:06:04.463 WARN  [RES] File Server <FileServer-(Condor)>: Failed in NetShareGetInfo(Condor, sp4$), status 2310. Tolerating…

 

000015c4.000015d4::2016/06/01-15:19:05.014 ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource ‘FileServer-(Condor)’.

000015c4.000015d4::2016/06/01-15:19:05.014 INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.

000015c4.000015d4::2016/06/01-15:19:05.014 ERR   [RHS] Resource FileServer-(Condor) handling deadlock. Cleaning current operation and terminating RHS process.

000015c4.000015d4::2016/06/01-15:19:05.014 ERR   [RHS] About to send WER report.

00000c88.0000369c::2016/06/01-15:19:05.014 WARN  [RCM] HandleMonitorReply: FAILURENOTIFICATION for ‘FileServer-(Condor)’, gen(0) result 4.

00000c88.0000369c::2016/06/01-15:19:05.014 INFO  [RCM] rcm::RcmResource::HandleMonitorReply: Resource ‘FileServer-(Condor)’ consecutive failure count 1.

00000c88.00007224::2016/06/01-15:25:29.369 ERR   [RCM] rcm::RcmResControl::DoResourceControl: ERROR_RESOURCE_CALL_TIMED_OUT(5910)’ because of ‘Control(STORAGE_GET_DISK_INFO) to resource ‘Disk L:\’ timed out.’

00000c88.00007224::2016/06/01-15:25:29.369 WARN  [RCM] ResourceControl(STORAGE_GET_DISK_INFO) to Disk L:\ returned 5910.

000015c4.00007b6c::2016/06/01-15:26:29.960 WARN  [RES] File Server <FileServer-(Condor)>: Failed in NetShareGetInfo(Condor, sp4$), status 2310. Tolerating…

00000c88.00006594::2016/06/01-15:26:30.927 INFO  [RCM] rcm::RcmResource::ReattachToMonitorProcess: (IP Address 193.27.213.16, Offline)

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

12/5/2008 23:54

(6.1:3790.0)

(1.6:6.4)

Adaptec, Inc.

Adaptec Windows SAS/SATA Storport Driver

5/1/2007 18:30

(6.0:3790.16512)

(1.6:6.1)

Adaptec, Inc.

Adaptec Windows SATA Storport Driver

2/28/2007 0:04

(6.0:6001.16459)

(7.2:0.0)

Adaptec, Inc.

Adaptec StorPort Ultra320 SCSI Driver (X64)

3/19/2010 16:18

(1.1:2.5)

(1.1:2.5)

Advanced Micro Devices

Storage Filter Driver

2/13/2009 22:18

(4.8:2.0)

(4.8:2.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

2/3/2009 22:52

(7.2:10.211)

(7.2:10.211)

Emulex

Storport Miniport Driver for LightPulse HBAs

12/31/2008 16:29

(4.8:13.0)

(4.8:13.0)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

4/24/2003 19:03

(6.0:1.0)

(6.0:1.0)

Broadcom Corporation

Frame Access Driver

6/11/2010 1:46

(8.6:2.1014)

(8.6:2.1014)

Intel Corporation

Intel Matrix Storage Manager driver – x64

12/13/2005 21:47

(0.4:22.0)

(5.4:22.0)

Intel Corp./ICP vortex GmbH

Intel/ICP Raid Storport Driver

12/2/2009 21:36

(5.2:3790.1830)

(1.3:0.4)

Intel Corporation

Intel(R) Network Adapter Diagnostic Driver

5/19/2009 2:09

(4.5:1.64)

(4.5:1.64)

LSI Corporation

MEGASAS RAID Controller Driver for Windows 7\Server 2008 R2 for x64

5/19/2009 2:25

13.05.0409.2009

(13.5:409.2009)

LSI Corporation, Inc.

LSI MegaRAID Software RAID Driver

6/6/2006 22:11

(7.10:0.0)

(7.10:0.0)

IBM Corporation

IBM ServeRAID Controller Driver

8/10/2007 0:47

(1.2:78.3)

(1.2:78.3)

Intel Corporation

Intel(R) 5000 Series Chipsets Integrated Device – 1A38

1/22/2009 23:05

(9.1:8.6)

(9.1:8.6)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Driver

5/19/2009 2:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

9/24/2008 19:28

(5.1:1039.2600)

(5.1:1039.2600)

Silicon Integrated Systems Corp.

SiS RAID Stor Miniport Driver

2/17/2009 23:03

(5.0:1.1)

(5.0:1.1)

Promise Technology

Promise  SuperTrak EX Series Driver for Windows

 

_________________________________________________________________

 

Conclusion:

 

  • After analyzing the logs we can see that the issue started from the Networking End which went offline on Node ABCFILE08 due to which we got the event ID 1135 and the Node got evicted from the Cluster. At 4:35 the Node is added back when the Network after we restarted the Machine. As per the Events we are getting Event ID 2012 which usually indicates a send is issued on a connection which is no longer in a state valid for sending.

 

 

  1. Make sure that the firmware for the network switches/WAN accelerators and routers are up-to-date.
  2. Update the NIC firmware and driver.
  3. Update the NIC teaming software/driver.
  4. Update the Antivirus software or completely uninstall (for relief, and follow-up w/ the AV vendor)
  5. Manually set the speed/duplex
  6. Replace the network cable(s)
  7. Try a different switch port
  8. For the WAN optimizers, to try getting the packets from being modified, try using encapsulating the packets using IPsec.

 

  1.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

Updates for Cluster Binaries for 2008 R2 : https://support.microsoft.com/en-us/kb/2545685

 

  1.  Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

 

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply