RCA – 8 – RCA for Cluster Going Down

Issue Description:

Need to Know the Possible Cause of the Cluster going offline on 12:11pm Central Time at 7/19/2016  on Cluster Name: ab1-abcxntclust running a copy of Microsoft Windows Server 2008 R2 Enterprise Service Pack 1 64-bit.

_________________________________________________________________________________

System Information:

OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AB1-C1-BLADE09
System Manufacturer HP
System Model ProLiant BL460c G6
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5530 @ 2.40GHz, 2400 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP I24, 8/16/2015

 

System Events:

  • Checked the event logs and found that the issue is started with the failure of the File share resources, we haven’t seen any events prior to the Fileserver failure. 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/19/2016

12:11:06 PM

Error

AB1-C1-BLADE09.inxpo.dmz

1587

Microsoft-Windows-FailoverClustering

Cluster file server resource ‘FileServer-(dc1-contentfs)(ContentABC_FC)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node.

7/19/2016

12:11:06 PM

Error

AB1-C1-BLADE09.inxpo.dmz

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘FileServer-(dc1-contentfs)(ContentABC_FC)’ in clustered service or application ‘dc1-contentfs’ failed.

7/19/2016

12:19:07 PM

Error

AB1-C1-BLADE09.inxpo.dmz

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered service or application ‘dc1-contentfs’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

  • Cluster resources are showing as degraded.

7/19/2016

12:19:25 PM

Warning

AB1-C1-BLADE09.inxpo.dmz

1167

Foundation Agents

Cluster Agent: The cluster resource ContentABC_FC has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

7/19/2016

12:19:25 PM

Error

AB1-C1-BLADE09.inxpo.dmz

1168

Foundation Agents

Cluster Agent: The cluster resource FileServer-(dc1-contentfs)(ContentABC_FC) has failed.  [SNMP TRAP: 15006 in CPQCLUS.MIB]

  • At 12:34 we found that the HP network adaptor Nic went down.

7/19/2016

12:34:26 PM

Warning

AB1-C1-BLADE09.inxpo.dmz

4

q57nd60a

HP NC326m PCIe Dual Port Adapter: The network link is down.  Check to make sure the network cable is properly connected.

7/19/2016

12:34:33 PM

Error

AB1-C1-BLADE09.inxpo.dmz

2

HP Ethernet

If the Network Interface is an Ethernet Port, the Ethernet Port has transitioned from OK to Error. If the Network Interface is an Ethernet Team, the Ethernet Team has transitioned from Fully Redundant, Degraded Redundancy or Redundancy Lost to Overall Failure, due to a failed team member.

  • Due to the Network down the Cluster Node : AB1-C2-BLADE09 was removed from the Failover cluster manager.

7/19/2016

12:35:46 PM

Critical

AB1-C1-BLADE09.inxpo.dmz

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘AB1-C2-BLADE09’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

7/19/2016

12:35:46 PM

Error

AB1-C1-BLADE09.inxpo.dmz

1127

Microsoft-Windows-FailoverClustering

Cluster network interface ‘AB1-C1-BLADE09 – Cluster Heartbeat’ for cluster node ‘AB1-C1-BLADE09’ on network ‘Cluster Heartbeat’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

7/19/2016

12:35:46 PM

Error

AB1-C1-BLADE09.inxpo.dmz

1130

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Heartbeat’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Application Events:

  • Checked the Application events at the time of issue but found that one of the VTS job service task was running in the background at the time of issue.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/19/2016

12:19:07 PM

Information

AB1-C1-BLADE09.inxpo.dmz

3

VTS Job Server

N/A

7/19/2016

12:19:07 PM

Information

AB1-C1-BLADE09.inxpo.dmz

3

VTS File Upload Conversion

N/A

Cluster Events:

  • Checked the cluster events and found the same trend of events that we are getting in the system events.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/19/2016

12:11:07 PM

Information

AB1-C1-BLADE09.inxpo.dmz

1201

Microsoft-Windows-FailoverClustering

The Cluster service successfully brought the clustered service or application ‘dc1-contentfs’ online.

7/19/2016

12:19:07 PM

Information

AB1-C1-BLADE09.inxpo.dmz

1153

Microsoft-Windows-FailoverClustering

The Cluster service is attempting to fail over the clustered service or application ‘dc1-contentfs’ from node ‘AB1-C1-BLADE09’ to node ‘AB1-C2-BLADE09’.

7/19/2016

12:19:07 PM

Information

AB1-C1-BLADE09.inxpo.dmz

1203

Microsoft-Windows-FailoverClustering

The Cluster service is attempting to bring the clustered service or application ‘dc1-contentfs’ offline.

7/19/2016

12:35:46 PM

Information

AB1-C1-BLADE09.inxpo.dmz

1125

Microsoft-Windows-FailoverClustering

Cluster network interface ‘AB1-C1-BLADE09 – DMZ Network’ for cluster node ‘AB1-C1-BLADE09’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network.

Cluster Logs:

00000e78.000046f4::2016/07/19-17:07:06.847 WARN  [RES] File Server <FileServer-(dc1-contentfs)(ContentStorage_FC)>: Failed in NetShareGetInfo(dc1-contentfs, CachedAttachmentFiles), status 53. Tolerating…

000008f0.00003494::2016/07/19-17:08:50.074 INFO  [GUM] Node 1: Processing RequestLock 1:33616

000008f0.00000b28::2016/07/19-17:08:50.386 INFO  [GUM] Node 1: Processing GrantLock to 1 (sent by 2 gumid: 233031)

000008f0.00003494::2016/07/19-17:09:24.878 INFO  [API] s_ApiGetQuorumResource final status 0.

000008f0.00000b28::2016/07/19-17:09:53.395 INFO  [GUM] Node 1: Processing RequestLock 2:3871

000008f0.00000b28::2016/07/19-17:09:53.395 INFO  [GUM] Node 1: Processing GrantLock to 2 (sent by 1 gumid: 233043)

00000e78.0000306c::2016/07/19-17:11:06.872 WARN  [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: Failed in NetShareGetInfo(dc1-contentfs, FTPRoot), status 53. Tolerating…

00000e78.000057c8::2016/07/19-17:11:06.872 WARN  [RES] File Server <FileServer-(dc1-contentfs)(ContentStorage_FC)>: Failed in NetShareGetInfo(dc1-contentfs, CachedAttachmentFiles), status 53. Tolerating…

00000e78.0000306c::2016/07/19-17:11:06.872 ERR   [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: Not a single share among 1 configured shares is online

00000e78.0000306c::2016/07/19-17:11:06.872 ERR   [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: File system check failed, number of shares verified: 1, last share status: 259.

00000e78.0000306c::2016/07/19-17:11:06.872 WARN  [RHS] Resource FileServer-(dc1-contentfs)(ContentABC_FC) IsAlive has indicated failure.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for ‘FileServer-(dc1-contentfs)(ContentABC_FC)’, gen(0) result 1.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO  [RCM] TransitionToState(FileServer-(dc1-contentfs)(ContentABC_FC)) Online–>ProcessingFailure.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (dc1-contentfs, Online –> Failed)

000008f0.0000222c::2016/07/19-17:11:06.872 ERR   [RCM] rcm::RcmResource::HandleFailure: (FileServer-(dc1-contentfs)(ContentABC_FC))

000008f0.0000222c::2016/07/19-17:11:06.872 INFO  [RCM] resource FileServer-(dc1-contentfs)(ContentABC_FC): failure count: 1, restartAction: 2.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO  [RCM] Will restart resource in 500 milliseconds.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO  [RCM] TransitionToState(FileServer-(dc1-contentfs)(ContentABC_FC)) ProcessingFailure–>[WaitingToTerminate to DelayRestartingResource].

00000e78.000050cc::2016/07/19-17:19:08.029 INFO  [RES] Generic Service <VTS Job Server>: Service died or not active any more; status = 1062.

00000e78.0000607c::2016/07/19-17:19:08.029 INFO  [RES] Generic Service <VTS File Upload Conversion>: Service died or not active any more; status = 1062.

00000e78.00005fd8::2016/07/19-17:19:08.029 INFO  [RES] Generic Service <VTS Background Report Mailer>: Service died or not active any more; status = 1062.

  • As part of the IsAlive process of file server resource, the cluster service will verify whether the directory path that is associated with the share is still valid. It does by running the UNC path (\\VCO name) command on the local host if the share paths are accessible then the resource comes online if not it will mark the resource fail. Cluster logs reports that share failed health check and transitioned to failed state.

  • Status 53 at “Failed in NetShareGetInfo” translates to:  ERROR_BAD_NETPATH

The network path was not found.

 

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

10/15/2013 21:22

(16.4:0.1)

(16.4:0.1)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

7/28/2011 2:54

(6.1:7600.16385)

(7.0:2.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

8/29/2013 11:30

(7.8:51.0)

(7.8:51.0)

Broadcom Corporation

Broadcom NetXtreme II Diagnostic Driver

2/13/2009 16:18

(4.8:2.0)

(4.8:2.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

10/23/2013 5:30

(10.90:0.0)

(10.90:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8

4/20/2010 6:03

(6.1:7600.16385)

(4.1:5.150)

Hewlett-Packard

HP MPIO DSM for EVA4x00/6×00/8×00 family of Disk Arrays

2/17/2011 17:16

(1.14:0.0)

(1.14:0.0)

Hewlett-Packard Company

HP ProLiant iLO 2 Management Controller Driver

5/18/2009 20:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

7/26/2010 20:39

(10.10:0.0)

(10.10:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.10.00.03, NDIS 6.0, x64, (free) on Win2K8

10/16/2011 19:02

(10.45:0.0)

(10.45:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.45.00.01, NDIS 6.0, x64, (free) on Win2K8

Fibre Channel Information:

Description: QLogic QMH2462 Fibre Channel Adapter

Driver Version: 9.1.16.21

Firmware version: 8.01.02

Driver name: ql2300.sys

___________________________________________________________________________________

System Information:

 OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AB1-C2-BLADE09
System Manufacturer HP
System Model ProLiant BL460c G6
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5530 @ 2.40GHz, 2400 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP I24, 8/16/2015

System Events:

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/19/2016

12:19:44 PM

Warning

AB1-C2-BLADE09.inxpo.dmz

1167

Foundation Agents

Cluster Agent: The cluster resource ContentABC_FC has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

7/19/2016

12:19:44 PM

Error

AB1-C2-BLADE09.inxpo.dmz

1168

Foundation Agents

Cluster Agent: The cluster resource FileServer-(dc1-contentfs)(ContentABC_FC) has failed.  [SNMP TRAP: 15006 in CPQCLUS.MIB]

7/19/2016

12:21:45 PM

Warning

AB1-C2-BLADE09.inxpo.dmz

1167

Foundation Agents

Cluster Agent: The cluster resource ContentStorage_FC has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

 

 

Application Events:

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/19/2016

11:48:38 AM

Information

AB1-C2-BLADE09.inxpo.dmz

8224

VSS

The VSS service is shutting down due to idle timeout. 

7/19/2016

12:47:34 PM

Error

AB1-C2-BLADE09.inxpo.dmz

1000

Application Error

Faulting application name: VTSJobServer.exe, version: 0.0.0.0, time stamp: 0x57604532 Faulting module name: MSVCR100.dll, version: 10.0.40219.325, time stamp: 0x4df2bcac Exception code: 0xc0000005 Fault offset: 0x000000000003c225 Faulting process id: 0x1874 Faulting application start time: 0x01d1e1e59fd6546f Faulting application path: H:\Services\VTSJobServer\VTSJobServer.exe Faulting module path: C:\Windows\system32\MSVCR100.dll Report Id: df57bdf1-4dd8-11e6-9949-00265521b094

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

10/15/2013 21:22

(16.4:0.1)

(16.4:0.1)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

7/28/2011 2:54

(6.1:7600.16385)

(7.0:2.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

2/13/2009 16:18

(4.8:2.0)

(4.8:2.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

10/23/2013 5:30

(10.90:0.0)

(10.90:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8

4/20/2010 6:03

(6.1:7600.16385)

(4.1:5.150)

Hewlett-Packard

HP MPIO DSM for EVA4x00/6×00/8×00 family of Disk Arrays

2/17/2011 17:16

(1.14:0.0)

(1.14:0.0)

Hewlett-Packard Company

HP ProLiant iLO 2 Management Controller Driver

5/18/2009 20:18

(2.1:3.20)

(2.1:3.20)

QLogic Corporation

QLogic iSCSI Storport Miniport Driver

7/26/2010 20:39

(10.10:0.0)

(10.10:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.10.00.03, NDIS 6.0, x64, (free) on Win2K8

10/16/2011 19:02

(10.45:0.0)

(10.45:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.45.00.01, NDIS 6.0, x64, (free) on Win2K8

 

_________________________________________________________________________________________

 

 

Conclusion:

 

.

  • After analyzing the logs we found that the issue is due to the File Server failed to pass the IssAlive Test on the Cluster.

 

  • This issue may happen if the system runs out of TCP ports (Port exhaustion) or Windows Server 2008 R2 is missing up to date Hotfixes for Kernel, networking and cluster modules or outdated NIC drivers. Since the issue could not be with the TCP port as we are not getting event id 4227.

 

  • Check if the TCP/IP NetBIOS Helper service disabled, if yes, we need to enable it by default the service is set to start automatic.

 

  • Update the Network Adaptor on both the nodes of the Cluster.

 

  • Kindly update the HP MPIO DSM after discussion with the HP Team.

 

  • The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

 

  •  
  • Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

Updates for Cluster Binaries for 2008 R2

https://support.microsoft.com/en-us/kb/2545685 

 

  • Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

 

  • Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.

Recommended private “Heartbeat” configuration on a cluster server 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply