RCA - 8 - RCA for Cluster Going Down

Issue Description:

Need to Know the Possible Cause of the Cluster going offline on 12:11pm Central Time at 7/19/2016 on Cluster Name: ab1-abcxntclust running a copy of Microsoft Windows Server 2008 R2 Enterprise Service Pack 1 64-bit.

_________________________________________________________________________________

System Information:

OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AB1-C1-BLADE09
System Manufacturer HP
System Model ProLiant BL460c G6
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5530 @ 2.40GHz, 2400 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP I24, 8/16/2015

System Events:

Checked the event logs and found that the issue is started with the failure of the File share resources, we haven’t seen any events prior to the Fileserver failure.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/19/2016	12:11:06 PM	Error	AB1-C1-BLADE09.inxpo.dmz	1587	Microsoft-Windows-FailoverClustering	Cluster file server resource ‘FileServer-(dc1-contentfs)(ContentABC_FC)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node.
7/19/2016	12:11:06 PM	Error	AB1-C1-BLADE09.inxpo.dmz	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘FileServer-(dc1-contentfs)(ContentABC_FC)’ in clustered service or application ‘dc1-contentfs’ failed.
7/19/2016	12:19:07 PM	Error	AB1-C1-BLADE09.inxpo.dmz	1205	Microsoft-Windows-FailoverClustering	The Cluster service failed to bring clustered service or application ‘dc1-contentfs’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

Cluster resources are showing as degraded.

7/19/2016	12:19:25 PM	Warning	AB1-C1-BLADE09.inxpo.dmz	1167	Foundation Agents	Cluster Agent: The cluster resource ContentABC_FC has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]
7/19/2016	12:19:25 PM	Error	AB1-C1-BLADE09.inxpo.dmz	1168	Foundation Agents	Cluster Agent: The cluster resource FileServer-(dc1-contentfs)(ContentABC_FC) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB]

At 12:34 we found that the HP network adaptor Nic went down.

7/19/2016	12:34:26 PM	Warning	AB1-C1-BLADE09.inxpo.dmz	4	q57nd60a	HP NC326m PCIe Dual Port Adapter: The network link is down. Check to make sure the network cable is properly connected.
7/19/2016	12:34:33 PM	Error	AB1-C1-BLADE09.inxpo.dmz	2	HP Ethernet	If the Network Interface is an Ethernet Port, the Ethernet Port has transitioned from OK to Error. If the Network Interface is an Ethernet Team, the Ethernet Team has transitioned from Fully Redundant, Degraded Redundancy or Redundancy Lost to Overall Failure, due to a failed team member.

Due to the Network down the Cluster Node : AB1-C2-BLADE09 was removed from the Failover cluster manager.

7/19/2016	12:35:46 PM	Critical	AB1-C1-BLADE09.inxpo.dmz	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘AB1-C2-BLADE09’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
7/19/2016	12:35:46 PM	Error	AB1-C1-BLADE09.inxpo.dmz	1127	Microsoft-Windows-FailoverClustering	Cluster network interface ‘AB1-C1-BLADE09 – Cluster Heartbeat’ for cluster node ‘AB1-C1-BLADE09’ on network ‘Cluster Heartbeat’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
7/19/2016	12:35:46 PM	Error	AB1-C1-BLADE09.inxpo.dmz	1130	Microsoft-Windows-FailoverClustering	Cluster network ‘Cluster Heartbeat’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Application Events:

Checked the Application events at the time of issue but found that one of the VTS job service task was running in the background at the time of issue.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/19/2016	12:19:07 PM	Information	AB1-C1-BLADE09.inxpo.dmz	3	VTS Job Server	N/A
7/19/2016	12:19:07 PM	Information	AB1-C1-BLADE09.inxpo.dmz	3	VTS File Upload Conversion	N/A

Cluster Events:

Checked the cluster events and found the same trend of events that we are getting in the system events.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/19/2016	12:11:07 PM	Information	AB1-C1-BLADE09.inxpo.dmz	1201	Microsoft-Windows-FailoverClustering	The Cluster service successfully brought the clustered service or application ‘dc1-contentfs’ online.
7/19/2016	12:19:07 PM	Information	AB1-C1-BLADE09.inxpo.dmz	1153	Microsoft-Windows-FailoverClustering	The Cluster service is attempting to fail over the clustered service or application ‘dc1-contentfs’ from node ‘AB1-C1-BLADE09’ to node ‘AB1-C2-BLADE09’.
7/19/2016	12:19:07 PM	Information	AB1-C1-BLADE09.inxpo.dmz	1203	Microsoft-Windows-FailoverClustering	The Cluster service is attempting to bring the clustered service or application ‘dc1-contentfs’ offline.
7/19/2016	12:35:46 PM	Information	AB1-C1-BLADE09.inxpo.dmz	1125	Microsoft-Windows-FailoverClustering	Cluster network interface ‘AB1-C1-BLADE09 – DMZ Network’ for cluster node ‘AB1-C1-BLADE09’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network.

Cluster Logs:

00000e78.000046f4::2016/07/19-17:07:06.847 WARN [RES] File Server <FileServer-(dc1-contentfs)(ContentStorage_FC)>: Failed in NetShareGetInfo(dc1-contentfs, CachedAttachmentFiles), status 53. Tolerating…

000008f0.00003494::2016/07/19-17:08:50.074 INFO [GUM] Node 1: Processing RequestLock 1:33616

000008f0.00000b28::2016/07/19-17:08:50.386 INFO [GUM] Node 1: Processing GrantLock to 1 (sent by 2 gumid: 233031)

000008f0.00003494::2016/07/19-17:09:24.878 INFO [API] s_ApiGetQuorumResource final status 0.

000008f0.00000b28::2016/07/19-17:09:53.395 INFO [GUM] Node 1: Processing RequestLock 2:3871

000008f0.00000b28::2016/07/19-17:09:53.395 INFO [GUM] Node 1: Processing GrantLock to 2 (sent by 1 gumid: 233043)

00000e78.0000306c::2016/07/19-17:11:06.872 WARN [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: Failed in NetShareGetInfo(dc1-contentfs, FTPRoot), status 53. Tolerating…

00000e78.000057c8::2016/07/19-17:11:06.872 WARN [RES] File Server <FileServer-(dc1-contentfs)(ContentStorage_FC)>: Failed in NetShareGetInfo(dc1-contentfs, CachedAttachmentFiles), status 53. Tolerating…

00000e78.0000306c::2016/07/19-17:11:06.872 ERR [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: Not a single share among 1 configured shares is online

00000e78.0000306c::2016/07/19-17:11:06.872 ERR [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: File system check failed, number of shares verified: 1, last share status: 259.

00000e78.0000306c::2016/07/19-17:11:06.872 WARN [RHS] Resource FileServer-(dc1-contentfs)(ContentABC_FC) IsAlive has indicated failure.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for ‘FileServer-(dc1-contentfs)(ContentABC_FC)’, gen(0) result 1.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] TransitionToState(FileServer-(dc1-contentfs)(ContentABC_FC)) Online–>ProcessingFailure.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (dc1-contentfs, Online –> Failed)

000008f0.0000222c::2016/07/19-17:11:06.872 ERR [RCM] rcm::RcmResource::HandleFailure: (FileServer-(dc1-contentfs)(ContentABC_FC))

000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] resource FileServer-(dc1-contentfs)(ContentABC_FC): failure count: 1, restartAction: 2.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] Will restart resource in 500 milliseconds.

000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] TransitionToState(FileServer-(dc1-contentfs)(ContentABC_FC)) ProcessingFailure–>[WaitingToTerminate to DelayRestartingResource].

00000e78.000050cc::2016/07/19-17:19:08.029 INFO [RES] Generic Service <VTS Job Server>: Service died or not active any more; status = 1062.

00000e78.0000607c::2016/07/19-17:19:08.029 INFO [RES] Generic Service <VTS File Upload Conversion>: Service died or not active any more; status = 1062.

00000e78.00005fd8::2016/07/19-17:19:08.029 INFO [RES] Generic Service <VTS Background Report Mailer>: Service died or not active any more; status = 1062.

As part of the IsAlive process of file server resource, the cluster service will verify whether the directory path that is associated with the share is still valid. It does by running the UNC path (\\VCO name) command on the local host if the share paths are accessible then the resource comes online if not it will mark the resource fail. Cluster logs reports that share failed health check and transitioned to failed state.

Status 53 at “Failed in NetShareGetInfo” translates to: ERROR_BAD_NETPATH

The network path was not found.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
10/15/2013 21:22	(16.4:0.1)	(16.4:0.1)	Broadcom Corporation	Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.
7/28/2011 2:54	(6.1:7600.16385)	(7.0:2.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
8/29/2013 11:30	(7.8:51.0)	(7.8:51.0)	Broadcom Corporation	Broadcom NetXtreme II Diagnostic Driver
2/13/2009 16:18	(4.8:2.0)	(4.8:2.0)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
10/23/2013 5:30	(10.90:0.0)	(10.90:0.0)	Hewlett-Packard Company	Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8
4/20/2010 6:03	(6.1:7600.16385)	(4.1:5.150)	Hewlett-Packard	HP MPIO DSM for EVA4x00/6×00/8×00 family of Disk Arrays
2/17/2011 17:16	(1.14:0.0)	(1.14:0.0)	Hewlett-Packard Company	HP ProLiant iLO 2 Management Controller Driver
5/18/2009 20:18	(2.1:3.20)	(2.1:3.20)	QLogic Corporation	QLogic iSCSI Storport Miniport Driver
7/26/2010 20:39	(10.10:0.0)	(10.10:0.0)	Hewlett-Packard Company	Network Teaming Intermediate Driver (NTID), 10.10.00.03, NDIS 6.0, x64, (free) on Win2K8
10/16/2011 19:02	(10.45:0.0)	(10.45:0.0)	Hewlett-Packard Company	Network Teaming Intermediate Driver (NTID), 10.45.00.01, NDIS 6.0, x64, (free) on Win2K8

Fibre Channel Information:

Description: QLogic QMH2462 Fibre Channel Adapter

Driver Version: 9.1.16.21

Firmware version: 8.01.02

Driver name: ql2300.sys

___________________________________________________________________________________

System Information:

 OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AB1-C2-BLADE09
System Manufacturer HP
System Model ProLiant BL460c G6
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5530 @ 2.40GHz, 2400 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP I24, 8/16/2015

System Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/19/2016	12:19:44 PM	Warning	AB1-C2-BLADE09.inxpo.dmz	1167	Foundation Agents	Cluster Agent: The cluster resource ContentABC_FC has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]
7/19/2016	12:19:44 PM	Error	AB1-C2-BLADE09.inxpo.dmz	1168	Foundation Agents	Cluster Agent: The cluster resource FileServer-(dc1-contentfs)(ContentABC_FC) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB]
7/19/2016	12:21:45 PM	Warning	AB1-C2-BLADE09.inxpo.dmz	1167	Foundation Agents	Cluster Agent: The cluster resource ContentStorage_FC has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB]

Application Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
7/19/2016	11:48:38 AM	Information	AB1-C2-BLADE09.inxpo.dmz	8224	VSS	The VSS service is shutting down due to idle timeout.
7/19/2016	12:47:34 PM	Error	AB1-C2-BLADE09.inxpo.dmz	1000	Application Error	Faulting application name: VTSJobServer.exe, version: 0.0.0.0, time stamp: 0x57604532 Faulting module name: MSVCR100.dll, version: 10.0.40219.325, time stamp: 0x4df2bcac Exception code: 0xc0000005 Fault offset: 0x000000000003c225 Faulting process id: 0x1874 Faulting application start time: 0x01d1e1e59fd6546f Faulting application path: H:\Services\VTSJobServer\VTSJobServer.exe Faulting module path: C:\Windows\system32\MSVCR100.dll Report Id: df57bdf1-4dd8-11e6-9949-00265521b094

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
10/15/2013 21:22	(16.4:0.1)	(16.4:0.1)	Broadcom Corporation	Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.
7/28/2011 2:54	(6.1:7600.16385)	(7.0:2.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
2/13/2009 16:18	(4.8:2.0)	(4.8:2.0)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
10/23/2013 5:30	(10.90:0.0)	(10.90:0.0)	Hewlett-Packard Company	Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8
4/20/2010 6:03	(6.1:7600.16385)	(4.1:5.150)	Hewlett-Packard	HP MPIO DSM for EVA4x00/6×00/8×00 family of Disk Arrays
2/17/2011 17:16	(1.14:0.0)	(1.14:0.0)	Hewlett-Packard Company	HP ProLiant iLO 2 Management Controller Driver
5/18/2009 20:18	(2.1:3.20)	(2.1:3.20)	QLogic Corporation	QLogic iSCSI Storport Miniport Driver
7/26/2010 20:39	(10.10:0.0)	(10.10:0.0)	Hewlett-Packard Company	Network Teaming Intermediate Driver (NTID), 10.10.00.03, NDIS 6.0, x64, (free) on Win2K8
10/16/2011 19:02	(10.45:0.0)	(10.45:0.0)	Hewlett-Packard Company	Network Teaming Intermediate Driver (NTID), 10.45.00.01, NDIS 6.0, x64, (free) on Win2K8

_________________________________________________________________________________________

Conclusion:

After analyzing the logs we found that the issue is due to the File Server failed to pass the IssAlive Test on the Cluster.

This issue may happen if the system runs out of TCP ports (Port exhaustion) or Windows Server 2008 R2 is missing up to date Hotfixes for Kernel, networking and cluster modules or outdated NIC drivers. Since the issue could not be with the TCP port as we are not getting event id 4227.

Check if the TCP/IP NetBIOS Helper service disabled, if yes, we need to enable it by default the service is set to start automatic.

Update the Network Adaptor on both the nodes of the Cluster.

Kindly update the HP MPIO DSM after discussion with the HP Team.

The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

Updates for Cluster Binaries for 2008 R2

https://support.microsoft.com/en-us/kb/2545685

Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

Investigation of Network Issues :

We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.