Issue Description:
Need to Know the Possible Cause of the Cluster going offline on 12:11pm Central Time at 7/19/2016 on Cluster Name: ab1-abcxntclust running a copy of Microsoft Windows Server 2008 R2 Enterprise Service Pack 1 64-bit.
_________________________________________________________________________________
System Information:
OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AB1-C1-BLADE09
System Manufacturer HP
System Model ProLiant BL460c G6
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5530 @ 2.40GHz, 2400 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP I24, 8/16/2015
System Events:
- Checked the event logs and found that the issue is started with the failure of the File share resources, we haven’t seen any events prior to the Fileserver failure.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
7/19/2016 |
12:11:06 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
1587 |
Microsoft-Windows-FailoverClustering |
Cluster file server resource ‘FileServer-(dc1-contentfs)(ContentABC_FC)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node. |
7/19/2016 |
12:11:06 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘FileServer-(dc1-contentfs)(ContentABC_FC)’ in clustered service or application ‘dc1-contentfs’ failed. |
7/19/2016 |
12:19:07 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
1205 |
Microsoft-Windows-FailoverClustering |
The Cluster service failed to bring clustered service or application ‘dc1-contentfs’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application. |
- Cluster resources are showing as degraded.
7/19/2016 |
12:19:25 PM |
Warning |
AB1-C1-BLADE09.inxpo.dmz |
1167 |
Foundation Agents |
Cluster Agent: The cluster resource ContentABC_FC has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
7/19/2016 |
12:19:25 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
1168 |
Foundation Agents |
Cluster Agent: The cluster resource FileServer-(dc1-contentfs)(ContentABC_FC) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB] |
- At 12:34 we found that the HP network adaptor Nic went down.
7/19/2016 |
12:34:26 PM |
Warning |
AB1-C1-BLADE09.inxpo.dmz |
4 |
q57nd60a |
HP NC326m PCIe Dual Port Adapter: The network link is down. Check to make sure the network cable is properly connected. |
7/19/2016 |
12:34:33 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
2 |
HP Ethernet |
If the Network Interface is an Ethernet Port, the Ethernet Port has transitioned from OK to Error. If the Network Interface is an Ethernet Team, the Ethernet Team has transitioned from Fully Redundant, Degraded Redundancy or Redundancy Lost to Overall Failure, due to a failed team member. |
- Due to the Network down the Cluster Node : AB1-C2-BLADE09 was removed from the Failover cluster manager.
7/19/2016 |
12:35:46 PM |
Critical |
AB1-C1-BLADE09.inxpo.dmz |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘AB1-C2-BLADE09’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
7/19/2016 |
12:35:46 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
1127 |
Microsoft-Windows-FailoverClustering |
Cluster network interface ‘AB1-C1-BLADE09 – Cluster Heartbeat’ for cluster node ‘AB1-C1-BLADE09’ on network ‘Cluster Heartbeat’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
7/19/2016 |
12:35:46 PM |
Error |
AB1-C1-BLADE09.inxpo.dmz |
1130 |
Microsoft-Windows-FailoverClustering |
Cluster network ‘Cluster Heartbeat’ is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
Application Events:
- Checked the Application events at the time of issue but found that one of the VTS job service task was running in the background at the time of issue.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
7/19/2016 |
12:19:07 PM |
Information |
AB1-C1-BLADE09.inxpo.dmz |
3 |
VTS Job Server |
N/A |
7/19/2016 |
12:19:07 PM |
Information |
AB1-C1-BLADE09.inxpo.dmz |
3 |
VTS File Upload Conversion |
N/A |
Cluster Events:
- Checked the cluster events and found the same trend of events that we are getting in the system events.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
7/19/2016 |
12:11:07 PM |
Information |
AB1-C1-BLADE09.inxpo.dmz |
1201 |
Microsoft-Windows-FailoverClustering |
The Cluster service successfully brought the clustered service or application ‘dc1-contentfs’ online. |
7/19/2016 |
12:19:07 PM |
Information |
AB1-C1-BLADE09.inxpo.dmz |
1153 |
Microsoft-Windows-FailoverClustering |
The Cluster service is attempting to fail over the clustered service or application ‘dc1-contentfs’ from node ‘AB1-C1-BLADE09’ to node ‘AB1-C2-BLADE09’. |
7/19/2016 |
12:19:07 PM |
Information |
AB1-C1-BLADE09.inxpo.dmz |
1203 |
Microsoft-Windows-FailoverClustering |
The Cluster service is attempting to bring the clustered service or application ‘dc1-contentfs’ offline. |
7/19/2016 |
12:35:46 PM |
Information |
AB1-C1-BLADE09.inxpo.dmz |
1125 |
Microsoft-Windows-FailoverClustering |
Cluster network interface ‘AB1-C1-BLADE09 – DMZ Network’ for cluster node ‘AB1-C1-BLADE09’ on network ‘Cluster Network 3’ is operational (up). The node can communicate with all other available failover cluster nodes on the network. |
Cluster Logs:
00000e78.000046f4::2016/07/19-17:07:06.847 WARN [RES] File Server <FileServer-(dc1-contentfs)(ContentStorage_FC)>: Failed in NetShareGetInfo(dc1-contentfs, CachedAttachmentFiles), status 53. Tolerating…
000008f0.00003494::2016/07/19-17:08:50.074 INFO [GUM] Node 1: Processing RequestLock 1:33616
000008f0.00000b28::2016/07/19-17:08:50.386 INFO [GUM] Node 1: Processing GrantLock to 1 (sent by 2 gumid: 233031)
000008f0.00003494::2016/07/19-17:09:24.878 INFO [API] s_ApiGetQuorumResource final status 0.
000008f0.00000b28::2016/07/19-17:09:53.395 INFO [GUM] Node 1: Processing RequestLock 2:3871
000008f0.00000b28::2016/07/19-17:09:53.395 INFO [GUM] Node 1: Processing GrantLock to 2 (sent by 1 gumid: 233043)
00000e78.0000306c::2016/07/19-17:11:06.872 WARN [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: Failed in NetShareGetInfo(dc1-contentfs, FTPRoot), status 53. Tolerating…
00000e78.000057c8::2016/07/19-17:11:06.872 WARN [RES] File Server <FileServer-(dc1-contentfs)(ContentStorage_FC)>: Failed in NetShareGetInfo(dc1-contentfs, CachedAttachmentFiles), status 53. Tolerating…
00000e78.0000306c::2016/07/19-17:11:06.872 ERR [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: Not a single share among 1 configured shares is online
00000e78.0000306c::2016/07/19-17:11:06.872 ERR [RES] File Server <FileServer-(dc1-contentfs)(ContentABC_FC)>: File system check failed, number of shares verified: 1, last share status: 259.
00000e78.0000306c::2016/07/19-17:11:06.872 WARN [RHS] Resource FileServer-(dc1-contentfs)(ContentABC_FC) IsAlive has indicated failure.
000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for ‘FileServer-(dc1-contentfs)(ContentABC_FC)’, gen(0) result 1.
000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] TransitionToState(FileServer-(dc1-contentfs)(ContentABC_FC)) Online–>ProcessingFailure.
000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (dc1-contentfs, Online –> Failed)
000008f0.0000222c::2016/07/19-17:11:06.872 ERR [RCM] rcm::RcmResource::HandleFailure: (FileServer-(dc1-contentfs)(ContentABC_FC))
000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] resource FileServer-(dc1-contentfs)(ContentABC_FC): failure count: 1, restartAction: 2.
000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] Will restart resource in 500 milliseconds.
000008f0.0000222c::2016/07/19-17:11:06.872 INFO [RCM] TransitionToState(FileServer-(dc1-contentfs)(ContentABC_FC)) ProcessingFailure–>[WaitingToTerminate to DelayRestartingResource].
00000e78.000050cc::2016/07/19-17:19:08.029 INFO [RES] Generic Service <VTS Job Server>: Service died or not active any more; status = 1062.
00000e78.0000607c::2016/07/19-17:19:08.029 INFO [RES] Generic Service <VTS File Upload Conversion>: Service died or not active any more; status = 1062.
00000e78.00005fd8::2016/07/19-17:19:08.029 INFO [RES] Generic Service <VTS Background Report Mailer>: Service died or not active any more; status = 1062.
- As part of the IsAlive process of file server resource, the cluster service will verify whether the directory path that is associated with the share is still valid. It does by running the UNC path (\\VCO name) command on the local host if the share paths are accessible then the resource comes online if not it will mark the resource fail. Cluster logs reports that share failed health check and transitioned to failed state.
- Status 53 at “Failed in NetShareGetInfo” translates to: ERROR_BAD_NETPATH
The network path was not found.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
10/15/2013 21:22 |
(16.4:0.1) |
(16.4:0.1) |
Broadcom Corporation |
Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
7/28/2011 2:54 |
(6.1:7600.16385) |
(7.0:2.0) |
Broadcom Corporation |
Broadcom NetXtreme Unified Crash Dump (x64) |
8/29/2013 11:30 |
(7.8:51.0) |
(7.8:51.0) |
Broadcom Corporation |
Broadcom NetXtreme II Diagnostic Driver |
2/13/2009 16:18 |
(4.8:2.0) |
(4.8:2.0) |
Broadcom Corporation |
Broadcom NetXtreme II GigE VBD |
10/23/2013 5:30 |
(10.90:0.0) |
(10.90:0.0) |
Hewlett-Packard Company |
Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8 |
4/20/2010 6:03 |
(6.1:7600.16385) |
(4.1:5.150) |
Hewlett-Packard |
HP MPIO DSM for EVA4x00/6×00/8×00 family of Disk Arrays |
2/17/2011 17:16 |
(1.14:0.0) |
(1.14:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 2 Management Controller Driver |
5/18/2009 20:18 |
(2.1:3.20) |
(2.1:3.20) |
QLogic Corporation |
QLogic iSCSI Storport Miniport Driver |
7/26/2010 20:39 |
(10.10:0.0) |
(10.10:0.0) |
Hewlett-Packard Company |
Network Teaming Intermediate Driver (NTID), 10.10.00.03, NDIS 6.0, x64, (free) on Win2K8 |
10/16/2011 19:02 |
(10.45:0.0) |
(10.45:0.0) |
Hewlett-Packard Company |
Network Teaming Intermediate Driver (NTID), 10.45.00.01, NDIS 6.0, x64, (free) on Win2K8 |
Fibre Channel Information:
Description: QLogic QMH2462 Fibre Channel Adapter
Driver Version: 9.1.16.21
Firmware version: 8.01.02
Driver name: ql2300.sys
___________________________________________________________________________________
System Information:
OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AB1-C2-BLADE09
System Manufacturer HP
System Model ProLiant BL460c G6
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU E5530 @ 2.40GHz, 2400 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP I24, 8/16/2015
System Events:
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
7/19/2016 |
12:19:44 PM |
Warning |
AB1-C2-BLADE09.inxpo.dmz |
1167 |
Foundation Agents |
Cluster Agent: The cluster resource ContentABC_FC has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
7/19/2016 |
12:19:44 PM |
Error |
AB1-C2-BLADE09.inxpo.dmz |
1168 |
Foundation Agents |
Cluster Agent: The cluster resource FileServer-(dc1-contentfs)(ContentABC_FC) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB] |
7/19/2016 |
12:21:45 PM |
Warning |
AB1-C2-BLADE09.inxpo.dmz |
1167 |
Foundation Agents |
Cluster Agent: The cluster resource ContentStorage_FC has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
Application Events:
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
7/19/2016 |
11:48:38 AM |
Information |
AB1-C2-BLADE09.inxpo.dmz |
8224 |
VSS |
The VSS service is shutting down due to idle timeout. |
7/19/2016 |
12:47:34 PM |
Error |
AB1-C2-BLADE09.inxpo.dmz |
1000 |
Application Error |
Faulting application name: VTSJobServer.exe, version: 0.0.0.0, time stamp: 0x57604532 Faulting module name: MSVCR100.dll, version: 10.0.40219.325, time stamp: 0x4df2bcac Exception code: 0xc0000005 Fault offset: 0x000000000003c225 Faulting process id: 0x1874 Faulting application start time: 0x01d1e1e59fd6546f Faulting application path: H:\Services\VTSJobServer\VTSJobServer.exe Faulting module path: C:\Windows\system32\MSVCR100.dll Report Id: df57bdf1-4dd8-11e6-9949-00265521b094 |
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
10/15/2013 21:22 |
(16.4:0.1) |
(16.4:0.1) |
Broadcom Corporation |
Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
7/28/2011 2:54 |
(6.1:7600.16385) |
(7.0:2.0) |
Broadcom Corporation |
Broadcom NetXtreme Unified Crash Dump (x64) |
2/13/2009 16:18 |
(4.8:2.0) |
(4.8:2.0) |
Broadcom Corporation |
Broadcom NetXtreme II GigE VBD |
10/23/2013 5:30 |
(10.90:0.0) |
(10.90:0.0) |
Hewlett-Packard Company |
Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8 |
4/20/2010 6:03 |
(6.1:7600.16385) |
(4.1:5.150) |
Hewlett-Packard |
HP MPIO DSM for EVA4x00/6×00/8×00 family of Disk Arrays |
2/17/2011 17:16 |
(1.14:0.0) |
(1.14:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 2 Management Controller Driver |
5/18/2009 20:18 |
(2.1:3.20) |
(2.1:3.20) |
QLogic Corporation |
QLogic iSCSI Storport Miniport Driver |
7/26/2010 20:39 |
(10.10:0.0) |
(10.10:0.0) |
Hewlett-Packard Company |
Network Teaming Intermediate Driver (NTID), 10.10.00.03, NDIS 6.0, x64, (free) on Win2K8 |
10/16/2011 19:02 |
(10.45:0.0) |
(10.45:0.0) |
Hewlett-Packard Company |
Network Teaming Intermediate Driver (NTID), 10.45.00.01, NDIS 6.0, x64, (free) on Win2K8 |
_________________________________________________________________________________________
Conclusion:
.
- After analyzing the logs we found that the issue is due to the File Server failed to pass the IssAlive Test on the Cluster.
- This issue may happen if the system runs out of TCP ports (Port exhaustion) or Windows Server 2008 R2 is missing up to date Hotfixes for Kernel, networking and cluster modules or outdated NIC drivers. Since the issue could not be with the TCP port as we are not getting event id 4227.
- Check if the TCP/IP NetBIOS Helper service disabled, if yes, we need to enable it by default the service is set to start automatic.
- Update the Network Adaptor on both the nodes of the Cluster.
- Kindly update the HP MPIO DSM after discussion with the HP Team.
- The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
Updates for Cluster Binaries for 2008 R2
https://support.microsoft.com/en-us/kb/2545685
- Investigate the Network timeout / latency / packet drops with the help of in house networking team.
Please Note : This step is the most critical while dealing with network connectivity issues.
Investigation of Network Issues :
We need to investigate the Network Connectivity Issues with the help of in-house networking team.
In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.
- Communication between Server Cluster nodes is critical for smooth cluster operations. Therefore, you must configure the networks that you use for cluster communication are configured optimally and follow all hardware compatibility list requirements. For networking configuration, two or more independent networks must connect the nodes of a cluster to avoid a single point of failure. Please add a heartbeat network to the cluster so that it can work properly.
Recommended private “Heartbeat” configuration on a cluster server