Issue Description:
Cluster Service Terminating and Resources failing on Server 2008 R2 Cluster Name “MSCSDPO02.xyz.local” in “xyz.local” domain
________________________________________________________________________________________________
System Information:
OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name XYZ223409
System Manufacturer HP
System Model ProLiant DL580 G7
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
BIOS Version/Date HP P65, 2013-10-01
System Events:
- Checked the system events and found that the File share services went offline around 4:02 PM However we haven’t seen any failure before the time of issue.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/24/2016 | 4:02:07 PM | Error | XYZ223409.xyz.local | 1587 | Microsoft-Windows-FailoverClustering | Cluster file server resource ‘FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node. |
7/24/2016 | 4:02:07 PM | Error | XYZ223409.xyz.local | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5)’ in clustered service or application ‘MSCSXYZ03-RES3’ failed. |
7/24/2016 | 4:02:08 PM | Error | XYZ223409.xyz.local | 1205 | Microsoft-Windows-FailoverClustering | The Cluster service failed to bring clustered service or application ‘MSCSXYZ03-RES3’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application. |
- Around 4:03 PM we saw the event 4227 which points out an issue towards the TCP/IP failing to establish an outgoing connection.
7/24/2016 | 4:03:23 PM | Warning | XYZ223409.xyz.local | 1167 | Foundation Agents | Cluster Agent: The cluster resource ShadowCopyVolume{0BAD05C7-69DC-4134-B944-6010CD655668} has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
7/24/2016 | 4:03:24 PM | Warning | XYZ223409.xyz.local | 4227 | Tcpip | TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint. |
7/24/2016 | 4:04:06 PM | Error | XYZ223409.xyz.local | 60 | volsnap | The shadow copies of volume J: were aborted because volume U:, which contains shadow copy storage for this shadow copy, has been taken offline. |
7/24/2016 | 4:04:07 PM | Error | XYZ223409.xyz.local | 1587 | Microsoft-Windows-FailoverClustering | Cluster file server resource ‘FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node. |
7/24/2016 | 4:04:07 PM | Error | XYZ223409.xyz.local | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)’ in clustered service or application ‘MSCSXYZ03-RES7’ failed. |
Application Events:
- Checked the application logs and was not able to find any relevant information for the backup.
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
2/12/2010 18:33 | (3.0:0.0) | (3.0:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3 PSHED Plugin Driver |
10/1/2014 15:26 | (5.3:30.1001) | (5.3:30.1001) | QLogic Corporation | QLogic FlexLOM(TM) NDIS Miniport Driver |
5/22/2013 17:41 | (3.9:0.0) | (3.9:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3/4 Management Controller Core Driver |
11/23/2013 21:26 | (3.10:0.0) | (3.10:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3/4 Channel Interface Driver |
10/23/2013 6:30 | (10.90:0.0) | (10.90:0.0) | Hewlett-Packard Company | Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8 |
10/16/2012 18:42 | (5.8:56.0) | (5.8:56.0) | Quest Software Corporation | Quest ChangeAuditor for Windows File Servers Driver |
10/16/2012 19:06 | (5.8:56.0) | (5.8:56.0) | Quest Software, Inc. | Quest ChangeAuditor Agent Support Driver |
Cluster Logs:
000023b4.00001054::2016/07/25-04:44:35.594 WARN [RHS] Resource FileServer-(MSCSXYZ03-RES8)(Disque du cluster 1) IsAlive has indicated failure.
000023b4.00000f9c::2016/07/25-04:20:31.234 INFO [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Successfully added Share Directions3 with Path k:\Directions3 on server MSCSXYZ03-RES7
000023b4.00002124::2016/07/25-04:20:31.234 WARN [RES] File Server <FileServer-(MSCSXYZ03-RES6)(Disque du cluster 7)>: Failed in NetShareGetInfo(MSCSXYZ03-RES6, Cabinet), status 53. Tolerating…
000023b4.00000f9c::2016/07/25-04:20:31.234 WARN [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Failed in NetShareGetInfo(MSCSXYZ03-RES7, Directions3), status 53. Tolerating…
000023b4.00000f9c::2016/07/25-04:20:31.234 ERR [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Not a single share among 1 configured shares is online
000023b4.00000f9c::2016/07/25-04:20:31.234 ERR [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: File system check failed, number of shares verified: 1, last share status: 53.
000023b4.00000f9c::2016/07/25-04:20:31.234 ERR [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Fileshares failed health check during online, status 53.
As part of the IsAlive process of file server resource, the cluster service will verify whether the directory path that is associated with the share is still valid. It does by running the UNC path (\\VCO name) command on the local host if the share paths are accessible then the resource comes online if not it will mark the resource fail. Cluster logs reports that share failed health check and transitioned to failed state.
Status 53 at “Failed in NetShareGetInfo” translates to: ERROR_BAD_NETPATH
The network path was not found.
_________________________________________________________________________________________
System Information:
Système d’exploitation Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Informations supplémentaires Non disponible
Éditeur Microsoft Corporation
Ordinateur XYZ223410
Fabricant HP
Modèle ProLiant DL580 G7
Type PC à base de x64
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Version du BIOS/Date HP P65, 2013-10-01
System Events:
- Checked the events and found that the Cluster resources started to fail around 4:02PM with event id 1587.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/24/2016 | 4:02:14 PM | Avertissement | XYZ223410.xyz.local | 1167 | Foundation Agents | Cluster Agent: The cluster resource GxClusPlugIn (MSCSXYZ03-res3) (Instance001) has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
7/24/2016 | 4:02:14 PM | Erreur | XYZ223410.xyz.local | 1168 | Foundation Agents | Cluster Agent: The cluster resource FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5) has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB] |
7/24/2016 | 4:02:08 PM | Error | XYZ223409.xyz.local | 1587 | Microsoft-Windows-FailoverClustering | Cluster file server resource ‘FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node. |
7/24/2016 | 4:04:19 PM | Erreur | XYZ223410.xyz.local | 1207 | Microsoft-Windows-FailoverClustering | La ressource de nom de réseau de cluster « MSCSXYZ03-RES6 » ne peut pas être mise en ligne. L’objet ordinateur associé à la ressource n’a pas pu être mis à jour dans le domaine « xyz.local » pour la raison suivante : Impossible de se connecter au domaine à l’aide du compte d’identité virtuelle du cluster. Texte du code d’erreur associé : Aucun serveur d’accès n’est actuellement disponible pour traiter la demande d’ouverture de session. L’identité de cluster « MSCSDPO02$ » ne dispose peut-être pas des autorisations nécessaires pour mettre à jour l’objet. Contactez votre administrateur de domaine pour vous assurer que l’identité de cluster peut mettre à jour les objets ordinateur du domaine. |
7/24/2016 | 4:04:19 PM | Erreur | XYZ223410.xyz.local | 1069 | Microsoft-Windows-FailoverClustering | Échec de la ressource de cluster « MSCSXYZ03-RES6 » dans le service ou l’application en cluster « MSCSXYZ03-RES6 ». |
Application Events:
- Checked the application logs and was not able to find any relevant information for the backup.
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
2/12/2010 18:33 | (3.0:0.0) | (3.0:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3 PSHED Plugin Driver |
10/1/2014 15:26 | (5.3:30.1001) | (5.3:30.1001) | QLogic Corporation | QLogic FlexLOM(TM) NDIS Miniport Driver |
5/22/2013 17:41 | (3.9:0.0) | (3.9:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3/4 Management Controller Core Driver |
11/23/2013 21:26 | (3.10:0.0) | (3.10:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3/4 Channel Interface Driver |
10/23/2013 6:30 | (10.90:0.0) | (10.90:0.0) | Hewlett-Packard Company | Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8 |
10/16/2012 18:42 | (5.8:56.0) | (5.8:56.0) | Quest Software Corporation | Quest ChangeAuditor for Windows File Servers Driver |
10/16/2012 19:06 | (5.8:56.0) | (5.8:56.0) | Quest Software, Inc. | Quest ChangeAuditor Agent Support Driver |
Cluster Logs:
__________________________________________________________________________________
Conclusion:
- After analyzing the logs we found that the issue is due to the File Server failed to pass the IssAlive Test on the Cluster.
- This issue may happen if the system runs out of TCP ports (Port exhaustion) or Windows Server 2008 R2 is missing up to date Hotfixes for Kernel, networking and cluster modules or outdated NIC drivers. Since we are getting Event id 4227 we get the confirmation for the same.
Plan:
a) Registry changes
- Increase the number of TCP ephemeral ports and decrease the TCP Wait Time Delay ( TcpTimedWaitDelay ). This can be done from editing following registry setting.
- HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
- Create a Dword TcpTimedWaitDelay
- Value: 30 (Dec)
- Create a DWORD MaxUserPort
- Value: 65,536 (Dec)
Above registry change requires reboot of the server for changes to be in effect.
Reducing the value of this entry allows TCP to release closed connections faster, providing more resources for new connections.
For more information about modifying the ephemeral port (MaxUserPort) in Windows Server 2008 or later operating systems please see following informational KB article (preferred method opposed to registry manipulation of MaxUserPort)
https://support.microsoft.com/en-us/kb/953230
Example:
netsh int ipv4 set dynamicport tcp start= 32768 num= 32768 (Default MaxUserPort = 16384)
Note : If the warnings/errors are seen again after some time, all ephemeral TCP ports on the cluster node may be exhausted, leading to applikation errors like WSAENOBUFS (10055) or ‘Error 53’ with cluster resource failover. (cluster health check is failing in NetShareGetInfo with status 53) look our for outdated TDI kernel filter drivers, as an example the following Trend Micro driver can lead into TCP port exhaustion:
–Name– |–Company– |- -Version– |–Date– |–DESCRIPTION—TMTDI.SYS |Trend Micro I n|5.82:0.1024 |Nov 08 2010 |Trend Micro TDI Driver (amd64-fre)
b) relevant Hotfixes
Please make sure that the following hotfixes are installed on the machine.
Network Stack:
https://support.microsoft.com/en-us/kb/3156417
https://support.microsoft.com/en-us/kb/3080140
https://support.microsoft.com/en-us/kb/2831013
https://support.microsoft.com/en-us/kb/3021169
Updates for Cluster Binaries for 2008 R2 : https://support.microsoft.com/en-us/kb/980054