RCA – 7 – Cluster Service Terminating and Resource Failing

Issue Description:

 

Cluster Service Terminating and Resources failing on Server 2008 R2 Cluster Name “MSCSDPO02.xyz.local” in “xyz.local” domain

________________________________________________________________________________________________

 

System Information: 

 

OS Name Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name XYZ223409
System Manufacturer HP
System Model ProLiant DL580 G7
System Type x64-based PC
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 Mhz, 6 Core(s), 6 Logical Processor(s)
BIOS Version/Date HP P65, 2013-10-01

 

 

System Events:

 

  • Checked the system events  and found that the File share services went offline around 4:02 PM However we haven’t seen any failure before the time of issue.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/24/2016

4:02:07 PM

Error

XYZ223409.xyz.local

1587

Microsoft-Windows-FailoverClustering

Cluster file server resource ‘FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node.

7/24/2016

4:02:07 PM

Error

XYZ223409.xyz.local

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5)’ in clustered service or application ‘MSCSXYZ03-RES3’ failed.

7/24/2016

4:02:08 PM

Error

XYZ223409.xyz.local

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered service or application ‘MSCSXYZ03-RES3’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

 

  • Around 4:03 PM we saw the event 4227 which points out an issue towards the TCP/IP failing to establish an outgoing connection.

 

7/24/2016

4:03:23 PM

Warning

XYZ223409.xyz.local

1167

Foundation Agents

Cluster Agent: The cluster resource ShadowCopyVolume{0BAD05C7-69DC-4134-B944-6010CD655668} has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

7/24/2016

4:03:24 PM

Warning

XYZ223409.xyz.local

4227

Tcpip

TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

 

 

7/24/2016

4:04:06 PM

Error

XYZ223409.xyz.local

60

volsnap

The shadow copies of volume J: were aborted because volume U:, which contains shadow copy storage for this shadow copy, has been taken offline.

7/24/2016

4:04:07 PM

Error

XYZ223409.xyz.local

1587

Microsoft-Windows-FailoverClustering

Cluster file server resource ‘FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node.

7/24/2016

4:04:07 PM

Error

XYZ223409.xyz.local

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)’ in clustered service or application ‘MSCSXYZ03-RES7’ failed.

 

 

Application Events:

 

 

  • Checked the application logs and was not able to find any relevant information for the backup.

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

10/1/2014 15:26

(5.3:30.1001)

(5.3:30.1001)

QLogic Corporation

QLogic FlexLOM(TM) NDIS Miniport Driver

5/22/2013 17:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

11/23/2013 21:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

10/23/2013 6:30

(10.90:0.0)

(10.90:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8

10/16/2012 18:42

(5.8:56.0)

(5.8:56.0)

Quest Software Corporation

Quest ChangeAuditor for Windows File Servers Driver

10/16/2012 19:06

(5.8:56.0)

(5.8:56.0)

Quest Software, Inc.

Quest ChangeAuditor Agent Support Driver

 

Cluster Logs:

 

000023b4.00001054::2016/07/25-04:44:35.594 WARN  [RHS] Resource FileServer-(MSCSXYZ03-RES8)(Disque du cluster 1) IsAlive has indicated failure.

000023b4.00000f9c::2016/07/25-04:20:31.234 INFO  [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Successfully added Share Directions3 with Path k:\Directions3 on server MSCSXYZ03-RES7

000023b4.00002124::2016/07/25-04:20:31.234 WARN  [RES] File Server <FileServer-(MSCSXYZ03-RES6)(Disque du cluster 7)>: Failed in NetShareGetInfo(MSCSXYZ03-RES6, Cabinet), status 53. Tolerating…

000023b4.00000f9c::2016/07/25-04:20:31.234 WARN  [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Failed in NetShareGetInfo(MSCSXYZ03-RES7, Directions3), status 53. Tolerating…

000023b4.00000f9c::2016/07/25-04:20:31.234 ERR   [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Not a single share among 1 configured shares is online

000023b4.00000f9c::2016/07/25-04:20:31.234 ERR   [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: File system check failed, number of shares verified: 1, last share status: 53.

000023b4.00000f9c::2016/07/25-04:20:31.234 ERR   [RES] File Server <FileServer-(MSCSXYZ03-RES7)(Disque du cluster 8)>: Fileshares failed health check during online, status 53.

 

As part of the IsAlive process of file server resource, the cluster service will verify whether the directory path that is associated with the share is still valid. It does by running the UNC path (\\VCO name) command on the local host if the share paths are accessible then the resource comes online if not it will mark the resource fail. Cluster logs reports that share failed health check and transitioned to failed state.

 

Status 53 at “Failed in NetShareGetInfo” translates to: ERROR_BAD_NETPATH

The network path was not found.

 

_________________________________________________________________________________________

 

 

System Information: 

 

Système d’exploitation Microsoft Windows Server 2008 R2 Enterprise
Version 6.1.7601 Service Pack 1 Build 7601
Informations supplémentaires Non disponible
Éditeur Microsoft Corporation
Ordinateur XYZ223410
Fabricant HP
Modèle ProLiant DL580 G7
Type PC à base de x64
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Version du BIOS/Date HP P65, 2013-10-01

 

 

System Events:

 

  • Checked the events and found that the Cluster resources started to fail around 4:02PM with event id 1587.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/24/2016

4:02:14 PM

Avertissement

XYZ223410.xyz.local

1167

Foundation Agents

Cluster Agent: The cluster resource GxClusPlugIn (MSCSXYZ03-res3) (Instance001) has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

7/24/2016

4:02:14 PM

Erreur

XYZ223410.xyz.local

1168

Foundation Agents

Cluster Agent: The cluster resource FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5) has failed.  [SNMP TRAP: 15006 in CPQCLUS.MIB]

7/24/2016

4:02:08 PM

Error

XYZ223409.xyz.local

1587

Microsoft-Windows-FailoverClustering

Cluster file server resource ‘FileServer-(MSCSXYZ03-RES3)(Disque du cluster 5)’ failed a health check. This was because some of its shared folders were inaccessible. Verify that the folders are accessible from clients. Additionally, confirm the state of the Server service on this cluster node using Server Manager and look for other events related to the Server service on this cluster node.

 

 

7/24/2016

4:04:19 PM

Erreur

XYZ223410.xyz.local

1207

Microsoft-Windows-FailoverClustering

La ressource de nom de réseau de cluster « MSCSXYZ03-RES6 » ne peut pas être mise en ligne. L’objet ordinateur associé à la ressource n’a pas pu être mis à jour dans le domaine « xyz.local » pour la raison suivante : Impossible de se connecter au domaine à l’aide du compte d’identité virtuelle du cluster. Texte du code d’erreur associé : Aucun serveur d’accès n’est actuellement disponible pour traiter la demande d’ouverture de session.   L’identité de cluster « MSCSDPO02$ » ne dispose peut-être pas des autorisations nécessaires pour mettre à jour l’objet. Contactez votre administrateur de domaine pour vous assurer que l’identité de cluster peut mettre à jour les objets ordinateur du domaine.

7/24/2016

4:04:19 PM

Erreur

XYZ223410.xyz.local

1069

Microsoft-Windows-FailoverClustering

Échec de la ressource de cluster « MSCSXYZ03-RES6 » dans le service ou l’application en cluster « MSCSXYZ03-RES6 ».

 

 

 

Application Events:

 

  • Checked the application logs and was not able to find any relevant information for the backup.

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

10/1/2014 15:26

(5.3:30.1001)

(5.3:30.1001)

QLogic Corporation

QLogic FlexLOM(TM) NDIS Miniport Driver

5/22/2013 17:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

11/23/2013 21:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

10/23/2013 6:30

(10.90:0.0)

(10.90:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8

10/16/2012 18:42

(5.8:56.0)

(5.8:56.0)

Quest Software Corporation

Quest ChangeAuditor for Windows File Servers Driver

10/16/2012 19:06

(5.8:56.0)

(5.8:56.0)

Quest Software, Inc.

Quest ChangeAuditor Agent Support Driver

 

Cluster Logs:

 

__________________________________________________________________________________

 

 

Conclusion:

 

  • After analyzing the logs we found that the issue is due to the File Server failed to pass the IssAlive Test on the Cluster.

 

  • This issue may happen if the system runs out of TCP ports (Port exhaustion) or Windows Server 2008 R2 is missing up to date Hotfixes for Kernel, networking and cluster modules or outdated NIC drivers. Since we are getting Event id 4227 we get the confirmation for the same.

 

Plan:

 

a) Registry changes

 

  • Increase the number of TCP ephemeral ports and decrease the TCP Wait Time Delay ( TcpTimedWaitDelay  ). This can be done from editing following registry setting.

 

  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
    • Create a Dword TcpTimedWaitDelay
    • Value: 30 (Dec)

 

  • Create a DWORD MaxUserPort
  • Value: 65,536 (Dec)

 

Above registry change requires reboot of the server for changes to be in effect.

Reducing the value of this entry allows TCP to release closed connections faster, providing more resources for new connections.

 

 

For more information about modifying the ephemeral port (MaxUserPort) in Windows Server 2008 or later operating systems please see following informational KB article (preferred method opposed to registry manipulation of MaxUserPort)

 

https://support.microsoft.com/en-us/kb/953230

 

Example:

 

netsh int ipv4 set dynamicport tcp start= 32768 num= 32768 (Default MaxUserPort = 16384)

 

Note : If the warnings/errors are seen again after some time, all ephemeral TCP ports on the cluster node may be exhausted, leading to applikation errors like WSAENOBUFS (10055) or ‘Error 53’ with cluster resource failover. (cluster health check is failing in NetShareGetInfo  with status 53) look our for outdated TDI kernel filter drivers, as an example the following Trend Micro driver can lead into TCP port exhaustion:

 

–Name– |–Company– |- -Version– |–Date– |–DESCRIPTION—TMTDI.SYS |Trend Micro I n|5.82:0.1024 |Nov 08 2010 |Trend Micro TDI Driver (amd64-fre)

 

b) relevant Hotfixes 

 

Please make sure that the following hotfixes are installed on the machine.

 

Network Stack:

 

https://support.microsoft.com/en-us/kb/3156417

 

https://support.microsoft.com/en-us/kb/3080140

 

https://support.microsoft.com/en-us/kb/2831013

 

https://support.microsoft.com/en-us/kb/3021169

 

 

Updates for Cluster Binaries for 2008 R2 : https://support.microsoft.com/en-us/kb/980054 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply