RCA – 6 – Cluster Service Terminating and Resource Failing

Issue Description:

 

Cluster Service Terminating and Resources failing on Server 2008 R2 Cluster Name “MSCSDPO02.abc.local” in “abc.local” domain

____________________________________________________________________________________

 

System Information: 

 

Système d’exploitation Microsoft Windows Server 2008 R2 Entreprise
Version 6.1.7601 Service Pack 1 Build 7601
Informations supplémentaires Non disponible
Éditeur Microsoft Corporation
Ordinateur ABCWFIC09
Fabricant HP
Modèle ProLiant DL580 G7
Type PC à base de x64
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Version du BIOS/Date HP P65, 2013-10-01

 

System Events:

 

  • Checked the system events and found that the Firmware of the Nic is causing the problem just before the issue.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/8/2016

7:58:35 AM

Erreur

ABCWFIC09.abc.local

282

QLNDNIC

DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter PROBLEM: Incompatibility detected between driver and firmware on flash. ACTION: Please upgrade firmware on flash immediately.

 

  • Checked the Events just after the Nic Driver event and found that the Cluster Recourses went to degraded state.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/8/2016

8:00:19 AM

Erreur

ABCWFIC09.abc.local

10016

Microsoft-Windows-DistributedCOM

Les paramètres d’autorisation spécifiques à l’application n’accordent pas d’autorisation Local Exécution pour l’application serveur COM avec le CLSID  {24FF4FDC-1D9F-4195-8C79-0DA39248FF48}  et l’APPID  {B292921D-AF50-400C-9B75-0C57A7F29BA1}  au SID AUTORITE NT\Système de l’utilisateur (S-1-5-18) depuis l’adresse LocalHost (utilisation de LRPC). Cette autorisation de sécurité peut être modifiée à l’aide de l’outil d’administration Services de composants.

7/8/2016

2:18:34 PM

Erreur

ABCWFIC09.abc.local

1069

Microsoft-Windows-FailoverClustering

Échec de la ressource de cluster « DFSR h:\PubliqueDT » dans le service ou l’application en cluster « MSCSFIC03-RES3 ».

7/8/2016

7:58:41 AM

Avertissement

ABCWFIC09.abc.local

461

CPQTeamMP

Team ID: 0 Aggregation ID: 0 Team Member ID: 0  PROBLEM: 802.3ad link aggregation (LACP) has failed. ACTION: Ensure all ports are connected to LACP-aware devices.

7/8/2016

8:00:19 AM

Erreur

ABCWFIC09.abc.local

10016

Microsoft-Windows-DistributedCOM

Les paramètres d’autorisation spécifiques à l’application n’accordent pas d’autorisation Local Exécution pour l’application serveur COM avec le CLSID  {24FF4FDC-1D9F-4195-8C79-0DA39248FF48}  et l’APPID  {B292921D-AF50-400C-9B75-0C57A7F29BA1}  au SID AUTORITE NT\Système de l’utilisateur (S-1-5-18) depuis l’adresse LocalHost (utilisation de LRPC). Cette autorisation de sécurité peut être modifiée à l’aide de l’outil d’administration Services de composants.

7/8/2016

8:01:43 AM

Avertissement

ABCWFIC09.abc.local

1167

Foundation Agents

Cluster Agent: The cluster resource DFSR j:\Cabinet has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

 

  • Checked the events of 20th June just to confirm if the issue is with the Nic Failure and found the same set of trend.

 

6/20/2016

5:14:20 AM

Error

ABCWFIC09.abc.local

282

QLNDNIC

DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter PROBLEM: Incompatibility detected between driver and firmware on flash. ACTION: Please upgrade firmware on flash immediately.

6/20/2016

5:14:25 AM

Warning

ABCWFIC09.abc.local

461

CPQTeamMP

Team ID: 0 Aggregation ID: 0 Team Member ID: 0  PROBLEM: 802.3ad link aggregation (LACP) has failed. ACTION: Ensure all ports are connected to LACP-aware devices.

 

  • Found that the Cluster resources failed after the issue.

 

6/20/2016

5:15:56 AM

Warning

ABCWFIC09.abc.local

1167

Foundation Agents

Cluster Agent: The cluster resource DFSR f:\Publique has become degraded.  [SNMP TRAP: 15005 in CPQCLUS.MIB]

6/20/2016

5:15:56 AM

Error

ABCWFIC09.abc.local

1168

Foundation Agents

Cluster Agent: The cluster resource MSCSFIC03-RES1 has failed.  [SNMP TRAP: 15006 in CPQCLUS.MIB]

 

Application Events:

 

  • Checked the Application logs and found events notifying the failure of services.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/8/2016

6:40:56 AM

Erreur

ABCWFIC09.abc.local

5

CimNotify

Component: HP Event Notification service. Error: Error sending notification to: ‘Foundation Agents: Cluster Resource Failed’. Cause: The SMTP host is currently offline or the recipient’s e-mail address is invalid. Action: Verify that the SMTP host is online and the recipient’s e-mail address is valid.

7/8/2016

6:41:09 AM

Avertissement

ABCWFIC09.abc.local

12317

SRMSVC

Échec d’énumération des chemins d’accès de partage ou DFS par le Gestionnaire de ressources du serveur de fichiers. Les mappages des chemins d’accès locaux à partager ou des chemins d’accès DFS sont peut-être incomplets ou temporairement indisponibles. Le Gestionnaire de ressources du serveur de fichiers tentera à nouveau cette opération ultérieurement.  Détails de l’erreur :    Erreur: NetShareEnum, 0x80070035, Le chemin réseau n’a pas été trouvé.

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

10/1/2014 15:26

(5.3:30.1001)

(5.3:30.1001)

QLogic Corporation

QLogic FlexLOM(TM) NDIS Miniport Driver

5/22/2013 17:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

11/23/2013 21:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

10/23/2013 6:30

(10.90:0.0)

(10.90:0.0)

Hewlett-Packard Company

Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8

10/16/2012 18:42

(5.8:56.0)

(5.8:56.0)

Quest Software Corporation

Quest ChangeAuditor for Windows File Servers Driver

10/16/2012 19:06

(5.8:56.0)

(5.8:56.0)

Quest Software, Inc.

Quest ChangeAuditor Agent Support Driver

 

Cluster Logs:

 

00001168.0000296c::2016/07/11-18:46:41.567 ERR   [RHS] Error 161 from ResourceControl for resource 1_Directions.

00000f7c.00002a54::2016/07/11-18:46:41.567 WARN  [RCM] ResourceControl(STORAGE_IS_PATH_VALID) to 1_Directions returned 161.

00001168.000008c0::2016/07/11-18:46:41.567 INFO  [RES] Physical Disk: Path \\.\M:\System Volume Information\DFSR\Config\Volume_2A2CDD32-6111-43CE-A7C5-F709EF0C0F63.XML is not on the disk

00001168.000008c0::2016/07/11-18:46:41.567 ERR   [RHS] Error 161 from ResourceControl for resource 7_VSS.

00000f7c.00002a54::2016/07/11-18:46:41.582 ERR   [RCM] rcm::RcmResControl::DoResourceControl: ERROR_ALREADY_EXISTS(183)’ because of ‘Key System\CurrentControlSet\Services\DFSR\Parameters\Volumes\2A2CDD32-6111-43CE-A7C5-F709EF0C0F63 is already being checkpointed for resource 1_Archivage. Key name c72a93aa-e627-45f6-a090-f34dbda598ed.’

00000f7c.00002a54::2016/07/11-18:46:41.582 WARN  [RCM] ResourceControl(ADD_REGISTRY_CHECKPOINT) to 1_Archivage returned 183.

00000f7c.00002b14::2016/07/11-18:48:38.431 INFO  [API] s_ApiGetQuorumResource final status 0.

 

___________________________________________________________________________________

 

 

 

Conclusion:

 

  • After analyzing the logs we found that the issue started after we are getting the event HP NC375i Firmware Driver Malfunctioned. After which the Resources from the Cluster are going in failed state.

 

  • We need to update the HP NC375i Firmware driver to the Latest Version as if the Network in will not be available on the System the Resources will fail and eventually move to a Different Node.

 

  • Kindly Uninstall the MacAfee Antivirus if possible. Otherwise add the following system locations to the Exclusion List of the Antivirus:

•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)

•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)

•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

 

  • Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

https://support.microsoft.com/en-us/kb/980054 : Updates for Cluster Binaries for 2008 R2

 

 

  • Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

 

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply