Issue Description:
Cluster Service Terminating and Resources failing on Server 2008 R2 Cluster Name “MSCSDPO02.abc.local” in “abc.local” domain
____________________________________________________________________________________
System Information:
Système d’exploitation Microsoft Windows Server 2008 R2 Entreprise
Version 6.1.7601 Service Pack 1 Build 7601
Informations supplémentaires Non disponible
Éditeur Microsoft Corporation
Ordinateur ABCWFIC09
Fabricant HP
Modèle ProLiant DL580 G7
Type PC à base de x64
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Processeur Intel(R) Xeon(R) CPU X7542 @ 2.67GHz, 2666 MHz, 6 cœur(s), 6 processeur(s) logique(s)
Version du BIOS/Date HP P65, 2013-10-01
System Events:
- Checked the system events and found that the Firmware of the Nic is causing the problem just before the issue.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/8/2016 | 7:58:35 AM | Erreur | ABCWFIC09.abc.local | 282 | QLNDNIC | DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter PROBLEM: Incompatibility detected between driver and firmware on flash. ACTION: Please upgrade firmware on flash immediately. |
- Checked the Events just after the Nic Driver event and found that the Cluster Recourses went to degraded state.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/8/2016 | 8:00:19 AM | Erreur | ABCWFIC09.abc.local | 10016 | Microsoft-Windows-DistributedCOM | Les paramètres d’autorisation spécifiques à l’application n’accordent pas d’autorisation Local Exécution pour l’application serveur COM avec le CLSID {24FF4FDC-1D9F-4195-8C79-0DA39248FF48} et l’APPID {B292921D-AF50-400C-9B75-0C57A7F29BA1} au SID AUTORITE NT\Système de l’utilisateur (S-1-5-18) depuis l’adresse LocalHost (utilisation de LRPC). Cette autorisation de sécurité peut être modifiée à l’aide de l’outil d’administration Services de composants. |
7/8/2016 | 2:18:34 PM | Erreur | ABCWFIC09.abc.local | 1069 | Microsoft-Windows-FailoverClustering | Échec de la ressource de cluster « DFSR h:\PubliqueDT » dans le service ou l’application en cluster « MSCSFIC03-RES3 ». |
7/8/2016 | 7:58:41 AM | Avertissement | ABCWFIC09.abc.local | 461 | CPQTeamMP | Team ID: 0 Aggregation ID: 0 Team Member ID: 0 PROBLEM: 802.3ad link aggregation (LACP) has failed. ACTION: Ensure all ports are connected to LACP-aware devices. |
7/8/2016 | 8:00:19 AM | Erreur | ABCWFIC09.abc.local | 10016 | Microsoft-Windows-DistributedCOM | Les paramètres d’autorisation spécifiques à l’application n’accordent pas d’autorisation Local Exécution pour l’application serveur COM avec le CLSID {24FF4FDC-1D9F-4195-8C79-0DA39248FF48} et l’APPID {B292921D-AF50-400C-9B75-0C57A7F29BA1} au SID AUTORITE NT\Système de l’utilisateur (S-1-5-18) depuis l’adresse LocalHost (utilisation de LRPC). Cette autorisation de sécurité peut être modifiée à l’aide de l’outil d’administration Services de composants. |
7/8/2016 | 8:01:43 AM | Avertissement | ABCWFIC09.abc.local | 1167 | Foundation Agents | Cluster Agent: The cluster resource DFSR j:\Cabinet has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
- Checked the events of 20th June just to confirm if the issue is with the Nic Failure and found the same set of trend.
6/20/2016 | 5:14:20 AM | Error | ABCWFIC09.abc.local | 282 | QLNDNIC | DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter PROBLEM: Incompatibility detected between driver and firmware on flash. ACTION: Please upgrade firmware on flash immediately. |
6/20/2016 | 5:14:25 AM | Warning | ABCWFIC09.abc.local | 461 | CPQTeamMP | Team ID: 0 Aggregation ID: 0 Team Member ID: 0 PROBLEM: 802.3ad link aggregation (LACP) has failed. ACTION: Ensure all ports are connected to LACP-aware devices. |
- Found that the Cluster resources failed after the issue.
6/20/2016 | 5:15:56 AM | Warning | ABCWFIC09.abc.local | 1167 | Foundation Agents | Cluster Agent: The cluster resource DFSR f:\Publique has become degraded. [SNMP TRAP: 15005 in CPQCLUS.MIB] |
6/20/2016 | 5:15:56 AM | Error | ABCWFIC09.abc.local | 1168 | Foundation Agents | Cluster Agent: The cluster resource MSCSFIC03-RES1 has failed. [SNMP TRAP: 15006 in CPQCLUS.MIB] |
Application Events:
- Checked the Application logs and found events notifying the failure of services.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/8/2016 | 6:40:56 AM | Erreur | ABCWFIC09.abc.local | 5 | CimNotify | Component: HP Event Notification service. Error: Error sending notification to: ‘Foundation Agents: Cluster Resource Failed’. Cause: The SMTP host is currently offline or the recipient’s e-mail address is invalid. Action: Verify that the SMTP host is online and the recipient’s e-mail address is valid. |
7/8/2016 | 6:41:09 AM | Avertissement | ABCWFIC09.abc.local | 12317 | SRMSVC | Échec d’énumération des chemins d’accès de partage ou DFS par le Gestionnaire de ressources du serveur de fichiers. Les mappages des chemins d’accès locaux à partager ou des chemins d’accès DFS sont peut-être incomplets ou temporairement indisponibles. Le Gestionnaire de ressources du serveur de fichiers tentera à nouveau cette opération ultérieurement. Détails de l’erreur : Erreur: NetShareEnum, 0x80070035, Le chemin réseau n’a pas été trouvé. |
List of outdated drivers:
Time/Date String | Product Version | File Version | Company Name | File Description |
2/12/2010 18:33 | (3.0:0.0) | (3.0:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3 PSHED Plugin Driver |
10/1/2014 15:26 | (5.3:30.1001) | (5.3:30.1001) | QLogic Corporation | QLogic FlexLOM(TM) NDIS Miniport Driver |
5/22/2013 17:41 | (3.9:0.0) | (3.9:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3/4 Management Controller Core Driver |
11/23/2013 21:26 | (3.10:0.0) | (3.10:0.0) | Hewlett-Packard Company | HP ProLiant iLO 3/4 Channel Interface Driver |
10/23/2013 6:30 | (10.90:0.0) | (10.90:0.0) | Hewlett-Packard Company | Network Teaming Intermediate Driver (NTID), 10.90.00.10, NDIS 6.0, x64, (free) on Win2K8 |
10/16/2012 18:42 | (5.8:56.0) | (5.8:56.0) | Quest Software Corporation | Quest ChangeAuditor for Windows File Servers Driver |
10/16/2012 19:06 | (5.8:56.0) | (5.8:56.0) | Quest Software, Inc. | Quest ChangeAuditor Agent Support Driver |
Cluster Logs:
00001168.0000296c::2016/07/11-18:46:41.567 ERR [RHS] Error 161 from ResourceControl for resource 1_Directions.
00000f7c.00002a54::2016/07/11-18:46:41.567 WARN [RCM] ResourceControl(STORAGE_IS_PATH_VALID) to 1_Directions returned 161.
00001168.000008c0::2016/07/11-18:46:41.567 INFO [RES] Physical Disk: Path \\.\M:\System Volume Information\DFSR\Config\Volume_2A2CDD32-6111-43CE-A7C5-F709EF0C0F63.XML is not on the disk
00001168.000008c0::2016/07/11-18:46:41.567 ERR [RHS] Error 161 from ResourceControl for resource 7_VSS.
00000f7c.00002a54::2016/07/11-18:46:41.582 ERR [RCM] rcm::RcmResControl::DoResourceControl: ERROR_ALREADY_EXISTS(183)’ because of ‘Key System\CurrentControlSet\Services\DFSR\Parameters\Volumes\2A2CDD32-6111-43CE-A7C5-F709EF0C0F63 is already being checkpointed for resource 1_Archivage. Key name c72a93aa-e627-45f6-a090-f34dbda598ed.’
00000f7c.00002a54::2016/07/11-18:46:41.582 WARN [RCM] ResourceControl(ADD_REGISTRY_CHECKPOINT) to 1_Archivage returned 183.
00000f7c.00002b14::2016/07/11-18:48:38.431 INFO [API] s_ApiGetQuorumResource final status 0.
___________________________________________________________________________________
Conclusion:
- After analyzing the logs we found that the issue started after we are getting the event HP NC375i Firmware Driver Malfunctioned. After which the Resources from the Cluster are going in failed state.
- We need to update the HP NC375i Firmware driver to the Latest Version as if the Network in will not be available on the System the Resources will fail and eventually move to a Different Node.
- Kindly Uninstall the MacAfee Antivirus if possible. Otherwise add the following system locations to the Exclusion List of the Antivirus:
•The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
•The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
•The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
https://support.microsoft.com/en-us/kb/980054 : Updates for Cluster Binaries for 2008 R2
- Investigate the Network timeout / latency / packet drops with the help of in house networking team.
Please Note : This step is the most critical while dealing with network connectivity issues.
Investigation of Network Issues :
We need to investigate the Network Connectivity Issues with the help of in-house networking team.
In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.