Issue Description:
Cluster Note Terminated with Bugcheck 9E while replacing the Bad Controller on the SAN.
Initial Description:
Resource Hosting Subsystem (RHS) In Windows Server 2008 Failover Clusters
A Windows Server 2008 Failover Cluster is capable of providing high availability services using a variety of resources some of which are included as part of the Failover Cluster feature and others are as part of ’cluster-aware’ applications like SQL and Exchange. Resources are designed to work together and are typically organized in Resource Groups (Figure 1). For example, a group of resources supporting a highly available File Server may consist of one or more of the following types of resources – Client Access Point (IP Address(s) + Network Name resource), Physical Disk (Storage), and a File Server. A highly available SQL Instance could contain the following resources – Client Access Point (IP Address + Network Name resource), Physical Disk (Storage), SQL Server and SQL Server Agent. Cluster resources are supported by special ‘plugins’ or resource Data Link Libraries (DLLs) that include coding to allow them to properly integrate\interoperate with the cluster service.
A Windows Server 2008 Failover Cluster is capable of hosting an unlimited number of resources. The management of these resources is the responsibility of the Resource Control Manager (RCM) and the Resource Host Subsystem (RHS) which provide this functionality as part of the Cluster Service itself (Figure 2).
The Resource Control Manager (RCM) is part of the overall cluster architecture and is responsible for implementing failover mechanisms and policies for the cluster service as well as establishing and maintaining the dependency tree (Figure 3) for each resource (e.g. a File Server resource requires a dependency on a Client Access Point and a Storage resource).
The Resource Control Manager maintains the state for individual resources (Online, Offline, Failed, Online Pending, and Offline Pending) as well as for Resource Groups (Online, Offline, Partial Online, and Failed).
For More Information Please refer: https://blogs.technet.microsoft.com/askcore/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters/
System Information: ASDCLOUD-H1
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ASDCLOUD-H1
System Manufacturer HP
System Model ProLiant DL380p Gen8
System Type x64-based PC
System SKU 697494-S01
Processor Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P70, 8/20/2012
System Events:
- Checked the logs and found that the Target connection lost from the Server. This might be due to the troubleshooting that was happening on the 4_1.CSV
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:08:03 PM |
Error |
ASDCLOUD-H1 |
20 |
iScsiPrt |
Connection to the target was lost. The initiator will attempt to |
5/5/2016 |
5:08:03 PM |
Error |
ASDCLOUD-H1 |
7 |
iScsiPrt |
The initiator could not send an iSCSI PDU. Error status is given |
- Checked the logs and found that VM KAF-CHD1 went to the Not responding state after which the RHS crashed.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:16:44 PM |
Error |
ASDCLOUD-H1 |
9 |
iScsiPrt |
Target did not respond in time for a SCSI request. The CDB is |
5/5/2016 |
5:17:06 PM |
Error |
ASDCLOUD-H1 |
1230 |
Microsoft-Windows-FailoverClustering |
A component on the server did not respond in a timely fashion. |
5/5/2016 |
5:17:08 PM |
Critical |
ASDCLOUD-H1 |
1146 |
Microsoft-Windows-FailoverClustering |
The cluster Resource Hosting Subsystem (RHS) process was |
5/5/2016 |
5:17:08 PM |
Error |
ASDCLOUD-H1 |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Hyper-V Replica Broker ASD_Cloud_Rep’ of type |
Application Events:
- Checked the Application logs but was not able to find any event at the time of issue.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
4/5/2013 14:34 |
(6.2:9200.16384) |
(12.7:28.0) |
Intel Corporation |
Intel(R) Gigabit Adapter NDIS 6.x driver |
6/26/2012 13:55 |
(3.7:0.0) |
(3.7:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
6/29/2012 14:26 |
(9.15:1.45) |
(9.15:1.45) |
Matrox Graphics Inc. |
MxG2hDO64.sys |
__________________________________________________________________________________________________________________
System Information: ASDCLOUD-H2
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ASDCLOUD-H2
System Manufacturer HP
System Model ProLiant DL380p Gen8
System Type x64-based PC
System SKU 697494-S01
Processor Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P70, 3/1/2013
System Events:
- Checked the System Logs and found that the Cluster Node
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:34:45 PM |
Error |
ASDCLOUD-H2 |
39 |
iScsiPrt |
Initiator sent a task management command to reset the target. The |
5/5/2016 |
5:34:45 PM |
Error |
ASDCLOUD-H2 |
9 |
iScsiPrt |
Target did not respond in time for a SCSI request. The CDB is |
5/5/2016 |
5:35:05 PM |
Critical |
ASDCLOUD-H2 |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ASDCLOUD-H3’ was removed from the active failover |
Application Events:
- Checked and found that the VSS service is also running at the time of issue which indicate that there is a backup job running.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
6:05:24 PM |
Information |
ASDCLOUD-H2 |
5605 |
Microsoft-Windows-WMI |
The root\mscluster namespace is marked with the RequiresEncryption |
5/5/2016 |
6:08:21 PM |
Information |
ASDCLOUD-H2 |
8224 |
VSS |
The VSS service is shutting down due to idle timeout. |
Failover Cluster Events:
- Checked and found that the VM ALIST-XENDC was in offline state and it came online around 6:17 PM. Since the Quorum was also residing on Cluster VM ALIST-XENDC as a Dependency so the Quorum also must have gone offline as it was also residing on ASDCLOUD-H3.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
6:17:54 PM |
Information |
ASDCLOUD-H2 |
1637 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘SCVMM ALIST-XENDC Configuration’ in clustered |
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
2/12/2010 18:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
6/26/2012 13:55 |
(3.7:0.0) |
(3.7:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
6/29/2012 14:26 |
(9.15:1.45) |
(9.15:1.45) |
Matrox Graphics Inc. |
MxG2hDO64.sys |
6/26/2012 13:55 |
(3.7:0.0) |
(3.7:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Channel Interface Driver |
__________________________________________________________________________________________________________________
System Information: ASDCLOUD-H3
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ASDCLOUD-H3
System Manufacturer HP
System Model ProLiant DL380p Gen8
System Type x64-based PC
System SKU 670854-S01
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2494 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2494 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P70, 8/2/2014
System Events:
- Checked and found that the connection to San4_1 is lost by the Node.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:09:54 PM |
Error |
ASDCLOUD-H3 |
1038 |
Microsoft-Windows-FailoverClustering |
Ownership of cluster disk ‘San4_1’ has been unexpectedly lost by |
5/5/2016 |
5:09:54 PM |
Error |
ASDCLOUD-H3 |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘San4_1’ of type ‘Physical Disk’ in clustered |
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:39:19 PM |
Error |
ASDCLOUD-H3 |
1001 |
Microsoft-Windows-WER-SystemErrorReporting |
The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009e |
5/5/2016 |
5:38:52 PM |
Critical |
ASDCLOUD-H3 |
41 |
Microsoft-Windows-Kernel-Power |
The system has rebooted without cleanly shutting down first. This |
5/5/2016 |
5:39:16 PM |
Error |
ASDCLOUD-H3 |
6008 |
EventLog |
The previous system shutdown at 5:34:43 PM on ?5/?5/?2016 was |
Application
Events:
- Checked the Application logs but was not able to find any event at the time of issue.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
10/28/2013 11:03 |
(6.2:9200.16384) |
(62.28:0.64) |
Hewlett-Packard Company |
Smart Array SAS/SATA Controller Storport Driver |
2/12/2010 18:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
5/22/2013 17:41 |
(3.9:0.0) |
(3.9:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
8/6/2013 17:00 |
(9.15:1.102) |
(9.15:1.102) |
Matrox Graphics Inc. |
MxG2hDO64.sys |
11/23/2013 21:26 |
(3.10:0.0) |
(3.10:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Channel Interface Driver |
__________________________________________________________________________________________________________________
System Information: ASDCLOUD-H4
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ASDCLOUD-H4
System Manufacturer HP
System Model ProLiant DL380 Gen9
System Type x64-based PC
System SKU 777337-S01
Processor Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P89, 7/20/2015
System Events:
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:34:53 PM |
Error |
ASDCLOUD-H4 |
39 |
iScsiPrt |
Initiator sent a task management command to reset the target. The |
5/5/2016 |
5:34:53 PM |
Error |
ASDCLOUD-H4 |
9 |
iScsiPrt |
Target did not respond in time for a SCSI request. The CDB is |
5/5/2016 |
5:35:05 PM |
Critical |
ASDCLOUD-H4 |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ASDCLOUD-H3’ was removed from the active failover |
5/5/2016 |
5:35:10 PM |
Information |
ASDCLOUD-H4 |
5121 |
Microsoft-Windows-FailoverClustering |
Cluster Shared Volume ‘Backups01’ (‘Backups01’) is no longer |
Application Events:
- Checked the Application logs but was not able to find any event at the time of issue.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
2/12/2010 18:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
9/12/2014 0:25 |
(16.8:0.4) |
(16.8:0.4) |
Broadcom Corporation |
Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
5/22/2013 17:41 |
(3.9:0.0) |
(3.9:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
8/6/2013 17:00 |
(9.15:1.102) |
(9.15:1.102) |
Matrox Graphics Inc. |
MxG2hDO64.sys |
11/23/2013 21:26 |
(3.10:0.0) |
(3.10:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Channel Interface Driver |
__________________________________________________________________________________________________________________
System Information: KST-HSR-HOST1
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name KST-HSR-HOST1
System Manufacturer HP
System Model ProLiant DL380 G7
System Type x64-based PC
System SKU 605876-005
Processor Intel(R) Xeon(R) CPU E5640 @ 2.67GHz, 2666 Mhz, 4 Core(s), 8 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5640 @ 2.67GHz, 2666 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date HP P67, 5/5/2011
System Events:
- Checked the System logs and found that we are getting a lot of errors related to the Foundation Agent.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
5/5/2016 |
5:35:01 PM |
Warning |
KST-HSR-HOST1 |
3072 |
Foundation Agents |
Component: Software Version Agent |
5/5/2016 |
5:35:05 PM |
Critical |
KST-HSR-HOST1 |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ASDCLOUD-H3’ was removed from the active failover |
Application Events:
- Checked the Application logs but was not able to find any event at the time of issue.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
2/12/2010 18:33 |
(3.0:0.0) |
(3.0:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3 PSHED Plugin Driver |
9/12/2014 0:25 |
(16.8:0.4) |
(16.8:0.4) |
Broadcom Corporation |
Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver. |
5/22/2013 17:41 |
(3.9:0.0) |
(3.9:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Management Controller Core Driver |
8/6/2013 17:00 |
(9.15:1.102) |
(9.15:1.102) |
Matrox Graphics Inc. |
MxG2hDO64.sys |
11/23/2013 21:26 |
(3.10:0.0) |
(3.10:0.0) |
Hewlett-Packard Company |
HP ProLiant iLO 3/4 Channel Interface Driver |
__________________________________________________________________________________________________________________
Conclusion:
- After analyzing the logs we can see that the issue started when we tried to replace a Bad Controller from the SAN. As per the Events we started getting events for ISCSIprt as the Cluster Nodes were not able to communicate with the SAN after which around 5:09 PM the Cluster Shared Volume San4_1 Went offline and the cluster was not able to access it.
- Since this CSV was hosted on ASDCLOUD-H3 which after sometime failed with the Bugcheck 9E as the Cluster Service thought that it was not able to host the CSV and
crashed eventually so that the resources can be failed over to another Node.
- This is a Normal behavior of the Cluster. However as per the cluster configuration at the time of issue was that it can only sustain One Node and Witness Failure.
====================================================================================================================================
This quorum model will be able to sustain failures of 2 node(s) with the disk witness online and 1 node(s) when the disk witness goes offline or fails.
This quorum configuration can be changed using the Configure Cluster Quorum wizard. This wizard can be started from the Failover Cluster Manager console by selecting the cluster name in the left hand pane, then in the right “actions” pane selecting “More Actions…” and then selecting “Configure Cluster Quorum Settings…”.
====================================================================================================================================
- When I took the remote session I found that the Cluster Witness is not configured as per the recommendations. Cluster Quorum was added as a Dependency on VM “ALIST-XENDC” which was initially residing on ASDCLOUD-H3. Due to which the Witness was also residing on that Same Node which crashed.
- As the Cluster Node Crashed the Quorum also went to the failed state. Cluster was not able to sustain two simultaneous failure and cluster went down.
Recommendations:
- As a recommendation kindly Bring the Cluster Shared Volume to Offline State incase if you are making any changes from the SAN end. So that the Cluster Service don’t try to bring it online.
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
https://support.microsoft.com/en-us/kb/2920151