RCA – 2 – RHS Terminated with Bugcheck 9E.

Issue Description:

 

Cluster Note Terminated with Bugcheck 9E while replacing the Bad Controller on the SAN.

 

Initial Description:

Resource Hosting Subsystem (RHS) In Windows Server 2008 Failover Clusters

A Windows Server 2008 Failover Cluster is capable of providing high availability services using a variety of resources some of which are included as part of the Failover Cluster feature and others are as part of ’cluster-aware’ applications like SQL and Exchange. Resources are designed to work together and are typically organized in Resource Groups (Figure 1).  For example, a group of resources supporting a highly available File Server may consist of one or more of the following types of resources –   Client Access Point (IP Address(s) + Network Name resource), Physical Disk (Storage), and a File Server.  A highly available SQL Instance could contain the following resources –   Client Access Point (IP Address + Network Name resource), Physical Disk (Storage), SQL Server and SQL Server Agent.  Cluster resources are supported by special ‘plugins’ or resource Data Link Libraries (DLLs) that include coding to allow them to properly integrate\interoperate with the cluster service.

A Windows Server 2008 Failover Cluster is capable of hosting an unlimited number of resources.  The management of these resources is the responsibility of the Resource Control Manager (RCM) and the Resource Host Subsystem (RHS) which provide this functionality as part of the Cluster Service itself (Figure 2). 

The Resource Control Manager (RCM) is part of the overall cluster architecture and is responsible for implementing failover mechanisms and policies for the cluster service as well as establishing and maintaining the dependency tree (Figure 3) for each resource (e.g. a File Server resource requires a dependency on a Client Access Point and a Storage resource).

The Resource Control Manager maintains the state for individual resources (Online, Offline, Failed, Online Pending, and Offline Pending) as well as for Resource Groups (Online, Offline, Partial Online, and Failed). 

 

For More Information Please refer: https://blogs.technet.microsoft.com/askcore/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters/

 

System Information: ASDCLOUD-H1

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        ASDCLOUD-H1
System Manufacturer        HP
System Model        ProLiant DL380p Gen8
System Type        x64-based PC
System SKU        697494-S01
Processor        Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor        Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP P70, 8/20/2012
 

 

System Events:

 

  • Checked the logs and found that the Target connection lost from the Server. This might be due to the troubleshooting that was happening on the 4_1.CSV

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:08:03 PM

Error

ASDCLOUD-H1

20

iScsiPrt

Connection to the target was lost. The initiator will attempt to
retry the connection.

5/5/2016

5:08:03 PM

Error

ASDCLOUD-H1

7

iScsiPrt

The initiator could not send an iSCSI PDU. Error status is given
in the dump data.

 

  • Checked the logs and found that VM KAF-CHD1 went to the Not responding state after which the RHS crashed.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:16:44 PM

Error

ASDCLOUD-H1

9

iScsiPrt

Target did not respond in time for a SCSI request. The CDB is
given in the dump data.

5/5/2016

5:17:06 PM

Error

ASDCLOUD-H1

1230

Microsoft-Windows-FailoverClustering

A component on the server did not respond in a timely fashion.
This caused the cluster resource ‘Virtual Machine KAF-CHD1’ (resource type
‘Virtual Machine’, DLL ‘vmclusres.dll’) to exceed its time-out threshold. As
part of cluster health detection, recovery actions will be taken. The cluster
will try to automatically recover by terminating and restarting the Resource
Hosting Subsystem (RHS) process that is running this resource. Verify that
the underlying infrastructure (such as storage, networking, or services) that
are associated with the resource are functioning correctly.

5/5/2016

5:17:08 PM

Critical

ASDCLOUD-H1

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was
terminated and will be restarted. This is typically associated with cluster
health detection and recovery of a resource. Refer to the System event log to
determine which resource and resource DLL is causing the issue.

5/5/2016

5:17:08 PM

Error

ASDCLOUD-H1

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Hyper-V Replica Broker ASD_Cloud_Rep’ of type
‘Virtual Machine Replication Broker’ in clustered role ‘ASD_Cloud_Rep’
failed. Based on the failure policies for the resource and role, the cluster
service may try to bring the resource online on this node or move the group
to another node of the cluster and then restart it.  Check the resource and group state using
Failover Cluster Manager or the Get-ClusterResource Windows PowerShell
cmdlet.

 

Application Events:

 

  •  Checked the Application logs but was not able to find any event at the time of issue.

 

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

4/5/2013 14:34

(6.2:9200.16384)

(12.7:28.0)

Intel Corporation

Intel(R) Gigabit Adapter NDIS 6.x driver

6/26/2012 13:55

(3.7:0.0)

(3.7:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

6/29/2012 14:26

(9.15:1.45)

(9.15:1.45)

Matrox Graphics Inc.

MxG2hDO64.sys

 

__________________________________________________________________________________________________________________

 

 

 

System Information: ASDCLOUD-H2

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        ASDCLOUD-H2
System Manufacturer        HP
System Model        ProLiant DL380p Gen8
System Type        x64-based PC
System SKU        697494-S01
Processor        Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor        Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz, 1995 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP P70, 3/1/2013
 

 

 

 

System Events:

 

  •  Checked the System Logs and found that the Cluster Node

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:34:45 PM

Error

ASDCLOUD-H2

39

iScsiPrt

Initiator sent a task management command to reset the target. The
target name is given in the dump data.

5/5/2016

5:34:45 PM

Error

ASDCLOUD-H2

9

iScsiPrt

Target did not respond in time for a SCSI request. The CDB is
given in the dump data.

5/5/2016

5:35:05 PM

Critical

ASDCLOUD-H2

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ASDCLOUD-H3’ was removed from the active failover
cluster membership. The Cluster service on this node may have stopped. This
could also be due to the node having lost communication with other active
nodes in the failover cluster. Run the Validate a Configuration wizard to
check your network configuration. If the condition persists, check for
hardware or software errors related to the network adapters on this node.
Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.

 

Application Events:

 

  • Checked and found that the VSS service is also running at the time of issue which indicate that there is a backup job running.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

6:05:24 PM

Information

ASDCLOUD-H2

5605

Microsoft-Windows-WMI

The root\mscluster namespace is marked with the RequiresEncryption
flag. Access to this namespace might be denied if the script or application
does not have the appropriate authentication level. Change the authentication
level to Pkt_Privacy and run the script or application again.

5/5/2016

6:08:21 PM

Information

ASDCLOUD-H2

8224

VSS

The VSS service is shutting down due to idle timeout. 

 

Failover Cluster Events:

 

  • Checked and found that the VM ALIST-XENDC was in offline state and it came online around 6:17 PM. Since the Quorum was also residing on Cluster VM ALIST-XENDC as a Dependency so the Quorum also must have gone offline as it was also residing on ASDCLOUD-H3.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

6:17:54 PM

Information

ASDCLOUD-H2

1637

Microsoft-Windows-FailoverClustering

Cluster resource ‘SCVMM ALIST-XENDC Configuration’ in clustered
role ‘ALIST-XENDC’ has transitioned from state Offline to state
OnlineCallIssued.

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

6/26/2012 13:55

(3.7:0.0)

(3.7:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

6/29/2012 14:26

(9.15:1.45)

(9.15:1.45)

Matrox Graphics Inc.

MxG2hDO64.sys

6/26/2012 13:55

(3.7:0.0)

(3.7:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

 

__________________________________________________________________________________________________________________ 

 

System Information: ASDCLOUD-H3

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        ASDCLOUD-H3
System Manufacturer        HP
System Model        ProLiant DL380p Gen8
System Type        x64-based PC
System SKU        670854-S01
Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2494 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz, 2494 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP P70, 8/2/2014
 

 

 

System Events:

 

  • Checked and found that the connection to San4_1 is lost by the Node.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:09:54 PM

Error

ASDCLOUD-H3

1038

Microsoft-Windows-FailoverClustering

Ownership of cluster disk ‘San4_1’ has been unexpectedly lost by
this node. Run the Validate a Configuration wizard to check your storage
configuration.

5/5/2016

5:09:54 PM

Error

ASDCLOUD-H3

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘San4_1’ of type ‘Physical Disk’ in clustered
role ‘504b4f34-1b1f-4186-9aa7-1174c2e311da’ failed. Based on the failure
policies for the resource and role, the cluster service may try to bring the
resource online on this node or move the group to another node of the cluster
and then restart it.  Check the
resource and group state using Failover Cluster Manager or the
Get-ClusterResource Windows PowerShell cmdlet.

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:39:19 PM

Error

ASDCLOUD-H3

1001

Microsoft-Windows-WER-SystemErrorReporting

The computer has rebooted from a bugcheck.  The bugcheck was: 0x0000009e
(0xffffe000e0fe0080, 0x00000000000004b0, 0x0000000000000005,
0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id:
050516-27062-01.

5/5/2016

5:38:52 PM

Critical

ASDCLOUD-H3

41

Microsoft-Windows-Kernel-Power

The system has rebooted without cleanly shutting down first. This
error could be caused if the system stopped responding, crashed, or lost
power unexpectedly.

5/5/2016

5:39:16 PM

Error

ASDCLOUD-H3

6008

EventLog

The previous system shutdown at 5:34:43 PM on ?5/?5/?2016 was
unexpected.

 

 

Application
Events:

 

  •  Checked the Application logs but was not able to find any event at the time of issue. 

 

 

 

List of outdated drivers:

 

 

Time/Date String

Product Version

File Version

Company Name

File Description

10/28/2013 11:03

(6.2:9200.16384)

(62.28:0.64)

Hewlett-Packard Company

Smart Array SAS/SATA Controller Storport Driver

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

5/22/2013 17:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

8/6/2013 17:00

(9.15:1.102)

(9.15:1.102)

Matrox Graphics Inc.

MxG2hDO64.sys

11/23/2013 21:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

 

__________________________________________________________________________________________________________________

 

 

System Information: ASDCLOUD-H4

 

OS Name        Microsoft Windows Server 2012 R2 Standard
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        ASDCLOUD-H4
System Manufacturer        HP
System Model        ProLiant DL380 Gen9
System Type        x64-based PC
System SKU        777337-S01
Processor        Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor        Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2397 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP P89, 7/20/2015
 

 

 

 

System Events:

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:34:53 PM

Error

ASDCLOUD-H4

39

iScsiPrt

Initiator sent a task management command to reset the target. The
target name is given in the dump data.

5/5/2016

5:34:53 PM

Error

ASDCLOUD-H4

9

iScsiPrt

Target did not respond in time for a SCSI request. The CDB is
given in the dump data.

5/5/2016

5:35:05 PM

Critical

ASDCLOUD-H4

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ASDCLOUD-H3’ was removed from the active failover
cluster membership. The Cluster service on this node may have stopped. This
could also be due to the node having lost communication with other active
nodes in the failover cluster. Run the Validate a Configuration wizard to
check your network configuration. If the condition persists, check for
hardware or software errors related to the network adapters on this node.
Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.

5/5/2016

5:35:10 PM

Information

ASDCLOUD-H4

5121

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Backups01’ (‘Backups01’) is no longer
directly accessible from this cluster node. I/O access will be redirected to
the storage device over the network to the node that owns the volume. If this
results in degraded performance, please troubleshoot this node’s connectivity
to the storage device and I/O will resume to a healthy state once
connectivity to the storage device is reestablished.

 

 

Application Events:

 

  • Checked the Application logs but was not able to find any event at the time of issue. 

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

9/12/2014 0:25

(16.8:0.4)

(16.8:0.4)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

5/22/2013 17:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

8/6/2013 17:00

(9.15:1.102)

(9.15:1.102)

Matrox Graphics Inc.

MxG2hDO64.sys

11/23/2013 21:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

 

__________________________________________________________________________________________________________________

 

 

System Information: KST-HSR-HOST1

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        KST-HSR-HOST1
System Manufacturer        HP
System Model        ProLiant DL380 G7
System Type        x64-based PC
System SKU        605876-005
Processor        Intel(R) Xeon(R) CPU           E5640  @ 2.67GHz, 2666 Mhz, 4 Core(s), 8 Logical Processor(s)
Processor        Intel(R) Xeon(R) CPU           E5640  @ 2.67GHz, 2666 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date        HP P67, 5/5/2011
 

 

System Events:

 

  • Checked the System logs and found that we are getting a lot of errors related to the Foundation Agent.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

5/5/2016

5:35:01 PM

Warning

KST-HSR-HOST1

3072

Foundation Agents

Component: Software Version Agent 
Error: Could not read from the registry sub-key.  Cause: This error can be caused by a
corrupt registry or a low memory condition. 
Rebooting the server may correct this error.

5/5/2016

5:35:05 PM

Critical

KST-HSR-HOST1

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ASDCLOUD-H3’ was removed from the active failover
cluster membership. The Cluster service on this node may have stopped. This
could also be due to the node having lost communication with other active
nodes in the failover cluster. Run the Validate a Configuration wizard to
check your network configuration. If the condition persists, check for
hardware or software errors related to the network adapters on this node.
Also check for failures in any other network components to which the node is
connected such as hubs, switches, or bridges.

 

Application Events:

 

  • Checked the Application logs but was not able to find any event at the time of issue. 

 

 

List of outdated drivers:

 

Time/Date String

Product Version

File Version

Company Name

File Description

2/12/2010 18:33

(3.0:0.0)

(3.0:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3 PSHED Plugin Driver

9/12/2014 0:25

(16.8:0.4)

(16.8:0.4)

Broadcom Corporation

Broadcom NetXtreme Gigabit Ethernet NDIS6.x Unified Driver.

5/22/2013 17:41

(3.9:0.0)

(3.9:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Management Controller Core Driver

8/6/2013 17:00

(9.15:1.102)

(9.15:1.102)

Matrox Graphics Inc.

MxG2hDO64.sys

11/23/2013 21:26

(3.10:0.0)

(3.10:0.0)

Hewlett-Packard Company

HP ProLiant iLO 3/4 Channel Interface Driver

 

 

__________________________________________________________________________________________________________________

 

 

 

 

Conclusion:

 

  • After analyzing the logs we can see that the issue started when we tried to replace a Bad Controller from the SAN. As per the Events we started getting events for ISCSIprt as the Cluster Nodes were not able to communicate with the SAN after which around 5:09 PM the Cluster Shared Volume San4_1 Went offline and the cluster was not able to access it.

 

  • Since this CSV was hosted on ASDCLOUD-H3 which after sometime failed with the Bugcheck 9E as the Cluster Service thought that it was not able to host the CSV and
    crashed eventually so that the resources can be failed over to another Node.

 

  • This is a Normal behavior of the Cluster. However as per the cluster configuration at the time of issue was that it can only sustain One Node and Witness Failure.

 

====================================================================================================================================

This quorum model will be able to sustain failures of 2 node(s) with the disk witness online and 1 node(s) when the disk witness goes offline or fails. 

This quorum configuration can be changed using the Configure Cluster Quorum wizard. This wizard can be started from the Failover Cluster Manager console by selecting the cluster name in the left hand pane, then in the right “actions” pane selecting “More Actions…” and then selecting “Configure Cluster Quorum Settings…”.

====================================================================================================================================

 

  • When I took the remote session I found that the Cluster Witness is not configured as per the  recommendations. Cluster Quorum was added as a Dependency on VM “ALIST-XENDC” which was initially residing on ASDCLOUD-H3. Due to which the Witness was also residing on that Same Node which crashed. 
  • As the Cluster Node Crashed the Quorum also went to the failed state. Cluster was not able to sustain two simultaneous failure and cluster went down.

 

Recommendations:

 

 

  1. As a recommendation kindly Bring the Cluster Shared Volume to Offline State incase if you are making any changes from the SAN end. So that the Cluster Service don’t try to bring it online.

 

  1.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

 

https://support.microsoft.com/en-us/kb/2920151  

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply