RCA – 20 – RCA on Cluster Failure

Issue Description:

 

6 node Mix Cluster “HV2012_Clust1” Running on server 2012 R2 Datacenter require log analysis for the cluster failure that happened on 18th of July at 4:30 PM

 

Date: 18th July 2017

Time: Around 04:30 PM

ABC-HYPERV04.abc.net – 2016 node

 

Initial Description:

 

Verified the Make and model of SAN and found that the SAN is not supported.

https://www.windowsservercatalog.com/item.aspx?idItem=8e532271-6003-9c86-cba3-79dfa56a8e46&bCatID=1282

 

______________________________________________________________________________________

 

System Information:  ABC-HYPERV04

 

OS Name        Microsoft Windows Server 2016 Datacenter

Version        10.0.14393 Build 14393

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABC-HYPERV04

System Manufacturer        HP

System Model        ProLiant DL380 Gen9

System Type        x64-based PC

System SKU        859083-S01

Processor        Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, 1998 Mhz, 14 Core(s), 28 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, 1998 Mhz, 14 Core(s), 28 Logical Processor(s)

BIOS Version/Date        HP P89, 2/17/2017

 

 

System Events:

 

  • Checked the events and found that just before the beginning of the issue HP Ethernet port #4 went down.

 

  • 7/18/2017

    4:07:59 PM

    Warning

    ABC-HYPERV04.abc.net

    4

    q57nd60a

    HP Ethernet 1Gb 4-port 331i Adapter #4: The network link is down.  Check to make sure the network cable is properly connected.

 

  • Checked the event logs from the Node: ABC-HYPERV04 and found that the Issue started around 4:08 PM where we found that the Cluster disk 5 is no longer accessible from the node 4.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/18/2017

4:08:30 PM

Information

ABC-HYPERV04.abc.net

5121

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume4’ (‘Cluster Disk 5’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

7/18/2017

4:08:30 PM

Information

ABC-HYPERV04.abc.net

5121

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume3’ (‘Cluster Disk 4’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

7/18/2017

4:08:32 PM

Information

ABC-HYPERV04.abc.net

5121

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

7/18/2017

4:08:32 PM

Information

ABC-HYPERV04.abc.net

5121

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

7/18/2017

4:08:32 PM

Error

ABC-HYPERV04.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 2’ of type ‘Physical Disk’ in clustered role ‘Cluster Group’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

7/18/2017

4:08:32 PM

Error

ABC-HYPERV04.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 2’ of type ‘Physical Disk’ in clustered role ‘Cluster Group’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

7/18/2017

4:08:32 PM

Error

ABC-HYPERV04.abc.net

1205

Microsoft-Windows-FailoverClustering

The Cluster service failed to bring clustered role ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

 

  • As soon as the cluster group failed the Cluster shared volume also went to failed state, with the Error code:2147943568. which clearly mentions that it is not able to find the underline storage.

 

PS C:\Users\adix5025\Downloads\ERR> & ‘.\err(vista).exe’ 2147943568

# for decimal -2147023728 / hex 0x80070490

# as an HRESULT: Severity: FAILURE (1), FACILITY_WIN32 (0x7), Code 0x490

# for decimal 1168 / hex 0x490

ERROR_NOT_FOUND

 

 

7/18/2017

4:10:29 PM

Error

ABC-HYPERV04.abc.net

1793

Microsoft-Windows-FailoverClustering

Cluster physical disk resource online failed. Physical Disk resource name: Cluster Disk 4 Device Number: 4294967295 Device Guid: {00000000-0000-0000-0000-000000000000} Error Code: 2147943568 Additional reason: ArbitrateFailure

7/18/2017

4:10:29 PM

Error

ABC-HYPERV04.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Disk 4’ of type ‘Physical Disk’ in clustered role ‘693703b6-b1c0-4125-8d8e-ac0254c0b97e’ failed. The error code was ‘0x80070490’ (‘Element not found.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

 

  • Found the events related to the Nic getting disconnected.

 

7/18/2017

4:14:25 PM

Warning

ABC-HYPERV04.abc.net

16949

Microsoft-Windows-MsLbfoSysEvtProvider

Member Nic {74a1e61c-bf4e-4e9c-9d86-bb73e96c86a6} Disconnected.

7/18/2017

4:14:25 PM

Warning

ABC-HYPERV04.abc.net

16949

Microsoft-Windows-MsLbfoSysEvtProvider

Member Nic {51552172-5b80-4fce-9052-509895031f63} Disconnected.

7/18/2017

4:14:26 PM

Warning

ABC-HYPERV04.abc.net

16949

Microsoft-Windows-MsLbfoSysEvtProvider

Member Nic {35dc5db6-fa97-4ba9-bcae-b5b080e316a8} Disconnected.

7/18/2017

4:14:26 PM

Warning

ABC-HYPERV04.abc.net

16949

Microsoft-Windows-MsLbfoSysEvtProvider

Member Nic {40b5acbd-e90c-4a37-b7db-927f9c52b990} Disconnected.

 

 

  • Checked and found that the CSV went to paused state after which the Virtual machine went to failed state.

 

7/18/2017

4:26:29 PM

Warning

ABC-HYPERV04.abc.net

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘STATUS_BAD_NETWORK_NAME(c00000cc)’. All I/O will temporarily be queued until a path to the volume is reestablished.

7/18/2017

4:26:29 PM

Information

ABC-HYPERV04.abc.net

5121

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

 

 

7/18/2017

4:28:16 PM

Error

ABC-HYPERV04.abc.net

21502

Microsoft-Windows-Hyper-V-High-Availability

‘Virtual Machine Configuration ABC-DIRSYNC’ failed to register the virtual machine with the virtual machine management service. The Virtual Machine Management Service failed to register the configuration for the virtual machine ‘451CB158-068D-45A1-BEEC-A27CA9F04BE3’ at ‘C:\ClusterStorage\volume3\hyper-v virtual machine files\ABC-o365-5’: The system cannot find the file specified. (0x80070002). If the virtual machine is managed by a failover cluster, ensure that the file is located at a path that is accessible to other nodes of the cluster.

7/18/2017

4:28:16 PM

Error

ABC-HYPERV04.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Virtual Machine Configuration ABC-DIRSYNC’ of type ‘Virtual Machine Configuration’ in clustered role ‘ABC-DIRSYNC’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

 

 

 

Failover Cluster Events:

 

  • Cluster events also points out the issue towards the Physical disk disconnection.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/18/2017

4:04:34 PM

Information

ABC-HYPERV04.abc.net

1132

Microsoft-Windows-FailoverClustering

Cluster network interface ‘ABC-HYPERV04 – vEthernet (Production Network 1)’ for node ‘ABC-HYPERV04’ on network ‘Cluster Network 3’ was removed.

7/18/2017

4:04:34 PM

Information

ABC-HYPERV04.abc.net

1134

Microsoft-Windows-FailoverClustering

Cluster network ‘Cluster Network 3’ was removed from the failover cluster.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ’84fffabf-34ee-46ee-b035-70e6f13f7176′ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘b91f823e-f6ea-408c-a86a-5eee3db75b13’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘ef4ee8ce-b31f-4062-8cb7-ec4b66ae3b16’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘1fdd054e-4413-4e5a-bbb6-c258172acc43’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘46163d63-2d5a-46bd-bde4-ae7dc65cf3a6’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ’84fffabf-34ee-46ee-b035-70e6f13f7176′ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘b91f823e-f6ea-408c-a86a-5eee3db75b13’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘ef4ee8ce-b31f-4062-8cb7-ec4b66ae3b16’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘1fdd054e-4413-4e5a-bbb6-c258172acc43’ has been disconnected from this node.

7/18/2017

4:04:54 PM

Information

ABC-HYPERV04.abc.net

5264

Microsoft-Windows-FailoverClustering

Physical Disk resource ‘46163d63-2d5a-46bd-bde4-ae7dc65cf3a6’ has been disconnected from this node.

 

7/18/2017

4:08:29 PM

Information

ABC-HYPERV04.abc.net

1154

Microsoft-Windows-FailoverClustering

The Cluster service is attempting to fail back the clustered role ‘Cluster Group’ from node ‘ABC-HYPERV01’ to node ‘ABC-HYPERV04’.

 

7/18/2017

4:08:42 PM

Information

ABC-HYPERV04.abc.net

1674

Microsoft-Windows-FailoverClustering

Group ‘Cluster Group’ has transitioned from state ‘Pending’ to state ‘Failed’.

7/18/2017

4:08:42 PM

Information

ABC-HYPERV04.abc.net

1153

Microsoft-Windows-FailoverClustering

The Cluster service is attempting to fail over the clustered role ‘Cluster Group’ from node ‘ABC-HYPERV04’ to node ‘ABC-HYPERV01’.

 

7/18/2017

4:24:00 PM

Information

ABC-HYPERV04.abc.net

1154

Microsoft-Windows-FailoverClustering

The Cluster service is attempting to fail back the clustered role ‘Cluster Group’ from node ‘ABC-HYPERV01’ to node ‘ABC-HYPERV04’.

 

 

 

 

_________________________________________________________________________________________

 

System Information: ABC-HYPERV01

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABC-HYPERV01

System Manufacturer        HP

System Model        ProLiant DL380 G7

System Type        x64-based PC

System SKU        583914-B21

Processor        Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz, 2799 Mhz, 6 Core(s), 12 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz, 2799 Mhz, 6 Core(s), 12 Logical Processor(s)

BIOS Version/Date        HP P67, 8/16/2015

 

System Events:

 

  • Checked the referenced events on other nodes as well and found that the cluster node 4 was evicted from the fail over clustering after the cluster group went to failed state.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/18/2017

4:20:23 PM

Warning

ABC-HYPERV01.abc.net

1011

Microsoft-Windows-FailoverClustering

Cluster node ABC-HYPERV04 has been evicted from the failover cluster.

7/18/2017

4:23:57 PM

Warning

ABC-HYPERV01.abc.net

1548

Microsoft-Windows-FailoverClustering

Node ‘ABC-HYPERV01’ established a communication session with node ‘ABC-HYPERV04’ and detected that it is running a different but compatible version of the cluster service software. It is recommended that the same version of the cluster service software be installed on all nodes in the cluster.

 

  • Virtual machines went to failed state since the storage was not accessible.

 

7/18/2017

4:28:18 PM

Error

ABC-HYPERV01.abc.net

21502

Microsoft-Windows-Hyper-V-High-Availability

‘Virtual Machine Configuration ABC-DIRSYNC’ failed to register the virtual machine with the virtual machine management service.

7/18/2017

4:28:18 PM

Error

ABC-HYPERV01.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Virtual Machine Configuration ABC-DIRSYNC’ of type ‘Virtual Machine Configuration’ in clustered role ‘ABC-DIRSYNC’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

 

  • This made live migration to fail.

 

7/18/2017

4:42:07 PM

Error

ABC-HYPERV01.abc.net

21502

Microsoft-Windows-Hyper-V-High-Availability

Live migration of ‘Virtual Machine ABC-OKTA2’ failed.

7/18/2017

4:42:07 PM

Warning

ABC-HYPERV01.abc.net

1155

Microsoft-Windows-FailoverClustering

The pending move for the role ‘ABC-OKTA2’ did not complete.

 

7/18/2017

5:12:21 PM

Error

ABC-HYPERV01.abc.net

6008

EventLog

The previous system shutdown at 5:05:43 PM on ?7/?18/?2017 was unexpected.

 

 

 

_________________________________________________________________________________

 

System Information: ABC-HYPERV02

 

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABC-HYPERV02

System Manufacturer        HP

System Model        ProLiant DL380p Gen8

System Type        x64-based PC

System SKU        734792-S01

Processor        Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz, 2195 Mhz, 10 Core(s), 20 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz, 2195 Mhz, 10 Core(s), 20 Logical Processor(s)

BIOS Version/Date        HP P70, 7/1/2015

 

System Events:

 

  • Same set of events can be seen on cluster node 2.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

7/18/2017

4:20:23 PM

Warning

ABC-HYPERV02.abc.net

1011

Microsoft-Windows-FailoverClustering

Cluster node ABC-HYPERV04 has been evicted from the failover cluster.

7/18/2017

4:23:56 PM

Warning

ABC-HYPERV02.abc.net

1548

Microsoft-Windows-FailoverClustering

Node ‘ABC-HYPERV02’ established a communication session with node ‘ABC-HYPERV04’ and detected that it is running a different but compatible version of the cluster service software. It is recommended that the same version of the cluster service software be installed on all nodes in the cluster.

 

7/18/2017

4:28:17 PM

Error

ABC-HYPERV02.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Virtual Machine ABC-AZ-RDS02’ of type ‘Virtual Machine’ in clustered role ‘ABC-AZ-RDS02’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

7/18/2017

4:28:17 PM

Error

ABC-HYPERV02.abc.net

21502

Microsoft-Windows-Hyper-V-High-Availability

‘Virtual Machine Configuration ABC-DIRSYNC’ failed to register the virtual machine with the virtual machine management service.

7/18/2017

4:28:17 PM

Error

ABC-HYPERV02.abc.net

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Virtual Machine Configuration ABC-DIRSYNC’ of type ‘Virtual Machine Configuration’ in clustered role ‘ABC-DIRSYNC’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

 

 

Conclusion:

 

 

 

  • I will recommend you to verify if the Storage is currently running at the latest firmware which is supported for windows server 2012 R2, also I will recommend you to keep the environment on 2012 R2 till the time you get a New supported Hardware for 2016.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply