Issue Description:
6 node Mix Cluster “HV2012_Clust1” Running on server 2012 R2 Datacenter require log analysis for the cluster failure that happened on 18th of July at 4:30 PM
Date: 18th July 2017
Time: Around 04:30 PM
ABC-HYPERV04.abc.net – 2016 node
Initial Description:
Verified the Make and model of SAN and found that the SAN is not supported.
______________________________________________________________________________________
System Information: ABC-HYPERV04
OS Name Microsoft Windows Server 2016 Datacenter
Version 10.0.14393 Build 14393
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABC-HYPERV04
System Manufacturer HP
System Model ProLiant DL380 Gen9
System Type x64-based PC
System SKU 859083-S01
Processor Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, 1998 Mhz, 14 Core(s), 28 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, 1998 Mhz, 14 Core(s), 28 Logical Processor(s)
BIOS Version/Date HP P89, 2/17/2017
System Events:
- Checked the events and found that just before the beginning of the issue HP Ethernet port #4 went down.
7/18/2017
4:07:59 PM
Warning
ABC-HYPERV04.abc.net
4
q57nd60a
HP Ethernet 1Gb 4-port 331i Adapter #4: The network link is down. Check to make sure the network cable is properly connected.
- Checked the event logs from the Node: ABC-HYPERV04 and found that the Issue started around 4:08 PM where we found that the Cluster disk 5 is no longer accessible from the node 4.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/18/2017 | 4:08:30 PM | Information | ABC-HYPERV04.abc.net | 5121 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume4’ (‘Cluster Disk 5’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished. |
7/18/2017 | 4:08:30 PM | Information | ABC-HYPERV04.abc.net | 5121 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume3’ (‘Cluster Disk 4’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished. |
7/18/2017 | 4:08:32 PM | Information | ABC-HYPERV04.abc.net | 5121 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished. |
7/18/2017 | 4:08:32 PM | Information | ABC-HYPERV04.abc.net | 5121 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished. |
7/18/2017 | 4:08:32 PM | Error | ABC-HYPERV04.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Cluster Disk 2’ of type ‘Physical Disk’ in clustered role ‘Cluster Group’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
7/18/2017 | 4:08:32 PM | Error | ABC-HYPERV04.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Cluster Disk 2’ of type ‘Physical Disk’ in clustered role ‘Cluster Group’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
7/18/2017 | 4:08:32 PM | Error | ABC-HYPERV04.abc.net | 1205 | Microsoft-Windows-FailoverClustering | The Cluster service failed to bring clustered role ‘Cluster Group’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role. |
- As soon as the cluster group failed the Cluster shared volume also went to failed state, with the Error code:2147943568. which clearly mentions that it is not able to find the underline storage.
PS C:\Users\adix5025\Downloads\ERR> & ‘.\err(vista).exe’ 2147943568
# for decimal -2147023728 / hex 0x80070490
# as an HRESULT: Severity: FAILURE (1), FACILITY_WIN32 (0x7), Code 0x490
# for decimal 1168 / hex 0x490
ERROR_NOT_FOUND
7/18/2017 | 4:10:29 PM | Error | ABC-HYPERV04.abc.net | 1793 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource online failed. Physical Disk resource name: Cluster Disk 4 Device Number: 4294967295 Device Guid: {00000000-0000-0000-0000-000000000000} Error Code: 2147943568 Additional reason: ArbitrateFailure |
7/18/2017 | 4:10:29 PM | Error | ABC-HYPERV04.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Cluster Disk 4’ of type ‘Physical Disk’ in clustered role ‘693703b6-b1c0-4125-8d8e-ac0254c0b97e’ failed. The error code was ‘0x80070490’ (‘Element not found.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
- Found the events related to the Nic getting disconnected.
7/18/2017 | 4:14:25 PM | Warning | ABC-HYPERV04.abc.net | 16949 | Microsoft-Windows-MsLbfoSysEvtProvider | Member Nic {74a1e61c-bf4e-4e9c-9d86-bb73e96c86a6} Disconnected. |
7/18/2017 | 4:14:25 PM | Warning | ABC-HYPERV04.abc.net | 16949 | Microsoft-Windows-MsLbfoSysEvtProvider | Member Nic {51552172-5b80-4fce-9052-509895031f63} Disconnected. |
7/18/2017 | 4:14:26 PM | Warning | ABC-HYPERV04.abc.net | 16949 | Microsoft-Windows-MsLbfoSysEvtProvider | Member Nic {35dc5db6-fa97-4ba9-bcae-b5b080e316a8} Disconnected. |
7/18/2017 | 4:14:26 PM | Warning | ABC-HYPERV04.abc.net | 16949 | Microsoft-Windows-MsLbfoSysEvtProvider | Member Nic {40b5acbd-e90c-4a37-b7db-927f9c52b990} Disconnected. |
- Checked and found that the CSV went to paused state after which the Virtual machine went to failed state.
7/18/2017 | 4:26:29 PM | Warning | ABC-HYPERV04.abc.net | 5120 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) has entered a paused state because of ‘STATUS_BAD_NETWORK_NAME(c00000cc)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
7/18/2017 | 4:26:29 PM | Information | ABC-HYPERV04.abc.net | 5121 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume2’ (‘Cluster Disk 3’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished. |
7/18/2017 | 4:28:16 PM | Error | ABC-HYPERV04.abc.net | 21502 | Microsoft-Windows-Hyper-V-High-Availability | ‘Virtual Machine Configuration ABC-DIRSYNC’ failed to register the virtual machine with the virtual machine management service. The Virtual Machine Management Service failed to register the configuration for the virtual machine ‘451CB158-068D-45A1-BEEC-A27CA9F04BE3’ at ‘C:\ClusterStorage\volume3\hyper-v virtual machine files\ABC-o365-5’: The system cannot find the file specified. (0x80070002). If the virtual machine is managed by a failover cluster, ensure that the file is located at a path that is accessible to other nodes of the cluster. |
7/18/2017 | 4:28:16 PM | Error | ABC-HYPERV04.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Virtual Machine Configuration ABC-DIRSYNC’ of type ‘Virtual Machine Configuration’ in clustered role ‘ABC-DIRSYNC’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
Failover Cluster Events:
- Cluster events also points out the issue towards the Physical disk disconnection.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/18/2017 | 4:04:34 PM | Information | ABC-HYPERV04.abc.net | 1132 | Microsoft-Windows-FailoverClustering | Cluster network interface ‘ABC-HYPERV04 – vEthernet (Production Network 1)’ for node ‘ABC-HYPERV04’ on network ‘Cluster Network 3’ was removed. |
7/18/2017 | 4:04:34 PM | Information | ABC-HYPERV04.abc.net | 1134 | Microsoft-Windows-FailoverClustering | Cluster network ‘Cluster Network 3’ was removed from the failover cluster. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ’84fffabf-34ee-46ee-b035-70e6f13f7176′ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘b91f823e-f6ea-408c-a86a-5eee3db75b13’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘ef4ee8ce-b31f-4062-8cb7-ec4b66ae3b16’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘1fdd054e-4413-4e5a-bbb6-c258172acc43’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘46163d63-2d5a-46bd-bde4-ae7dc65cf3a6’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ’84fffabf-34ee-46ee-b035-70e6f13f7176′ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘b91f823e-f6ea-408c-a86a-5eee3db75b13’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘ef4ee8ce-b31f-4062-8cb7-ec4b66ae3b16’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘1fdd054e-4413-4e5a-bbb6-c258172acc43’ has been disconnected from this node. |
7/18/2017 | 4:04:54 PM | Information | ABC-HYPERV04.abc.net | 5264 | Microsoft-Windows-FailoverClustering | Physical Disk resource ‘46163d63-2d5a-46bd-bde4-ae7dc65cf3a6’ has been disconnected from this node. |
7/18/2017 | 4:08:29 PM | Information | ABC-HYPERV04.abc.net | 1154 | Microsoft-Windows-FailoverClustering | The Cluster service is attempting to fail back the clustered role ‘Cluster Group’ from node ‘ABC-HYPERV01’ to node ‘ABC-HYPERV04’. |
7/18/2017 | 4:08:42 PM | Information | ABC-HYPERV04.abc.net | 1674 | Microsoft-Windows-FailoverClustering | Group ‘Cluster Group’ has transitioned from state ‘Pending’ to state ‘Failed’. |
7/18/2017 | 4:08:42 PM | Information | ABC-HYPERV04.abc.net | 1153 | Microsoft-Windows-FailoverClustering | The Cluster service is attempting to fail over the clustered role ‘Cluster Group’ from node ‘ABC-HYPERV04’ to node ‘ABC-HYPERV01’. |
7/18/2017 | 4:24:00 PM | Information | ABC-HYPERV04.abc.net | 1154 | Microsoft-Windows-FailoverClustering | The Cluster service is attempting to fail back the clustered role ‘Cluster Group’ from node ‘ABC-HYPERV01’ to node ‘ABC-HYPERV04’. |
_________________________________________________________________________________________
System Information: ABC-HYPERV01
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABC-HYPERV01
System Manufacturer HP
System Model ProLiant DL380 G7
System Type x64-based PC
System SKU 583914-B21
Processor Intel(R) Xeon(R) CPU X5660 @ 2.80GHz, 2799 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU X5660 @ 2.80GHz, 2799 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP P67, 8/16/2015
System Events:
- Checked the referenced events on other nodes as well and found that the cluster node 4 was evicted from the fail over clustering after the cluster group went to failed state.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/18/2017 | 4:20:23 PM | Warning | ABC-HYPERV01.abc.net | 1011 | Microsoft-Windows-FailoverClustering | Cluster node ABC-HYPERV04 has been evicted from the failover cluster. |
7/18/2017 | 4:23:57 PM | Warning | ABC-HYPERV01.abc.net | 1548 | Microsoft-Windows-FailoverClustering | Node ‘ABC-HYPERV01’ established a communication session with node ‘ABC-HYPERV04’ and detected that it is running a different but compatible version of the cluster service software. It is recommended that the same version of the cluster service software be installed on all nodes in the cluster. |
- Virtual machines went to failed state since the storage was not accessible.
7/18/2017 | 4:28:18 PM | Error | ABC-HYPERV01.abc.net | 21502 | Microsoft-Windows-Hyper-V-High-Availability | ‘Virtual Machine Configuration ABC-DIRSYNC’ failed to register the virtual machine with the virtual machine management service. |
7/18/2017 | 4:28:18 PM | Error | ABC-HYPERV01.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Virtual Machine Configuration ABC-DIRSYNC’ of type ‘Virtual Machine Configuration’ in clustered role ‘ABC-DIRSYNC’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
- This made live migration to fail.
7/18/2017 | 4:42:07 PM | Error | ABC-HYPERV01.abc.net | 21502 | Microsoft-Windows-Hyper-V-High-Availability | Live migration of ‘Virtual Machine ABC-OKTA2’ failed. |
7/18/2017 | 4:42:07 PM | Warning | ABC-HYPERV01.abc.net | 1155 | Microsoft-Windows-FailoverClustering | The pending move for the role ‘ABC-OKTA2’ did not complete. |
7/18/2017 | 5:12:21 PM | Error | ABC-HYPERV01.abc.net | 6008 | EventLog | The previous system shutdown at 5:05:43 PM on ?7/?18/?2017 was unexpected. |
_________________________________________________________________________________
System Information: ABC-HYPERV02
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABC-HYPERV02
System Manufacturer HP
System Model ProLiant DL380p Gen8
System Type x64-based PC
System SKU 734792-S01
Processor Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz, 2195 Mhz, 10 Core(s), 20 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz, 2195 Mhz, 10 Core(s), 20 Logical Processor(s)
BIOS Version/Date HP P70, 7/1/2015
System Events:
- Same set of events can be seen on cluster node 2.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
7/18/2017 | 4:20:23 PM | Warning | ABC-HYPERV02.abc.net | 1011 | Microsoft-Windows-FailoverClustering | Cluster node ABC-HYPERV04 has been evicted from the failover cluster. |
7/18/2017 | 4:23:56 PM | Warning | ABC-HYPERV02.abc.net | 1548 | Microsoft-Windows-FailoverClustering | Node ‘ABC-HYPERV02’ established a communication session with node ‘ABC-HYPERV04’ and detected that it is running a different but compatible version of the cluster service software. It is recommended that the same version of the cluster service software be installed on all nodes in the cluster. |
7/18/2017 | 4:28:17 PM | Error | ABC-HYPERV02.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Virtual Machine ABC-AZ-RDS02’ of type ‘Virtual Machine’ in clustered role ‘ABC-AZ-RDS02’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
7/18/2017 | 4:28:17 PM | Error | ABC-HYPERV02.abc.net | 21502 | Microsoft-Windows-Hyper-V-High-Availability | ‘Virtual Machine Configuration ABC-DIRSYNC’ failed to register the virtual machine with the virtual machine management service. |
7/18/2017 | 4:28:17 PM | Error | ABC-HYPERV02.abc.net | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Virtual Machine Configuration ABC-DIRSYNC’ of type ‘Virtual Machine Configuration’ in clustered role ‘ABC-DIRSYNC’ failed. The error code was ‘0x2’ (‘The system cannot find the file specified.’). Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
Conclusion:
- Based on the events that we found we can conclude that the issue started after we saw the storage disconnect from cluster node: 04 which is a 2016 server. This could be due to any issues from the HBA or the SAN which is not supported for the current version of OS, for more information please refer: https://www.windowsservercatalog.com/item.aspx?idItem=8e532271-6003-9c86-cba3-79dfa56a8e46&bCatID=1282
- I will recommend you to verify if the Storage is currently running at the latest firmware which is supported for windows server 2012 R2, also I will recommend you to keep the environment on 2012 R2 till the time you get a New supported Hardware for 2016.