Issue Description:
RCA of Getting Event ID 5120 on a Cluster Shared Volume.
Initial Description:
As per the article: https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/
Auto Pause due to STATUS_IO_TIMEOUT
When conducting backups of a Windows Server 2012 or later Failover Cluster using Cluster Shared Volumes (CSV), you may encounter the following event in the System event log:
Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Description: Cluster Shared Volume 'VolumeName' ('ResourceName') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Having an Event ID 5120 logged may or may not be the sign of a problem with the cluster, based on the error code logged. Having an Event 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR or the error code c0130021 may be expected and can be safely ignored in most situations.
An Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR is logged on the node which owns the cluster Physical Disk resource when there was a VSS Software Snapshot which clustering knew of, but the software snapshot was deleted. When a snapshot is deleted which Failover Clustering had knowledge of, clustering must resynchronize its state of the view of the snapshots.
One scenario where an Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR may be logged is when using System Center Data Protection Manager (DPM), and DPM may delete a software snapshot once a backup has completed. When DPM requests deletion of a software snapshot, volsnap will mark the software snapshot for deletion. However volsnap conducts deletion in an asynchronous fashion which occurs at a later point in time. Even though the snapshot has been marked for deletion, Clustering will detect that the software snapshot still exists and needs to handle it appropriately. Eventually volsnap will perform the actual deletion operation of the software snapshot. When clustering then notices that a software snapshot it knew of was deleted, it must resynchronize its view of the snapshots.
Think of it as clustering getting surprised by an un-notified software snapshot deletion, and the cluster service telling the various internal components of the cluster service that they need to resynchronize their views of the snapshots.
There are also a few other expected scenarios where volsnap will delete snapshots, and as a result clustering will need to resynchronize its snapshot view. Such as if a copy on write fails due to lack of space or an IO error. In these conditions volsnap will log an event in the system event log associated with those failures. So review the system event logs for other events accompanying the event 5120, this could be logged on any node in the cluster.
_____________________________________________________________________________________________________________________________________________
System Information: QVH6
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH6
System Manufacturer HP
System Model ProLiant BL465c G7
System Type x64-based PC
System SKU 518857-B21
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A19, 02/02/2014
Application Events:
• Tried to find the events at the time of issue but was not able to find anything specific.
System Events:
• Checked the event logs on the QVH6 and found that the Physical Server is going low on Physical RAM due to which the Live migrated VMs were not able to come online.
- Checked the event logs on the QVH6 and found that the Physical Server is going low on Physical RAM due to which the Live migrated VMs were not able to come online.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 12:57:59 PM | Error | QVH6 | 21502 | Microsoft-Windows-Hyper-V-High-Availability | Live migration of ‘SCVMM QERPSQL01’ failed. Virtual machine |
2/22/2017 | 12:57:59 PM | Warning | QVH6 | 1155 | Microsoft-Windows-FailoverClustering | The pending move for the role ‘SCVMM QERPSQL01 Resources’ did not |
- Also found that the Disk 9 has failed due to Hardware issue.
2/22/2017 | 2:15:57 PM | Error | QVH6 | 154 | disk | The IO operation at logical block address 0x1e for Disk 9 (PDO name: \Device\MPIODisk8) failed due to a hardware error. |
- When a disk returns an error indicating that the drive is failing, Windows logs an entry in the Event Viewer. Depending on the type of error, the disk might be identified by either a GUID or its physical disk number.
Event 154, shown below, includes additional SCSI sense information that can be used for diagnostic purposes:
Provider: disk
Event ID: 154
Level: Error
Text: “The IO operation at logical block address <LBA> for Disk <Disk Number> (PDO name: <PDO Name>) failed due to a hardware error.”
Logged when: A fatal hardware error is reported by the device.
Sense Key: Byte 0x2D in the binary data
Add’l Sense Code: Byte 0x2E in the binary data
Add’l Sense Code Qualifier: Byte 0x2F in the binary data
From https://technet.microsoft.com/en-us/library/dn782852(v=ws.11).aspx
Hyper-V
Events:
- Found the
same set of events in the Hyper-V Logs.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 12:57:07 PM | Warning | QVH6. | 19548 | Microsoft-Windows-Hyper-V-VMMS | The dynamic memory balancer could not add memory to the virtual |
2/22/2017 | 12:57:59 PM | Error | QVH6 | 21026 | Microsoft-Windows-Hyper-V-VMMS | Virtual machine migration operation for ‘QERPSQL01’ failed at |
2/22/2017 | 12:58:01 PM | Warning | QVH6 | 19548 | Microsoft-Windows-Hyper-V-VMMS | The dynamic memory balancer could not add memory to the virtual |
List
of outdated drivers:
- Tried to
find any outdated driver but was not able to find anything specific.
Cluster
Logs:
- Found the
Events mentioning that the cluster was not able to communicate on IP
192.168.101.6, 10.10.1.26 and 192.168.0.6.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 12:18:48 PM | Information | QVH6 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 12:18:48 PM | Information | QVH6 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 12:18:48 PM | Information | QVH6 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
System Information: QVH9
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH9
System Manufacturer HP
System Model ProLiant BL465c Gen8
System Type x64-based PC
System SKU 708931-B21
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A26, 02/11/2014
Application
Events:
- Tried to
find the events at the time of issue but was not able to find anything
specific.
System
Events:
- We are
getting DCOM events which seems to be started with the Network Disconnect.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 1:20:05 PM | Error | QVH9 | 10028 | Microsoft-Windows-DistributedCOM | DCOM was unable to communicate with the computer qvh4 using any of |
2/22/2017 | 1:20:31 PM | Error | QVH9 | 10028 | Microsoft-Windows-DistributedCOM | DCOM was unable to communicate with the computer qvh4 using any of |
2/22/2017 | 1:20:57 PM | Error | QVH9 | 10028 | Microsoft-Windows-DistributedCOM | DCOM was unable to communicate with the computer QVH3 using any of |
2/22/2017 | 2:17:06 PM | Error | QVH9 | 5120 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume2’ (‘P20300’) has entered a paused |
2/22/2017 | 2:17:06 PM | Error | QVH9 | 5120 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘Volume11’ (‘R2CS3V8’) has entered a paused |
- After this
we started getting event ID 5120. where the CSV went to Paused State
err(vista).exe' 0xc0000203
# for hex 0xc0000203 / decimal -1073741309
STATUS_USER_SESSION_DELETED ntstatus.h
# The remote user session has been deleted.
# 1 matches found for "0xc0000203"
PS C:\Users\adix5025.INDIA\Downloads\ERR>
0xc0000203 = The remote user session has been deleted.
2/22/2017 | 2:18:08 PM | Error | QVH9 | 7009 | Service Control Manager | A timeout was reached (30000 milliseconds) while waiting for the |
2/22/2017 | 2:18:08 PM | Error | QVH9. | 7000 | Service Control Manager | The Hyper-V Virtual Machine Management service failed to start due |
It seem like we timed out “The remote user session has been deleted” while we were attempting to change the state of CSV from direct IO to draining. The DCM is closing the session
Hyper-V Events:
- Events are
occurring after we have seen the issue, So these can be Ignored.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 2:11:23 PM | Information | QVH9 | 20413 | Microsoft-Windows-Hyper-V-VMMS | The Virtual Machine Management service initiated the live |
2/22/2017 | 2:11:23 PM | Warning | QVH9 | 19548 | Microsoft-Windows-Hyper-V-VMMS | The dynamic memory balancer could not add memory to the virtual |
2/22/2017 | 2:11:28 PM | Information | QVH9 | 20413 | Microsoft-Windows-Hyper-V-VMMS | The Virtual Machine Management service initiated the live |
2/22/2017 | 2:11:29 PM | Warning | QVH9 | 19548 | Microsoft-Windows-Hyper-V-VMMS | The dynamic memory balancer could not add memory to the virtual |
Cluster Logs:
- Issue
Started with the Heartbeat Miss On IP Address: 192.168.0.9, 10.0.0.123
and 192.168.101.9 after which we
started to see cluster events.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 1:18:18 PM | Information | QVH9 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 1:18:18 PM | Information | QVH9 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 1:18:18 PM | Information | QVH9 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 1:18:26 PM | Information | QVH9 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has lost the UDP connection from local endpoint |
2/22/2017 | 1:18:26 PM | Information | QVH9 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has lost the UDP connection from local endpoint |
2/22/2017 | 1:18:26 PM | Information | QVH9 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has lost the UDP connection from local endpoint |
System Information:
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH15
System Manufacturer HP
System Model ProLiant BL465c Gen8
System Type x64-based PC
System SKU 708931-B21
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A26, 02/11/2014
Application Events:
- Checked the
events and found some events for the VSS timing out, However we were not
able to find anything specific to the issue.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 12:58:21 PM | Information | QVH15 | 8224 | VSS | The VSS service is shutting down due to idle timeout. |
2/22/2017 | 1:13:21 PM | Information | QVH15 | 8224 | VSS | The VSS service is shutting down due to idle timeout. |
System Events:
- Tried to
find the events at the time of issue but was not able to find anything
specific.
Hyper-V Events:
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 4:00:20 PM | Warning | QVH15 | 14100 | Microsoft-Windows-Hyper-V-VMMS | Shut down physical computer. Stopping/saving all virtual |
2/22/2017 | 6:08:12 PM | Warning | QVH15 | 10170 | Microsoft-Windows-Hyper-V-VMMS | Requester reported unsuccessful backup for the virtual machine |
2/22/2017 | 6:08:12 PM | Error | QVH15 | 16010 | Microsoft-Windows-Hyper-V-VMMS | The operation failed. |
Cluster Logs:
- Found the
Same events for the heartbeat Misses on the Cluster.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 1:18:17 PM | Information | QVH15 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 1:18:18 PM | Information | QVH15 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
2/22/2017 | 1:18:18 PM | Information | QVH15 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has missed two consecutive heartbeats for the local |
System Information: QVH1
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH1
System Manufacturer HP
System Model ProLiant BL465c G7
System Type x64-based PC
System SKU 518857-B21
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A19, 02/02/2014
System Events:
- Verified the
events and found events for the ISCSI portal failure.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 12:28:12 PM | Error | QVH1. | 5 | iScsiPrt | Failed to setup initiator portal. Error status is given in the |
2/22/2017 | 12:28:13 PM | Information | QVH1 | 16 | Microsoft-Windows-Kernel-General | The access history in hive \SystemRoot\System32\Config\SOFTWARE |
2/22/2017 | 12:28:15 PM | Error | QVH1 | 5 | iScsiPrt | Failed to setup initiator portal. Error status is given in the |
2/22/2017 | 12:28:17 PM | Information | QVH1 | 16 | Microsoft-Windows-Kernel-General | The access history in hive \SystemRoot\System32\Config\DEFAULT was |
2/22/2017 | 12:28:18 PM | Error | QVH1 | 5 | iScsiPrt | Failed to setup initiator portal. Error status is given in the |
This issue occurs for one or more of the following reasons.
Cause 1
An iSCSI target device that is listed on the
Favourite
Targets tab of the iSCSI
Initiator is no longer accessible. This may occur if the iSCSI target device
name is changed, if the IP address is changed, or if the iSCSI target device is
removed from the network.
Cause 2
The iSCSI target device has reached its
limit for the number of incoming connections.
Cause 3
A transient network issue temporarily prevented connectivity to the
iSCSI target device.
Cause 4
The network stack is not fully prepared when the initiator was
trying to connect to persistent targets. The initiator will try to connect.
When the network stack is fully prepared, the connection should be successfully
established.
2/22/2017 | 12:28:47 PM | Information | QVH1 | 67 | be2net | HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #2 : |
2/22/2017 | 12:28:47 PM | Information | QVH1 | 67 | be2net | HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #3 : |
2/22/2017 | 12:28:47 PM | Information | QVH1 | 67 | be2net | HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #4 : |
Cluster Logs:
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
2/22/2017 | 12:32:07 PM | Information | QVH1 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has established a UDP connection from local endpoint |
2/22/2017 | 12:32:07 PM | Information | QVH1 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has established a UDP connection from local endpoint |
2/22/2017 | 12:32:07 PM | Information | QVH1 | 1650 | Microsoft-Windows-FailoverClustering | Cluster has established a UDP connection from local endpoint |
__________________________________________________________________________________________________________________
Conclusion:
- After Analyzing the logs we can see couple of issues which are pointing in different directions. As per the Understanding the Cluster Shared Volume is going in Paused state as it was not able to change the State of the CSV. While this is happening we can see cluster heartbeat misses on nearly all the nodes which could point issues towards the Network Adaptor Driver not able to handle the Load. However while analyzing we can also See events related to the VMs not able to initialize to RAM issues. This could also cause problems with the Cluster and the Network Drivers.
Plan:
- Please install the Latest Cluster Hotfixes by following the Article: https://support.microsoft.com/en-us/help/2920151/recommended-hotfixes-and-updates-for-windows-server-2012-r2-based-failover-clusters
- Kindly monitor the RAM usage on the Cluster Node and also add some more RAM to the Physical Server so that they can handle the Load.