RCA - 1 - RCA of Getting Event ID 5120 on a Cluster Shared Volume

Issue Description:

RCA of Getting Event ID 5120 on a Cluster Shared Volume.

Initial Description:

As per the article: https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/

Auto Pause due to STATUS_IO_TIMEOUT

When conducting backups of a Windows Server 2012 or later Failover Cluster using Cluster Shared Volumes (CSV), you may encounter the following event in the System event log:

Log Name:  System
Source:  Microsoft-Windows-FailoverClustering
Event ID:  5120
Task Category: Cluster Shared Volume
Level:  Error
Description:  Cluster Shared Volume 'VolumeName' ('ResourceName') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Having an Event ID 5120 logged may or may not be the sign of a problem with the cluster, based on the error code logged. Having an Event 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR or the error code c0130021 may be expected and can be safely ignored in most situations.

An Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR is logged on the node which owns the cluster Physical Disk resource when there was a VSS Software Snapshot which clustering knew of, but the software snapshot was deleted. When a snapshot is deleted which Failover Clustering had knowledge of, clustering must resynchronize its state of the view of the snapshots.

One scenario where an Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR may be logged is when using System Center Data Protection Manager (DPM), and DPM may delete a software snapshot once a backup has completed. When DPM requests deletion of a software snapshot, volsnap will mark the software snapshot for deletion. However volsnap conducts deletion in an asynchronous fashion which occurs at a later point in time. Even though the snapshot has been marked for deletion, Clustering will detect that the software snapshot still exists and needs to handle it appropriately. Eventually volsnap will perform the actual deletion operation of the software snapshot. When clustering then notices that a software snapshot it knew of was deleted, it must resynchronize its view of the snapshots.

Think of it as clustering getting surprised by an un-notified software snapshot deletion, and the cluster service telling the various internal components of the cluster service that they need to resynchronize their views of the snapshots.

There are also a few other expected scenarios where volsnap will delete snapshots, and as a result clustering will need to resynchronize its snapshot view. Such as if a copy on write fails due to lack of space or an IO error. In these conditions volsnap will log an event in the system event log associated with those failures. So review the system event logs for other events accompanying the event 5120, this could be logged on any node in the cluster.

_____________________________________________________________________________________________________________________________________________

System Information: QVH6
OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        QVH6
System Manufacturer        HP
System Model        ProLiant BL465c G7
System Type        x64-based PC
System SKU        518857-B21
Processor        AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
Processor        AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP A19, 02/02/2014

Application Events:

• Tried to find the events at the time of issue but was not able to find anything specific.

System Events:

• Checked the event logs on the QVH6 and found that the Physical Server is going low on Physical RAM due to which the Live migrated VMs were not able to come online.

Checked the event logs on the QVH6 and found that the Physical Server is going low on Physical RAM due to which the Live migrated VMs were not able to come online.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	12:57:59 PM	Error	QVH6	21502	Microsoft-Windows-Hyper-V-High-Availability	Live migration of ‘SCVMM QERPSQL01’ failed. Virtual machine migration operation for ‘QERPSQL01’ failed at migration destination ‘QVH12’. (Virtual machine ID 66E9A08A-4C57-456B-9CCA-62D26ECA286D) ‘QERPSQL01’ could not initialize. (Virtual machine ID 66E9A08A-4C57-456B-9CCA-62D26ECA286D) Not enough memory in the system to start the virtual machine QERPSQL01 with ram size 32768 megabytes. (Virtual machine ID 66E9A08A-4C57-456B-9CCA-62D26ECA286D)
2/22/2017	12:57:59 PM	Warning	QVH6	1155	Microsoft-Windows-FailoverClustering	The pending move for the role ‘SCVMM QERPSQL01 Resources’ did not complete.

Also found that the Disk 9 has failed due to Hardware issue.

2/22/2017

2:15:57 PM

Error

QVH6

154

disk

The IO operation at logical block address 0x1e for Disk 9 (PDO name: \Device\MPIODisk8) failed due to a hardware error.

When a disk returns an error indicating that the drive is failing, Windows logs an entry in the Event Viewer. Depending on the type of error, the disk might be identified by either a GUID or its physical disk number.
Event 154, shown below, includes additional SCSI sense information that can be used for diagnostic purposes:

Provider: disk
Event ID: 154
Level: Error
Text: “The IO operation at logical block address <LBA> for Disk <Disk Number> (PDO name: <PDO Name>) failed due to a hardware error.”
Logged when: A fatal hardware error is reported by the device.
Sense Key: Byte 0x2D in the binary data
Add’l Sense Code: Byte 0x2E in the binary data
Add’l Sense Code Qualifier: Byte 0x2F in the binary data

From https://technet.microsoft.com/en-us/library/dn782852(v=ws.11).aspx

Hyper-V
Events:

Found the
same set of events in the Hyper-V Logs.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	12:57:07 PM	Warning	QVH6.	19548	Microsoft-Windows-Hyper-V-VMMS	The dynamic memory balancer could not add memory to the virtual machine ‘QRAISERWEB’ because its configured maximum has been reached (Virtual machine ID 2C23BE25-5E1B-40AD-A043-911967BE0A5B).
2/22/2017	12:57:59 PM	Error	QVH6	21026	Microsoft-Windows-Hyper-V-VMMS	Virtual machine migration operation for ‘QERPSQL01’ failed at migration destination ‘QVH12’. (Virtual machine ID 66E9A08A-4C57-456B-9CCA-62D26ECA286D)
2/22/2017	12:58:01 PM	Warning	QVH6	19548	Microsoft-Windows-Hyper-V-VMMS	The dynamic memory balancer could not add memory to the virtual machine ‘QNETMANAGER’ because its configured maximum has been reached (Virtual machine ID 61C6D360-8733-4CC8-A366-9C94FF88ABB7).

List
of outdated drivers:

Tried to
find any outdated driver but was not able to find anything specific.

Cluster
Logs:

Found the
Events mentioning that the cluster was not able to communicate on IP
192.168.101.6, 10.10.1.26 and 192.168.0.6.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	12:18:48 PM	Information	QVH6	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.168.101.6:~3343~ connected to remote endpoint 192.168.101.2:~3343~.
2/22/2017	12:18:48 PM	Information	QVH6	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 10.10.1.26:~3343~ connected to remote endpoint 10.10.1.22:~3343~.
2/22/2017	12:18:48 PM	Information	QVH6	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.168.0.6:~3343~ connected to remote endpoint 192.168.0.2:~3343~.

System Information: QVH9

OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        QVH9
System Manufacturer        HP
System Model        ProLiant BL465c Gen8
System Type        x64-based PC
System SKU        708931-B21
Processor        AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor        AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP A26, 02/11/2014

Application
Events:

Tried to
find the events at the time of issue but was not able to find anything
specific.

System
Events:

We are
getting DCOM events which seems to be started with the Network Disconnect.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	1:20:05 PM	Error	QVH9	10028	Microsoft-Windows-DistributedCOM	DCOM was unable to communicate with the computer qvh4 using any of the configured protocols; requested by PID 41e4 (C:\Windows\system32\mmc.exe).
2/22/2017	1:20:31 PM	Error	QVH9	10028	Microsoft-Windows-DistributedCOM	DCOM was unable to communicate with the computer qvh4 using any of the configured protocols; requested by PID 41e4 (C:\Windows\system32\mmc.exe).
2/22/2017	1:20:57 PM	Error	QVH9	10028	Microsoft-Windows-DistributedCOM	DCOM was unable to communicate with the computer QVH3 using any of the configured protocols; requested by PID 41e4 (C:\Windows\system32\mmc.exe).

2/22/2017	2:17:06 PM	Error	QVH9	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume2’ (‘P20300’) has entered a paused state because of ‘(c0000203)’. All I/O will temporarily be queued until a path to the volume is reestablished.
2/22/2017	2:17:06 PM	Error	QVH9	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume11’ (‘R2CS3V8’) has entered a paused state because of ‘(c0000203)’. All I/O will temporarily be queued until a path to the volume is reestablished.

After this
we started getting event ID 5120. where the CSV went to Paused State

err(vista).exe' 0xc0000203
# for hex 0xc0000203 / decimal -1073741309
  STATUS_USER_SESSION_DELETED                                    ntstatus.h
# The remote user session has been deleted.
# 1 matches found for "0xc0000203"
PS C:\Users\adix5025.INDIA\Downloads\ERR>
0xc0000203 = The remote user session has been deleted.

2/22/2017	2:18:08 PM	Error	QVH9	7009	Service Control Manager	A timeout was reached (30000 milliseconds) while waiting for the Hyper-V Virtual Machine Management service to connect.
2/22/2017	2:18:08 PM	Error	QVH9.	7000	Service Control Manager	The Hyper-V Virtual Machine Management service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

It seem like we timed out “The remote user session has been deleted” while we were attempting to change the state of CSV from direct IO to draining. The DCM is closing the session

Hyper-V Events:

Events are
occurring after we have seen the issue, So these can be Ignored.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	2:11:23 PM	Information	QVH9	20413	Microsoft-Windows-Hyper-V-VMMS	The Virtual Machine Management service initiated the live migration of virtual machine ‘QSFTP’ to destination host ‘QVH7’ (VMID 237054C2-DBAE-42DE-BA2E-8B59A741A091).
2/22/2017	2:11:23 PM	Warning	QVH9	19548	Microsoft-Windows-Hyper-V-VMMS	The dynamic memory balancer could not add memory to the virtual machine ‘QSFTP’ because its configured maximum has been reached (Virtual machine ID 237054C2-DBAE-42DE-BA2E-8B59A741A091).
2/22/2017	2:11:28 PM	Information	QVH9	20413	Microsoft-Windows-Hyper-V-VMMS	The Virtual Machine Management service initiated the live migration of virtual machine ‘QDSUAT’ to destination host ‘QVH6’ (VMID 5FF63E8F-AF67-41B6-8AAB-DFB9598E6189).
2/22/2017	2:11:29 PM	Warning	QVH9	19548	Microsoft-Windows-Hyper-V-VMMS	The dynamic memory balancer could not add memory to the virtual machine ‘QDSUAT’ because its configured maximum has been reached (Virtual machine ID 5FF63E8F-AF67-41B6-8AAB-DFB9598E6189).

Cluster Logs:

Issue
Started with the Heartbeat Miss On IP Address: 192.168.0.9, 10.0.0.123
and 192.168.101.9 after which we
started to see cluster events.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	1:18:18 PM	Information	QVH9	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.168.0.9:~3343~ connected to remote endpoint 192.168.0.3:~3343~.
2/22/2017	1:18:18 PM	Information	QVH9	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 10.0.0.123:~3343~ connected to remote endpoint 10.10.1.23:~3343~.
2/22/2017	1:18:18 PM	Information	QVH9	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.168.101.9:~3343~ connected to remote endpoint 192.168.101.3:~3343~.
2/22/2017	1:18:26 PM	Information	QVH9	1650	Microsoft-Windows-FailoverClustering	Cluster has lost the UDP connection from local endpoint 192.168.0.9:~3343~ connected to remote endpoint 192.168.0.3:~3343~.
2/22/2017	1:18:26 PM	Information	QVH9	1650	Microsoft-Windows-FailoverClustering	Cluster has lost the UDP connection from local endpoint 10.0.0.123:~3343~ connected to remote endpoint 10.10.1.23:~3343~.
2/22/2017	1:18:26 PM	Information	QVH9	1650	Microsoft-Windows-FailoverClustering	Cluster has lost the UDP connection from local endpoint 192.168.101.9:~3343~ connected to remote endpoint 192.168.101.3:~3343~.

System Information:

OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH15
System Manufacturer HP
System Model ProLiant BL465c Gen8
System Type x64-based PC
System SKU 708931-B21
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A26, 02/11/2014

Application Events:

Checked the
events and found some events for the VSS timing out, However we were not
able to find anything specific to the issue.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	12:58:21 PM	Information	QVH15	8224	VSS	The VSS service is shutting down due to idle timeout.
2/22/2017	1:13:21 PM	Information	QVH15	8224	VSS	The VSS service is shutting down due to idle timeout.

System Events:

Tried to
find the events at the time of issue but was not able to find anything
specific.

Hyper-V Events:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	4:00:20 PM	Warning	QVH15	14100	Microsoft-Windows-Hyper-V-VMMS	Shut down physical computer. Stopping/saving all virtual machines…
2/22/2017	6:08:12 PM	Warning	QVH15	10170	Microsoft-Windows-Hyper-V-VMMS	Requester reported unsuccessful backup for the virtual machine ‘Test-DC-1’. (Virtual machine ID BA85458B-D40C-4257-B92F-6A1487AF86A6)
2/22/2017	6:08:12 PM	Error	QVH15	16010	Microsoft-Windows-Hyper-V-VMMS	The operation failed.

Cluster Logs:

Found the
Same events for the heartbeat Misses on the Cluster.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	1:18:17 PM	Information	QVH15	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.168.101.15:~3343~ connected to remote endpoint 192.168.101.3:~3343~.
2/22/2017	1:18:18 PM	Information	QVH15	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 10.0.0.56:~3343~ connected to remote endpoint 10.10.1.23:~3343~.
2/22/2017	1:18:18 PM	Information	QVH15	1650	Microsoft-Windows-FailoverClustering	Cluster has missed two consecutive heartbeats for the local endpoint 192.168.0.15:~3343~ connected to remote endpoint 192.168.0.3:~3343~.

System Information: QVH1

OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH1
System Manufacturer HP
System Model ProLiant BL465c G7
System Type x64-based PC
System SKU 518857-B21
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A19, 02/02/2014

System Events:

Verified the
events and found events for the ISCSI portal failure.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	12:28:12 PM	Error	QVH1.	5	iScsiPrt	Failed to setup initiator portal. Error status is given in the dump data.
2/22/2017	12:28:13 PM	Information	QVH1	16	Microsoft-Windows-Kernel-General	The access history in hive \SystemRoot\System32\Config\SOFTWARE was cleared updating 62395 keys and creating 9567 modified pages.
2/22/2017	12:28:15 PM	Error	QVH1	5	iScsiPrt	Failed to setup initiator portal. Error status is given in the dump data.
2/22/2017	12:28:17 PM	Information	QVH1	16	Microsoft-Windows-Kernel-General	The access history in hive \SystemRoot\System32\Config\DEFAULT was cleared updating 237 keys and creating 40 modified pages.
2/22/2017	12:28:18 PM	Error	QVH1	5	iScsiPrt	Failed to setup initiator portal. Error status is given in the dump data.

This issue occurs for one or more of the following reasons.

Cause 1

An iSCSI target device that is listed on the

Favourite
Targets tab of the iSCSI
Initiator is no longer accessible. This may occur if the iSCSI target device
name is changed, if the IP address is changed, or if the iSCSI target device is
removed from the network.

Cause 2

The iSCSI target device has reached its
limit for the number of incoming connections.

Cause 3

A transient network issue temporarily prevented connectivity to the
iSCSI target device.

Cause 4

The network stack is not fully prepared when the initiator was
trying to connect to persistent targets. The initiator will try to connect.
When the network stack is fully prepared, the connection should be successfully
established.

From <https://support.microsoft.com/en-in/help/976072/event-id-1,-event-id-5,-event-id-70,-or-event-id-113-may-be-logged-in-the-system-log-on-a-computer-that-uses-the-iscsi-software-initiator-to-connect-to-an-iscsi-target-device>

2/22/2017	12:28:47 PM	Information	QVH1	67	be2net	HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #2 : IOCTL to firmware has failed. The last 5 words in payload indicate opcode, status, add.status, domain & subsys
2/22/2017	12:28:47 PM	Information	QVH1	67	be2net	HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #3 : IOCTL to firmware has failed. The last 5 words in payload indicate opcode, status, add.status, domain & subsys
2/22/2017	12:28:47 PM	Information	QVH1	67	be2net	HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #4 : IOCTL to firmware has failed. The last 5 words in payload indicate opcode, status, add.status, domain & subsys

Cluster Logs:

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/22/2017	12:32:07 PM	Information	QVH1	1650	Microsoft-Windows-FailoverClustering	Cluster has established a UDP connection from local endpoint 192.168.0.1:~3343~ connected to remote endpoint 192.168.0.6:~3343~.
2/22/2017	12:32:07 PM	Information	QVH1	1650	Microsoft-Windows-FailoverClustering	Cluster has established a UDP connection from local endpoint 10.10.1.21:~3343~ connected to remote endpoint 10.10.1.22:~3343~.
2/22/2017	12:32:07 PM	Information	QVH1	1650	Microsoft-Windows-FailoverClustering	Cluster has established a UDP connection from local endpoint 192.168.0.1:~3343~ connected to remote endpoint 192.168.0.2:~3343~.

__________________________________________________________________________________________________________________

Conclusion:

After Analyzing the logs we can see couple of issues which are pointing in different directions. As per the Understanding the Cluster Shared Volume is going in Paused state as it was not able to change the State of the CSV. While this is happening we can see cluster heartbeat misses on nearly all the nodes which could point issues towards the Network Adaptor Driver not able to handle the Load. However while analyzing we can also See events related to the VMs not able to initialize to RAM issues. This could also cause problems with the Cluster and the Network Drivers.