RCA – 1 – RCA of Getting Event ID 5120 on a Cluster Shared Volume

Issue Description:

RCA of Getting Event ID 5120 on a Cluster Shared Volume.

 

Initial Description:

As per the article: https://blogs.msdn.microsoft.com/clustering/2014/02/26/event-id-5120-in-system-event-log/

Auto Pause due to STATUS_IO_TIMEOUT

When conducting backups of a Windows Server 2012 or later Failover Cluster using Cluster Shared Volumes (CSV), you may encounter the following event in the System event log:

Log Name:  System
Source:  Microsoft-Windows-FailoverClustering

Event ID:  5120
Task Category: Cluster Shared Volume
Level:  Error
Description:  Cluster Shared Volume 'VolumeName' ('ResourceName') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Having an Event ID 5120 logged may or may not be the sign of a problem with the cluster, based on the error code logged.  Having an Event 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR or the error code c0130021 may be expected and can be safely ignored in most situations.

An Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR is logged on the node which owns the cluster Physical Disk resource when there was a VSS Software Snapshot which clustering knew of, but the software snapshot was deleted.  When a snapshot is deleted which Failover Clustering had knowledge of, clustering must resynchronize its state of the view of the snapshots.

One scenario where an Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR may be logged is when using System Center Data Protection Manager (DPM), and DPM may delete a software snapshot once a backup has completed.  When DPM requests deletion of a software snapshot, volsnap will mark the software snapshot for deletion.  However volsnap conducts deletion in an asynchronous fashion which occurs at a later point in time.  Even though the snapshot has been marked for deletion, Clustering will detect that the software snapshot still exists and needs to handle it appropriately.  Eventually volsnap will perform the actual deletion operation of the software snapshot.  When clustering then notices that a software snapshot it knew of was deleted, it must resynchronize its view of the snapshots.

Think of it as clustering getting surprised by an un-notified software snapshot deletion, and the cluster service telling the various internal components of the cluster service that they need to resynchronize their views of the snapshots.

There are also a few other expected scenarios where volsnap will delete snapshots, and as a result clustering will need to resynchronize its snapshot view.  Such as if a copy on write fails due to lack of space or an IO error. In these conditions volsnap will log an event in the system event log associated with those failures.  So review the system event logs for other events accompanying the event 5120, this could be logged on any node in the cluster.

_____________________________________________________________________________________________________________________________________________

System Information: QVH6
OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version        6.3.9600 Build 9600
Other OS Description         Not Available
OS Manufacturer        Microsoft Corporation
System Name        QVH6
System Manufacturer        HP
System Model        ProLiant BL465c G7
System Type        x64-based PC
System SKU        518857-B21
Processor        AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
Processor        AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
BIOS Version/Date        HP A19, 02/02/2014

Application Events:

• Tried to find the events at the time of issue but was not able to find anything specific.

System Events:

• Checked the event logs on the QVH6 and found that the Physical Server is going low on Physical RAM due to which the Live migrated VMs were not able to come online.

  • Checked the event logs on the QVH6 and found that the Physical Server is going low on Physical RAM due to which the Live migrated VMs were not able to come online.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

12:57:59 PM

Error

QVH6

21502

Microsoft-Windows-Hyper-V-High-Availability

Live migration of ‘SCVMM QERPSQL01’ failed. Virtual machine
migration operation for ‘QERPSQL01’ failed at migration destination ‘QVH12’.
(Virtual machine ID 66E9A08A-4C57-456B-9CCA-62D26ECA286D) ‘QERPSQL01’ could
not initialize. (Virtual machine ID 66E9A08A-4C57-456B-9CCA-62D26ECA286D) Not
enough memory in the system to start the virtual machine QERPSQL01 with ram
size 32768 megabytes. (Virtual machine ID
66E9A08A-4C57-456B-9CCA-62D26ECA286D)

2/22/2017

12:57:59 PM

Warning

QVH6

1155

Microsoft-Windows-FailoverClustering

The pending move for the role ‘SCVMM QERPSQL01 Resources’ did not
complete.

  • Also found that the Disk 9 has failed due to Hardware issue.

2/22/2017

2:15:57 PM

Error

QVH6

154

disk

The IO operation at logical block address 0x1e for Disk 9 (PDO name: \Device\MPIODisk8) failed due to a hardware error.

 

  • When a disk returns an error indicating that the drive is failing, Windows logs an entry in the Event Viewer. Depending on the type of error, the disk might be identified by either a GUID or its physical disk number.
    Event 154, shown below, includes additional SCSI sense information that can be used for diagnostic purposes:
 
Provider: disk
Event ID: 154
Level: Error
Text: “The IO operation at logical block address <LBA> for Disk <Disk Number> (PDO name: <PDO Name>) failed due to a hardware error.”
Logged when: A fatal hardware error is reported by the device.
Sense Key: Byte 0x2D in the binary data
Add’l Sense Code: Byte 0x2E in the binary data
Add’l Sense Code Qualifier: Byte 0x2F in the binary data

 

From https://technet.microsoft.com/en-us/library/dn782852(v=ws.11).aspx

Hyper-V
Events:

 

  • Found the
    same set of events in the Hyper-V Logs.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

12:57:07 PM

Warning

QVH6.

19548

Microsoft-Windows-Hyper-V-VMMS

The dynamic memory balancer could not add memory to the virtual
machine ‘QRAISERWEB’ because its configured maximum has been reached (Virtual
machine ID 2C23BE25-5E1B-40AD-A043-911967BE0A5B).

2/22/2017

12:57:59 PM

Error

QVH6

21026

Microsoft-Windows-Hyper-V-VMMS

Virtual machine migration operation for ‘QERPSQL01’ failed at
migration destination ‘QVH12’. (Virtual machine ID
66E9A08A-4C57-456B-9CCA-62D26ECA286D)

2/22/2017

12:58:01 PM

Warning

QVH6

19548

Microsoft-Windows-Hyper-V-VMMS

The dynamic memory balancer could not add memory to the virtual
machine ‘QNETMANAGER’ because its configured maximum has been reached
(Virtual machine ID 61C6D360-8733-4CC8-A366-9C94FF88ABB7).

 

 

List
of outdated drivers:

 

  • Tried to
    find any outdated driver but was not able to find anything specific.

 

Cluster
Logs:

 

  • Found the
    Events mentioning that the cluster was not able to communicate on IP
    192.168.101.6, 10.10.1.26 and 192.168.0.6.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

12:18:48 PM

Information

QVH6

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 192.168.101.6:~3343~ connected to remote endpoint
192.168.101.2:~3343~.

2/22/2017

12:18:48 PM

Information

QVH6

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 10.10.1.26:~3343~ connected to remote endpoint 10.10.1.22:~3343~.

2/22/2017

12:18:48 PM

Information

QVH6

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 192.168.0.6:~3343~ connected to remote endpoint 192.168.0.2:~3343~.

 

System Information: QVH9

OS Name        Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH9
System Manufacturer HP
System Model ProLiant BL465c Gen8
System Type x64-based PC
System SKU 708931-B21
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A26, 02/11/2014

Application
Events:

 

  • Tried to
    find the events at the time of issue but was not able to find anything
    specific.

 

 

System
Events:

 

  • We are
    getting DCOM events which seems to be started with the Network Disconnect.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

1:20:05 PM

Error

QVH9

10028

Microsoft-Windows-DistributedCOM

DCOM was unable to communicate with the computer qvh4 using any of
the configured protocols; requested by PID    
41e4 (C:\Windows\system32\mmc.exe).

2/22/2017

1:20:31 PM

Error

QVH9

10028

Microsoft-Windows-DistributedCOM

DCOM was unable to communicate with the computer qvh4 using any of
the configured protocols; requested by PID    
41e4 (C:\Windows\system32\mmc.exe).

2/22/2017

1:20:57 PM

Error

QVH9

10028

Microsoft-Windows-DistributedCOM

DCOM was unable to communicate with the computer QVH3 using any of
the configured protocols; requested by PID    
41e4 (C:\Windows\system32\mmc.exe).

 

 

2/22/2017

2:17:06 PM

Error

QVH9

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume2’ (‘P20300’) has entered a paused
state because of ‘(c0000203)’. All I/O will temporarily be queued until a
path to the volume is reestablished.

2/22/2017

2:17:06 PM

Error

QVH9

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume11’ (‘R2CS3V8’) has entered a paused
state because of ‘(c0000203)’. All I/O will temporarily be queued until a
path to the volume is reestablished.

 

 

  • After this
    we started getting event ID 5120. where the CSV went to Paused State
err(vista).exe' 0xc0000203
# for hex 0xc0000203 / decimal -1073741309
  STATUS_USER_SESSION_DELETED                                    ntstatus.h
# The remote user session has been deleted.
# 1 matches found for "0xc0000203"
PS C:\Users\adix5025.INDIA\Downloads\ERR>
0xc0000203 = The remote user session has been deleted.

2/22/2017

2:18:08 PM

Error

QVH9

7009

Service Control Manager

A timeout was reached (30000 milliseconds) while waiting for the
Hyper-V Virtual Machine Management service to connect.

2/22/2017

2:18:08 PM

Error

QVH9.

7000

Service Control Manager

The Hyper-V Virtual Machine Management service failed to start due
to the following error:  The service
did not respond to the start or control request in a timely fashion.

It seem like we timed out “The remote user session has been deleted” while we were attempting to change the state of CSV from direct IO to draining. The DCM is closing the session 

 

 

 

Hyper-V Events:

 

  • Events are
    occurring after we have seen the issue, So these can be Ignored.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

2:11:23 PM

Information

QVH9

20413

Microsoft-Windows-Hyper-V-VMMS

The Virtual Machine Management service initiated the live
migration of virtual machine  ‘QSFTP’
to destination host ‘QVH7’ (VMID 237054C2-DBAE-42DE-BA2E-8B59A741A091).

2/22/2017

2:11:23 PM

Warning

QVH9

19548

Microsoft-Windows-Hyper-V-VMMS

The dynamic memory balancer could not add memory to the virtual
machine ‘QSFTP’ because its configured maximum has been reached (Virtual
machine ID 237054C2-DBAE-42DE-BA2E-8B59A741A091).

2/22/2017

2:11:28 PM

Information

QVH9

20413

Microsoft-Windows-Hyper-V-VMMS

The Virtual Machine Management service initiated the live
migration of virtual machine  ‘QDSUAT’
to destination host ‘QVH6’ (VMID 5FF63E8F-AF67-41B6-8AAB-DFB9598E6189).

2/22/2017

2:11:29 PM

Warning

QVH9

19548

Microsoft-Windows-Hyper-V-VMMS

The dynamic memory balancer could not add memory to the virtual
machine ‘QDSUAT’ because its configured maximum has been reached (Virtual
machine ID 5FF63E8F-AF67-41B6-8AAB-DFB9598E6189).

 

 

 

Cluster Logs:

 

  • Issue
    Started with the Heartbeat Miss On IP Address: 192.168.0.9, 10.0.0.123
    and  192.168.101.9 after which we
    started to see cluster events.

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

1:18:18 PM

Information

QVH9

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 192.168.0.9:~3343~ connected to remote endpoint 192.168.0.3:~3343~.

2/22/2017

1:18:18 PM

Information

QVH9

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 10.0.0.123:~3343~ connected to remote endpoint 10.10.1.23:~3343~.

2/22/2017

1:18:18 PM

Information

QVH9

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 192.168.101.9:~3343~ connected to remote endpoint
192.168.101.3:~3343~.

2/22/2017

1:18:26 PM

Information

QVH9

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint
192.168.0.9:~3343~ connected to remote endpoint 192.168.0.3:~3343~.

2/22/2017

1:18:26 PM

Information

QVH9

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint
10.0.0.123:~3343~ connected to remote endpoint 10.10.1.23:~3343~.

2/22/2017

1:18:26 PM

Information

QVH9

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint
192.168.101.9:~3343~ connected to remote endpoint 192.168.101.3:~3343~.

 

System Information:

OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH15
System Manufacturer HP
System Model ProLiant BL465c Gen8
System Type x64-based PC
System SKU 708931-B21
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6344, 2600 Mhz, 6 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A26, 02/11/2014

Application Events:

 

  • Checked the
    events and found some events for the VSS timing out, However we were not
    able to find anything specific to the issue.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

12:58:21 PM

Information

QVH15

8224

VSS

The VSS service is shutting down due to idle timeout. 

2/22/2017

1:13:21 PM

Information

QVH15

8224

VSS

The VSS service is shutting down due to idle timeout. 

 

System Events:

 

  • Tried to
    find the events at the time of issue but was not able to find anything
    specific.

 

Hyper-V Events:

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

4:00:20 PM

Warning

QVH15

14100

Microsoft-Windows-Hyper-V-VMMS

Shut down physical computer. Stopping/saving all virtual
machines…

2/22/2017

6:08:12 PM

Warning

QVH15

10170

Microsoft-Windows-Hyper-V-VMMS

Requester reported unsuccessful backup for the virtual machine
‘Test-DC-1’. (Virtual machine ID BA85458B-D40C-4257-B92F-6A1487AF86A6)

2/22/2017

6:08:12 PM

Error

QVH15

16010

Microsoft-Windows-Hyper-V-VMMS

The operation failed.

 

 

Cluster Logs:

 

  • Found the
    Same events for the heartbeat Misses on the Cluster.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

1:18:17 PM

Information

QVH15

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 192.168.101.15:~3343~ connected to remote endpoint
192.168.101.3:~3343~.

2/22/2017

1:18:18 PM

Information

QVH15

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 10.0.0.56:~3343~ connected to remote endpoint 10.10.1.23:~3343~.

2/22/2017

1:18:18 PM

Information

QVH15

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local
endpoint 192.168.0.15:~3343~ connected to remote endpoint 192.168.0.3:~3343~.

 

System Information: QVH1

 

OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name QVH1
System Manufacturer HP
System Model ProLiant BL465c G7
System Type x64-based PC
System SKU 518857-B21
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
Processor AMD Opteron(tm) Processor 6174, 2200 Mhz, 12 Core(s), 12 Logical Processor(s)
BIOS Version/Date HP A19, 02/02/2014

 

System Events:

 

  • Verified the
    events and found events for the ISCSI portal failure.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

12:28:12 PM

Error

QVH1.

5

iScsiPrt

Failed to setup initiator portal. Error status is given in the
dump data.

2/22/2017

12:28:13 PM

Information

QVH1

16

Microsoft-Windows-Kernel-General

The access history in hive \SystemRoot\System32\Config\SOFTWARE
was cleared updating 62395 keys and creating 9567 modified pages.

2/22/2017

12:28:15 PM

Error

QVH1

5

iScsiPrt

Failed to setup initiator portal. Error status is given in the
dump data.

2/22/2017

12:28:17 PM

Information

QVH1

16

Microsoft-Windows-Kernel-General

The access history in hive \SystemRoot\System32\Config\DEFAULT was
cleared updating 237 keys and creating 40 modified pages.

2/22/2017

12:28:18 PM

Error

QVH1

5

iScsiPrt

Failed to setup initiator portal. Error status is given in the
dump data.

 

This issue occurs for one or more of the following reasons.

 

Cause 1

 

An iSCSI target device that is listed on the

Favourite
Targets
tab of the iSCSI
Initiator is no longer accessible. This may occur if the iSCSI target device
name is changed, if the IP address is changed, or if the iSCSI target device is
removed from the network.

 

Cause 2

 

The iSCSI target device has reached its
limit for the number of incoming connections.

 

Cause 3

 

A transient network issue temporarily prevented connectivity to the
iSCSI target device.

 

Cause 4

 

The network stack is not fully prepared when the initiator was
trying to connect to persistent targets. The initiator will try to connect.
When the network stack is fully prepared, the connection should be successfully
established.

 

From <https://support.microsoft.com/en-in/help/976072/event-id-1,-event-id-5,-event-id-70,-or-event-id-113-may-be-logged-in-the-system-log-on-a-computer-that-uses-the-iscsi-software-initiator-to-connect-to-an-iscsi-target-device>

 

 

 

2/22/2017

12:28:47 PM

Information

QVH1

67

be2net

HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #2 :
IOCTL to firmware has failed. The last 5 words in payload indicate opcode,
status, add.status, domain & subsys

2/22/2017

12:28:47 PM

Information

QVH1

67

be2net

HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #3 :
IOCTL to firmware has failed. The last 5 words in payload indicate opcode,
status, add.status, domain & subsys

2/22/2017

12:28:47 PM

Information

QVH1

67

be2net

HP NC551i Dual Port FlexFabric 10Gb Converged Network Adapter #4 :
IOCTL to firmware has failed. The last 5 words in payload indicate opcode,
status, add.status, domain & subsys

 

 

Cluster Logs:

 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

2/22/2017

12:32:07 PM

Information

QVH1

1650

Microsoft-Windows-FailoverClustering

Cluster has established a UDP connection from local endpoint
192.168.0.1:~3343~ connected to remote endpoint 192.168.0.6:~3343~.

2/22/2017

12:32:07 PM

Information

QVH1

1650

Microsoft-Windows-FailoverClustering

Cluster has established a UDP connection from local endpoint
10.10.1.21:~3343~ connected to remote endpoint 10.10.1.22:~3343~.

2/22/2017

12:32:07 PM

Information

QVH1

1650

Microsoft-Windows-FailoverClustering

Cluster has established a UDP connection from local endpoint
192.168.0.1:~3343~ connected to remote endpoint 192.168.0.2:~3343~.

 

 

 

__________________________________________________________________________________________________________________

 

 

Conclusion:

 

  • After Analyzing the logs we can see couple of issues which are pointing in different directions. As per the Understanding the Cluster Shared Volume is going in Paused state as it was not  able to change the State of the CSV. While this is happening we can see cluster heartbeat misses on nearly all the nodes which could point issues towards the Network Adaptor Driver not able to handle the Load. However while analyzing we can also See events related to the VMs not able to initialize to RAM issues. This could also cause problems with the Cluster and the Network Drivers.

 

 

 

         

        Plan:

 

 

  • Kindly monitor the RAM usage on the Cluster Node and also add some more RAM to the Physical Server so that they can handle the Load.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply