RCA - 17 - CSV in Paused State During Backup

Issue Description:

Getting Event ID 5120, 5142 on Cluster Name: ORL-HVCLUSTER-PR01″ running a copy of “Microsoft Windows Server 2012 R2 DTC”

_________________________________________________________________________

System Information: ORL-220-VS-02

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ORL-220-VS-02

System Manufacturer Cisco Systems Inc

System Model UCSB-B200-M3

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz, 2400 Mhz, 8 Core(s), 16 Logical Processor(s)

BIOS Version/Date Cisco Systems, Inc. B200M3.2.2.1a.0.111220131105, 11/12/2013

System Events:

Analyzed the logs of Node: ORL-220-VS-02 and found that the VSS Service entered in Running State which generally explains that a VSS operation is running in the Background.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/10/2017	11:46:40 PM	Information	ORL-220-VS-02.ntm.org	7036	Service Control Manager	The Volume Shadow Copy service entered the running state.
2/10/2017	11:47:08 PM	Error	ORL-220-VS-02.ntm.org	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Virtual Machine ST-NETSCALER-01’ of type ‘Virtual Machine’ in clustered role ‘ST-NETSCALER-01’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.

Checked the Logs around 11:54 and found that the VMs went to failed state, this is probably because the CSV went inaccessible on Node: ORL-220-VS-03 around 11:36:49 PM.

2/10/2017	11:54:00 PM	Error	ORL-220-VS-02.ntm.org	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Virtual Machine ST-NETSCALER-01’ of type ‘Virtual Machine’ in clustered role ‘ST-NETSCALER-01’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.
2/10/2017	11:54:00 PM	Error	ORL-220-VS-02.ntm.org	1205	Microsoft-Windows-FailoverClustering	The Cluster service failed to bring clustered role ‘ST-NETSCALER-01’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Cluster Events:

Found the Cluster task running around 11:39:32 PM.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/10/2017	11:39:32 PM	Information	ORL-220-VS-02.ntm.org	1641	Microsoft-Windows-FailoverClustering	Clustered role ‘SCVMM ST-ADMIN-01 Resources’ is moving to cluster node ‘ORL-220-VS-02’.
2/10/2017	11:39:32 PM	Information	ORL-220-VS-02.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘SCVMM ST-ADMIN-01 Configuration’ in clustered role ‘SCVMM ST-ADMIN-01 Resources’ has transitioned from state Offline to state OnlineCallIssued.
2/10/2017	11:39:32 PM	Information	ORL-220-VS-02.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘SCVMM ST-ADMIN-01’ in clustered role ‘SCVMM ST-ADMIN-01 Resources’ has transitioned from state Offline to state WaitingToComeOnline. Cluster resource ‘SCVMM ST-ADMIN-01’ is waiting on the following resources: SCVMM ST-ADMIN-01 Configuration.
2/10/2017	11:39:32 PM	Information	ORL-220-VS-02.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘SCVMM ST-ADMIN-01 Configuration’ in clustered role ‘SCVMM ST-ADMIN-01 Resources’ has transitioned from state OnlineCallIssued to state OnlinePending.
2/10/2017	11:39:32 PM	Information	ORL-220-VS-02.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘SCVMM ST-ADMIN-01 Configuration’ in clustered role ‘SCVMM ST-ADMIN-01 Resources’ has transitioned from state OnlinePending to state Online.

_____________________________________________________________________________________

System Information: ORL-220-VS-03

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ORL-220-VS-03

System Manufacturer Cisco Systems Inc

System Model UCSB-B200-M3

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz, 2400 Mhz, 8 Core(s), 16 Logical Processor(s)

BIOS Version/Date Cisco Systems, Inc. B200M3.2.2.1a.0.111220131105, 11/12/2013

Application Events:

Started Analyzing the logs from Node: ORL-220-VS-03 around 11:36 PM and found events for the VSS service in operation.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/10/2017	11:06:29 PM	Information	ORL-220-VS-03.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.
2/10/2017	11:10:55 PM	Information	ORL-220-VS-03.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.
2/10/2017	11:17:06 PM	Information	ORL-220-VS-03.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.
2/11/2017	3:03:28 AM	Error	ORL-220-VS-03.ntm.org	257	Microsoft-Windows-Defrag	The volume ST-General-VM01 (C:\ClusterStorage\ST-General-VM01) was not optimized because an error was encountered: The process cannot access the file because it is being used by another process. (0x80070020)

Found event ID 257 which gives us an idea about the Cluster Shared Volume being used by another process. This generally gives us an idea that the cluster shared volume was being accessed by another Application (Backup) due to which the Defrag operation was not able to continue.

2/10/2017	11:36:01 PM	Information	ORL-220-VS-03.ntm.org	7036	Service Control Manager	The Volume Shadow Copy service entered the running state.
2/10/2017	11:36:49 PM	Error	ORL-220-VS-03.ntm.org	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘ST-General-VM01’ (‘ST-General-VM01’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

After this we can see the cluster shared volume went inaccessible and then later went to failed state.

2/10/2017	11:36:49 PM	Error	ORL-220-VS-03.ntm.org	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘ST-General-VM01’ (‘ST-General-VM01’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.
2/10/2017	11:42:40 PM	Error	ORL-220-VS-03.ntm.org	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘USHQ-FS-02-H’ (‘USHQ-FS-02-H’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.
2/10/2017	11:43:20 PM	Error	ORL-220-VS-03.ntm.org	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘USHQ-FS-02-H’ (‘USHQ-FS-02-H’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

System Events:

Analyzed the logs but was not able to find anything specific related to the issue.

Cluster Events:

Verified the Cluster logs and found the Movement of Cluster Virtual Machines started around: 12:16 PM.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/10/2017	11:39:32 PM	Information	ORL-220-VS-03.ntm.org	1641	Microsoft-Windows-FailoverClustering	Clustered role ‘SCVMM ST-ADMIN-01 Resources’ is moving to cluster node ‘ORL-220-VS-02’.
2/10/2017	11:39:59 PM	Information	ORL-220-VS-03.ntm.org	1641	Microsoft-Windows-FailoverClustering	Clustered role ‘ST-ADMIN-02’ is moving to cluster node ‘ORL-220-VS-04’.
2/10/2017	11:42:26 PM	Information	ORL-220-VS-03.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘ST-General-VM03′ in clustered role ’28a8dba1-091b-4b80-b8a7-6c88fd2ad9bd’ has transitioned from state Online to state ProcessingFailure.
2/10/2017	11:42:26 PM	Information	ORL-220-VS-03.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘ST-General-VM03′ in clustered role ’28a8dba1-091b-4b80-b8a7-6c88fd2ad9bd’ has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource ‘ST-General-VM03’ is waiting on the following resources: .

____________________________________________________________________________________________

System Information: ORL-220-VS-04

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ORL-220-VS-04

System Manufacturer Cisco Systems Inc

System Model UCSB-B200-M3

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz, 2400 Mhz, 8 Core(s), 16 Logical Processor(s)

BIOS Version/Date Cisco Systems, Inc. B200M3.2.2.1a.0.111220131105, 11/12/2013

Application Events:

Checked the events and found the VSS Service constantly shutting down.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/10/2017	11:17:11 PM	Information	ORL-220-VS-04.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.
2/10/2017	11:39:02 PM	Information	ORL-220-VS-04.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.
2/10/2017	11:42:56 PM	Information	ORL-220-VS-04.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.
2/10/2017	11:50:02 PM	Information	ORL-220-VS-04.ntm.org	5605	Microsoft-Windows-WMI	The root\mscluster namespace is marked with the RequiresEncryption flag. Access to this namespace might be denied if the script or application does not have the appropriate authentication level. Change the authentication level to Pkt_Privacy and run the script or application again.
2/10/2017	11:50:19 PM	Information	ORL-220-VS-04.ntm.org	8224	VSS	The VSS service is shutting down due to idle timeout.

Cluster Events:

Found the events regarding the resource movement around 12:16 PM but we were not able to see any errors

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
2/10/2017	11:49:56 PM	Information	ORL-220-VS-04.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘ST-General-VM03′ in clustered role ’28a8dba1-091b-4b80-b8a7-6c88fd2ad9bd’ has transitioned from state Offline to state OnlineCallIssued.
2/10/2017	11:50:01 PM	Information	ORL-220-VS-04.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘ST-General-VM03′ in clustered role ’28a8dba1-091b-4b80-b8a7-6c88fd2ad9bd’ has transitioned from state OnlineCallIssued to state OnlinePending.
2/10/2017	11:50:02 PM	Information	ORL-220-VS-04.ntm.org	1637	Microsoft-Windows-FailoverClustering	Cluster resource ‘ST-General-VM03′ in clustered role ’28a8dba1-091b-4b80-b8a7-6c88fd2ad9bd’ has transitioned from state OnlinePending to state Online.
2/10/2017	11:50:02 PM	Information	ORL-220-VS-04.ntm.org	1201	Microsoft-Windows-FailoverClustering	The Cluster service successfully brought the clustered role ’28a8dba1-091b-4b80-b8a7-6c88fd2ad9bd’ online.

______________________________________________________________________________________________

Conclusion:

As per our discussion you mentioned that the issue started after the we have initiated two simultaneous backups. As per the backup architecture when we initiate a Backup the Filter driver associated with the Backup application takes an Exclusive handle on the Volume due to which we generally gets errors like:

2/11/2017

3:03:28 AM

Error

ORL-220-VS-03.ntm.org

257

Microsoft-Windows-Defrag

The volume ST-General-VM01 (C:\ClusterStorage\ST-General-VM01) was not optimized because an error was encountered: The process cannot access the file because it is being used by another process. (0x80070020)

Where the Error code state the following:

\err(vista).exe’ 0x80070020

# for hex 0x80070020 / decimal -2147024864

STIERR_SHARING_VIOLATION stierr.h

# as an HRESULT: Severity: FAILURE (1), FACILITY_WIN32 (0x7), Code 0x20

# for hex 0x20 / decimal 32

ERROR_SHARING_VIOLATION winerror.h

# The process cannot access the file because it is being used by another process.

# 2 matches found for “0x80070020”

PS C:\Users\adix5025.INDIA\Downloads\ERR>

During this time if any other Application filter driver will try to take access to the same volume can make the Cluster Shared Volume go inaccessible and in some cases take the entire CSV Offline which can be fixed by Restating the Node who was owning that resource.

Based on our discussion I will recommend you not to run multiple Backup application at the same time as two filter drivers can operate at the same time on the same CSV.