RCA- 13 - VM Went in Saved State after Node Failure

Issue Description:

Virtual Machines went to Saved state after Nodes failure on Cluster Name: ABCHVR2CLUSTER Running a copy of Microsoft Windows Server 2012 R2 Datacenter Version 6.3.9600 Build 9600

__________________________________________________________________________________________

System Information: ABCH17

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCH17

System Manufacturer Cisco Systems Inc

System Model UCSB-B200-M4

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015

SMBIOS Version 2.8

System Events:

Checked the events on the Machine Name: ABCH17 and found that the Live Migration was failed on the VMs due to which the VMs went to failed State.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
12/7/2016	9:05:45 AM	Warning	ABCH17.xyz.local	21501	Microsoft-Windows-Hyper-V-High-Availability	Live migration of ‘SCVMM ABCS85’ failed. Virtual machine migration operation for ‘ABCS85’ failed at migration source ‘ABCH17’. (Virtual machine ID F64344B4-091A-46F2-858A-CF4BB8408337) Failed to perform migration on virtual machine ‘ABCS85’ because virtual machine migration limit ‘5’ was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID F64344B4-091A-46F2-858A-CF4BB8408337)
12/7/2016	9:05:45 AM	Warning	ABCH17.xyz.local	21501	Microsoft-Windows-Hyper-V-High-Availability	Live migration of ‘Virtual Machine ABCS39’ failed. Virtual machine migration operation for ‘ABCS39’ failed at migration source ‘ABCH17’. (Virtual machine ID 2B45E171-9143-4258-8247-D4307F64EFD4) Failed to perform migration on virtual machine ‘ABCS39’ because virtual machine migration limit ‘5’ was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID 2B45E171-9143-4258-8247-D4307F64EFD4)

Checked the Events for the Nic disconnect, probably as the machine has been restarted.

12/7/2016

9:13:50 AM

Warning

ABCH17.xyz.local

16949

Microsoft-Windows-MsLbfoSysEvtProvider

Member Nic {f9210cd5-36f5-4a2f-8f2c-0aa45563ca85} Disconnected.

Application Events:

Checked the events but was not able to find anything specific related to the issue that we were facing.

Failover Clustering Operational Events:

Checked the events but was not able to find anything specific related to the issue that we were facing.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
4/8/2013 0:02	(6.3:9374.0)	(6.3:9374.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
3/27/2013 21:08	(6.3:9367.0)	(6.3:9367.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
1/31/2013 18:14	(6.2:9304.0)	(7.4:5.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
1/31/2013 18:16	(6.2:9304.0)	(7.4:3.0)	Broadcom Corporation	Broadcom NetXtreme FCoE Crash Dump (x64)
2/4/2013 21:38	(6.2:9200.16384)	(7.4:6.0)	Broadcom Corporation	FCoE offload x64 FREE
2/4/2013 21:40	(6.2:9200.16384)	(7.4:4.0)	Broadcom Corporation	iSCSI offload x64 FREE
2/4/2013 19:47	(7.4:14.0)	(7.4:14.0)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
4/8/2013 15:30	(7.4:33.1)	(7.4:33.1)	Broadcom Corporation	Broadcom NetXtreme II 10 GigE VBD
6/3/2013 22:08	(9.1:11.3)	(9.1:11.3)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Inbox Driver
3/25/2013 22:43	(2.1:5.0)	(2.1:5.0)	QLogic Corporation	QLogic iSCSI Storport Miniport Inbox Driver
6/7/2013 20:07	(9.1:11.3)	(9.1:11.3)	QLogic Corporation	QLogic FCoE Stor Miniport Inbox Driver
9/24/2008 19:28	(5.1:1039.2600)	(5.1:1039.2600)	Silicon Integrated Systems Corp.	SiS RAID Stor Miniport Driver
10/1/2008 22:56	(6.1:6918.0)	(5.1:1039.3600)	Silicon Integrated Systems	SiS AHCI Stor-Miniport Driver
11/27/2012 0:02	(5.1:0.10)	(5.1:0.10)	Promise Technology, Inc.	Promise SuperTrak EX Series Driver for Windows x64
11/4/2010 18:33	(4.2:0.58)	(4.0:1.58)	VMware, Inc.	VMware Virtual Storage Volume Driver

Cluster Logs:

_________________________________________________________________________________________

System Information: ABCH11

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCH11

System Manufacturer Cisco Systems Inc

System Model UCSB-B200-M4

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015

System Events:

Checked the events on the other node of the Cluster just to understand what was going on from the Cluster prospective at the time of Issue.

Found the event mentioned that Node 17 went out of the Cluster membership. This event is generated after we restarted the Host.

While the process of Node Removal was going on we saw a Deadlock on RHS as the Cluster Shared Volume went to Paused and was not able to come online.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
12/7/2016	9:14:24 AM	Critical	ABCH11.xyz.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
12/7/2016	9:23:53 AM	Error	ABCH11.xyz.local	5377	Microsoft-Windows-FailoverClustering	An internal Cluster service operation exceeded the defined threshold of ‘120’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
12/7/2016	9:23:53 AM	Error	ABCH11.xyz.local	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.
12/7/2016	9:23:53 AM	Warning	ABCH11.xyz.local	140	Microsoft-Windows-Ntfs	The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV1, DeviceName: \Device\HarddiskVolume6. (A device which does not exist was specified.)
12/7/2016	9:23:53 AM	Critical	ABCH11.xyz.local	1146	Microsoft-Windows-FailoverClustering	The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

Since there was a Deadlock on the Cluster Disk the Cluster Service terminated on the Cluster and the Cluster went down.

12/7/2016	9:23:53 AM	Information	ABCH11.xyz.local	7036	Service Control Manager	The Cluster Service service entered the stopped state.
12/7/2016	9:23:53 AM	Error	ABCH11.xyz.local	7024	Service Control Manager	The Cluster Service service terminated with the following service-specific error: An internal error occurred.
12/7/2016	9:23:53 AM	Error	ABCH11.xyz.local	7031	Service Control Manager	The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.

Application Events:

Checked the Application logs and found nothing specific except that the VSS service was shutting down from time to time which generally points to a running Backup operation.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
12/7/2016	2:48:12 AM	Error	ABCH11.xyz.local	2006	Microsoft-Windows-PerfNet	Unable to read Server Queue performance data from the Server service. The first four bytes (DWORD) of the Data section contains the status code, the second four bytes contains the IOSB.Status and the next four bytes contains the IOSB.Information.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
7/12/2013 22:47	(1.0:0.254)	(1.0:0.254)	PMC-Sierra	PMC-Sierra Storport Driver For SPC8x6G SAS/SATA controller
7/9/2013 1:50	(7.2:0.30261)	(7.2:0.30261)	PMC-Sierra, Inc.	Adaptec SAS RAID WS03 Driver
4/8/2013 0:02	(6.3:9374.0)	(6.3:9374.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
3/27/2013 21:08	(6.3:9367.0)	(6.3:9367.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
1/31/2013 18:14	(6.2:9304.0)	(7.4:5.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
1/31/2013 18:16	(6.2:9304.0)	(7.4:3.0)	Broadcom Corporation	Broadcom NetXtreme FCoE Crash Dump (x64)
2/4/2013 21:38	(6.2:9200.16384)	(7.4:6.0)	Broadcom Corporation	FCoE offload x64 FREE
2/4/2013 21:40	(6.2:9200.16384)	(7.4:4.0)	Broadcom Corporation	iSCSI offload x64 FREE
2/4/2013 19:47	(7.4:14.0)	(7.4:14.0)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
5/14/2013 6:19	(6.3:9391.6)	(4.4:13.0)	Chelsio Communications	Virtual Bus Driver for Chelsio ® T4 Chipset
6/11/2013 21:21	(2.74:214.4)	(2.74:214.4)	Emulex	Emulex Storport Miniport Driver
6/11/2013 21:21	(2.74:214.4)	(2.74:214.4)	Emulex	Emulex Storport Miniport Driver
4/8/2013 15:30	(7.4:33.1)	(7.4:33.1)	Broadcom Corporation	Broadcom NetXtreme II 10 GigE VBD
6/3/2013 22:08	(9.1:11.3)	(9.1:11.3)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Inbox Driver
3/25/2013 22:43	(2.1:5.0)	(2.1:5.0)	QLogic Corporation	QLogic iSCSI Storport Miniport Inbox Driver
6/7/2013 20:07	(9.1:11.3)	(9.1:11.3)	QLogic Corporation	QLogic FCoE Stor Miniport Inbox Driver
9/24/2008 19:28	(5.1:1039.2600)	(5.1:1039.2600)	Silicon Integrated Systems Corp.	SiS RAID Stor Miniport Driver
10/1/2008 22:56	(6.1:6918.0)	(5.1:1039.3600)	Silicon Integrated Systems	SiS AHCI Stor-Miniport Driver
11/27/2012 0:02	(5.1:0.10)	(5.1:0.10)	Promise Technology, Inc.	Promise SuperTrak EX Series Driver for Windows x64
11/4/2010 18:33	(4.2:0.58)	(4.0:1.58)	VMware, Inc.	VMware Virtual Storage Volume Driver

Cluster Logs:

Checked the Cluster logs at the time of issue and found that there was a Deadlock on the resource manager after which the Cluster Serivce terminated, while going through the logs we found that

348278 000009bc.00003464::2016/12/07-09:05:43.356 INFO [RCM-plcmt] removing outranked 1 to 0 candidate: NodeCandidate(10) banCode: (0) lastRank:GroupCountRanker=0

348279 000009bc.00003464::2016/12/07-09:05:43.356 INFO [RCM-plcmt] removing outranked 1 to 0 candidate: NodeCandidate(6) banCode: (0) lastRank:GroupCountRanker=0

348280 000009bc.00003464::2016/12/07-09:05:43.356 INFO [RCM-plcmt] placement manager result: grp=1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Manual, node=9

348281 000009bc.00003464::2016/12/07-09:05:43.356 INFO MTimer(GetPlacementAsDirector): [AntiAffinityFilter to StmFilter : 16 ms

348282 000009bc.00003464::2016/12/07-09:05:43.356 INFO MTimer(GetPlacementAsDirector): [Total: 16 ms ( 0 s )]

348283 000009bc.00003464::2016/12/07-09:05:43.356 INFO [GUM] Node 6: Executing locally gumId: 141210, updates: 1, first action: /dm/update

348284 000009bc.00001b0c::2016/12/07-09:05:43.356 ERR [RCM] [GIM] Cant remove provisional info of group 1b6b4f65-9220-449b-93a0-6065958a5cfe from node 9

348285 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter NodeDownFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348286 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter NodeShuttingDownFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348287 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter CurrentNodeFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348288 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] removing banned candidate: NodeCandidate(8) banCode: CurrentNodeFilter (5016)

348289 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter PausedNodeFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348290 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter PossibleOwnerFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

349624 000009bc.00000c04::2016/12/07-09:05:43.528 INFO [GUM] Node 6: Executing locally gumId: 141218, updates: 1, first action: /rcm/gum/GroupMoveOperation

349625 000009bc.00000c04::2016/12/07-09:05:43.528 INFO [RCM] rcm::RcmGum::GroupMoveOperation(1)

349626 000009bc.00000ff4::2016/12/07-09:05:43.528 WARN [RCM] rcm::RcmApi::GetResourceState: retrying: 1b6b4f65-9220-449b-93a0-6065958a5cfe, 5908.

349627 000009bc.00000c04::2016/12/07-09:05:43.528 INFO [RCM] move of group 1b6b4f65-9220-449b-93a0-6065958a5cfe from ABCH17(8) to ABCH14(9) of type MoveType::Drain is about to succeed, failoverCount=1, lastFailoverTime=1601/01/01-00:00:00.000 targeted=true

349628 000009bc.00003464::2016/12/07-09:05:43.528 INFO [RCM] moved 0 tasks from staging set to task set. TaskSetSize=0

353919 000009bc.000014b0::2016/12/07-09:14:09.606 WARN [CHANNEL fe80::98ad:7066:5b87:38b0%49:~3343~] failure, status (10054)

353920 000009bc.000014b0::2016/12/07-09:14:09.606 INFO [PULLER ABCH17] Parent stream has been closed.

353921 000009bc.000014b0::2016/12/07-09:14:09.606 ERR [NODE] Node 6: Connection to Node 8 is broken. Reason Closed(1236)’ because of ‘channel to remote endpoint fe80::98ad:7066:5b87:38b0%49:~3343~ has failed with status (10054)’

353922 000009bc.000014b0::2016/12/07-09:14:09.606 WARN [NODE] Node 6: Initiating reconnect with n8.

359146 000009bc.0000223c::2016/12/07-09:23:43.739 INFO [NM] Received request from client address ABCH11.

359147 000018c8.00001bd4::2016/12/07-09:23:44.208 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e85cac7f-c9b0-4dcf-86c2-544d8581d574:Netbios

359148 000018ec.00002820::2016/12/07-09:23:44.473 INFO [RES] Physical Disk <Cluster Disk 1>: VolumeIsNtfs: Volume \\?\GLOBALROOT\Device\Harddisk1\ClusterPartition2\ has FS type NTFS

359149 000018c8.00001bd4::2016/12/07-09:23:46.318 INFO [RES] Network Name <Cluster Name>: Dns: HealthCheck: ABCHVR2CLUSTER

359150 000018c8.00001bd4::2016/12/07-09:23:46.318 INFO [RES] Network Name <Cluster Name>: Dns: End of Slow Operation, state: Initialized/Reading, prevWorkState: Reading

359151 000018c8.00001bd4::2016/12/07-09:23:49.208 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e85cac7f-c9b0-4dcf-86c2-544d8581d574:Netbios

359152 000009bc.000037c8::2016/12/07-09:23:53.068 ERR [UserModeMonitor] Possibly DEADLOCK. No proggress reported for component GetSingleState in last 120 sec. Initial thread: 2696, started: 2016/12/07-09:21:53.062. Terminating the clussvc process.

359153 000009bc.000037c8::2016/12/07-09:23:53.068 ERR UserModeMonitor timeout (status = 1359)

359154 000014e0.000014dc::2016/12/07-09:23:53.068 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.

359155 00000e20.00001110::2016/12/07-09:23:53.068 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.

______________________________________________________________________________

System Information:

OS Name Microsoft Windows Server 2012 R2 Datacenter

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCH12

System Manufacturer Cisco Systems Inc

System Model UCSB-B200-M4

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015

System Events:

Checked the logs of another node just to confirm.

Based on the logs the issue seems to be the same first the Node being removed after restart due just after which we found that the Cluster Shared Volume 1 went in Paused State.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
12/7/2016	9:14:24 AM	Critical	ABCH12.xyz.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
12/7/2016	9:17:50 AM	Critical	ABCH12.xyz.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
12/7/2016	9:23:58 AM	Error	ABCH12.xyz.local	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c000026e)’. All I/O will temporarily be queued until a path to the volume is reestablished.
12/7/2016	9:24:05 AM	Critical	ABCH12.xyz.local	1135	Microsoft-Windows-FailoverClustering	Cluster node ‘ABCH11’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

We found that the Node 11 also went out of the Cluster membership just after the Volume 1 went to paused state.

12/7/2016	9:27:00 AM	Error	ABCH12.xyz.local	5377	Microsoft-Windows-FailoverClustering	An internal Cluster service operation exceeded the defined threshold of ‘120’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.
12/7/2016	9:27:00 AM	Error	ABCH12.xyz.local	5120	Microsoft-Windows-FailoverClustering	Cluster Shared Volume ‘Volume5’ (‘Cluster Disk 5’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.
12/7/2016	9:27:00 AM	Warning	ABCH12.xyz.local	140	Microsoft-Windows-Ntfs	The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV5, DeviceName: \Device\HarddiskVolume4. (A device which does not exist was specified.)
12/7/2016	9:27:00 AM	Critical	ABCH12.xyz.local	1146	Microsoft-Windows-FailoverClustering	The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

Application Events:

Found the same event for VSS service Shutting Down which gives us a probable reason of an ongoing backup operation.

Date	Time	Type/Level	Computer Name	Event Code	Source	Description
12/7/2016	9:35:18 AM	Information	ABCH12.xyz.local	8224	VSS	The VSS service is shutting down due to idle timeout.

List of outdated drivers:

Time/Date String	Product Version	File Version	Company Name	File Description
7/12/2013 22:47	(1.0:0.254)	(1.0:0.254)	PMC-Sierra	PMC-Sierra Storport Driver For SPC8x6G SAS/SATA controller
7/9/2013 1:50	(7.2:0.30261)	(7.2:0.30261)	PMC-Sierra, Inc.	Adaptec SAS RAID WS03 Driver
4/8/2013 0:02	(6.3:9374.0)	(6.3:9374.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
3/27/2013 21:08	(6.3:9367.0)	(6.3:9367.0)	Brocade Communications Systems, Inc.	Brocade FC/FCoE HBA Stor Miniport Driver
1/31/2013 18:14	(6.2:9304.0)	(7.4:5.0)	Broadcom Corporation	Broadcom NetXtreme Unified Crash Dump (x64)
1/31/2013 18:16	(6.2:9304.0)	(7.4:3.0)	Broadcom Corporation	Broadcom NetXtreme FCoE Crash Dump (x64)
2/4/2013 21:38	(6.2:9200.16384)	(7.4:6.0)	Broadcom Corporation	FCoE offload x64 FREE
2/4/2013 21:40	(6.2:9200.16384)	(7.4:4.0)	Broadcom Corporation	iSCSI offload x64 FREE
2/4/2013 19:47	(7.4:14.0)	(7.4:14.0)	Broadcom Corporation	Broadcom NetXtreme II GigE VBD
5/14/2013 6:19	(6.3:9391.6)	(4.4:13.0)	Chelsio Communications	Virtual Bus Driver for Chelsio ® T4 Chipset
6/11/2013 21:21	(2.74:214.4)	(2.74:214.4)	Emulex	Emulex Storport Miniport Driver
6/11/2013 21:21	(2.74:214.4)	(2.74:214.4)	Emulex	Emulex Storport Miniport Driver
4/8/2013 15:30	(7.4:33.1)	(7.4:33.1)	Broadcom Corporation	Broadcom NetXtreme II 10 GigE VBD
6/3/2013 22:08	(9.1:11.3)	(9.1:11.3)	QLogic Corporation	QLogic Fibre Channel Stor Miniport Inbox Driver
3/25/2013 22:43	(2.1:5.0)	(2.1:5.0)	QLogic Corporation	QLogic iSCSI Storport Miniport Inbox Driver
6/7/2013 20:07	(9.1:11.3)	(9.1:11.3)	QLogic Corporation	QLogic FCoE Stor Miniport Inbox Driver
9/24/2008 19:28	(5.1:1039.2600)	(5.1:1039.2600)	Silicon Integrated Systems Corp.	SiS RAID Stor Miniport Driver
10/1/2008 22:56	(6.1:6918.0)	(5.1:1039.3600)	Silicon Integrated Systems	SiS AHCI Stor-Miniport Driver
11/27/2012 0:02	(5.1:0.10)	(5.1:0.10)	Promise Technology, Inc.	Promise SuperTrak EX Series Driver for Windows x64
11/4/2010 18:33	(4.2:0.58)	(4.0:1.58)	VMware, Inc.	VMware Virtual Storage Volume Driver

_______________________________________________________________________________

Conclusion:

After analyzing the logs we found that there was a Deadlock on the Cluster Service due to which the cluster service terminated on the Nodes which are part of the cluster and the Virtual Machines that was running went to failed state. Based on this trend the issue seems to be with the Network misconfiguration as when there is a Live migration task initiate it consumes a high bandwidth of data and since the cluster Network and ISCSI Network was allowing the Cluster traffic to communicate through it the cluster shared Volume went to failed state.

There is no events which shows if that was any Network or Hardware Failure. However updating the Network components will be recommended.

Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

Updates for Cluster Binaries for 2012 R2

https://support.microsoft.com/en-us/kb/2920151

Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

Investigation of Network Issues :

We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.