Issue Description:
Virtual Machines went to Saved state after Nodes failure on Cluster Name: ABCHVR2CLUSTER Running a copy of Microsoft Windows Server 2012 R2 Datacenter Version 6.3.9600 Build 9600
__________________________________________________________________________________________
System Information: ABCH17
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCH17
System Manufacturer Cisco Systems Inc
System Model UCSB-B200-M4
System Type x64-based PC
System SKU
Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)
BIOS Version/Date Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015
SMBIOS Version 2.8
System Events:
- Checked the events on the Machine Name: ABCH17 and found that the Live Migration was failed on the VMs due to which the VMs went to failed State.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
12/7/2016 |
9:05:45 AM |
Warning |
ABCH17.xyz.local |
21501 |
Microsoft-Windows-Hyper-V-High-Availability |
Live migration of ‘SCVMM ABCS85’ failed. Virtual machine migration operation for ‘ABCS85’ failed at migration source ‘ABCH17’. (Virtual machine ID F64344B4-091A-46F2-858A-CF4BB8408337) Failed to perform migration on virtual machine ‘ABCS85’ because virtual machine migration limit ‘5’ was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID F64344B4-091A-46F2-858A-CF4BB8408337) |
12/7/2016 |
9:05:45 AM |
Warning |
ABCH17.xyz.local |
21501 |
Microsoft-Windows-Hyper-V-High-Availability |
Live migration of ‘Virtual Machine ABCS39’ failed. Virtual machine migration operation for ‘ABCS39’ failed at migration source ‘ABCH17’. (Virtual machine ID 2B45E171-9143-4258-8247-D4307F64EFD4) Failed to perform migration on virtual machine ‘ABCS39’ because virtual machine migration limit ‘5’ was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID 2B45E171-9143-4258-8247-D4307F64EFD4) |
- Checked the Events for the Nic disconnect, probably as the machine has been restarted.
12/7/2016 |
9:13:50 AM |
Warning |
ABCH17.xyz.local |
16949 |
Microsoft-Windows-MsLbfoSysEvtProvider |
Member Nic {f9210cd5-36f5-4a2f-8f2c-0aa45563ca85} Disconnected. |
Application Events:
- Checked the events but was not able to find anything specific related to the issue that we were facing.
Failover Clustering Operational Events:
- Checked the events but was not able to find anything specific related to the issue that we were facing.
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
4/8/2013 0:02 |
(6.3:9374.0) |
(6.3:9374.0) |
Brocade Communications Systems, Inc. |
Brocade FC/FCoE HBA Stor Miniport Driver |
3/27/2013 21:08 |
(6.3:9367.0) |
(6.3:9367.0) |
Brocade Communications Systems, Inc. |
Brocade FC/FCoE HBA Stor Miniport Driver |
1/31/2013 18:14 |
(6.2:9304.0) |
(7.4:5.0) |
Broadcom Corporation |
Broadcom NetXtreme Unified Crash Dump (x64) |
1/31/2013 18:16 |
(6.2:9304.0) |
(7.4:3.0) |
Broadcom Corporation |
Broadcom NetXtreme FCoE Crash Dump (x64) |
2/4/2013 21:38 |
(6.2:9200.16384) |
(7.4:6.0) |
Broadcom Corporation |
FCoE offload x64 FREE |
2/4/2013 21:40 |
(6.2:9200.16384) |
(7.4:4.0) |
Broadcom Corporation |
iSCSI offload x64 FREE |
2/4/2013 19:47 |
(7.4:14.0) |
(7.4:14.0) |
Broadcom Corporation |
Broadcom NetXtreme II GigE VBD |
4/8/2013 15:30 |
(7.4:33.1) |
(7.4:33.1) |
Broadcom Corporation |
Broadcom NetXtreme II 10 GigE VBD |
6/3/2013 22:08 |
(9.1:11.3) |
(9.1:11.3) |
QLogic Corporation |
QLogic Fibre Channel Stor Miniport Inbox Driver |
3/25/2013 22:43 |
(2.1:5.0) |
(2.1:5.0) |
QLogic Corporation |
QLogic iSCSI Storport Miniport Inbox Driver |
6/7/2013 20:07 |
(9.1:11.3) |
(9.1:11.3) |
QLogic Corporation |
QLogic FCoE Stor Miniport Inbox Driver |
9/24/2008 19:28 |
(5.1:1039.2600) |
(5.1:1039.2600) |
Silicon Integrated Systems Corp. |
SiS RAID Stor Miniport Driver |
10/1/2008 22:56 |
(6.1:6918.0) |
(5.1:1039.3600) |
Silicon Integrated Systems |
SiS AHCI Stor-Miniport Driver |
11/27/2012 0:02 |
(5.1:0.10) |
(5.1:0.10) |
Promise Technology, Inc. |
Promise SuperTrak EX Series Driver for Windows x64 |
11/4/2010 18:33 |
(4.2:0.58) |
(4.0:1.58) |
VMware, Inc. |
VMware Virtual Storage Volume Driver |
Cluster Logs:
_________________________________________________________________________________________
System Information: ABCH11
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCH11
System Manufacturer Cisco Systems Inc
System Model UCSB-B200-M4
System Type x64-based PC
System SKU
Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)
BIOS Version/Date Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015
System Events:
- Checked the events on the other node of the Cluster just to understand what was going on from the Cluster prospective at the time of Issue.
- Found the event mentioned that Node 17 went out of the Cluster membership. This event is generated after we restarted the Host.
- While the process of Node Removal was going on we saw a Deadlock on RHS as the Cluster Shared Volume went to Paused and was not able to come online.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
12/7/2016 |
9:14:24 AM |
Critical |
ABCH11.xyz.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
12/7/2016 |
9:23:53 AM |
Error |
ABCH11.xyz.local |
5377 |
Microsoft-Windows-FailoverClustering |
An internal Cluster service operation exceeded the defined threshold of ‘120’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
12/7/2016 |
9:23:53 AM |
Error |
ABCH11.xyz.local |
5120 |
Microsoft-Windows-FailoverClustering |
Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
12/7/2016 |
9:23:53 AM |
Warning |
ABCH11.xyz.local |
140 |
Microsoft-Windows-Ntfs |
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV1, DeviceName: \Device\HarddiskVolume6. (A device which does not exist was specified.) |
12/7/2016 |
9:23:53 AM |
Critical |
ABCH11.xyz.local |
1146 |
Microsoft-Windows-FailoverClustering |
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue. |
- Since there was a Deadlock on the Cluster Disk the Cluster Service terminated on the Cluster and the Cluster went down.
12/7/2016 |
9:23:53 AM |
Information |
ABCH11.xyz.local |
7036 |
Service Control Manager |
The Cluster Service service entered the stopped state. |
12/7/2016 |
9:23:53 AM |
Error |
ABCH11.xyz.local |
7024 |
Service Control Manager |
The Cluster Service service terminated with the following service-specific error: An internal error occurred. |
12/7/2016 |
9:23:53 AM |
Error |
ABCH11.xyz.local |
7031 |
Service Control Manager |
The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service. |
Application Events:
- Checked the Application logs and found nothing specific except that the VSS service was shutting down from time to time which generally points to a running Backup operation.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
12/7/2016 |
2:48:12 AM |
Error |
ABCH11.xyz.local |
2006 |
Microsoft-Windows-PerfNet |
Unable to read Server Queue performance data from the Server service. The first four bytes (DWORD) of the Data section contains the status code, the second four bytes contains the IOSB.Status and the next four bytes contains the IOSB.Information. |
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
7/12/2013 22:47 |
(1.0:0.254) |
(1.0:0.254) |
PMC-Sierra |
PMC-Sierra Storport Driver For SPC8x6G SAS/SATA controller |
7/9/2013 1:50 |
(7.2:0.30261) |
(7.2:0.30261) |
PMC-Sierra, Inc. |
Adaptec SAS RAID WS03 Driver |
4/8/2013 0:02 |
(6.3:9374.0) |
(6.3:9374.0) |
Brocade Communications Systems, Inc. |
Brocade FC/FCoE HBA Stor Miniport Driver |
3/27/2013 21:08 |
(6.3:9367.0) |
(6.3:9367.0) |
Brocade Communications Systems, Inc. |
Brocade FC/FCoE HBA Stor Miniport Driver |
1/31/2013 18:14 |
(6.2:9304.0) |
(7.4:5.0) |
Broadcom Corporation |
Broadcom NetXtreme Unified Crash Dump (x64) |
1/31/2013 18:16 |
(6.2:9304.0) |
(7.4:3.0) |
Broadcom Corporation |
Broadcom NetXtreme FCoE Crash Dump (x64) |
2/4/2013 21:38 |
(6.2:9200.16384) |
(7.4:6.0) |
Broadcom Corporation |
FCoE offload x64 FREE |
2/4/2013 21:40 |
(6.2:9200.16384) |
(7.4:4.0) |
Broadcom Corporation |
iSCSI offload x64 FREE |
2/4/2013 19:47 |
(7.4:14.0) |
(7.4:14.0) |
Broadcom Corporation |
Broadcom NetXtreme II GigE VBD |
5/14/2013 6:19 |
(6.3:9391.6) |
(4.4:13.0) |
Chelsio Communications |
Virtual Bus Driver for Chelsio ® T4 Chipset |
6/11/2013 21:21 |
(2.74:214.4) |
(2.74:214.4) |
Emulex |
Emulex Storport Miniport Driver |
6/11/2013 21:21 |
(2.74:214.4) |
(2.74:214.4) |
Emulex |
Emulex Storport Miniport Driver |
4/8/2013 15:30 |
(7.4:33.1) |
(7.4:33.1) |
Broadcom Corporation |
Broadcom NetXtreme II 10 GigE VBD |
6/3/2013 22:08 |
(9.1:11.3) |
(9.1:11.3) |
QLogic Corporation |
QLogic Fibre Channel Stor Miniport Inbox Driver |
3/25/2013 22:43 |
(2.1:5.0) |
(2.1:5.0) |
QLogic Corporation |
QLogic iSCSI Storport Miniport Inbox Driver |
6/7/2013 20:07 |
(9.1:11.3) |
(9.1:11.3) |
QLogic Corporation |
QLogic FCoE Stor Miniport Inbox Driver |
9/24/2008 19:28 |
(5.1:1039.2600) |
(5.1:1039.2600) |
Silicon Integrated Systems Corp. |
SiS RAID Stor Miniport Driver |
10/1/2008 22:56 |
(6.1:6918.0) |
(5.1:1039.3600) |
Silicon Integrated Systems |
SiS AHCI Stor-Miniport Driver |
11/27/2012 0:02 |
(5.1:0.10) |
(5.1:0.10) |
Promise Technology, Inc. |
Promise SuperTrak EX Series Driver for Windows x64 |
11/4/2010 18:33 |
(4.2:0.58) |
(4.0:1.58) |
VMware, Inc. |
VMware Virtual Storage Volume Driver |
Cluster Logs:
Checked the Cluster logs at the time of issue and found that there was a Deadlock on the resource manager after which the Cluster Serivce terminated, while going through the logs we found that
348278 000009bc.00003464::2016/12/07-09:05:43.356 INFO [RCM-plcmt] removing outranked 1 to 0 candidate: NodeCandidate(10) banCode: (0) lastRank:GroupCountRanker=0
348279 000009bc.00003464::2016/12/07-09:05:43.356 INFO [RCM-plcmt] removing outranked 1 to 0 candidate: NodeCandidate(6) banCode: (0) lastRank:GroupCountRanker=0
348280 000009bc.00003464::2016/12/07-09:05:43.356 INFO [RCM-plcmt] placement manager result: grp=1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Manual, node=9
348281 000009bc.00003464::2016/12/07-09:05:43.356 INFO MTimer(GetPlacementAsDirector): [AntiAffinityFilter to StmFilter : 16 ms
348282 000009bc.00003464::2016/12/07-09:05:43.356 INFO MTimer(GetPlacementAsDirector): [Total: 16 ms ( 0 s )]
348283 000009bc.00003464::2016/12/07-09:05:43.356 INFO [GUM] Node 6: Executing locally gumId: 141210, updates: 1, first action: /dm/update
348284 000009bc.00001b0c::2016/12/07-09:05:43.356 ERR [RCM] [GIM] Cant remove provisional info of group 1b6b4f65-9220-449b-93a0-6065958a5cfe from node 9
348285 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter NodeDownFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain
348286 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter NodeShuttingDownFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain
348287 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter CurrentNodeFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain
348288 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] removing banned candidate: NodeCandidate(8) banCode: CurrentNodeFilter (5016)
348289 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter PausedNodeFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain
348290 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO [RCM-plcmt] applying filter PossibleOwnerFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain
349624 000009bc.00000c04::2016/12/07-09:05:43.528 INFO [GUM] Node 6: Executing locally gumId: 141218, updates: 1, first action: /rcm/gum/GroupMoveOperation
349625 000009bc.00000c04::2016/12/07-09:05:43.528 INFO [RCM] rcm::RcmGum::GroupMoveOperation(1)
349626 000009bc.00000ff4::2016/12/07-09:05:43.528 WARN [RCM] rcm::RcmApi::GetResourceState: retrying: 1b6b4f65-9220-449b-93a0-6065958a5cfe, 5908.
349627 000009bc.00000c04::2016/12/07-09:05:43.528 INFO [RCM] move of group 1b6b4f65-9220-449b-93a0-6065958a5cfe from ABCH17(8) to ABCH14(9) of type MoveType::Drain is about to succeed, failoverCount=1, lastFailoverTime=1601/01/01-00:00:00.000 targeted=true
349628 000009bc.00003464::2016/12/07-09:05:43.528 INFO [RCM] moved 0 tasks from staging set to task set. TaskSetSize=0
353919 000009bc.000014b0::2016/12/07-09:14:09.606 WARN [CHANNEL fe80::98ad:7066:5b87:38b0%49:~3343~] failure, status (10054)
353920 000009bc.000014b0::2016/12/07-09:14:09.606 INFO [PULLER ABCH17] Parent stream has been closed.
353921 000009bc.000014b0::2016/12/07-09:14:09.606 ERR [NODE] Node 6: Connection to Node 8 is broken. Reason Closed(1236)’ because of ‘channel to remote endpoint fe80::98ad:7066:5b87:38b0%49:~3343~ has failed with status (10054)’
353922 000009bc.000014b0::2016/12/07-09:14:09.606 WARN [NODE] Node 6: Initiating reconnect with n8.
359146 000009bc.0000223c::2016/12/07-09:23:43.739 INFO [NM] Received request from client address ABCH11.
359147 000018c8.00001bd4::2016/12/07-09:23:44.208 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e85cac7f-c9b0-4dcf-86c2-544d8581d574:Netbios
359148 000018ec.00002820::2016/12/07-09:23:44.473 INFO [RES] Physical Disk <Cluster Disk 1>: VolumeIsNtfs: Volume \\?\GLOBALROOT\Device\Harddisk1\ClusterPartition2\ has FS type NTFS
359149 000018c8.00001bd4::2016/12/07-09:23:46.318 INFO [RES] Network Name <Cluster Name>: Dns: HealthCheck: ABCHVR2CLUSTER
359150 000018c8.00001bd4::2016/12/07-09:23:46.318 INFO [RES] Network Name <Cluster Name>: Dns: End of Slow Operation, state: Initialized/Reading, prevWorkState: Reading
359151 000018c8.00001bd4::2016/12/07-09:23:49.208 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e85cac7f-c9b0-4dcf-86c2-544d8581d574:Netbios
359152 000009bc.000037c8::2016/12/07-09:23:53.068 ERR [UserModeMonitor] Possibly DEADLOCK. No proggress reported for component GetSingleState in last 120 sec. Initial thread: 2696, started: 2016/12/07-09:21:53.062. Terminating the clussvc process.
359153 000009bc.000037c8::2016/12/07-09:23:53.068 ERR UserModeMonitor timeout (status = 1359)
359154 000014e0.000014dc::2016/12/07-09:23:53.068 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
359155 00000e20.00001110::2016/12/07-09:23:53.068 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
______________________________________________________________________________
System Information:
OS Name Microsoft Windows Server 2012 R2 Datacenter
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCH12
System Manufacturer Cisco Systems Inc
System Model UCSB-B200-M4
System Type x64-based PC
System SKU
Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)
BIOS Version/Date Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015
System Events:
- Checked the logs of another node just to confirm.
- Based on the logs the issue seems to be the same first the Node being removed after restart due just after which we found that the Cluster Shared Volume 1 went in Paused State.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
12/7/2016 |
9:14:24 AM |
Critical |
ABCH12.xyz.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
12/7/2016 |
9:17:50 AM |
Critical |
ABCH12.xyz.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
12/7/2016 |
9:23:58 AM |
Error |
ABCH12.xyz.local |
5120 |
Microsoft-Windows-FailoverClustering |
Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c000026e)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
12/7/2016 |
9:24:05 AM |
Critical |
ABCH12.xyz.local |
1135 |
Microsoft-Windows-FailoverClustering |
Cluster node ‘ABCH11’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges. |
- We found that the Node 11 also went out of the Cluster membership just after the Volume 1 went to paused state.
12/7/2016 |
9:27:00 AM |
Error |
ABCH12.xyz.local |
5377 |
Microsoft-Windows-FailoverClustering |
An internal Cluster service operation exceeded the defined threshold of ‘120’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster. |
12/7/2016 |
9:27:00 AM |
Error |
ABCH12.xyz.local |
5120 |
Microsoft-Windows-FailoverClustering |
Cluster Shared Volume ‘Volume5’ (‘Cluster Disk 5’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
12/7/2016 |
9:27:00 AM |
Warning |
ABCH12.xyz.local |
140 |
Microsoft-Windows-Ntfs |
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV5, DeviceName: \Device\HarddiskVolume4. (A device which does not exist was specified.) |
12/7/2016 |
9:27:00 AM |
Critical |
ABCH12.xyz.local |
1146 |
Microsoft-Windows-FailoverClustering |
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue. |
Application Events:
- Found the same event for VSS service Shutting Down which gives us a probable reason of an ongoing backup operation.
Date |
Time |
Type/Level |
Computer Name |
Event Code |
Source |
Description |
12/7/2016 |
9:35:18 AM |
Information |
ABCH12.xyz.local |
8224 |
VSS |
The VSS service is shutting down due to idle timeout. |
List of outdated drivers:
Time/Date String |
Product Version |
File Version |
Company Name |
File Description |
7/12/2013 22:47 |
(1.0:0.254) |
(1.0:0.254) |
PMC-Sierra |
PMC-Sierra Storport Driver For SPC8x6G SAS/SATA controller |
7/9/2013 1:50 |
(7.2:0.30261) |
(7.2:0.30261) |
PMC-Sierra, Inc. |
Adaptec SAS RAID WS03 Driver |
4/8/2013 0:02 |
(6.3:9374.0) |
(6.3:9374.0) |
Brocade Communications Systems, Inc. |
Brocade FC/FCoE HBA Stor Miniport Driver |
3/27/2013 21:08 |
(6.3:9367.0) |
(6.3:9367.0) |
Brocade Communications Systems, Inc. |
Brocade FC/FCoE HBA Stor Miniport Driver |
1/31/2013 18:14 |
(6.2:9304.0) |
(7.4:5.0) |
Broadcom Corporation |
Broadcom NetXtreme Unified Crash Dump (x64) |
1/31/2013 18:16 |
(6.2:9304.0) |
(7.4:3.0) |
Broadcom Corporation |
Broadcom NetXtreme FCoE Crash Dump (x64) |
2/4/2013 21:38 |
(6.2:9200.16384) |
(7.4:6.0) |
Broadcom Corporation |
FCoE offload x64 FREE |
2/4/2013 21:40 |
(6.2:9200.16384) |
(7.4:4.0) |
Broadcom Corporation |
iSCSI offload x64 FREE |
2/4/2013 19:47 |
(7.4:14.0) |
(7.4:14.0) |
Broadcom Corporation |
Broadcom NetXtreme II GigE VBD |
5/14/2013 6:19 |
(6.3:9391.6) |
(4.4:13.0) |
Chelsio Communications |
Virtual Bus Driver for Chelsio ® T4 Chipset |
6/11/2013 21:21 |
(2.74:214.4) |
(2.74:214.4) |
Emulex |
Emulex Storport Miniport Driver |
6/11/2013 21:21 |
(2.74:214.4) |
(2.74:214.4) |
Emulex |
Emulex Storport Miniport Driver |
4/8/2013 15:30 |
(7.4:33.1) |
(7.4:33.1) |
Broadcom Corporation |
Broadcom NetXtreme II 10 GigE VBD |
6/3/2013 22:08 |
(9.1:11.3) |
(9.1:11.3) |
QLogic Corporation |
QLogic Fibre Channel Stor Miniport Inbox Driver |
3/25/2013 22:43 |
(2.1:5.0) |
(2.1:5.0) |
QLogic Corporation |
QLogic iSCSI Storport Miniport Inbox Driver |
6/7/2013 20:07 |
(9.1:11.3) |
(9.1:11.3) |
QLogic Corporation |
QLogic FCoE Stor Miniport Inbox Driver |
9/24/2008 19:28 |
(5.1:1039.2600) |
(5.1:1039.2600) |
Silicon Integrated Systems Corp. |
SiS RAID Stor Miniport Driver |
10/1/2008 22:56 |
(6.1:6918.0) |
(5.1:1039.3600) |
Silicon Integrated Systems |
SiS AHCI Stor-Miniport Driver |
11/27/2012 0:02 |
(5.1:0.10) |
(5.1:0.10) |
Promise Technology, Inc. |
Promise SuperTrak EX Series Driver for Windows x64 |
11/4/2010 18:33 |
(4.2:0.58) |
(4.0:1.58) |
VMware, Inc. |
VMware Virtual Storage Volume Driver |
_______________________________________________________________________________
Conclusion:
- After analyzing the logs we found that there was a Deadlock on the Cluster Service due to which the cluster service terminated on the Nodes which are part of the cluster and the Virtual Machines that was running went to failed state. Based on this trend the issue seems to be with the Network misconfiguration as when there is a Live migration task initiate it consumes a high bandwidth of data and since the cluster Network and ISCSI Network was allowing the Cluster traffic to communicate through it the cluster shared Volume went to failed state.
- There is no events which shows if that was any Network or Hardware Failure. However updating the Network components will be recommended.
- Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:
Updates for Cluster Binaries for 2012 R2
https://support.microsoft.com/en-us/kb/2920151
- Investigate the Network timeout / latency / packet drops with the help of in house networking team.
Please Note : This step is the most critical while dealing with network connectivity issues.
Investigation of Network Issues :
We need to investigate the Network Connectivity Issues with the help of in-house networking team.
In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.
We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.
We also need to check hubs, switches, or bridges in the networks that connect the nodes.
We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.
- As I have already explained and made some changes with the network, it will really help us out on the issue that we are facing.