RCA- 13 – VM Went in Saved State after Node Failure

Issue Description:

Virtual Machines went to Saved state after Nodes failure on Cluster Name: ABCHVR2CLUSTER Running a copy of Microsoft Windows Server 2012 R2 Datacenter Version 6.3.9600 Build 9600

__________________________________________________________________________________________

System Information: ABCH17

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCH17

System Manufacturer        Cisco Systems Inc

System Model        UCSB-B200-M4

System Type        x64-based PC

System SKU       

Processor        Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date        Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015

SMBIOS Version        2.8

System Events:

  • Checked the events on the Machine Name: ABCH17 and found that the Live Migration was failed on the VMs due to which the VMs went to failed State.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

12/7/2016

9:05:45 AM

Warning

ABCH17.xyz.local

21501

Microsoft-Windows-Hyper-V-High-Availability

Live migration of ‘SCVMM ABCS85’ failed. Virtual machine migration operation for ‘ABCS85’ failed at migration source ‘ABCH17’. (Virtual machine ID F64344B4-091A-46F2-858A-CF4BB8408337) Failed to perform migration on virtual machine ‘ABCS85’ because virtual machine migration limit ‘5’ was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID F64344B4-091A-46F2-858A-CF4BB8408337)

12/7/2016

9:05:45 AM

Warning

ABCH17.xyz.local

21501

Microsoft-Windows-Hyper-V-High-Availability

Live migration of ‘Virtual Machine ABCS39’ failed. Virtual machine migration operation for ‘ABCS39’ failed at migration source ‘ABCH17’. (Virtual machine ID 2B45E171-9143-4258-8247-D4307F64EFD4) Failed to perform migration on virtual machine ‘ABCS39’ because virtual machine migration limit ‘5’ was reached, please wait for completion of an ongoing migration operation. (Virtual machine ID 2B45E171-9143-4258-8247-D4307F64EFD4)

  • Checked the Events for the Nic disconnect, probably as the machine has been restarted.

12/7/2016

9:13:50 AM

Warning

ABCH17.xyz.local

16949

Microsoft-Windows-MsLbfoSysEvtProvider

Member Nic {f9210cd5-36f5-4a2f-8f2c-0aa45563ca85} Disconnected.

Application Events:

  • Checked the events but was not able to find anything specific related to the issue that we were facing.

Failover Clustering Operational Events:

  • Checked the events but was not able to find anything specific related to the issue that we were facing.

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

4/8/2013 0:02

(6.3:9374.0)

(6.3:9374.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

3/27/2013 21:08

(6.3:9367.0)

(6.3:9367.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

1/31/2013 18:14

(6.2:9304.0)

(7.4:5.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

1/31/2013 18:16

(6.2:9304.0)

(7.4:3.0)

Broadcom Corporation

Broadcom NetXtreme FCoE Crash Dump (x64)

2/4/2013 21:38

(6.2:9200.16384)

(7.4:6.0)

Broadcom Corporation

FCoE offload x64 FREE

2/4/2013 21:40

(6.2:9200.16384)

(7.4:4.0)

Broadcom Corporation

iSCSI offload x64 FREE

2/4/2013 19:47

(7.4:14.0)

(7.4:14.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

4/8/2013 15:30

(7.4:33.1)

(7.4:33.1)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

6/3/2013 22:08

(9.1:11.3)

(9.1:11.3)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Inbox Driver

3/25/2013 22:43

(2.1:5.0)

(2.1:5.0)

QLogic Corporation

QLogic iSCSI Storport Miniport Inbox Driver

6/7/2013 20:07

(9.1:11.3)

(9.1:11.3)

QLogic Corporation

QLogic FCoE Stor Miniport Inbox Driver

9/24/2008 19:28

(5.1:1039.2600)

(5.1:1039.2600)

Silicon Integrated Systems Corp.

SiS RAID Stor Miniport Driver

10/1/2008 22:56

(6.1:6918.0)

(5.1:1039.3600)

Silicon Integrated Systems

SiS AHCI Stor-Miniport Driver

11/27/2012 0:02

(5.1:0.10)

(5.1:0.10)

Promise Technology, Inc.

Promise SuperTrak EX Series Driver for Windows x64

11/4/2010 18:33

(4.2:0.58)

(4.0:1.58)

VMware, Inc.

VMware Virtual Storage Volume Driver

Cluster Logs:

_________________________________________________________________________________________

System Information: ABCH11

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCH11

System Manufacturer        Cisco Systems Inc

System Model        UCSB-B200-M4

System Type        x64-based PC

System SKU       

Processor        Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date        Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015

System Events:

  • Checked the events on the other node of the Cluster just to understand what was going on from the Cluster prospective at the time of Issue.
  • Found the event mentioned that Node 17 went out of the Cluster membership. This event is generated after we restarted the Host.
  • While the process of Node Removal was going on we saw a Deadlock on RHS as the Cluster Shared Volume went to Paused and was not able to come online.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

12/7/2016

9:14:24 AM

Critical

ABCH11.xyz.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

12/7/2016

9:23:53 AM

Error

ABCH11.xyz.local

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘120’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

12/7/2016

9:23:53 AM

Error

ABCH11.xyz.local

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.

12/7/2016

9:23:53 AM

Warning

ABCH11.xyz.local

140

Microsoft-Windows-Ntfs

The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV1, DeviceName: \Device\HarddiskVolume6. (A device which does not exist was specified.)

12/7/2016

9:23:53 AM

Critical

ABCH11.xyz.local

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

  •  
  • Since there was a Deadlock on the Cluster Disk the Cluster Service terminated on the Cluster and the Cluster went down.

12/7/2016

9:23:53 AM

Information

ABCH11.xyz.local

7036

Service Control Manager

The Cluster Service service entered the stopped state.

12/7/2016

9:23:53 AM

Error

ABCH11.xyz.local

7024

Service Control Manager

The Cluster Service service terminated with the following service-specific error:  An internal error occurred.

12/7/2016

9:23:53 AM

Error

ABCH11.xyz.local

7031

Service Control Manager

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Application Events:

  • Checked the Application logs and found nothing specific except that the VSS service was shutting down from time to time which generally points to a running Backup operation.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

12/7/2016

2:48:12 AM

Error

ABCH11.xyz.local

2006

Microsoft-Windows-PerfNet

Unable to read Server Queue performance data from the Server service. The first four bytes (DWORD) of the Data section contains the status code, the second four bytes contains the IOSB.Status and the next four bytes contains the IOSB.Information.

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

7/12/2013 22:47

(1.0:0.254)

(1.0:0.254)

PMC-Sierra

PMC-Sierra Storport  Driver For SPC8x6G SAS/SATA controller

7/9/2013 1:50

(7.2:0.30261)

(7.2:0.30261)

PMC-Sierra, Inc.

Adaptec SAS RAID WS03 Driver

4/8/2013 0:02

(6.3:9374.0)

(6.3:9374.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

3/27/2013 21:08

(6.3:9367.0)

(6.3:9367.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

1/31/2013 18:14

(6.2:9304.0)

(7.4:5.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

1/31/2013 18:16

(6.2:9304.0)

(7.4:3.0)

Broadcom Corporation

Broadcom NetXtreme FCoE Crash Dump (x64)

2/4/2013 21:38

(6.2:9200.16384)

(7.4:6.0)

Broadcom Corporation

FCoE offload x64 FREE

2/4/2013 21:40

(6.2:9200.16384)

(7.4:4.0)

Broadcom Corporation

iSCSI offload x64 FREE

2/4/2013 19:47

(7.4:14.0)

(7.4:14.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

5/14/2013 6:19

(6.3:9391.6)

(4.4:13.0)

Chelsio Communications

Virtual Bus Driver for Chelsio ® T4 Chipset

6/11/2013 21:21

(2.74:214.4)

(2.74:214.4)

Emulex

Emulex Storport Miniport Driver

6/11/2013 21:21

(2.74:214.4)

(2.74:214.4)

Emulex

Emulex Storport Miniport Driver

4/8/2013 15:30

(7.4:33.1)

(7.4:33.1)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

6/3/2013 22:08

(9.1:11.3)

(9.1:11.3)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Inbox Driver

3/25/2013 22:43

(2.1:5.0)

(2.1:5.0)

QLogic Corporation

QLogic iSCSI Storport Miniport Inbox Driver

6/7/2013 20:07

(9.1:11.3)

(9.1:11.3)

QLogic Corporation

QLogic FCoE Stor Miniport Inbox Driver

9/24/2008 19:28

(5.1:1039.2600)

(5.1:1039.2600)

Silicon Integrated Systems Corp.

SiS RAID Stor Miniport Driver

10/1/2008 22:56

(6.1:6918.0)

(5.1:1039.3600)

Silicon Integrated Systems

SiS AHCI Stor-Miniport Driver

11/27/2012 0:02

(5.1:0.10)

(5.1:0.10)

Promise Technology, Inc.

Promise SuperTrak EX Series Driver for Windows x64

11/4/2010 18:33

(4.2:0.58)

(4.0:1.58)

VMware, Inc.

VMware Virtual Storage Volume Driver

Cluster Logs:

Checked the Cluster logs at the time of issue and found that there was a Deadlock on the resource manager after which the Cluster Serivce terminated, while going through the logs we found that

348278 000009bc.00003464::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] removing outranked 1 to 0 candidate: NodeCandidate(10) banCode: (0) lastRank:GroupCountRanker=0

348279 000009bc.00003464::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] removing outranked 1 to 0 candidate: NodeCandidate(6) banCode: (0) lastRank:GroupCountRanker=0

348280 000009bc.00003464::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] placement manager result:  grp=1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Manual, node=9

348281 000009bc.00003464::2016/12/07-09:05:43.356 INFO  MTimer(GetPlacementAsDirector): [AntiAffinityFilter to StmFilter : 16 ms

348282 000009bc.00003464::2016/12/07-09:05:43.356 INFO  MTimer(GetPlacementAsDirector): [Total: 16 ms ( 0 s )]

348283 000009bc.00003464::2016/12/07-09:05:43.356 INFO  [GUM] Node 6: Executing locally gumId: 141210, updates: 1, first action: /dm/update

348284 000009bc.00001b0c::2016/12/07-09:05:43.356 ERR   [RCM] [GIM] Cant remove provisional info of group 1b6b4f65-9220-449b-93a0-6065958a5cfe from node 9

348285 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] applying filter NodeDownFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348286 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] applying filter NodeShuttingDownFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348287 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] applying filter CurrentNodeFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348288 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] removing banned candidate: NodeCandidate(8) banCode: CurrentNodeFilter (5016)

348289 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] applying filter PausedNodeFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

348290 000009bc.00001b0c::2016/12/07-09:05:43.356 INFO  [RCM-plcmt] applying filter PossibleOwnerFilter to group 1b6b4f65-9220-449b-93a0-6065958a5cfe moveType=MoveType::Drain

349624 000009bc.00000c04::2016/12/07-09:05:43.528 INFO  [GUM] Node 6: Executing locally gumId: 141218, updates: 1, first action: /rcm/gum/GroupMoveOperation

349625 000009bc.00000c04::2016/12/07-09:05:43.528 INFO  [RCM] rcm::RcmGum::GroupMoveOperation(1)

349626 000009bc.00000ff4::2016/12/07-09:05:43.528 WARN  [RCM] rcm::RcmApi::GetResourceState: retrying: 1b6b4f65-9220-449b-93a0-6065958a5cfe, 5908.

349627 000009bc.00000c04::2016/12/07-09:05:43.528 INFO  [RCM] move of group 1b6b4f65-9220-449b-93a0-6065958a5cfe from ABCH17(8) to ABCH14(9) of type MoveType::Drain is about to succeed, failoverCount=1, lastFailoverTime=1601/01/01-00:00:00.000 targeted=true

349628 000009bc.00003464::2016/12/07-09:05:43.528 INFO  [RCM] moved 0 tasks from staging set to task set.  TaskSetSize=0

353919 000009bc.000014b0::2016/12/07-09:14:09.606 WARN  [CHANNEL fe80::98ad:7066:5b87:38b0%49:~3343~] failure, status (10054)

353920 000009bc.000014b0::2016/12/07-09:14:09.606 INFO  [PULLER ABCH17] Parent stream has been closed.

353921 000009bc.000014b0::2016/12/07-09:14:09.606 ERR   [NODE] Node 6: Connection to Node 8 is broken. Reason Closed(1236)’ because of ‘channel to remote endpoint fe80::98ad:7066:5b87:38b0%49:~3343~ has failed with status (10054)’

353922 000009bc.000014b0::2016/12/07-09:14:09.606 WARN  [NODE] Node 6: Initiating reconnect with n8.

359146 000009bc.0000223c::2016/12/07-09:23:43.739 INFO  [NM] Received request from client address ABCH11.

359147 000018c8.00001bd4::2016/12/07-09:23:44.208 INFO  [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e85cac7f-c9b0-4dcf-86c2-544d8581d574:Netbios

359148 000018ec.00002820::2016/12/07-09:23:44.473 INFO  [RES] Physical Disk <Cluster Disk 1>: VolumeIsNtfs: Volume \\?\GLOBALROOT\Device\Harddisk1\ClusterPartition2\ has FS type NTFS

359149 000018c8.00001bd4::2016/12/07-09:23:46.318 INFO  [RES] Network Name <Cluster Name>: Dns: HealthCheck: ABCHVR2CLUSTER

359150 000018c8.00001bd4::2016/12/07-09:23:46.318 INFO  [RES] Network Name <Cluster Name>: Dns: End of Slow Operation, state: Initialized/Reading, prevWorkState: Reading

359151 000018c8.00001bd4::2016/12/07-09:23:49.208 INFO  [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:e85cac7f-c9b0-4dcf-86c2-544d8581d574:Netbios

359152 000009bc.000037c8::2016/12/07-09:23:53.068 ERR   [UserModeMonitor] Possibly DEADLOCK. No proggress reported for component GetSingleState in last 120 sec. Initial thread: 2696, started: 2016/12/07-09:21:53.062. Terminating the clussvc process.

359153 000009bc.000037c8::2016/12/07-09:23:53.068 ERR   UserModeMonitor timeout (status = 1359)

359154 000014e0.000014dc::2016/12/07-09:23:53.068 WARN  [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.

359155 00000e20.00001110::2016/12/07-09:23:53.068 WARN  [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.

______________________________________________________________________________

System Information: 

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCH12

System Manufacturer        Cisco Systems Inc

System Model        UCSB-B200-M4

System Type        x64-based PC

System SKU       

Processor        Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 2594 Mhz, 12 Core(s), 24 Logical Processor(s)

BIOS Version/Date        Cisco Systems, Inc. B200M4.3.1.1a.0.121720151230, 17/12/2015

System Events:

  • Checked the logs of another node just to confirm.

  • Based on the logs the issue seems to be the same first the Node being removed after restart due just after which we found that the Cluster Shared Volume 1 went in Paused State. 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

12/7/2016

9:14:24 AM

Critical

ABCH12.xyz.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

12/7/2016

9:17:50 AM

Critical

ABCH12.xyz.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCH17’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

12/7/2016

9:23:58 AM

Error

ABCH12.xyz.local

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume1’ (‘Cluster Disk 1’) has entered a paused state because of ‘(c000026e)’. All I/O will temporarily be queued until a path to the volume is reestablished.

12/7/2016

9:24:05 AM

Critical

ABCH12.xyz.local

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘ABCH11’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

  • We found that the Node 11 also went out of the Cluster membership just after the Volume 1 went to paused state.

12/7/2016

9:27:00 AM

Error

ABCH12.xyz.local

5377

Microsoft-Windows-FailoverClustering

An internal Cluster service operation exceeded the defined threshold of ‘120’ seconds. The Cluster service has been terminated to recover. Service Control Manager will restart the Cluster service and the node will rejoin the cluster.

12/7/2016

9:27:00 AM

Error

ABCH12.xyz.local

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘Volume5’ (‘Cluster Disk 5’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.

12/7/2016

9:27:00 AM

Warning

ABCH12.xyz.local

140

Microsoft-Windows-Ntfs

The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV5, DeviceName: \Device\HarddiskVolume4. (A device which does not exist was specified.)

12/7/2016

9:27:00 AM

Critical

ABCH12.xyz.local

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

Application Events:

  • Found the same event for VSS service Shutting Down which gives us a probable reason of an ongoing backup operation.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

12/7/2016

9:35:18 AM

Information

ABCH12.xyz.local

8224

VSS

The VSS service is shutting down due to idle timeout. 

List of outdated drivers:


Time/Date String

Product Version

File Version

Company Name

File Description

7/12/2013 22:47

(1.0:0.254)

(1.0:0.254)

PMC-Sierra

PMC-Sierra Storport  Driver For SPC8x6G SAS/SATA controller

7/9/2013 1:50

(7.2:0.30261)

(7.2:0.30261)

PMC-Sierra, Inc.

Adaptec SAS RAID WS03 Driver

4/8/2013 0:02

(6.3:9374.0)

(6.3:9374.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

3/27/2013 21:08

(6.3:9367.0)

(6.3:9367.0)

Brocade Communications Systems, Inc.

Brocade FC/FCoE HBA Stor Miniport Driver

1/31/2013 18:14

(6.2:9304.0)

(7.4:5.0)

Broadcom Corporation

Broadcom NetXtreme Unified Crash Dump (x64)

1/31/2013 18:16

(6.2:9304.0)

(7.4:3.0)

Broadcom Corporation

Broadcom NetXtreme FCoE Crash Dump (x64)

2/4/2013 21:38

(6.2:9200.16384)

(7.4:6.0)

Broadcom Corporation

FCoE offload x64 FREE

2/4/2013 21:40

(6.2:9200.16384)

(7.4:4.0)

Broadcom Corporation

iSCSI offload x64 FREE

2/4/2013 19:47

(7.4:14.0)

(7.4:14.0)

Broadcom Corporation

Broadcom NetXtreme II GigE VBD

5/14/2013 6:19

(6.3:9391.6)

(4.4:13.0)

Chelsio Communications

Virtual Bus Driver for Chelsio ® T4 Chipset

6/11/2013 21:21

(2.74:214.4)

(2.74:214.4)

Emulex

Emulex Storport Miniport Driver

6/11/2013 21:21

(2.74:214.4)

(2.74:214.4)

Emulex

Emulex Storport Miniport Driver

4/8/2013 15:30

(7.4:33.1)

(7.4:33.1)

Broadcom Corporation

Broadcom NetXtreme II 10 GigE VBD

6/3/2013 22:08

(9.1:11.3)

(9.1:11.3)

QLogic Corporation

QLogic Fibre Channel Stor Miniport Inbox Driver

3/25/2013 22:43

(2.1:5.0)

(2.1:5.0)

QLogic Corporation

QLogic iSCSI Storport Miniport Inbox Driver

6/7/2013 20:07

(9.1:11.3)

(9.1:11.3)

QLogic Corporation

QLogic FCoE Stor Miniport Inbox Driver

9/24/2008 19:28

(5.1:1039.2600)

(5.1:1039.2600)

Silicon Integrated Systems Corp.

SiS RAID Stor Miniport Driver

10/1/2008 22:56

(6.1:6918.0)

(5.1:1039.3600)

Silicon Integrated Systems

SiS AHCI Stor-Miniport Driver

11/27/2012 0:02

(5.1:0.10)

(5.1:0.10)

Promise Technology, Inc.

Promise SuperTrak EX Series Driver for Windows x64

11/4/2010 18:33

(4.2:0.58)

(4.0:1.58)

VMware, Inc.

VMware Virtual Storage Volume Driver

 

_______________________________________________________________________________

 

 

Conclusion:

 

  • After analyzing the logs we found that there was a Deadlock on the Cluster Service due to which the cluster service terminated on the Nodes which are part of the cluster and the Virtual Machines that was running went to failed state. Based on this trend the issue seems to be with the Network misconfiguration as when there is a Live migration task initiate it consumes a high bandwidth of data and since the cluster Network and ISCSI Network was allowing the Cluster traffic to communicate through it the cluster shared Volume went to failed state.

 

  • There is no events which shows if that was any Network or Hardware Failure. However updating the Network components will be recommended.

 

 

  1.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

Updates for Cluster Binaries for 2012 R2

https://support.microsoft.com/en-us/kb/2920151 

 

  1.  Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

 

  1. As I have already explained and made some changes with the network, it will really help us out on the issue that we are facing.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply