RCA- 15 – Node went down after Quorum Loss

Issue Description:

We have a node “HV11SG2” running on  Server 2012 R2 Datacenter which is a part of cluster ABCG2. The Node went down as it lost access to the quorum.

Issue happened on : 1/23/2017

2:18:48 PM

_______________________________________________________________________________________________

System Information: AB1XY2

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        AB1XY2

System Manufacturer        Cisco Systems Inc

System Model        UCSB-B200-M3

System Type        x64-based PC

System SKU       

Processor        Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 2800 Mhz, 10 Core(s), 10 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 2800 Mhz, 10 Core(s), 10 Logical Processor(s)

BIOS Version/Date        Cisco Systems, Inc. B200M3.2.2.6d.0.062220160055, 22/06/2016

System Events:

  • Verified the logs and found that the Cluster Shared Volume went to paused State. After which the Cluster Nodes lost the Communication from AB9XY2 and AB5XY2. After which the Cluster node got evicted from the Failover Cluster Manager.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:18:32 PM

Error

AB1XY2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘CSV-LegB04’ (‘ABCG2-CSV-LegB04’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:34 PM

Error

AB1XY2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘CSV-LegA-SQL01’ (‘ABCG2-CSV-SQL01’) has entered a paused state because of ‘(c00000c4)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:42 PM

Error

AB1XY2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-SQLBKP02’ (‘ABCG2-CSV-SQLBKP02’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:43 PM

Information

AB1XY2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB1XY2’ lost communication with cluster node ‘AB9XY2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:18:47 PM

Information

AB1XY2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB1XY2’ lost communication with cluster node ‘AB5XY2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:18:48 PM

Critical

AB1XY2.ad.xyz.com

1177

Microsoft-Windows-FailoverClustering

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:18:48 PM

Critical

AB1XY2.ad.xyz.com

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

1/23/2017

2:18:48 PM

Critical

AB1XY2.ad.xyz.com

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

1/23/2017

2:18:48 PM

Error

AB1XY2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-India’ (‘ABCG2-CSV-India’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.

  • As per the CSV Error code the issue seems to be the Disconnect from the Storage.

PS C:\Users\adix5025.INDIA\Downloads\ERR> & ‘.\err(vista).exe’ c000020c

# for hex 0xc000020c / decimal -1073741300

  STATUS_CONNECTION_DISCONNECTED                                 ntstatus.h

# The transport connection is now disconnected.

# 1 matches found for “c000020c”

PS C:\Users\adix5025.INDIA\Downloads\ERR>

  • Since we were not able to connect to the storage the Cluster Node also lost access to the quorum after which the Node got evicted from the Failover.

1/23/2017

2:19:06 PM

Error

AB1XY2.ad.xyz.com

5005

ENIC

Cisco VIC Ethernet Interface #2 : Has encountered an internal error and has failed.

1/23/2017

2:19:06 PM

Error

AB1XY2.ad.xyz.com

5005

ENIC

Cisco VIC Ethernet Interface #2 : Has encountered an internal error and has failed.

Application Events:

  • Verified the logs and found nothing specific related to the issue.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:22:55 PM

Information

AB1XY2.ad.xyz.com

8224

VSS

The VSS service is shutting down due to idle timeout. 

1/23/2017

2:23:20 PM

Information

AB1XY2.ad.xyz.com

5605

Microsoft-Windows-WMI

The root\mscluster namespace is marked with the RequiresEncryption flag. Access to this namespace might be denied if the script or application does not have the appropriate authentication level. Change the authentication level to Pkt_Privacy and run the script or application again.

1/23/2017

2:23:54 PM

Information

AB1XY2.ad.xyz.com

5605

Microsoft-Windows-WMI

The root\mscluster namespace is marked with the RequiresEncryption flag. Access to this namespace might be denied if the script or application does not have the appropriate authentication level. Change the authentication level to Pkt_Privacy and run the script or application again.

1/23/2017

2:24:35 PM

Warning

AB1XY2.ad.xyz.com

5612

Microsoft-Windows-WMI

Windows Management Instrumentation has stopped WMIPRVSE.EXE because a quota reached a warning value. Quota: ThreadCount  Value: 5143 Maximum value: 256 WMIPRVSE PID: 28644 Providers hosted in this process: %SystemRoot%\System32\wbem\cluswmi.dll, %SystemRoot%\System32\wbem\cluswmi.dll, %windir%\system32\wbem\servercompprov.dll, %SystemRoot%\System32\smbwmiv2.dll, %SystemRoot%\System32\wbem\cluswmi.dll, %SystemRoot%\System32\wbem\cluswmi.dll, %SystemRoot%\System32\wbem\cluswmi.dll, C:\Windows\System32\iscsiwmi.dll, %systemroot%\system32\wbem\cimwin32.dll, %SystemRoot%\System32\wbem\cluswmi.dll, %SystemRoot%\system32\tscfgwmi.dll

Cluster Events:

  • Verified the cluster logs and found the same of events which points out the issue with the Network Adaptors.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:18:24 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.211:~3343~ connected to remote endpoint 192.00.23.223:~3343~.

1/23/2017

2:18:24 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.211:~3343~ connected to remote endpoint 192.00.23.212:~3343~.

1/23/2017

2:18:24 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.211:~3343~ connected to remote endpoint 192.00.23.219:~3343~.

1/23/2017

2:18:24 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.211:~3343~ connected to remote endpoint 172.00.123.215:~3343~.

1/23/2017

2:18:24 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.211:~3343~ connected to remote endpoint 192.00.23.217:~3343~.

1/23/2017

2:18:32 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint 192.00.23.211:~3343~ connected to remote endpoint 192.00.23.222:~3343~.

1/23/2017

2:18:32 PM

Information

AB1XY2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has lost the UDP connection from local endpoint 172.00.123.211:~3343~ connected to remote endpoint 172.00.123.222:~3343~.

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

5/14/2013 6:19

(6.3:9391.6)

(4.4:13.0)

Chelsio Communications

Virtual Bus Driver for Chelsio ® T4 Chipset

2/3/2016 21:49

(3.5:0.13)

(3.5:0.13)

Cisco Systems, Inc.

Cisco VIC Ethernet Driver

_______________________________________________________________________________

System Information: AB2SG2

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        AB2SG2

System Manufacturer        Cisco Systems Inc

System Model        UCSB-B200-M3

System Type        x64-based PC

System SKU       

Processor        Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 2800 Mhz, 10 Core(s), 10 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 2800 Mhz, 10 Core(s), 10 Logical Processor(s)

BIOS Version/Date        Cisco Systems, Inc. B200M3.2.2.6d.0.062220160055, 22/06/2016

System Events:

  • Verified the logs and found the same set of events which were being generated on the other Nodes.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:18:38 PM

Error

AB2SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘CSV-LegA03’ (‘ABCG2-CSV-LegA03’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:38 PM

Error

AB2SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-SQLBKP02’ (‘ABCG2-CSV-SQLBKP02’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:40 PM

Error

AB2SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-SQLBKP01’ (‘ABCG2-CSV-SQLBKP01’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:41 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘HV10SG2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:18:43 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘AB9XY2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:18:45 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘HV6SG2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:19:11 PM

Critical

AB2SG2.ad.xyz.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘AB1XY2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:19:11 PM

Critical

AB2SG2.ad.xyz.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘HV12SG2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

  • As per the Event 1592 we can say that the Communication was restored around: 3:34 PM.

1/23/2017

3:34:59 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘AB9XY2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

3:34:59 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘HV11SG2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

3:35:05 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘HV10SG2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

3:35:07 PM

Information

AB2SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB2SG2’ lost communication with cluster node ‘HV8SG2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Cluster Events:

  • Verified the cluster logs and found the same of events which points out the issue with the Network Adaptors.

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:18:24 PM

Information

AB2SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.212:~3343~ connected to remote endpoint 192.00.23.211:~3343~.

1/23/2017

2:18:25 PM

Information

AB2SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.212:~3343~ connected to remote endpoint 172.00.123.217:~3343~.

1/23/2017

2:18:25 PM

Information

AB2SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.212:~3343~ connected to remote endpoint 172.00.123.219:~3343~.

1/23/2017

2:18:25 PM

Information

AB2SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.212:~3343~ connected to remote endpoint 192.00.23.220:~3343~.

1/23/2017

2:18:25 PM

Information

AB2SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.212:~3343~ connected to remote endpoint 192.00.23.219:~3343~.

List of outdated drivers:

Time/Date String

Product Version

File Version

Company Name

File Description

5/14/2013 6:19

(6.3:9391.6)

(4.4:13.0)

Chelsio Communications

Virtual Bus Driver for Chelsio ® T4 Chipset

2/3/2016 21:49

(3.5:0.13)

(3.5:0.13)

Cisco Systems, Inc.

Cisco VIC Ethernet Driver

 ________________________________________________________________________________


 

System Information: AB3SG2

 

OS Name        Microsoft Windows Server 2012 R2 Datacenter

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        AB3SG2

System Manufacturer        Cisco Systems Inc

System Model        UCSB-B200-M3

System Type        x64-based PC

System SKU       

Processor        Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 2800 Mhz, 10 Core(s), 10 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 2800 Mhz, 10 Core(s), 10 Logical Processor(s)

BIOS Version/Date        Cisco Systems, Inc. B200M3.2.2.6d.0.062220160055, 22/06/2016

 

System Events:

 

  • Verified the logs and found the same set of events which were being generated on the other Nodes.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:18:39 PM

Error

AB3SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-FILE04’ (‘ABCG2-CSV-FILE04’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:40 PM

Information

AB3SG2.ad.xyz.com

1592

Microsoft-Windows-FailoverClustering

Cluster node ‘AB3SG2’ lost communication with cluster node ‘HV6SG2’.  Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop.  If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:18:41 PM

Error

AB3SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-FILE03’ (‘ABCG2-CSV-FILE03’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:18:41 PM

Error

AB3SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘CSV-LegA-SQL01’ (‘ABCG2-CSV-SQL01’) has entered a paused state because of ‘(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:19:05 PM

Error

AB3SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-BK1’ (‘ABCG2-CSV-BK1’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:19:05 PM

Error

AB3SG2.ad.xyz.com

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCG2-CSV-India’ (‘ABCG2-CSV-India’) has entered a paused state because of ‘(c000020c)’. All I/O will temporarily be queued until a path to the volume is reestablished.

1/23/2017

2:19:11 PM

Critical

AB3SG2.ad.xyz.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘AB1XY2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:19:11 PM

Critical

AB3SG2.ad.xyz.com

1135

Microsoft-Windows-FailoverClustering

Cluster node ‘HV12SG2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

1/23/2017

2:19:14 PM

Error

AB3SG2.ad.xyz.com

4

Microsoft-Windows-Security-Kerberos

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server hv4sg2$. The target name used was MSServerClusterMgmtAPI/ABCG2.ad.xyz.com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (AD.xyz.com) is different from the client domain (AD.xyz.com), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

 

 

 

Cluster Events:

 

  • Verified the cluster logs and found the same of events which points out the issue with the Network Adaptors.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

1/23/2017

2:18:24 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.213:~3343~ connected to remote endpoint 192.00.23.222:~3343~.

1/23/2017

2:18:24 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.213:~3343~ connected to remote endpoint 172.00.123.221:~3343~.

1/23/2017

2:18:25 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.213:~3343~ connected to remote endpoint 172.00.123.220:~3343~.

1/23/2017

2:18:27 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.213:~3343~ connected to remote endpoint 192.00.23.219:~3343~.

1/23/2017

2:18:27 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.213:~3343~ connected to remote endpoint 172.00.123.211:~3343~.

1/23/2017

2:18:28 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 172.00.123.213:~3343~ connected to remote endpoint 172.00.123.222:~3343~.

1/23/2017

2:18:29 PM

Information

AB3SG2.ad.xyz.com

1650

Microsoft-Windows-FailoverClustering

Cluster has missed two consecutive heartbeats for the local endpoint 192.00.23.213:~3343~ connected to remote endpoint 192.00.23.222:~3343~.

 

 

________________________________________________________________________________

 

Conclusion:

 

 

  • After analyzing the logs we found that the issue started after the Node:AB1XY2 lost the Communication with the other nodes of the Cluster, which can be clearly verified by Event ID 1650. After which the Storage also got disconnected and the CSV went to paused state.

 

  • Based on the Error that we are getting the issue seems to be due to the Cisco VIC Ethernet Interface went to failed state. I will recommend you to involve the CISCO Team to troubleshoot further on this issue.

 

  • The following file system locations should be excluded from virus scanning on a server that is running Cluster Services:
    • The path of the \mscs folder on the quorum hard disk. For example, exclude the Q:\mscs folder from virus scanning.(Applicable for Cluster 2003)
    • The %Systemroot%\Cluster folder.(Applicable for Cluster 2003, 2008 & 2008 R2)
    • The temp folder for the Cluster Service account. For example, exclude the \clusterserviceaccount\Local Settings\Temp folder from virus scanning.(Applicable for Cluster 2003)

 

  1. Please follow the article to add the antivirus exclusion for http://support.microsoft.com/kb/309422 .

 

  1.  Install following hotfixes on all cluster nodes one by one. Reboot will be required for the changes to take effect. Follow the article and make sure all these updates are installed on all the nodes:

 

https://support.microsoft.com/en-us/help/2920151/recommended-hotfixes-and-updates-for-windows-server-2012-r2-based-failover-clusters

 

  1.  Investigate the Network timeout / latency / packet drops with the help of in house networking team.

Please Note : This step is the most critical while dealing with network connectivity issues.

           Investigation of Network Issues :

           We need to investigate the Network Connectivity Issues with the help of in-house networking team.

In order to avoid this issue in future the most critical part is to diagnose & investigate the consistent Network Connectivity Issue with Cluster Networks.

We need to check the network adapter, cables, and network configuration for the networks that connect the nodes.

We also need to check hubs, switches, or bridges in the networks that connect the nodes.

We need to check for Switch Delays & Proxy ARPs with the help of in-house Networking Team.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply