RCA – 22 – Virtual Machine crashed on Node Restart in an S2D

Issue Description:

 

You have a 5 Node  “ABCDLWHV1″, ‘ABCDLWHV2”, “ABCDLWHV3”, “ABCDLWHV4”  and “ABCDLWHV5” Storage Spaces Direct cluster “ABCDLWHVCL0”. You want to know that why the Virtual Machines on Cluster Node Name ABCDLWHV1 Running a copy of Microsoft Windows Server 2016 Datacenter Version 10.0.14393 Build 14393 Crashed after the Node Name ABCDLWHV4 was restarted running Windows Server 2016 Datacenter Version 10.0.14393 Build 14393

Date & Time: 29.9.2017 ~ 1:10 PM

 

_____________________________________________________________________

 

System Information: 

 

OS Name        Microsoft Windows Server 2016 Datacenter

Version        10.0.14393 Build 14393

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCDLWHV1

System Manufacturer        Dell Inc.

System Model        PowerEdge R730xd

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R730xd

Processor        Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)

BIOS Version/Date        Dell Inc. 2.4.3, 17.01.2017

SMBIOS Version        2.8

Embedded Controller Version        255.255

BIOS Mode        UEFI

BaseBoard Manufacturer        Dell Inc.

BaseBoard Model        Not Available

BaseBoard Name        Base Board

Platform Role        Enterprise Server

Secure Boot State        Off

PCR7 Configuration        Not Available

 

System Events:

 

  • The node ABCDLWHV4 was rebooted at 1:11:20 PM, here we can see the “Harddisk16\Dr18” became in accessible within seconds. Which in turn caused the CSV to go in a pause state. 

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

9/29/2017

1:11:22 PM

Error

ABCDLWHV1.abcd.co.uk

15

Disk

The device, \Device\Harddisk16\DR18, is not ready for access yet.

9/29/2017

1:11:22 PM

Warning

ABCDLWHV1.abcd.co.uk

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCDLWHV1’ (‘Cluster Virtual Disk (ABCDLWHV1)’) has entered a paused state because of ‘STATUS_DEVICE_NOT_CONNECTED(c000009d)’. All I/O will temporarily be queued until a path to the volume is reestablished.

9/29/2017

1:11:23 PM

Error

ABCDLWHV1.abcd.co.uk

15

Disk

The device, \Device\Harddisk16\DR18, is not ready for access yet.

9/29/2017

1:11:27 PM

Error

ABCDLWHV1.abcd.co.uk

134

Microsoft-Windows-ReFS

The file system was unable to write metadata to the media backing volume ABCDLWHV1. A write failed with status ‘A device which does not exist was specified.’ ReFS will take the volume offline. It may be mounted again automatically.

9/29/2017

1:11:27 PM

Error

ABCDLWHV1.abcd.co.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Cluster Virtual Disk (ABCDLWHV1)’ of type ‘Physical Disk’ in clustered role ‘1de8c8d0-8b0f-4751-852c-8556bdd39799’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

9/29/2017

1:11:27 PM

Error

ABCDLWHV1.abcd.co.uk

1795

Microsoft-Windows-FailoverClustering

Cluster physical disk resource terminate encountered an error. Physical Disk resource name: Cluster Virtual Disk (ABCDLWHV1) Device Number: 16 Device Guid: {70df1205-7a7e-4dc8-8b55-b32b1864da9d} Error Code: 1168 Additional reason: ReleaseDiskPRFailure

9/29/2017

1:11:27 PM

Error

ABCDLWHV1.abcd.co.uk

5150

Microsoft-Windows-FailoverClustering

Cluster physical disk resource ‘Cluster Virtual Disk (ABCDLWHV1)’ failed.  The Cluster Shared Volume was put in failed state with the following error: ‘Failed to get the volume number for \\?\GLOBALROOT\Device\Harddisk16\ClusterPartition2\ (error 2)’

 

  •  After sometime since the CSV was in paused state the virtual machines started to fail.

 

9/29/2017

1:11:28 PM

Warning

ABCDLWHV1.abcd.co.uk

157

Disk

Disk 16 has been surprise removed.

9/29/2017

1:15:06 PM

Warning

ABCDLWHV1.abcd.co.uk

5120

Microsoft-Windows-FailoverClustering

Cluster Shared Volume ‘ABCDLWHV5’ (‘Cluster Virtual Disk (ABCDLWHV5)’) has entered a paused state because of ‘STATUS_NO_SUCH_DEVICE(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished.

9/29/2017

1:16:25 PM

Error

ABCDLWHV1.abcd.co.uk

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Virtual Machine OBJPLNTC’ of type ‘Virtual Machine’ in clustered role ‘OBJPLNTC’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

 

 

Application Events:

 

  • *No relevant logs application event

 

List of outdated drivers:

 

Module Path

Time/Date String

File Version

Company Name

File Description

C:\WINDOWS\SYSTEM32\DRIVERS\HPSAMD.SYS

3/26/2013 22:36

(8.0:4.0)

Hewlett-Packard Company

Smart Array SAS/SATA Controller Media Driver

C:\WINDOWS\SYSTEM32\DRIVERS\IBBUS.SYS

3/8/2017 14:54

(5.35:12978.0)

Mellanox

InfiniBand Fabric Bus Driver

C:\WINDOWS\SYSTEM32\DRIVERS\MLX4ETH63.SYS

3/8/2017 14:53

(5.35:12978.0)

Mellanox

Mellanox ConnectX 10Gb Ethernet Adapter NDIS 6.60 driver

C:\WINDOWS\SYSTEM32\DRIVERS\MLX4_BUS.SYS

3/8/2017 14:54

(5.35:12978.0)

Mellanox

MLX4 Bus Driver

C:\WINDOWS\SYSTEM32\DRIVERS\MVUMIS.SYS

5/23/2014 22:39

(1.0:5.1016)

Marvell Semiconductor, Inc.

Marvell Flash Controller Driver

C:\WINDOWS\SYSTEM32\DRIVERS\NDFLTR.SYS

3/8/2017 14:53

(5.35:12978.0)

Mellanox

NetworkDirect Support Filter Driver

C:\WINDOWS\SYSTEM32\DRIVERS\PROCEXP152.SYS

12/5/2015 22:42

(15.0:0.0)

Sysinternals – www.sysinternals.com

Process Explorer

C:\WINDOWS\SYSTEM32\DRIVERS\VEEAMFCT.SYS

4/6/2017 0:45

(9.5:0.1015)

Veeam Software AG

CTK file system minifilter

C:\WINDOWS\SYSTEM32\DRIVERS\WINMAD.SYS

3/8/2017 14:53

(5.35:12978.0)

Mellanox

Kernel WinMad

C:\WINDOWS\SYSTEM32\DRIVERS\WINVERBS.SYS

3/8/2017 14:53

(5.35:12978.0)

Mellanox

Kernel WinVerbs

 

SMB Events:

 

Log Name:      Microsoft-Windows-SmbClient/Connectivity

Source:        Microsoft-Windows-SMBClient

Date:          9/29/2017 1:05:55 PM

Event ID:      30804

Level:         Error

Computer:      ABCDLWHV1.abcd.co.uk

Description:

A network connection was disconnected.

 

Server name: \fe80::c483:d420:cdcc:d450%18

Server address: 10.102.22.15:445

Connection type: Rdma

 

Guidance:

This indicates that the client’s connection to the server was disconnected.

Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.

 

 

Cluster Logs:

 

 

[=== Cluster Logs ===]

00000a60.00007028::2017/09/29-15:40:04.635 INFO  [CAM] Token Created, Client Handle: 8000614c

0000173c.000022b0::2017/09/29-15:40:21.232 INFO  [GUM] Node 4: Processing RequestLock 1:1698

0000173c.000022b0::2017/09/29-15:40:21.232 INFO  [GUM] Node 4: Processing GrantLock to 1 (sent by 4 gumid: 494190)

0000173c.00002c30::2017/09/29-15:40:21.233 INFO  [GUM] Node 4: Executing locally gumId: 494191, updates: 1, first action: /dm/update

 

 

______________________________________________

 

 

 

System Information:

 

 

OS Name        Microsoft Windows Server 2016 Datacenter

Version        10.0.14393 Build 14393

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCDLWHV4

System Manufacturer        Dell Inc.

System Model        PowerEdge R730xd

System Type        x64-based PC

System SKU        SKU=NotProvided;ModelName=PowerEdge R730xd

Processor        Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)

Processor        Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)

BIOS Version/Date        Dell Inc. 2.4.3, 17.01.2017

SMBIOS Version        2.8

Embedded Controller Version        255.255

BIOS Mode        UEFI

BaseBoard Manufacturer        Dell Inc.

BaseBoard Model        Not Available

BaseBoard Name        Base Board

Platform Role        Enterprise Server

Secure Boot State        Off

PCR7 Configuration        Not Available

 

System Events:

 

  • The System was rebooted on 1:11:20 PM.

 

Date

Time

Type/Level

Computer Name

Event Code

Source

Description

9/29/2017

1:11:08 PM

Information

ABCDLWHV4.abcd.co.uk

7036

Service Control Manager

The Sync Host_800fc935 service entered the stopped state.

9/29/2017

1:11:08 PM

Information

ABCDLWHV4.abcd.co.uk

7002

Microsoft-Windows-Winlogon

User Logoff Notification for Customer Experience Improvement Program

9/29/2017

1:11:11 PM

Information

ABCDLWHV4.abcd.co.uk

7036

Service Control Manager

The CDPUserSvc_5c9884 service entered the stopped state.

9/29/2017

1:11:11 PM

Information

ABCDLWHV4.abcd.co.uk

7002

Microsoft-Windows-Winlogon

User Logoff Notification for Customer Experience Improvement Program

9/29/2017

1:11:20 PM

Information

ABCDLWHV4.abcd.co.uk

1074

User32

The process Explorer.EXE has initiated the restart of computer ABCDLWHV4 on behalf of user OBJECTIVITY\ABC_admin for the following reason: Hardware: Maintenance (Planned)  Reason Code: 0x84010001  Shutdown Type: restart  Comment: 

9/29/2017

1:11:21 PM

Information

ABCDLWHV4.abcd.co.uk

1074

User32

The process C:\Windows\Explorer.EXE (ABCDLWHV4) has initiated the restart of computer ABCDLWHV4 on behalf of user OBJECTIVITY\ABC_admin for the following reason: Hardware: Maintenance (Planned)  Reason Code: 0x84010001  Shutdown Type: restart  Comment: 

 

Application Events:

 

 

  • *No relevant logs application event

 

List of outdated drivers:

 

 

Module Path

Time/Date String

File Version

Company Name

File Description

C:\WINDOWS\SYSTEM32\DRIVERS\IBBUS.SYS

3/8/2017 14:54

(5.35:12978.0)

Mellanox

InfiniBand Fabric Bus Driver

C:\WINDOWS\SYSTEM32\DRIVERS\IQVW64E.SYS

10/29/2015 22:44

(1.3:1.2)

Intel Corporation

Intel(R) Network Adapter Diagnostic Driver

C:\WINDOWS\SYSTEM32\DRIVERS\LSI_SAS3I.SYS

3/28/2016 20:49

(2.51:12.80)

Avago Technologies

Avago SAS Gen3 Driver (StorPort)

C:\WINDOWS\SYSTEM32\DRIVERS\MEGASAS.SYS

3/5/2015 3:36

(6.706:6.0)

Avago Technologies

MEGASAS RAID Controller Driver for Windows

C:\WINDOWS\SYSTEM32\DRIVERS\MEGASAS2I.SYS

7/22/2016 23:36

(6.711:10.11)

Avago Technologies

MEGASAS RAID Controller Driver for Windows

C:\WINDOWS\SYSTEM32\DRIVERS\MLX4ETH63.SYS

3/8/2017 14:53

(5.35:12978.0)

Mellanox

Mellanox ConnectX 10Gb Ethernet Adapter NDIS 6.60 driver

C:\WINDOWS\SYSTEM32\DRIVERS\MLX4_BUS.SYS

3/8/2017 14:54

(5.35:12978.0)

Mellanox

MLX4 Bus Driver

C:\WINDOWS\SYSTEM32\DRIVERS\MVUMIS.SYS

5/23/2014 22:39

(1.0:5.1016)

Marvell Semiconductor, Inc.

Marvell Flash Controller Driver

C:\WINDOWS\SYSTEM32\DRIVERS\NDFLTR.SYS

3/8/2017 14:53

(5.35:12978.0)

Mellanox

NetworkDirect Support Filter Driver

C:\WINDOWS\SYSTEM32\DRIVERS\PERCSAS2I.SYS

3/15/2016 1:50

(6.805:3.0)

Avago Technologies

MEGASAS RAID Controller Driver for Windows

C:\WINDOWS\SYSTEM32\DRIVERS\PERCSAS3I.SYS

3/4/2016 22:22

(6.603:6.0)

Avago Technologies

MEGASAS RAID Controller Driver for Windows

C:\WINDOWS\SYSTEM32\DRIVERS\VEEAMFCT.SYS

4/6/2017 0:45

(9.5:0.1015)

Veeam Software AG

CTK file system minifilter

 

 

 

Cluster Logs:

 

[=== Cluster Logs ===]

00000a64.00006768::2017/09/29-15:42:59.565 INFO  [CAM] Token Created, Client Handle: 8000701c

00000a64.00005850::2017/09/29-15:42:59.584 INFO  [CAM] Token Created, Client Handle: 80007004

00000a64.00006768::2017/09/29-15:42:59.602 INFO  [CAM] Token Created, Client Handle: 80007024

 

 

SMB Log:

 

Log Name:      Microsoft-Windows-SmbClient/Connectivity

Source:        Microsoft-Windows-SMBClient

Date:          9/29/2017 1:10:20 PM

Event ID:      30804

Task Category: None

Level:         Error

Keywords:      (64)

User:          N/A

Computer:      ABCDLWHV4.abcd.co.uk

Description:

A network connection was disconnected.

 

Server name: \fe80::c483:d420:cdcc:d450%10

Server address: 10.102.22.135:445

Connection type: Rdma

 

Guidance:

This indicates that the client’s connection to the server was disconnected.

Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.

 

 

Conclusion:

 

  • The issue seems to be due to the misconfiguration of the Network Adaptor due to which when the Cluster Node: ABCDLWHV1 Restarted all the VMs the CSV on which the VMs were hosted went to paused state probably due to issues with network communication and since the CSV was not able to come online in time the VMs hosted on the CSV started going down.

 

  • Reconfigure the RDMA configuration on the Cluster.

 

  • Update the Mellanox Network Firmware drivers and if they are up to date please confirm if there is any ongoing issues with that version.

 

  • Update the Storage Controller firmware as well just to make sure that we are at the updated level

 

  • Install all the latest windows updates on the Cluster nodes.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply