Issue Description:
You have a 5 Node “ABCDLWHV1″, ‘ABCDLWHV2”, “ABCDLWHV3”, “ABCDLWHV4” and “ABCDLWHV5” Storage Spaces Direct cluster “ABCDLWHVCL0”. You want to know that why the Virtual Machines on Cluster Node Name ABCDLWHV1 Running a copy of Microsoft Windows Server 2016 Datacenter Version 10.0.14393 Build 14393 Crashed after the Node Name ABCDLWHV4 was restarted running Windows Server 2016 Datacenter Version 10.0.14393 Build 14393
Date & Time: 29.9.2017 ~ 1:10 PM
_____________________________________________________________________
System Information:
OS Name Microsoft Windows Server 2016 Datacenter
Version 10.0.14393 Build 14393
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCDLWHV1
System Manufacturer Dell Inc.
System Model PowerEdge R730xd
System Type x64-based PC
System SKU SKU=NotProvided;ModelName=PowerEdge R730xd
Processor Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)
BIOS Version/Date Dell Inc. 2.4.3, 17.01.2017
SMBIOS Version 2.8
Embedded Controller Version 255.255
BIOS Mode UEFI
BaseBoard Manufacturer Dell Inc.
BaseBoard Model Not Available
BaseBoard Name Base Board
Platform Role Enterprise Server
Secure Boot State Off
PCR7 Configuration Not Available
System Events:
- The node ABCDLWHV4 was rebooted at 1:11:20 PM, here we can see the “Harddisk16\Dr18” became in accessible within seconds. Which in turn caused the CSV to go in a pause state.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
9/29/2017 | 1:11:22 PM | Error | ABCDLWHV1.abcd.co.uk | 15 | Disk | The device, \Device\Harddisk16\DR18, is not ready for access yet. |
9/29/2017 | 1:11:22 PM | Warning | ABCDLWHV1.abcd.co.uk | 5120 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘ABCDLWHV1’ (‘Cluster Virtual Disk (ABCDLWHV1)’) has entered a paused state because of ‘STATUS_DEVICE_NOT_CONNECTED(c000009d)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
9/29/2017 | 1:11:23 PM | Error | ABCDLWHV1.abcd.co.uk | 15 | Disk | The device, \Device\Harddisk16\DR18, is not ready for access yet. |
9/29/2017 | 1:11:27 PM | Error | ABCDLWHV1.abcd.co.uk | 134 | Microsoft-Windows-ReFS | The file system was unable to write metadata to the media backing volume ABCDLWHV1. A write failed with status ‘A device which does not exist was specified.’ ReFS will take the volume offline. It may be mounted again automatically. |
9/29/2017 | 1:11:27 PM | Error | ABCDLWHV1.abcd.co.uk | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Cluster Virtual Disk (ABCDLWHV1)’ of type ‘Physical Disk’ in clustered role ‘1de8c8d0-8b0f-4751-852c-8556bdd39799’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
9/29/2017 | 1:11:27 PM | Error | ABCDLWHV1.abcd.co.uk | 1795 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource terminate encountered an error. Physical Disk resource name: Cluster Virtual Disk (ABCDLWHV1) Device Number: 16 Device Guid: {70df1205-7a7e-4dc8-8b55-b32b1864da9d} Error Code: 1168 Additional reason: ReleaseDiskPRFailure |
9/29/2017 | 1:11:27 PM | Error | ABCDLWHV1.abcd.co.uk | 5150 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource ‘Cluster Virtual Disk (ABCDLWHV1)’ failed. The Cluster Shared Volume was put in failed state with the following error: ‘Failed to get the volume number for \\?\GLOBALROOT\Device\Harddisk16\ClusterPartition2\ (error 2)’ |
- After sometime since the CSV was in paused state the virtual machines started to fail.
9/29/2017 | 1:11:28 PM | Warning | ABCDLWHV1.abcd.co.uk | 157 | Disk | Disk 16 has been surprise removed. |
9/29/2017 | 1:15:06 PM | Warning | ABCDLWHV1.abcd.co.uk | 5120 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume ‘ABCDLWHV5’ (‘Cluster Virtual Disk (ABCDLWHV5)’) has entered a paused state because of ‘STATUS_NO_SUCH_DEVICE(c000000e)’. All I/O will temporarily be queued until a path to the volume is reestablished. |
9/29/2017 | 1:16:25 PM | Error | ABCDLWHV1.abcd.co.uk | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource ‘Virtual Machine OBJPLNTC’ of type ‘Virtual Machine’ in clustered role ‘OBJPLNTC’ failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet. |
Application Events:
- *No relevant logs application event
List of outdated drivers:
Module Path | Time/Date String | File Version | Company Name | File Description |
C:\WINDOWS\SYSTEM32\DRIVERS\HPSAMD.SYS | 3/26/2013 22:36 | (8.0:4.0) | Hewlett-Packard Company | Smart Array SAS/SATA Controller Media Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\IBBUS.SYS | 3/8/2017 14:54 | (5.35:12978.0) | Mellanox | InfiniBand Fabric Bus Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\MLX4ETH63.SYS | 3/8/2017 14:53 | (5.35:12978.0) | Mellanox | Mellanox ConnectX 10Gb Ethernet Adapter NDIS 6.60 driver |
C:\WINDOWS\SYSTEM32\DRIVERS\MLX4_BUS.SYS | 3/8/2017 14:54 | (5.35:12978.0) | Mellanox | MLX4 Bus Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\MVUMIS.SYS | 5/23/2014 22:39 | (1.0:5.1016) | Marvell Semiconductor, Inc. | Marvell Flash Controller Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\NDFLTR.SYS | 3/8/2017 14:53 | (5.35:12978.0) | Mellanox | NetworkDirect Support Filter Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\PROCEXP152.SYS | 12/5/2015 22:42 | (15.0:0.0) | Sysinternals – www.sysinternals.com | Process Explorer |
C:\WINDOWS\SYSTEM32\DRIVERS\VEEAMFCT.SYS | 4/6/2017 0:45 | (9.5:0.1015) | Veeam Software AG | CTK file system minifilter |
C:\WINDOWS\SYSTEM32\DRIVERS\WINMAD.SYS | 3/8/2017 14:53 | (5.35:12978.0) | Mellanox | Kernel WinMad |
C:\WINDOWS\SYSTEM32\DRIVERS\WINVERBS.SYS | 3/8/2017 14:53 | (5.35:12978.0) | Mellanox | Kernel WinVerbs |
SMB Events:
Log Name: Microsoft-Windows-SmbClient/Connectivity
Source: Microsoft-Windows-SMBClient
Date: 9/29/2017 1:05:55 PM
Event ID: 30804
Level: Error
Computer: ABCDLWHV1.abcd.co.uk
Description:
A network connection was disconnected.
Server name: \fe80::c483:d420:cdcc:d450%18
Server address: 10.102.22.15:445
Connection type: Rdma
Guidance:
This indicates that the client’s connection to the server was disconnected.
Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.
Cluster Logs:
[=== Cluster Logs ===]
00000a60.00007028::2017/09/29-15:40:04.635 INFO [CAM] Token Created, Client Handle: 8000614c
0000173c.000022b0::2017/09/29-15:40:21.232 INFO [GUM] Node 4: Processing RequestLock 1:1698
0000173c.000022b0::2017/09/29-15:40:21.232 INFO [GUM] Node 4: Processing GrantLock to 1 (sent by 4 gumid: 494190)
0000173c.00002c30::2017/09/29-15:40:21.233 INFO [GUM] Node 4: Executing locally gumId: 494191, updates: 1, first action: /dm/update
______________________________________________
System Information:
OS Name Microsoft Windows Server 2016 Datacenter
Version 10.0.14393 Build 14393
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCDLWHV4
System Manufacturer Dell Inc.
System Model PowerEdge R730xd
System Type x64-based PC
System SKU SKU=NotProvided;ModelName=PowerEdge R730xd
Processor Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)
Processor Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 2100 Mhz, 16 Core(s), 32 Logical Processor(s)
BIOS Version/Date Dell Inc. 2.4.3, 17.01.2017
SMBIOS Version 2.8
Embedded Controller Version 255.255
BIOS Mode UEFI
BaseBoard Manufacturer Dell Inc.
BaseBoard Model Not Available
BaseBoard Name Base Board
Platform Role Enterprise Server
Secure Boot State Off
PCR7 Configuration Not Available
System Events:
- The System was rebooted on 1:11:20 PM.
Date | Time | Type/Level | Computer Name | Event Code | Source | Description |
9/29/2017 | 1:11:08 PM | Information | ABCDLWHV4.abcd.co.uk | 7036 | Service Control Manager | The Sync Host_800fc935 service entered the stopped state. |
9/29/2017 | 1:11:08 PM | Information | ABCDLWHV4.abcd.co.uk | 7002 | Microsoft-Windows-Winlogon | User Logoff Notification for Customer Experience Improvement Program |
9/29/2017 | 1:11:11 PM | Information | ABCDLWHV4.abcd.co.uk | 7036 | Service Control Manager | The CDPUserSvc_5c9884 service entered the stopped state. |
9/29/2017 | 1:11:11 PM | Information | ABCDLWHV4.abcd.co.uk | 7002 | Microsoft-Windows-Winlogon | User Logoff Notification for Customer Experience Improvement Program |
9/29/2017 | 1:11:20 PM | Information | ABCDLWHV4.abcd.co.uk | 1074 | User32 | The process Explorer.EXE has initiated the restart of computer ABCDLWHV4 on behalf of user OBJECTIVITY\ABC_admin for the following reason: Hardware: Maintenance (Planned) Reason Code: 0x84010001 Shutdown Type: restart Comment: |
9/29/2017 | 1:11:21 PM | Information | ABCDLWHV4.abcd.co.uk | 1074 | User32 | The process C:\Windows\Explorer.EXE (ABCDLWHV4) has initiated the restart of computer ABCDLWHV4 on behalf of user OBJECTIVITY\ABC_admin for the following reason: Hardware: Maintenance (Planned) Reason Code: 0x84010001 Shutdown Type: restart Comment: |
Application Events:
- *No relevant logs application event
List of outdated drivers:
Module Path | Time/Date String | File Version | Company Name | File Description |
C:\WINDOWS\SYSTEM32\DRIVERS\IBBUS.SYS | 3/8/2017 14:54 | (5.35:12978.0) | Mellanox | InfiniBand Fabric Bus Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\IQVW64E.SYS | 10/29/2015 22:44 | (1.3:1.2) | Intel Corporation | Intel(R) Network Adapter Diagnostic Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\LSI_SAS3I.SYS | 3/28/2016 20:49 | (2.51:12.80) | Avago Technologies | Avago SAS Gen3 Driver (StorPort) |
C:\WINDOWS\SYSTEM32\DRIVERS\MEGASAS.SYS | 3/5/2015 3:36 | (6.706:6.0) | Avago Technologies | MEGASAS RAID Controller Driver for Windows |
C:\WINDOWS\SYSTEM32\DRIVERS\MEGASAS2I.SYS | 7/22/2016 23:36 | (6.711:10.11) | Avago Technologies | MEGASAS RAID Controller Driver for Windows |
C:\WINDOWS\SYSTEM32\DRIVERS\MLX4ETH63.SYS | 3/8/2017 14:53 | (5.35:12978.0) | Mellanox | Mellanox ConnectX 10Gb Ethernet Adapter NDIS 6.60 driver |
C:\WINDOWS\SYSTEM32\DRIVERS\MLX4_BUS.SYS | 3/8/2017 14:54 | (5.35:12978.0) | Mellanox | MLX4 Bus Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\MVUMIS.SYS | 5/23/2014 22:39 | (1.0:5.1016) | Marvell Semiconductor, Inc. | Marvell Flash Controller Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\NDFLTR.SYS | 3/8/2017 14:53 | (5.35:12978.0) | Mellanox | NetworkDirect Support Filter Driver |
C:\WINDOWS\SYSTEM32\DRIVERS\PERCSAS2I.SYS | 3/15/2016 1:50 | (6.805:3.0) | Avago Technologies | MEGASAS RAID Controller Driver for Windows |
C:\WINDOWS\SYSTEM32\DRIVERS\PERCSAS3I.SYS | 3/4/2016 22:22 | (6.603:6.0) | Avago Technologies | MEGASAS RAID Controller Driver for Windows |
C:\WINDOWS\SYSTEM32\DRIVERS\VEEAMFCT.SYS | 4/6/2017 0:45 | (9.5:0.1015) | Veeam Software AG | CTK file system minifilter |
Cluster Logs:
[=== Cluster Logs ===]
00000a64.00006768::2017/09/29-15:42:59.565 INFO [CAM] Token Created, Client Handle: 8000701c
00000a64.00005850::2017/09/29-15:42:59.584 INFO [CAM] Token Created, Client Handle: 80007004
00000a64.00006768::2017/09/29-15:42:59.602 INFO [CAM] Token Created, Client Handle: 80007024
SMB Log:
Log Name: Microsoft-Windows-SmbClient/Connectivity
Source: Microsoft-Windows-SMBClient
Date: 9/29/2017 1:10:20 PM
Event ID: 30804
Task Category: None
Level: Error
Keywords: (64)
User: N/A
Computer: ABCDLWHV4.abcd.co.uk
Description:
A network connection was disconnected.
Server name: \fe80::c483:d420:cdcc:d450%10
Server address: 10.102.22.135:445
Connection type: Rdma
Guidance:
This indicates that the client’s connection to the server was disconnected.
Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.
Conclusion:
- The issue seems to be due to the misconfiguration of the Network Adaptor due to which when the Cluster Node: ABCDLWHV1 Restarted all the VMs the CSV on which the VMs were hosted went to paused state probably due to issues with network communication and since the CSV was not able to come online in time the VMs hosted on the CSV started going down.
- Reconfigure the RDMA configuration on the Cluster.
- Update the Mellanox Network Firmware drivers and if they are up to date please confirm if there is any ongoing issues with that version.
- Update the Storage Controller firmware as well just to make sure that we are at the updated level
- Install all the latest windows updates on the Cluster nodes.