RCA 33: RHS Service Terminated on VM Clustering on vSphere Infra

ABCDB01 – W2K12 – cesxst357 – Active node

ABCDB02 – W2K12 – cesxst362 – Passive node

Hostname: CESXST357.Corp.abcd.com

ESXi Version: ESXi 6.0 P07

Time of Issue: Oct 1 around 12:41 PM HKT

Time in GMT: Oct 1 around 4:41 AM GMT

RDM Disk:

VMDK	vml.ID	naa.ID
ABCDB01_3.vmdk	vml.02000800006000097000049260063753303046323453594d4d4554	naa.60000970000492600637533030463234
ABCDB01_4.vmdk	vml.02000900006000097000049260063753303046323853594d4d4554	naa.60000970000492600637533030463238
ABCDB01_5.vmdk	vml.02001800006000097000049260063753303046464453594d4d4554	naa.60000970000492600637533030464644
ABCDB01_6.vmdk	vml.02001900006000097000049260063753303046464553594d4d4554	naa.60000970000492600637533030464645
ABCDB01_7.vmdk	vml.02000a00006000097000049260063753303046323953594d4d4554	naa.60000970000492600637533030463239
ABCDB01_8.vmdk	vml.02001700006000097000049260063753303046353253594d4d4554	naa.60000970000492600637533030463532
ABCDB01_9.vmdk	vml.02001a00006000097000049260063753303336344653594d4d4554	naa.60000970000492600637533033363446
ABCDB01_10.vmdk	vml.02001b00006000097000049260063753303336353053594d4d4554	naa.60000970000492600637533033363530
ABCDB01_11.vmdk	vml.02001c00006000097000049260063753303336353153594d4d4554	naa.60000970000492600637533033363531
ABCDB01_12.vmdk	vml.02001d00006000097000049260063753303336353453594d4d4554	naa.60000970000492600637533033363534
ABCDB01_13.vmdk	vml.02000b00006000097000049260063753303046333753594d4d4554	naa.60000970000492600637533030463337
ABCDB01_14.vmdk	vml.02000c00006000097000049260063753303046333853594d4d4554	naa.60000970000492600637533030463338
ABCDB01_15.vmdk	vml.02000d00006000097000049260063753303046333953594d4d4554	naa.60000970000492600637533030463339
ABCDB01_16.vmdk	vml.02000e00006000097000049260063753303046334153594d4d4554	naa.60000970000492600637533030463341
ABCDB01_17.vmdk	vml.02000f00006000097000049260063753303046334253594d4d4554	naa.60000970000492600637533030463342
ABCDB01_18.vmdk	vml.02001000006000097000049260063753303046334353594d4d4554	naa.60000970000492600637533030463343

HBA Firmware/Driver:

Emulex LightPulse FC SCSI 11.4.329.0
Synergy 3530C 16G HBA on PCI bus 0000:ce device 00 fn 1 port 1 Logical Link Speed: 16000 Mbps
FW Version: 12.0.346.9

Supported and Compatible.

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=42595&deviceCategory=io&details=1&VID=10df&DID=e200&SVID=103c&SSID=8001&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

System Logs:

From the system logs we can see that the issue started with the RHS being terminated and there are events for the Cluster Disk getting failed on the cluster.

TimeGenerated	EntryType	MachineName	EventID	Source	Message
01-10-2020 12.26	0	ABCDB01.corp.abcd.com	1146	Microsoft-Windows-FailoverClustering	The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Disk L:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Disk K:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Disk N:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Disk S:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	7034	Service Control Manager	The Distributed Transaction Coordinator (1994e80e-0d84-4269-b769-6aeb065c8dda) service terminated unexpectedly. It has done this 1 time(s).
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Disk F:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.
01-10-2020 12.35	Error	ABCDB01.corp.abcd.com	1069	Microsoft-Windows-FailoverClustering	Cluster resource ‘Disk H:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

Vmware Logs:

Reviewed the Vmware logs and was not able to see any issues related to the Virtual Machine.

2020-10-01T04:37:55.541Z| vmx| I125: Guest: toolbox-dnd: Version: build-8068406
2020-10-01T04:39:46.647Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a transaction, ignore
2020-10-01T04:39:46.650Z| vcpu-4| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a transaction, ignore
2020-10-01T04:39:46.654Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a transaction, ignore
2020-10-01T04:42:42.028Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:42:42.031Z| vcpu-0| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:42:42.035Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:43:07.532Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:43:07.536Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:43:07.541Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:47:42.615Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in a transaction, ignore
2020-10-01T04:47:42.618Z| vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in a transaction, ignore
2020-10-01T04:47:42.621Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in a transaction, ignore
2020-10-01T04:47:42.643Z| vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a transaction, ignore
2020-10-01T04:47:42.647Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a transaction, ignore
2020-10-01T04:47:42.651Z| vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a transaction, ignore

VOBD Logs:

Reviewed the VOBD logs and not able to see any events logs at the time of issue. The last events logged were around 12:00 GMT.

2020-10-01T00:34:15.084Z: [scsiCorrelator] 18976919149442us: [esx.clear.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 214058 microseconds to 42438 microseconds.
2020-10-01T00:36:35.403Z: [scsiCorrelator] 18977059472002us: [vob.scsi.device.io.latency.high] Device naa.60000970000492600637533030463238 performance has deteriorated. I/O latency increased from average value of 10377 microseconds to 209582 microseconds.
2020-10-01T00:36:35.404Z: [scsiCorrelator] 18977059468539us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000492600637533030463238 performance has deteriorated. I/O latency increased from average value of 10377 microseconds to 209582 microseconds.
2020-10-01T00:37:27.768Z: [scsiCorrelator] 18977111836957us: [vob.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 209582 microseconds to 41694 microseconds.
2020-10-01T00:37:27.768Z: [scsiCorrelator] 18977111833482us: [esx.clear.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 209582 microseconds to 41694 microseconds.
2020-10-01T00:39:43.911Z: [scsiCorrelator] 18977247979151us: [vob.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 41694 microseconds to 20493 microseconds.
2020-10-01T00:39:43.911Z: [scsiCorrelator] 18977247975697us: [esx.clear.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 41694 microseconds to 20493 microseconds.

VMKernel Logs:

From the VMKernel Logs we can see NAA.ID:60000970000492600637533030464645 i.e. ABCDB01_6.vmdk has SCSI Errors H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5. Act:NONE

Type	Code	Name	Description
Host Status	[0x0]	OK	This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status	[0x2]	CHECK_CONDITION	This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0x6]	UNIT ATTENTION
Additional Sense Data	2A/05	REGISTRATIONS PREEMPTED
OP Code	0x84	Third-party Copy IN

These are just messages related to MSCS issued SCSI-3 Persistent Reservations commands and coming out after the issue has started in the Windows and could be the results of a Failover on the cluster.

2020-10-01T04:47:15.127Z cpu66:34237)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x84 failed <0/25> sid x0c2c08, did x025d00, oxid x1213 iotag x91a SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)

2020-10-01T04:47:15.127Z cpu66:34237)NMP: nmp_ThrottleLogForDevice:3302: Cmd 0x84 (0x43bd4073df80, 0) to dev “naa.60000970000492600637533030464645” on path “vmhba2:C0:T0:L25” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5. Act:NONE
2020-10-01T04:47:15.127Z cpu66:34237)ScsiDeviceIO: 2613: Cmd(0x43bd4073df80) 0x84, CmdSN 0x101bc3f from world 0 to dev “naa.60000970000492600637533030464645” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5.

2020-10-01T04:47:15.138Z cpu83:34241)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x84 failed <2/25> sid x0b2c08, did x016c00, oxid x1261 iotag x968 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.138Z cpu63:33795)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <2/23> sid x0b2c08, did x016c00, oxid x1373 iotag xa7a SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.141Z cpu50:33960)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <0/23> sid x0c2c08, did x025d00, oxid x1187 iotag x88e SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.145Z cpu40:34239)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <2/29> sid x0b2c08, did x016c00, oxid x139a iotag xaa1 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.170Z cpu50:33960)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <3/25> sid x0c2c08, did x022700, oxid x104b iotag x752 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.171Z cpu83:34241)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <3/10> sid x0b2c08, did x015e00, oxid x145e iotag xb65 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.172Z cpu50:34236)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <3/28> sid x0c2c08, did x022700, oxid xdb3 iotag x4ba SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.172Z cpu40:32865655)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <3/9> sid x0b2c08, did x015e00, oxid x11e7 iotag x8ee SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.174Z cpu40:34239)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <3/9> sid x0c2c08, did x022700, oxid xecc iotag x5d3 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:48:15.098Z cpu41:34142)NMP: nmp_ResetDeviceLogThrottling:3446: last error status from device naa.60000970000492600637533030464645 repeated 1 times

Conclusion:

From the logs we can conclude that the issue started around 12:35 when the RHS in the Cluster got terminated.

Looking at the VOBD logs we cannot see any performance degradation events which could cause this issue.

From the VMKernel logs we can see event related to Pre-emption of the Previously held lock around 12:47, which could be the SCSI3 Reservations command coming out of Windows during the time of Failover.

Action Plan:

I Will recommend you to check with you Windows Team to confirm if there were any issues from the Windows Layer. Since the RHS is terminated they can provide you more details on the reason of the RHS Terminated.