RCA 33: RHS Service Terminated on VM Clustering on vSphere Infra

ABCDB01 – W2K12 – cesxst357 – Active node

ABCDB02 – W2K12 – cesxst362 – Passive node

 

 

Hostname: CESXST357.Corp.abcd.com

ESXi Version: ESXi 6.0 P07

 

Time of Issue: Oct 1 around 12:41 PM HKT

Time in GMT: Oct 1 around 4:41 AM GMT

 

RDM Disk:

 

VMDK

vml.ID

naa.ID

ABCDB01_3.vmdk

vml.02000800006000097000049260063753303046323453594d4d4554

naa.60000970000492600637533030463234

ABCDB01_4.vmdk

vml.02000900006000097000049260063753303046323853594d4d4554

naa.60000970000492600637533030463238

ABCDB01_5.vmdk

vml.02001800006000097000049260063753303046464453594d4d4554

naa.60000970000492600637533030464644

ABCDB01_6.vmdk

vml.02001900006000097000049260063753303046464553594d4d4554

naa.60000970000492600637533030464645

ABCDB01_7.vmdk

vml.02000a00006000097000049260063753303046323953594d4d4554

naa.60000970000492600637533030463239

ABCDB01_8.vmdk

vml.02001700006000097000049260063753303046353253594d4d4554

naa.60000970000492600637533030463532

ABCDB01_9.vmdk

vml.02001a00006000097000049260063753303336344653594d4d4554

naa.60000970000492600637533033363446

ABCDB01_10.vmdk

vml.02001b00006000097000049260063753303336353053594d4d4554

naa.60000970000492600637533033363530

ABCDB01_11.vmdk

vml.02001c00006000097000049260063753303336353153594d4d4554

naa.60000970000492600637533033363531

ABCDB01_12.vmdk

vml.02001d00006000097000049260063753303336353453594d4d4554

naa.60000970000492600637533033363534

ABCDB01_13.vmdk

vml.02000b00006000097000049260063753303046333753594d4d4554

naa.60000970000492600637533030463337

ABCDB01_14.vmdk

vml.02000c00006000097000049260063753303046333853594d4d4554

naa.60000970000492600637533030463338

ABCDB01_15.vmdk

vml.02000d00006000097000049260063753303046333953594d4d4554

naa.60000970000492600637533030463339

ABCDB01_16.vmdk

vml.02000e00006000097000049260063753303046334153594d4d4554

naa.60000970000492600637533030463341

ABCDB01_17.vmdk

vml.02000f00006000097000049260063753303046334253594d4d4554

naa.60000970000492600637533030463342

ABCDB01_18.vmdk

vml.02001000006000097000049260063753303046334353594d4d4554

naa.60000970000492600637533030463343

 

HBA Firmware/Driver:

 

Emulex LightPulse FC SCSI 11.4.329.0
Synergy 3530C 16G HBA on PCI bus 0000:ce device 00 fn 1 port 1 Logical Link Speed: 16000 Mbps
FW Version:    12.0.346.9

 

Supported and Compatible.

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=42595&deviceCategory=io&details=1&VID=10df&DID=e200&SVID=103c&SSID=8001&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc

 

 

System Logs:

 

  • From the system logs we can see that the issue started with the RHS being terminated and there are events for the Cluster Disk getting failed on the cluster.

 

TimeGenerated

EntryType

MachineName

EventID

Source

Message

01-10-2020 12.26

0

ABCDB01.corp.abcd.com

1146

Microsoft-Windows-FailoverClustering

The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk L:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk K:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk N:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk S:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

7034

Service Control Manager

The Distributed Transaction Coordinator (1994e80e-0d84-4269-b769-6aeb065c8dda) service terminated unexpectedly.  It has done this 1 time(s).

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk F:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

01-10-2020 12.35

Error

ABCDB01.corp.abcd.com

1069

Microsoft-Windows-FailoverClustering

Cluster resource ‘Disk H:’ of type ‘Physical Disk’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to block a required state change in that cluster resource.

 

 

Vmware Logs:

 

  • Reviewed the Vmware logs and was not able to see any issues related to the Virtual Machine.

 

2020-10-01T04:37:55.541Z| vmx| I125: Guest: toolbox-dnd: Version: build-8068406
2020-10-01T04:39:46.647Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a transaction, ignore
2020-10-01T04:39:46.650Z| vcpu-4| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a transaction, ignore
2020-10-01T04:39:46.654Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a transaction, ignore
2020-10-01T04:42:42.028Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:42:42.031Z| vcpu-0| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:42:42.035Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:43:07.532Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:43:07.536Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:43:07.541Z| vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Optimize Start Menu Cache Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:47:42.615Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in a transaction, ignore
2020-10-01T04:47:42.618Z| vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in a transaction, ignore
2020-10-01T04:47:42.621Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in a transaction, ignore
2020-10-01T04:47:42.643Z| vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a transaction, ignore
2020-10-01T04:47:42.647Z| vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a transaction, ignore
2020-10-01T04:47:42.651Z| vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File (\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a transaction, ignore

 

 

VOBD Logs:

 

  • Reviewed the VOBD logs and not able to see any events logs at the time of issue. The last events logged were around 12:00 GMT.

 

2020-10-01T00:34:15.084Z: [scsiCorrelator] 18976919149442us: [esx.clear.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 214058 microseconds to 42438 microseconds.
2020-10-01T00:36:35.403Z: [scsiCorrelator] 18977059472002us: [vob.scsi.device.io.latency.high] Device naa.60000970000492600637533030463238 performance has deteriorated. I/O latency increased from average value of 10377 microseconds to 209582 microseconds.
2020-10-01T00:36:35.404Z: [scsiCorrelator] 18977059468539us: [esx.problem.scsi.device.io.latency.high] Device naa.60000970000492600637533030463238 performance has deteriorated. I/O latency increased from average value of 10377 microseconds to 209582 microseconds.
2020-10-01T00:37:27.768Z: [scsiCorrelator] 18977111836957us: [vob.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 209582 microseconds to 41694 microseconds.
2020-10-01T00:37:27.768Z: [scsiCorrelator] 18977111833482us: [esx.clear.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 209582 microseconds to 41694 microseconds.
2020-10-01T00:39:43.911Z: [scsiCorrelator] 18977247979151us: [vob.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 41694 microseconds to 20493 microseconds.
2020-10-01T00:39:43.911Z: [scsiCorrelator] 18977247975697us: [esx.clear.scsi.device.io.latency.improved] Device naa.60000970000492600637533030463238 performance has improved. I/O latency reduced from 41694 microseconds to 20493 microseconds.

 

 

VMKernel Logs:

 

 

  • From the VMKernel Logs we can see NAA.ID:60000970000492600637533030464645 i.e. ABCDB01_6.vmdk has SCSI Errors  H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5. Act:NONE

 

Type

Code

Name

Description

Host Status

[0x0]

OK

This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.

Device Status

[0x2]

CHECK_CONDITION

This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.

Plugin Status

[0x0]

GOOD

No error. (ESXi 5.x / 6.x only)

Sense Key

[0x6]

UNIT ATTENTION

 

Additional Sense Data

2A/05

REGISTRATIONS PREEMPTED

 

OP Code

0x84

Third-party Copy IN

 

 

 

 

  • These are just messages related to MSCS issued SCSI-3 Persistent Reservations commands and coming out after the issue has started in the Windows and could be the results of a Failover on the cluster.

 

2020-10-01T04:47:15.127Z cpu66:34237)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x84 failed <0/25> sid x0c2c08, did x025d00, oxid x1213 iotag x91a SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)

2020-10-01T04:47:15.127Z cpu66:34237)NMP: nmp_ThrottleLogForDevice:3302: Cmd 0x84 (0x43bd4073df80, 0) to dev “naa.60000970000492600637533030464645” on path “vmhba2:C0:T0:L25” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5. Act:NONE
2020-10-01T04:47:15.127Z cpu66:34237)ScsiDeviceIO: 2613: Cmd(0x43bd4073df80) 0x84, CmdSN 0x101bc3f from world 0 to dev “naa.60000970000492600637533030464645” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5.

 

2020-10-01T04:47:15.138Z cpu83:34241)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x84 failed <2/25> sid x0b2c08, did x016c00, oxid x1261 iotag x968 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.138Z cpu63:33795)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <2/23> sid x0b2c08, did x016c00, oxid x1373 iotag xa7a SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.141Z cpu50:33960)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <0/23> sid x0c2c08, did x025d00, oxid x1187 iotag x88e SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.145Z cpu40:34239)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <2/29> sid x0b2c08, did x016c00, oxid x139a iotag xaa1 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.170Z cpu50:33960)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <3/25> sid x0c2c08, did x022700, oxid x104b iotag x752 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.171Z cpu83:34241)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <3/10> sid x0b2c08, did x015e00, oxid x145e iotag xb65 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.172Z cpu50:34236)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <3/28> sid x0c2c08, did x022700, oxid xdb3 iotag x4ba SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.172Z cpu40:32865655)lpfc: lpfc_handle_status:5069: 2:(0):3271: FCP cmd x0 failed <3/9> sid x0b2c08, did x015e00, oxid x11e7 iotag x8ee SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.174Z cpu40:34239)lpfc: lpfc_handle_status:5069: 1:(0):3271: FCP cmd x0 failed <3/9> sid x0c2c08, did x022700, oxid xecc iotag x5d3 SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:48:15.098Z cpu41:34142)NMP: nmp_ResetDeviceLogThrottling:3446: last error status from device naa.60000970000492600637533030464645 repeated 1 times

 

 

Conclusion:

 

  • From the logs we can conclude that the issue started around 12:35 when the RHS in the Cluster got terminated.

 

  • Looking at the VOBD logs we cannot see any performance degradation events which could cause this issue.

 

  • From the VMKernel logs we can see event related to Pre-emption of the Previously held lock around 12:47, which could be the SCSI3 Reservations command coming out of Windows during the time of Failover.

 

Action Plan:

 

  • I Will recommend you to check with you Windows Team to confirm if there were any issues from the Windows Layer. Since the RHS is terminated they can provide you more details on the reason of the RHS Terminated.

 

  • I Will recommend you to check with your Storage Team to understand more about the Status of the RDMs at the time of issue.

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply