ABCDB01 – W2K12 – cesxst357 –
Active node
ABCDB02 – W2K12 – cesxst362 –
Passive node
Hostname: CESXST357.Corp.abcd.com
ESXi Version: ESXi 6.0 P07
Time of Issue: Oct 1 around 12:41 PM HKT
Time in GMT: Oct 1 around 4:41 AM GMT
RDM Disk:
VMDK |
vml.ID |
naa.ID |
ABCDB01_3.vmdk |
vml.02000800006000097000049260063753303046323453594d4d4554 |
naa.60000970000492600637533030463234 |
ABCDB01_4.vmdk |
vml.02000900006000097000049260063753303046323853594d4d4554 |
naa.60000970000492600637533030463238 |
ABCDB01_5.vmdk |
vml.02001800006000097000049260063753303046464453594d4d4554 |
naa.60000970000492600637533030464644 |
ABCDB01_6.vmdk |
vml.02001900006000097000049260063753303046464553594d4d4554 |
naa.60000970000492600637533030464645 |
ABCDB01_7.vmdk |
vml.02000a00006000097000049260063753303046323953594d4d4554 |
naa.60000970000492600637533030463239 |
ABCDB01_8.vmdk |
vml.02001700006000097000049260063753303046353253594d4d4554 |
naa.60000970000492600637533030463532 |
ABCDB01_9.vmdk |
vml.02001a00006000097000049260063753303336344653594d4d4554 |
naa.60000970000492600637533033363446 |
ABCDB01_10.vmdk |
vml.02001b00006000097000049260063753303336353053594d4d4554 |
naa.60000970000492600637533033363530 |
ABCDB01_11.vmdk |
vml.02001c00006000097000049260063753303336353153594d4d4554 |
naa.60000970000492600637533033363531 |
ABCDB01_12.vmdk |
vml.02001d00006000097000049260063753303336353453594d4d4554 |
naa.60000970000492600637533033363534 |
ABCDB01_13.vmdk |
vml.02000b00006000097000049260063753303046333753594d4d4554 |
naa.60000970000492600637533030463337 |
ABCDB01_14.vmdk |
vml.02000c00006000097000049260063753303046333853594d4d4554 |
naa.60000970000492600637533030463338 |
ABCDB01_15.vmdk |
vml.02000d00006000097000049260063753303046333953594d4d4554 |
naa.60000970000492600637533030463339 |
ABCDB01_16.vmdk |
vml.02000e00006000097000049260063753303046334153594d4d4554 |
naa.60000970000492600637533030463341 |
ABCDB01_17.vmdk |
vml.02000f00006000097000049260063753303046334253594d4d4554 |
naa.60000970000492600637533030463342 |
ABCDB01_18.vmdk |
vml.02001000006000097000049260063753303046334353594d4d4554 |
naa.60000970000492600637533030463343 |
HBA Firmware/Driver:
Emulex LightPulse FC SCSI
11.4.329.0
Synergy 3530C 16G HBA on PCI bus 0000:ce device 00 fn 1 port 1 Logical Link
Speed: 16000 Mbps
FW Version: 12.0.346.9
Supported and Compatible.
System Logs:
- From the
system logs we can see that the issue started with the RHS being
terminated and there are events for the Cluster Disk getting failed on the
cluster.
TimeGenerated |
EntryType |
MachineName |
EventID |
Source |
Message |
01-10-2020 12.26 |
0 |
ABCDB01.corp.abcd.com |
1146 |
Microsoft-Windows-FailoverClustering |
The cluster Resource Hosting Subsystem (RHS) process
was terminated and will be restarted. This is typically associated with
cluster health detection and recovery of a resource. Refer to the System
event log to determine which resource and resource DLL is causing the issue. |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Disk L:’ of type ‘Physical Disk’
in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to
block a required state change in that cluster resource. |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Disk K:’ of type ‘Physical Disk’
in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to
block a required state change in that cluster resource. |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Disk N:’ of type ‘Physical Disk’
in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to
block a required state change in that cluster resource. |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Disk S:’ of type ‘Physical Disk’
in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to
block a required state change in that cluster resource. |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
7034 |
Service Control Manager |
The Distributed Transaction Coordinator
(1994e80e-0d84-4269-b769-6aeb065c8dda) service terminated unexpectedly. It has done this 1 time(s). |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Disk F:’ of type ‘Physical Disk’
in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to
block a required state change in that cluster resource. |
01-10-2020 12.35 |
Error |
ABCDB01.corp.abcd.com |
1069 |
Microsoft-Windows-FailoverClustering |
Cluster resource ‘Disk H:’ of type ‘Physical Disk’
in clustered role ‘SQL Server (MSSQLSERVER)’ failed due to an attempt to
block a required state change in that cluster resource. |
Vmware Logs:
- Reviewed
the Vmware logs and was not able to see any issues related to the Virtual
Machine.
2020-10-01T04:37:55.541Z| vmx| I125: Guest:
toolbox-dnd: Version: build-8068406
2020-10-01T04:39:46.647Z|
vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a
transaction, ignore
2020-10-01T04:39:46.650Z|
vcpu-4| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a
transaction, ignore
2020-10-01T04:39:46.654Z|
vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\Device Setup\Metadata Refresh) in a
transaction, ignore
2020-10-01T04:42:42.028Z|
vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Optimize Start Menu Cache
Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:42:42.031Z|
vcpu-0| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Optimize Start Menu Cache
Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:42:42.035Z|
vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Optimize Start Menu Cache
Files-S-1-5-21-69497794-716713911-133121052-5268) in a transaction, ignore
2020-10-01T04:43:07.532Z|
vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Optimize Start Menu Cache
Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:43:07.536Z|
vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Optimize Start Menu Cache
Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:43:07.541Z|
vcpu-5| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Optimize Start Menu Cache
Files-S-1-5-21-69497794-716713911-133121052-96695) in a transaction, ignore
2020-10-01T04:47:42.615Z|
vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in
a transaction, ignore
2020-10-01T04:47:42.618Z|
vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in
a transaction, ignore
2020-10-01T04:47:42.621Z|
vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUScheduledInstall) in
a transaction, ignore
2020-10-01T04:47:42.643Z|
vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a
transaction, ignore
2020-10-01T04:47:42.647Z|
vcpu-3| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a
transaction, ignore
2020-10-01T04:47:42.651Z|
vcpu-2| I125: Guest: vsep: AUDIT: VFileFltPostOpCreate : File
(\Windows\System32\Tasks\Microsoft\Windows\WindowsUpdate\AUSessionConnect) in a
transaction, ignore
VOBD Logs:
- Reviewed
the VOBD logs and not able to see any events logs at the time of issue.
The last events logged were around 12:00 GMT.
2020-10-01T00:34:15.084Z: [scsiCorrelator]
18976919149442us: [esx.clear.scsi.device.io.latency.improved] Device
naa.60000970000492600637533030463238 performance has improved. I/O latency
reduced from 214058 microseconds to 42438 microseconds.
2020-10-01T00:36:35.403Z:
[scsiCorrelator] 18977059472002us: [vob.scsi.device.io.latency.high] Device
naa.60000970000492600637533030463238 performance has deteriorated. I/O latency
increased from average value of 10377 microseconds to 209582 microseconds.
2020-10-01T00:36:35.404Z:
[scsiCorrelator] 18977059468539us: [esx.problem.scsi.device.io.latency.high]
Device naa.60000970000492600637533030463238 performance has deteriorated. I/O
latency increased from average value of 10377 microseconds to 209582 microseconds.
2020-10-01T00:37:27.768Z:
[scsiCorrelator] 18977111836957us: [vob.scsi.device.io.latency.improved] Device
naa.60000970000492600637533030463238 performance has improved. I/O latency
reduced from 209582 microseconds to 41694 microseconds.
2020-10-01T00:37:27.768Z:
[scsiCorrelator] 18977111833482us: [esx.clear.scsi.device.io.latency.improved]
Device naa.60000970000492600637533030463238 performance has improved. I/O
latency reduced from 209582 microseconds to 41694 microseconds.
2020-10-01T00:39:43.911Z:
[scsiCorrelator] 18977247979151us: [vob.scsi.device.io.latency.improved] Device
naa.60000970000492600637533030463238 performance has improved. I/O latency
reduced from 41694 microseconds to 20493 microseconds.
2020-10-01T00:39:43.911Z:
[scsiCorrelator] 18977247975697us: [esx.clear.scsi.device.io.latency.improved]
Device naa.60000970000492600637533030463238 performance has improved. I/O
latency reduced from 41694 microseconds to 20493 microseconds.
VMKernel Logs:
- From the
VMKernel Logs we can see NAA.ID:60000970000492600637533030464645 i.e. ABCDB01_6.vmdk
has SCSI Errors H:0x0 D:0x2 P:0x0
Valid sense data: 0x6 0x2a 0x5. Act:NONE
Type |
Code |
Name |
Description |
Host Status |
[0x0] |
OK |
This status is returned when
there is no error on the host side. This is when you will see if there is a
status for a Device or Plugin. It is also when you will see Valid sense data
instead of Possible sense Data. |
Device Status |
[0x2] |
CHECK_CONDITION |
This status is returned when
a command fails for a specific reason. When a CHECK CONDITION is received,
the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in
order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier,
and other bits). The sense data is listed after Valid sense data in the order
of Sense Key, Additional Sense Code, and ASC Qualifier. |
Plugin Status |
[0x0] |
GOOD |
No error. (ESXi 5.x / 6.x
only) |
Sense Key |
[0x6] |
UNIT ATTENTION |
|
Additional Sense Data |
2A/05 |
REGISTRATIONS PREEMPTED |
|
OP Code |
0x84 |
Third-party Copy IN |
|
- These are
just messages related to MSCS issued SCSI-3 Persistent Reservations
commands and coming out after the issue has started in the Windows and
could be the results of a Failover on the cluster.
2020-10-01T04:47:15.127Z cpu66:34237)lpfc:
lpfc_handle_status:5069: 1:(0):3271: FCP cmd x84 failed <0/25> sid
x0c2c08, did x025d00, oxid x1213 iotag x91a SCSI Chk Cond – Unit Attn:
Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.127Z cpu66:34237)NMP:
nmp_ThrottleLogForDevice:3302: Cmd 0x84 (0x43bd4073df80, 0) to dev
“naa.60000970000492600637533030464645” on path
“vmhba2:C0:T0:L25” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6
0x2a 0x5. Act:NONE
2020-10-01T04:47:15.127Z
cpu66:34237)ScsiDeviceIO: 2613: Cmd(0x43bd4073df80) 0x84, CmdSN 0x101bc3f from
world 0 to dev “naa.60000970000492600637533030464645” failed H:0x0
D:0x2 P:0x0 Valid sense data: 0x6 0x2a 0x5.
2020-10-01T04:47:15.138Z cpu83:34241)lpfc:
lpfc_handle_status:5069: 2:(0):3271: FCP cmd x84 failed <2/25> sid
x0b2c08, did x016c00, oxid x1261 iotag x968 SCSI Chk Cond – Unit Attn:
Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.138Z cpu63:33795)lpfc: lpfc_handle_status:5069: 2:(0):3271:
FCP cmd x0 failed <2/23> sid x0b2c08, did x016c00, oxid x1373 iotag xa7a
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.141Z cpu50:33960)lpfc: lpfc_handle_status:5069: 1:(0):3271:
FCP cmd x0 failed <0/23> sid x0c2c08, did x025d00, oxid x1187 iotag x88e
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.145Z cpu40:34239)lpfc: lpfc_handle_status:5069: 2:(0):3271:
FCP cmd x0 failed <2/29> sid x0b2c08, did x016c00, oxid x139a iotag xaa1
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.170Z cpu50:33960)lpfc: lpfc_handle_status:5069: 1:(0):3271:
FCP cmd x0 failed <3/25> sid x0c2c08, did x022700, oxid x104b iotag x752
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.171Z cpu83:34241)lpfc: lpfc_handle_status:5069: 2:(0):3271:
FCP cmd x0 failed <3/10> sid x0b2c08, did x015e00, oxid x145e iotag xb65
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.172Z cpu50:34236)lpfc: lpfc_handle_status:5069: 1:(0):3271:
FCP cmd x0 failed <3/28> sid x0c2c08, did x022700, oxid xdb3 iotag x4ba
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.172Z cpu40:32865655)lpfc: lpfc_handle_status:5069:
2:(0):3271: FCP cmd x0 failed <3/9> sid x0b2c08, did x015e00, oxid x11e7
iotag x8ee SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:47:15.174Z cpu40:34239)lpfc: lpfc_handle_status:5069: 1:(0):3271:
FCP cmd x0 failed <3/9> sid x0c2c08, did x022700, oxid xecc iotag x5d3
SCSI Chk Cond – Unit Attn: Data(x2:x6:x2a:x5)
2020-10-01T04:48:15.098Z cpu41:34142)NMP: nmp_ResetDeviceLogThrottling:3446:
last error status from device naa.60000970000492600637533030464645 repeated 1
times
Conclusion:
- From the
logs we can conclude that the issue started around 12:35 when the RHS in
the Cluster got terminated.
- Looking
at the VOBD logs we cannot see any performance degradation events which
could cause this issue.
- From the
VMKernel logs we can see event related to Pre-emption of the
Previously held lock around 12:47, which could be the SCSI3 Reservations
command coming out of Windows during the time of Failover.
Action Plan:
- I Will
recommend you to check with you Windows Team to confirm if there were any
issues from the Windows Layer. Since the RHS is terminated they can
provide you more details on the reason of the RHS Terminated.
- I Will
recommend you to check with your Storage Team to understand more about the
Status of the RDMs at the time of issue.