RCA 49: Storage Devices went offline on vSAN Datastore

Hostname: xyzsv010321.intranet.abc.com | uptime: 7.9 days | uptime: 11363 minutes
ESXi Version: VMware ESXi 6.7.0 build-17700523

Object Health:

—————————————————————————-
nonavailabilityrelatedincompliancewithpolicypendingfailed                  0
reduced-availability-with-no-rebuild-delay-timer                           0
reducedavailabilitywithpolicypending                                       0
inaccessible                                                               0
reduced-availability-with-active-rebuild                                   0
nonavailability-related-incompliance                                       0
reducedavailabilitywithpausedrebuild                                       0
nonavailabilityrelatedincompliancewithpausedrebuild                        0
nonavailability-related-reconfig                                           0
reduced-availability-with-no-rebuild                                       0
nonavailabilityrelatedincompliancewithpolicypending                        0
reducedavailabilitywithpolicypendingfailed                                 0
data-move                                                                  0
healthy                                                                 1674

VOBD Logs:

From the VOBD logs we can see that due to IO Failure there is a Repair operation which is being triggered and the vSAN Device has gone offline.

2022-08-16T13:17:54.827Z: [vSANCorrelator] 399494474us: [vob.vsan.lsom.devicerepair] vSAN device 52385363-7081-d832-fa4f-88200d001cb5 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

2022-08-16T13:17:54.827Z: [vSANCorrelator] 399494779us: [esx.problem.vob.vsan.lsom.devicerepair] Device 52385363-7081-d832-fa4f-88200d001cb5 is in offline state and is getting repaired.

2022-08-16T13:17:58.844Z: [vSANCorrelator] 403511359us: [vob.vsan.pdl.offline] vSAN device 528dd82a-9af8-a4ff-2982-653e28d011ce has gone offline.

2022-08-16T13:17:58.844Z: [vSANCorrelator] 403511548us: [esx.problem.vob.vsan.pdl.offline] vSAN device 528dd82a-9af8-a4ff-2982-653e28d011ce has gone offline.

2022-08-16T13:17:58.844Z: An event (esx.problem.vob.vsan.pdl.offline) could not be sent immediately to hostd; queueing for retry.

2022-08-16T13:17:58.844Z: [vSANCorrelator] 403511367us: [vob.vsan.pdl.offline] vSAN device 52824081-ba45-2e25-e41c-03338f894606 has gone offline.

2022-08-16T13:17:58.844Z: [vSANCorrelator] 403511609us: [esx.problem.vob.vsan.pdl.offline] vSAN device 52824081-ba45-2e25-e41c-03338f894606 has gone offline.

2022-08-16T13:17:58.844Z: An event (esx.problem.vob.vsan.pdl.offline) could not be sent immediately to hostd; queueing for retry.

2022-08-16T13:17:58.844Z: [vSANCorrelator] 403511392us: [vob.vsan.pdl.offline] vSAN device 52d7cbcd-30f3-b646-7174-32e28da7dbbb has gone offline.

2022-08-16T13:18:01.786Z: [vSANCorrelator] 406453743us: [vob.vsan.net.reconfigured] vmknic vmk2 has been reconfigured.

2022-08-16T13:18:01.786Z: [vSANCorrelator] 406453984us: [esx.audit.vsan.net.vnic.added] vSAN vnic added

2022-08-16T13:18:01.787Z: An event (esx.audit.vsan.net.vnic.added) could not be sent immediately to hostd; queueing for retry.

2022-08-16T13:18:40.574Z: [GenericCorrelator] 444823152us: [vob.user.host.boot] Host has booted.

2022-08-16T13:18:40.574Z: [UserLevelCorrelator] 444823152us: [vob.user.host.boot] Host has booted.

2022-08-16T13:18:40.574Z: [UserLevelCorrelator] 444823659us: [esx.audit.host.boot] Host has booted.

2022-08-16T13:23:41.329Z: [vSANCorrelator] 745548641us: [vob.vsan.lsom.devicerepair] vSAN device 52385363-7081-d832-fa4f-88200d001cb5 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

2022-08-16T13:23:41.329Z: [vSANCorrelator] 745577903us: [esx.problem.vob.vsan.lsom.devicerepair] Device 52385363-7081-d832-fa4f-88200d001cb5 is in offline state and is getting repaired.

2022-08-16T13:39:12.864Z: [vSANCorrelator] 1677004102us: [vob.vsan.lsom.devicerepair] vSAN device 52385363-7081-d832-fa4f-88200d001cb5 is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.

Here we can see that the device error:

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606646322us: [vob.vsan.lsom.diskerror] vSAN device 52d7cbcd-30f3-b646-7174-32e28da7dbbb is under permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606835399us: [esx.problem.vob.vsan.lsom.diskerror] vSAN device 52d7cbcd-30f3-b646-7174-32e28da7dbbb is under permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606646331us: [vob.vsan.lsom.diskerror] vSAN device 52d7cbcd-30f3-b646-7174-32e28da7dbbb is under permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606835490us: [esx.problem.vob.vsan.lsom.diskerror] vSAN device 52d7cbcd-30f3-b646-7174-32e28da7dbbb is under permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606646388us: [vob.vsan.lsom.diskpropagatedpermerror] vSAN device 52824081-ba45-2e25-e41c-03338f894606 is under propagated permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606835535us: [esx.problem.vob.vsan.lsom.diskpropagatedpermerror] vSAN device 52824081-ba45-2e25-e41c-03338f894606 is under propagated permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606646401us: [vob.vsan.lsom.diskpropagatedpermerror] vSAN device 528dd82a-9af8-a4ff-2982-653e28d011ce is under propagated permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606835574us: [esx.problem.vob.vsan.lsom.diskpropagatedpermerror] vSAN device 528dd82a-9af8-a4ff-2982-653e28d011ce is under propagated permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606646415us: [vob.vsan.lsom.diskpropagatedpermerror] vSAN device 52385363-7081-d832-fa4f-88200d001cb5 is under propagated permanent error.

2022-08-16T13:54:42.586Z: [vSANCorrelator] 2606835612us: [esx.problem.vob.vsan.lsom.diskpropagatedpermerror] vSAN device 52385363-7081-d832-fa4f-88200d001cb5 is under propagated permanent error.

Vmkernel Logs:

2022-08-16T13:23:37.311Z cpu6:2098447)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x28 (0x459bc1c9abc0, 0) to dev “naa.5002538bc9916d70” on path “vmhba7:C0:T8:L0” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. Act:NONE

2022-08-16T13:23:37.311Z cpu6:2098447)ScsiDeviceIO: 3483: Cmd(0x459bc1c9abc0) 0x28, CmdSN 0x2875 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:37.812Z cpu2:2098447)ScsiDeviceIO: 3483: Cmd(0x459bc1d3f800) 0x28, CmdSN 0x357e from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:38.314Z cpu20:2098447)ScsiDeviceIO: 3483: Cmd(0x459bc1d3e180) 0x28, CmdSN 0x40a4 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:38.816Z cpu3:2098447)ScsiDeviceIO: 3483: Cmd(0x459f3498fa80) 0x28, CmdSN 0x44c9 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:39.317Z cpu3:2098447)ScsiDeviceIO: 3483: Cmd(0x459f34928100) 0x28, CmdSN 0x475b from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:39.819Z cpu3:2098447)ScsiDeviceIO: 3483: Cmd(0x459f348f4440) 0x28, CmdSN 0x4769 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:40.320Z cpu3:2098447)ScsiDeviceIO: 3483: Cmd(0x45a222252c00) 0x28, CmdSN 0x476a from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:40.822Z cpu3:2098447)ScsiDeviceIO: 3483: Cmd(0x45a2223c5c40) 0x28, CmdSN 0x476b from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:23:41.324Z cpu3:2098447)ScsiDeviceIO: 3483: Cmd(0x459bc1c20640) 0x28, CmdSN 0x476c from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:39:08.848Z cpu43:2098448)NMP: nmp_ThrottleLogForDevice:3872: Cmd 0x28 (0x45bbc17859c0, 0) to dev “naa.5002538bc9916d70” on path “vmhba7:C0:T8:L0” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. Act:NONE

2022-08-16T13:39:08.848Z cpu43:2098448)ScsiDeviceIO: 3483: Cmd(0x45bbc17859c0) 0x28, CmdSN 0x27b7 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:39:09.349Z cpu36:2098447)ScsiDeviceIO: 3483: Cmd(0x45a3d58cb500) 0x28, CmdSN 0x3418 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:39:09.850Z cpu36:2098447)ScsiDeviceIO: 3483: Cmd(0x45a3d596d440) 0x28, CmdSN 0x3f60 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:39:10.351Z cpu5:2098447)ScsiDeviceIO: 3483: Cmd(0x45a3d59adf40) 0x28, CmdSN 0x44b3 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:39:10.851Z cpu5:2098447)ScsiDeviceIO: 3483: Cmd(0x45a3d59f4f80) 0x28, CmdSN 0x474e from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2022-08-16T13:39:11.354Z cpu36:2098447)ScsiDeviceIO: 3483: Cmd(0x45a3d58cad80) 0x28, CmdSN 0x4769 from world 0 to dev “naa.5002538bc9916d70” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

Under VMKernel logs we can see that the NAA.5002538bc9916d70 is getting 0x3 SCSI Code which as per the table below. As per this Read commands are getting failed with the Medium Error. This gives us an idea that the Device is having Issues.

Type	Code	Name	Description
Host Status	[0x0]	OK	This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status	[0x2]	CHECK_CONDITION	This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0x3]	MEDIUM ERROR
Additional Sense Data	11/00	UNRECOVERED READ ERROR
OP Code	0x28	READ(10)

Conclusion:

Based on the logs we can see that there is a 0x3 SCSI Code generated for the NAA.5002538bc9916d70 . This is an IO Device Failure due to Medium Error.

Action Plan:

Please replace the Disk with NAA.ID: 5002538bc9916d70. Below is the information which you can share with the Dell Team so that they can isolate this device:

   naa.5002538bc9916d70:
   Device: naa.5002538bc9916d70
   Display Name: naa.5002538bc9916d70
   Is SSD: true
   VSAN UUID: 528dd82a-9af8-a4ff-2982-653e28d011ce
   VSAN Disk Group UUID: 52385363-7081-d832-fa4f-88200d001cb5
   VSAN Disk Group Name: t10.NVMe____Dell_Express_Flash_NVMe_P4610_1.6TB_SFF_00016FEA25E4D25C
   Used by this host: true
   In CMMDS: true
   On-disk format version: 10
   Deduplication: true
   Compression: true
   Checksum: 1200740824495245392
   Checksum OK: true
   Is Capacity Tier: true
   Encryption Metadata Checksum OK: true
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: true
   Creation Time: Fri Jul 15 12:32:04 2022

naa.5002538bc9916d70:
   Device Display Name: Local SAMSUNG Disk (naa.5002538bc9916d70)
   Storage Array Type: VMW_SATP_LOCAL
   Storage Array Type Device Config: SATP VMW_SATP_LOCAL does not support device configuration.
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba7:C0:T8:L0;current=vmhba7:C0:T8:L0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba7:C0:T8:L0
   Is USB: false

Please follow the KB: https://kb.vmware.com/s/article/2149067 in order to replace the storage. Please note that since the Deduplication is enabled, the Entire Disk Group will be in failed state.
Disk Group must be destroyed first with No Data migration option (as the Disk Group is effectively lost), then replace the failed disk and re-create the Disk Group.