RCA 50: Physical Disk Failure on Deduplication enabled vSAN Datastore

Post category:Logs Analysis / VMware Analysis
Post published:July 30, 2024
Post last modified:July 30, 2024

Based on the logs and details, we have Deduplication Enabled on the vSAN Datastore. Incase if any Disk failure the whole Diskgroup will go down and That’s exactly what we are seeing here.

The same is mentioned in the Document: https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.virtualsan.doc/GUID-3D2D80CC-444E-454E-9B8B-25C3F620EFED.html

Hostname	abcdpng2r192.
Version	VMware ESXi 7.0.3 Build: 20036589
Release	ESXi 7.0 Update 3f
Uptime	314 days, 3:40:52

vSAN Objects are showing as healthy:

Health Status                                              Number Of Objects
——————————————————— —————–
remoteAccessible                                                           0
inaccessible                                                               0
reduced-availability-with-no-rebuild                                       0
reduced-availability-with-no-rebuild-delay-timer                           0
reducedavailabilitywithpolicypending                                       0
reducedavailabilitywithpolicypendingfailed                                 0
reduced-availability-with-active-rebuild                                   0
healthy                                                                  780

VMKernel Logs:

From the vMKernel logs we can see the Ready I/O Error coming up for the Device: t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C

2023-09-05T15:05:40.059Z cpu33:173974292)WARNING: ScsiDeviceIO: 12155: READ CAPACITY on device “t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C” from Plugin “HPP” failed. I/O error

2023-09-05T15:05:40.059Z cpu70:173960604)NVMEPSA:1314 taskMgmt:abort cmdId.initiator=0x430accc2d3c0 CmdSN 0x56ab364f world:0 controller 259 state:5 nsid:1

2023-09-05T15:05:40.059Z cpu70:173960604)NVMEIO:3904 Ctlr 259, ns 1, tmReq 0x4317531f6230, type 1, initiator 0x430accc2d3c0, sn 0x56ab364f, world id 0.

2023-09-05T15:05:40.060Z cpu124:173974293)WARNING: ScsiDeviceIO: 12155: READ CAPACITY on device “t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C” from Plugin “HPP” failed. I/O error

Entire Diskgroup went into PDL State.

2023-09-05T15:01:13.771Z cpu87:173966416)PLOG: PLOGCanStartDeviceRecovery:5471: SSD 5204043d-c8e4-5848-a435-cece854c3f9c is in PDL

vSAN Storage List:

Here we can see more details regarding the failed Disk:

VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C, errno (19)
VsanUtil::GetVsanDisks: Error occurred ‘Failed to open device /vmfs/devices/disks/t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C’, create disk with null id

t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C:
   Device: t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
   Display Name: t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
   Is SSD: false
   VSAN UUID:
   VSAN Disk Group UUID:
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: false
   On-disk format version: -1
   Deduplication: false
   Compression: false
   Checksum:
   Checksum OK: false
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: false
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: false
   Creation Time: Unknown

Here we can see the details for the Disk Group Objects showing as unknown.

Unknown:
   Device: Unknown
   Display Name: Unknown
   Is SSD: false
   VSAN UUID: 5204043d-c8e4-5848-a435-cece854c3f9c
   VSAN Disk Group UUID:
   VSAN Disk Group Name:
   Used by this host: false
   In CMMDS: true
   On-disk format version: -1
   Deduplication: false
   Compression: false
   Checksum:
   Checksum OK: false
   Is Capacity Tier: false
   Encryption Metadata Checksum OK: false
   Encryption: false
   DiskKeyLoaded: false
   Is Mounted: false
   Creation Time: Unknown

Device Stats Get:

Checked the Device Stats to have a better understanding of the Disk:

t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C:
   Device: t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
   Successful Commands: 51094514038
   Blocks Read: 1339219854886
   Blocks Written: 1817959134424
   Read Operations: 16214215168
   Write Operations: 34879073031
   Reserve Operations: 2
   Reservation Conflicts: 0
   Failed Commands: 126
   Failed Blocks Read: 0
   Failed Blocks Written: 0
   Failed Read Operations: 84
   Failed Write Operations: 39
   Failed Reserve Operations: 0

Tried to capture the Path List for the Failed Storage:

/commands/localcli_storage-core-path-list.txt

pcie.6600-pcie.0:0-t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C:
   UID: pcie.6600-pcie.0:0-t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
   Runtime Name: vmhba3:C0:T0:L0
   Device: t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
   Device Display Name: Local NVMe Disk (t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C)
   Adapter: vmhba3
   Channel: 0
   Target: 0
   LUN: 0
   Plugin: HPP
   State: active
   Transport: pcie
   Adapter Identifier: pcie.6600
   Target Identifier: pcie.0:0
   Adapter Transport Details: Unavailable or path is unclaimed
   Target Transport Details: Unavailable or path is unclaimed
   Maximum IO Size: 131072

Action Plan:

The Disk Group must be removed first with the option “No Data migration” (as the Disk Group is effectively lost), then replace the failed disk and re-create the Disk Group.
I would recommend you to get in touch with you Hardware Vendor and replace the faulty disk from the ESXi Host based on the details i have shared above.

For more details please refer to: https://kb.vmware.com/s/article/2149067

Please let me know if you have any questions and queries regarding this.

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.