Based on the logs and details, we have Deduplication Enabled on the vSAN
Datastore. Incase if any Disk failure the whole Diskgroup will go down and
That’s exactly what we are seeing here.
The same is mentioned in the Document: https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.virtualsan.doc/GUID-3D2D80CC-444E-454E-9B8B-25C3F620EFED.html
Hostname |
abcdpng2r192. |
Version |
VMware ESXi 7.0.3 Build: 20036589 |
Release |
ESXi 7.0 Update 3f |
Uptime |
314 days, 3:40:52 |
- vSAN Objects are showing as healthy:
Health Status
Number Of Objects
——————————————————— —————–
remoteAccessible
0
inaccessible
0
reduced-availability-with-no-rebuild 0
reduced-availability-with-no-rebuild-delay-timer 0
reducedavailabilitywithpolicypending 0
reducedavailabilitywithpolicypendingfailed 0
reduced-availability-with-active-rebuild 0
healthy
780
VMKernel Logs:
- From the vMKernel logs we can see the Ready I/O Error coming up for
the Device:
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
2023-09-05T15:05:40.059Z cpu33:173974292)WARNING: ScsiDeviceIO: 12155:
READ CAPACITY on device
“t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C”
from Plugin “HPP” failed. I/O error
2023-09-05T15:05:40.059Z cpu33:173974292)WARNING: ScsiDeviceIO: 12155:
READ CAPACITY on device
“t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C”
from Plugin “HPP” failed. I/O error
2023-09-05T15:05:40.059Z cpu70:173960604)NVMEPSA:1314 taskMgmt:abort
cmdId.initiator=0x430accc2d3c0 CmdSN 0x56ab364f world:0 controller 259 state:5
nsid:1
2023-09-05T15:05:40.059Z cpu70:173960604)NVMEIO:3904 Ctlr 259, ns 1,
tmReq 0x4317531f6230, type 1, initiator 0x430accc2d3c0, sn 0x56ab364f, world id
0.
2023-09-05T15:05:40.060Z cpu124:173974293)WARNING: ScsiDeviceIO: 12155:
READ CAPACITY on device
“t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C”
from Plugin “HPP” failed. I/O error
2023-09-05T15:05:40.060Z cpu124:173974293)WARNING: ScsiDeviceIO: 12155:
READ CAPACITY on device
“t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C”
from Plugin “HPP” failed. I/O error
- Entire Diskgroup went into PDL State.
2023-09-05T15:01:13.771Z cpu87:173966416)PLOG:
PLOGCanStartDeviceRecovery:5471: SSD 5204043d-c8e4-5848-a435-cece854c3f9c is in
PDL
vSAN Storage List:
- Here we can see more details regarding the failed Disk:
VsanUtil::ReadFromDevice: Failed to open
/vmfs/devices/disks/t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C,
errno (19)
VsanUtil::GetVsanDisks: Error occurred ‘Failed to open device
/vmfs/devices/disks/t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C’,
create disk with null id
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C:
Device:
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
Display Name:
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
Is SSD: false
VSAN UUID:
VSAN Disk Group UUID:
VSAN Disk Group Name:
Used by this host: false
In CMMDS: false
On-disk format version: -1
Deduplication: false
Compression: false
Checksum:
Checksum OK: false
Is Capacity Tier: false
Encryption Metadata Checksum OK: false
Encryption: false
DiskKeyLoaded: false
Is Mounted: false
Creation Time: Unknown
- Here we can see the details for the Disk Group Objects showing as
unknown.
Unknown:
Device: Unknown
Display Name: Unknown
Is SSD: false
VSAN UUID:
5204043d-c8e4-5848-a435-cece854c3f9c
VSAN Disk Group UUID:
VSAN Disk Group Name:
Used by this host: false
In CMMDS: true
On-disk format version: -1
Deduplication: false
Compression: false
Checksum:
Checksum OK: false
Is Capacity Tier: false
Encryption Metadata Checksum OK: false
Encryption: false
DiskKeyLoaded: false
Is Mounted: false
Creation Time: Unknown
Device Stats Get:
- Checked the Device Stats to have
a better understanding of the Disk:
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C:
Device:
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
Successful Commands: 51094514038
Blocks Read: 1339219854886
Blocks Written: 1817959134424
Read Operations: 16214215168
Write Operations: 34879073031
Reserve Operations: 2
Reservation Conflicts: 0
Failed
Commands: 126
Failed Blocks Read: 0
Failed Blocks Written: 0
Failed
Read Operations: 84
Failed Write Operations: 39
Failed Reserve Operations: 0
- Tried to capture the Path List for the Failed Storage:
/commands/localcli_storage-core-path-list.txt
pcie.6600-pcie.0:0-t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C:
UID:
pcie.6600-pcie.0:0-t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
Runtime Name: vmhba3:C0:T0:L0
Device:
t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C
Device Display Name: Local NVMe Disk
(t10.NVMe____INTEL_SSDPF21Q800GB_____________________00036B5526E4D25C)
Adapter: vmhba3
Channel: 0
Target: 0
LUN: 0
Plugin: HPP
State: active
Transport: pcie
Adapter Identifier: pcie.6600
Target Identifier: pcie.0:0
Adapter Transport Details: Unavailable
or path is unclaimed
Target Transport Details: Unavailable
or path is unclaimed
Maximum IO Size: 131072
Action Plan:
- The Disk Group must be removed first with the option “No
Data migration” (as the Disk Group is effectively lost),
then replace the failed disk and re-create the Disk Group.
- I would recommend you to get in touch with you Hardware Vendor and
replace the faulty disk from the ESXi Host based on the details i have
shared above.
For more details please refer to: https://kb.vmware.com/s/article/2149067
Please let me know if you have any questions and queries regarding this.