Identifying Storage Corruption on a VMFS Datastore

There can be a situation where you will find strange issues on your vSphere infrastructure like multiple hosts are going in not responding state and your virtual machines are not able to ping. You can see these issues on multiple ESXi hosts.

When you execute the below commands you might see your datastores missing:

While working through the events you may see the below scenario:

VOBD Logs:

2023-03-03T00:52:39.820Z: Event rate limit reached. Dropping vprob: esx.problem.vmfs.heartbeat.corruptondisk
2023-03-03T00:52:39.825Z: [vmfsCorrelator] 1801426573592us: [vob.vmfs.heartbeat.corruptondisk] Volume 6155563f-2882c26a-e58e-246e96d6abcd ("GVVOL0ABCD") may be damaged on disk. Corrupt heartbeat detected at offset 3706880: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2023-03-03T00:52:39.825Z: Event rate limit reached. Dropping vprob: esx.problem.vmfs.heartbeat.corruptondisk
2023-03-03T00:52:39.826Z: [vmfsCorrelator] 1801426574523us: [vob.vmfs.resource.corruptondisk] Volume 63208c05-f0d644f6-26cb-000e1e371234 ("GVVOL01234") might be damaged on the disk. Resource cluster metadata corruption has been detected.
2023-03-03T00:52:39.826Z: Event rate limit reached. Dropping vprob: esx.problem.vmfs.resource.corruptondisk
2023-03-03T00:52:39.827Z: [vmfsCorrelator] 1801426575164us: [vob.vmfs.heartbeat.corruptondisk] Volume 609d5cbd-3b0b3d8e-15cf-e4434b35123D ("GVVOL0123D") may be damaged on disk. Corrupt heartbeat detected at offset 3706880: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2023-03-03T00:52:39.827Z: Event rate limit reached. Dropping vprob: esx.problem.vmfs.heartbeat.corruptondisk
2023-03-03T00:52:39.839Z: [vmfsCorrelator] 1801426587046us: [vob.vmfs.heartbeat.corruptondisk] Volume 6155563f-2882c26a-e58e-246e96d6abcd ("GVVOL0ABCD") may be damaged on disk. Corrupt heartbeat detected at offset 3706880: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]

VMKernel Logs:

2023-03-03T00:52:11.381Z cpu48:2581195)WARNING: HBX: 751: 'GVVOL01234': HB at offset 0 - Volume 611f90fe-72dfa85e-de64-246e96d61234 may be damaged on disk. Corrupt heartbeat detected:
2023-03-03T00:52:11.381Z cpu48:2581195)WARNING: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2023-03-03T00:52:11.381Z cpu48:2581195)WARNING: FS3: 608: VMFS volume GVVOL01234/611f90fe-72dfa85e-de64-246e96d61234 on naa.6006016046203f0097fd42845101ec11:1 has been detected corrupted
2023-03-03T00:52:11.381Z cpu48:2581195)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
2023-03-03T00:52:11.381Z cpu48:2581195)FS3: 634: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.6006016046203f0097fd42845101ec11:1 -D X`
2023-03-03T00:52:11.381Z cpu48:2581195)FS3: 641: where X is the dump file name on a DIFFERENT volume

2023-03-03T00:52:11.383Z cpu46:2581125)WARNING: HBX: 751: ‘GVVOL0112D’: HB at offset 0 – Volume 6337398b-376c5c79-2ec4-000e1e376538 may be damaged on disk. Corrupt heartbeat detected:
2023-03-03T00:52:11.383Z cpu46:2581125)WARNING: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2023-03-03T00:52:11.383Z cpu46:2581125)WARNING: FS3: 608: VMFS volume GVVOL0112D/6337398b-376c5c79-2ec4-000e1e376538 on naa.6006016046203f00e0abfcb8dc40ed11:1 has been detected corrupted
2023-03-03T00:52:11.383Z cpu46:2581125)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
2023-03-03T00:52:11.383Z cpu46:2581125)FS3: 634: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.6006016046203f00e0abfcb8dc40ed11:1 -D X`
2023-03-03T00:52:11.383Z cpu46:2581125)FS3: 641: where X is the dump file name on a DIFFERENT volume

2023-03-03T00:52:11.385Z cpu62:2581194)WARNING: HBX: 751: ‘GVVOL01342D’: HB at offset 0 – Volume xxxxxx-9c91bbed-4273-000e1e376348 may be damaged on disk. Corrupt heartbeat detected:
2023-03-03T00:52:11.385Z cpu62:2581194)WARNING: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2023-03-03T00:52:11.385Z cpu62:2581194)WARNING: FS3: 608: VMFS volume GVVOL01342D/5fa68e3c-9c91bbed-4273-000e1e376348 on naa.6006016046203f008a6f93d8b520eb11:1 has been detected corrupted
2023-03-03T00:52:11.385Z cpu62:2581194)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
2023-03-03T00:52:11.385Z cpu62:2581194)FS3: 634: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.6006016046203f008a6f93d8b520eb11:1 -D X`
2023-03-03T00:52:11.385Z cpu62:2581194)FS3: 641: where X is the dump file name on a DIFFERENT volume

2023-03-03T00:52:11.387Z cpu62:2581393)WARNING: HBX: 751: ‘GVVOL0173DDD’: HB at offset 0 – Volume xxxxxxxx-2882c26a-e58e-246e96d61234 may be damaged on disk. Corrupt heartbeat detected:
2023-03-03T00:52:11.387Z cpu62:2581393)WARNING: [HB state 0 offset 0 gen 0 stampUS 0 uuid 00000000-00000000-0000-000000000000 jrnl <FB 0> drv 0.0]
2023-03-03T00:52:11.387Z cpu62:2581393)WARNING: FS3: 608: VMFS volume GVVOL0173DDD/6155563f-2882c26a-e58e-246e96d61234 on naa.6006016046203f009ea60af5ea1cec11:1 has been detected corrupted

This is a Clear case of Disk Corruption. From the logs, we can see that the Datastores are missing with the HB Region and hence the Volumes are failing. While trying to recover the Volumes the Hostd Service goes unresponsive, hence the Esxi Agents go down.

Generally based on the extent of corruption you can run the below VOMA Tool and try to fix some corruption however it’s always recommended to restore from the backup as we don’t know the extent of corruption.

In my case executing VOMA gave me the below response.

[FDNET\kabcdef#@abcd21:/dev/disks] voma -m vmfs -d /vmfs/devices/disks/naa.6006016046203f004f2e3b80ebb3eb11:1 -s /tmp/naa.6006016046203f004f2e3b80ebb3eb11-analysis.txt
Initializing LVM metadata..-
LVM magic not found at expected Offset,
It might take long time to search in rest of the disk.
VMware ESXi Question:
Do you want to continue (Y/N)?
0) _Yes
1) _No
Select a number from 0-1: 0

It Failed with the error that it’s not able to find the LVM on the LUN which means that the Partition information is cleared up.

While running a hex dump command I found below:

[sabcd21:/dev/disks] hexdump -C /vmfs/devices/disks/naa.6006016046203f004f2e3b80ebb3eb11
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*

Only ZEEEEERROOOOOOSSSSS

This indicates that the Volume is wiped out from the Storage end and there is nothing left on this Data store.

A Working Datastore on Hexdump looks like below:

hexdump -C /vmfs/devices/disks/naa.6006016046203f007c43b138ed74ec11[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J[J naa.6d09466061b6d700234e228751eda800 naa.6d09466061b6d700234e228751eda800[J[36Dnaa.6d09466061b6d700234e228751eda800 | less
[?1049h[?1h=
00000000 fa 31 c0 8e d8 8e d0 bc 00 7c 89 e6 06 57 8e c0 |.1.......|...W..|
00000010 fb fc bf 00 06 b9 00 01 f3 a5 ea 1f 06 00 00 52 |...............R|
00000020 52 b4 41 bb aa 55 31 c9 30 f6 f9 cd 13 72 13 81 |R.A..U1.0....r..|
00000030 fb 55 aa 75 0d d1 e9 73 09 66 c7 06 47 07 b4 42 |.U.u...s.f..G..B|
00000040 eb 13 5a b4 08 cd 13 83 e1 3f 89 e5 51 0f b6 c6 |..Z......?..Q...|
00000050 40 f7 e1 52 50 66 31 c0 66 99 40 bb 00 7c 53 e8 |@..RPf1.f.@..|S.|
00000060 d7 00 8b 4e 56 8b 46 5a 50 51 f7 e1 c1 e8 09 91 |...NV.FZPQ......|
00000070 41 66 8b 46 4e 66 8b 56 52 53 e8 bc 00 e8 b0 00 |Af.FNf.VRS......|
00000080 e2 f8 5e 59 58 51 56 83 c6 10 bf a8 07 b9 08 00 |..^YXQV.........|
00000090 f3 a7 5e 59 74 21 01 c6 e2 eb e8 e8 00 42 6f 6f |..^Yt!.......Boo|
000000a0 74 20 70 61 72 74 69 74 69 6f 6e 20 6e 6f 74 20 |t partition not |
000000b0 66 6f 75 6e 64 0d 0a 91 bf be 07 57 66 31 c0 b0 |found......Wf1..|
000000c0 80 66 ab b0 ee 66 ab 66 8b 44 20 66 8b 54 24 e8 |.f...f.f.D f.T$.|
000000d0 4e 00 66 8b 44 28 66 8b 54 30 66 2b 44 20 66 1b |N.f.D(f.T0f+D f.|
000000e0 54 24 e8 4b 00 e8 38 00 f3 a4 5e 66 8b 44 30 66 |T$.K..8...^f.D0f|
000000f0 8b 54 34 5b e8 42 00 81 7f fe 55 aa 75 0e 89 ec |.T4[.B....U.u...|
00000100 5a 5f 07 66 b8 21 47 50 54 fa ff e4 e8 76 00 4f |Z_.f.!GPT....v.O|
00000110 53 20 6e 6f 74 20 62 6f 6f 74 61 62 6c 65 0d 0a |S not bootable..|
00000120 66 50 66 21 d2 74 04 66 83 c8 ff 66 ab 66 58 c3 |fPf!.t.f...f.fX.|

Hence in this situation unfortunately your only workaround is to restore from a backup and investigate this from the Storage end on what went wrong or who performed the cleanup of the storage.