RCA 37: VM Went into Read-only Mode on VMFS Datastore

Hostname: ma-abc-1411-a8-01.infra.abc.com

ESXi Version: VMware ESXi 6.5.0 build-9298722 (ESXi 6.5 U2C)

 

Time of Issue: Feb 25 06:50 to 7:50 UTC 2021

 

 

vmhba

driver

driver version

VID

DID

SVID

SDID

model

vmhba0

nhpsa

2.0.6-3vmw.650.0.0.4564106

103c

323c

103c

1921

Hewlett-Packard Company Smart Array P830i

 

 

vmnic

PCI bus address

link

speed

duplex

MTU

driver

driver version

firmware version

MAC address

VID

DID

SVID

SDID

name

vmnic0

0000:03:00.0

Up

10000

Full

9000

bnx2x

2.713.60.v60.2

bc 7.15.56

e0:07:1b:f0:ac:48

1.40E+05

168e

103c

1930

Broadcom Corporation QLogic 57810 10 Gigabit Ethernet Adapter

vmnic1

0000:03:00.1

Up

10000

Full

9000

bnx2x

2.713.60.v60.2

bc 7.15.56

e0:07:1b:f0:ac:4c

1.40E+05

168e

103c

1930

Broadcom Corporation QLogic 57810 10 Gigabit Ethernet Adapter

vmnic2

0000:84:00.0

Up

10000

Full

9000

bnx2x

2.713.60.v60.2

bc 7.15.56

9c:dc:71:7a:b6:c8

1.40E+05

168e

103c

339d

Broadcom Corporation QLogic 57810 10 Gigabit Ethernet Adapter

vmnic3

0000:84:00.1

Up

10000

Full

9000

bnx2x

2.713.60.v60.2

bc 7.15.56

9c:dc:71:7a:b6:cc

1.40E+05

168e

103c

339d

Broadcom Corporation QLogic 57810 10 Gigabit Ethernet Adapter

 

VMK Summary:

 

  • Boot Summary:

 

2020-10-02T20:33:22Z bootstop: Host has booted
2020-12-02T17:27:05Z bootstop: Host is rebooting
2020-12-02T17:36:02Z bootstop: Host has booted
2021-02-25T06:41:38Z bootstop: Host has booted

 

Hostd Logs:

 

 

  • Reviewed the Hostd Logs, but was not able to find any details regarding the ESXi Host to be in a Hung State:

 

  • However in the logs we can see few IO Errors for the Storage not being accessible:

 

2021-02-25T06:11:50.616Z error hostd[DCC6B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/abc-s-ma-hdb-001/abc-s-ma-hdb-001.vmx] Could not perform config check (storage not accessible): vim.fault.GenericVmConfigFault
2021-02-25T06:11:50.621Z info hostd[DC85B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx opID=lro-1-4cd9b3b7-ec198-01-01-01-b1-03bf user=vpxuser] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)
2021-02-25T06:11:50.624Z info hostd[D485B70] [Originator@6876 sub=vm:DictionaryLoad: Cannot open file “/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] : Input/output error.
2021-02-25T06:11:50.625Z info hostd[D485B70] [Originator@6876 sub=vm:DictionaryLoad: Cannot open file “/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] : Input/output error.
2021-02-25T06:11:50.627Z info hostd[D485B70] [Originator@6876 sub=vm:DictionaryLoad: Cannot open file “/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] : Input/output error.
2021-02-25T06:11:50.630Z error hostd[D485B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] Could not perform config check (storage not accessible): vim.fault.GenericVmConfigFault

 

  • From the logs we can see the Datastore:MA_HANA_STG_01_L04_DS02 was inaccessible:

 

2021-02-25T06:16:53.594Z warning hostd[D340B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/abc-s-ma-hdb-001/abc-s-ma-hdb-001.vmx] UpdateStorageAccessibilityStatusInt: The datastore 172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:16:53.617Z warning hostd[D340B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/abc-s-ma-hdb-001/abc-s-ma-hdb-001.vmx] UpdateStorageAccessibilityStatusInt: The datastore 172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:16:53.894Z warning hostd[D7CAB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/pqr-s-ma-hdb-001/pqr-s-ma-hdb-001.vmx] UpdateStorageAccessibilityStatusInt: The datastore 172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:16:53.916Z warning hostd[D7CAB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/pqr-s-ma-hdb-001/pqr-s-ma-hdb-001.vmx] UpdateStorageAccessibilityStatusInt: The datastore 172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible

2021-02-25T06:26:54.169Z warning hostd[DDCAB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] UpdateStorageAccessibilityStatusInt: The datastore 172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:26:54.171Z warning hostd[DDCAB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] FetchUpdatedLayout: VM storage inaccessible.
2021-02-25T06:26:54.173Z warning hostd[DDCAB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx] Failed to find activation record, event user unknown.

  • Post this we can see the reboot Operation:

 

–>    fullName = “VMware ESX build-9298722”,
–>    version = “6.5.0”,
–>    build = “9298722”,

 

VOBD Logs:

 

  • In the VOBD Logs we can see the Server log connection to the Server  172.16.0.5 and 172.16.0.4 for Mountpoint to MA_HANA_STG_01_L05_DS03

 

2021-02-25T06:03:46.988Z: [vmfsCorrelator] 7302728494869us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)
2021-02-25T06:26:22.957Z: [vmfsCorrelator] 7304084494786us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02, mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)

2021-02-25T05:32:58.029Z: [APDCorrelator] 7300879495079us: [vob.storage.apd.start] Device or filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T05:32:58.029Z: [APDCorrelator] 7300714347738us: [esx.problem.storage.apd.start] Device or filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down state.

 

  • We can see the similar events throughout the day.

 

2021-02-25T01:46:20.318Z: [vmfsCorrelator] 7287281495957us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)
2021-02-25T01:47:56.316Z: [vmfsCorrelator] 7287377496007us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.3 mount point /MA_HANA_STG_01_L03_DS01, mounted as 5ed3d1ca-fc73f1bd-0000-000000000000 (“MA_HANA_STG_01_L03_DS01”)
2021-02-25T01:50:44.312Z: [vmfsCorrelator] 7287545496157us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)
2021-02-25T01:55:08.308Z: [vmfsCorrelator] 7287809495916us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)
2021-02-25T02:10:32.285Z: [vmfsCorrelator] 7288733495835us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02, mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T03:32:57.181Z: [vmfsCorrelator] 7293678495514us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02, mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T04:07:34.138Z: [vmfsCorrelator] 7295755495587us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02, mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T04:24:58.116Z: [vmfsCorrelator] 7296799495322us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02, mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)

  • Post this we can see that the Host has rebooted:

 

2021-02-25T06:41:38.866Z: [GenericCorrelator] 107926801us: [vob.user.host.boot] Host has booted.
2021-02-25T06:41:38.866Z: [UserLevelCorrelator] 107926801us: [vob.user.host.boot] Host has booted.
2021-02-25T06:41:38.866Z: [UserLevelCorrelator] 107927036us: [esx.audit.host.boot] Host has booted.
2021-02-25T06:43:59.498Z: [GenericCorrelator] 248559208us: [vob.user.maintenancemode.entering] The host has begun entering maintenance mode
2021-02-25T06:43:59.498Z: [UserLevelCorrelator] 248559208us: [vob.user.maintenancemode.entering] The host has begun entering maintenance mode
2021-02-25T06:43:59.499Z: [UserLevelCorrelator] 248559382us: [esx.audit.maintenancemode.entering] The host has begun entering maintenance mode.

VMKernel Logs:

 

  • From Vmkernel Logs as well we can see the same set of events for the Storage Disconnection with All Paths Down.

 

2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.993Z cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.993Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.993Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.993Z cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.993Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.993Z cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No connection

 

2021-02-25T04:06:10.140Z cpu3:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T04:23:34.118Z cpu3:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T04:49:34.085Z cpu0:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T05:32:58.029Z cpu0:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down state.

2021-02-25T06:02:22.990Z cpu0:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier [abff1f2a-fc86081e] has entered the All Paths Down state.
2021-02-25T06:03:46.988Z cpu78:66728)WARNING: NFS: 337: Lost connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)

 

 

Conclusion:

 

  • Based on the logs which we can see, the issue has happened with the ESXi Host losing the Access to the Datastores with the All Paths Down State. Few of the VMs specifically Linux might still be able to respond if they are running from inside the Memory.
  • Generally in these scenarios we can sometime expect the ESXi Host to be unresponsive as all the Datastores associated with the Host went into ALL Paths Down State.
  • Since it was a networking issue we cannot investigate further from the logs as generally a live troubleshooting session at the time of issue helps to isolate the issue in a better way.

 

Action Plan:

 

  • Since it was NFS Storage generally the issue can be isolated while doing the live troubleshooting at the time of issue.
  • I will recommend you to check with your Networking Team to confirm if they had seen any issues from the NAS at the time of issue.
  • Next time Incase if we face any issues please follow the below steps or engage us on call so that we can perform live troubleshooting to isolate it better:
    • Use the below Command to check the connectivity between the ESXi Host and the NAS Server:
    • vmkping -I vmkX x.x.x.x (where vmkX is the Kernel port on which NFS is connected, and x.x.x.x is the NFS Server IP Address)
    • We can also do a packet capture to see the flow, Incase if the ping operation is not working.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply