Hostname: ma-abc-1411-a8-01.infra.abc.com
ESXi Version: VMware ESXi 6.5.0 build-9298722 (ESXi 6.5 U2C)
Time of Issue: Feb 25 06:50 to 7:50 UTC 2021
vmhba |
driver |
driver version |
VID |
DID |
SVID |
SDID |
model |
vmhba0 |
nhpsa |
2.0.6-3vmw.650.0.0.4564106 |
103c |
323c |
103c |
1921 |
Hewlett-Packard Company Smart
Array P830i |
vmnic |
PCI bus address |
link |
speed |
duplex |
MTU |
driver |
driver version |
firmware version |
MAC address |
VID |
DID |
SVID |
SDID |
name |
vmnic0 |
0000:03:00.0 |
Up |
10000 |
Full |
9000 |
bnx2x |
2.713.60.v60.2 |
bc 7.15.56 |
e0:07:1b:f0:ac:48 |
1.40E+05 |
168e |
103c |
1930 |
Broadcom Corporation QLogic 57810 10 Gigabit
Ethernet Adapter |
vmnic1 |
0000:03:00.1 |
Up |
10000 |
Full |
9000 |
bnx2x |
2.713.60.v60.2 |
bc 7.15.56 |
e0:07:1b:f0:ac:4c |
1.40E+05 |
168e |
103c |
1930 |
Broadcom Corporation QLogic 57810 10 Gigabit
Ethernet Adapter |
vmnic2 |
0000:84:00.0 |
Up |
10000 |
Full |
9000 |
bnx2x |
2.713.60.v60.2 |
bc 7.15.56 |
9c:dc:71:7a:b6:c8 |
1.40E+05 |
168e |
103c |
339d |
Broadcom Corporation QLogic 57810 10 Gigabit
Ethernet Adapter |
vmnic3 |
0000:84:00.1 |
Up |
10000 |
Full |
9000 |
bnx2x |
2.713.60.v60.2 |
bc 7.15.56 |
9c:dc:71:7a:b6:cc |
1.40E+05 |
168e |
103c |
339d |
Broadcom Corporation QLogic 57810 10 Gigabit
Ethernet Adapter |
VMK Summary:
- Boot
Summary:
2020-10-02T20:33:22Z bootstop: Host has booted
2020-12-02T17:27:05Z bootstop: Host is rebooting
2020-12-02T17:36:02Z bootstop: Host has booted
2021-02-25T06:41:38Z bootstop: Host has booted
Hostd Logs:
- Reviewed
the Hostd Logs, but was not able to find any details regarding the ESXi
Host to be in a Hung State:
- However
in the logs we can see few IO Errors for the Storage not being accessible:
2021-02-25T06:11:50.616Z error hostd[DCC6B70]
[Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/abc-s-ma-hdb-001/abc-s-ma-hdb-001.vmx]
Could not perform config check (storage not accessible):
vim.fault.GenericVmConfigFault
2021-02-25T06:11:50.621Z
info hostd[DC85B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx
opID=lro-1-4cd9b3b7-ec198-01-01-01-b1-03bf user=vpxuser] State Transition
(VM_STATE_ON -> VM_STATE_EMIGRATING)
2021-02-25T06:11:50.624Z
info hostd[D485B70] [Originator@6876 sub=vm:DictionaryLoad: Cannot open file
“/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
: Input/output error.
2021-02-25T06:11:50.625Z
info hostd[D485B70] [Originator@6876 sub=vm:DictionaryLoad: Cannot open file
“/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
: Input/output error.
2021-02-25T06:11:50.627Z
info hostd[D485B70] [Originator@6876 sub=vm:DictionaryLoad: Cannot open file
“/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
: Input/output error.
2021-02-25T06:11:50.630Z
error hostd[D485B70] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
Could not perform config check (storage not accessible):
vim.fault.GenericVmConfigFault
- From the
logs we can see the Datastore:MA_HANA_STG_01_L04_DS02 was inaccessible:
2021-02-25T06:16:53.594Z warning
hostd[D340B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/abc-s-ma-hdb-001/abc-s-ma-hdb-001.vmx]
UpdateStorageAccessibilityStatusInt: The datastore
172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:16:53.617Z warning hostd[D340B70] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/abc-s-ma-hdb-001/abc-s-ma-hdb-001.vmx]
UpdateStorageAccessibilityStatusInt: The datastore
172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:16:53.894Z warning hostd[D7CAB70] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/pqr-s-ma-hdb-001/pqr-s-ma-hdb-001.vmx]
UpdateStorageAccessibilityStatusInt: The datastore
172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:16:53.916Z warning hostd[D7CAB70] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/pqr-s-ma-hdb-001/pqr-s-ma-hdb-001.vmx]
UpdateStorageAccessibilityStatusInt: The datastore
172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:26:54.169Z warning
hostd[DDCAB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
UpdateStorageAccessibilityStatusInt: The datastore
172.16.0.4:/MA_HANA_STG_01_L04_DS02 is not accessible
2021-02-25T06:26:54.171Z
warning hostd[DDCAB70] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
FetchUpdatedLayout: VM storage inaccessible.
2021-02-25T06:26:54.173Z
warning hostd[DDCAB70] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/7b84abff-2a86b528/xyz-s4-ma-hdb-001/xyz-s4-ma-hdb-001.vmx]
Failed to find activation record, event user unknown.
- Post this
we can see the reboot Operation:
–>
fullName = “VMware ESX build-9298722”,
–> version = “6.5.0”,
–> build = “9298722”,
VOBD Logs:
- In the
VOBD Logs we can see the Server log connection to the Server 172.16.0.5 and 172.16.0.4 for Mountpoint
to MA_HANA_STG_01_L05_DS03
2021-02-25T06:03:46.988Z: [vmfsCorrelator] 7302728494869us:
[vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.5 mount
point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000
(“MA_HANA_STG_01_L05_DS03”)
2021-02-25T06:26:22.957Z:
[vmfsCorrelator] 7304084494786us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02,
mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T05:32:58.029Z: [APDCorrelator]
7300879495079us: [vob.storage.apd.start] Device or filesystem with identifier
[7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T05:32:58.029Z:
[APDCorrelator] 7300714347738us: [esx.problem.storage.apd.start] Device or
filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down
state.
- We can
see the similar events throughout the day.
2021-02-25T01:46:20.318Z: [vmfsCorrelator] 7287281495957us:
[vob.vmfs.nfs.server.disconnect] Lost connection to the server 172.16.0.5 mount
point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000
(“MA_HANA_STG_01_L05_DS03”)
2021-02-25T01:47:56.316Z:
[vmfsCorrelator] 7287377496007us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.3 mount point /MA_HANA_STG_01_L03_DS01,
mounted as 5ed3d1ca-fc73f1bd-0000-000000000000 (“MA_HANA_STG_01_L03_DS01”)
2021-02-25T01:50:44.312Z:
[vmfsCorrelator] 7287545496157us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03,
mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)
2021-02-25T01:55:08.308Z:
[vmfsCorrelator] 7287809495916us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.5 mount point /MA_HANA_STG_01_L05_DS03,
mounted as abff1f2a-fc86081e-0000-000000000000 (“MA_HANA_STG_01_L05_DS03”)
2021-02-25T02:10:32.285Z:
[vmfsCorrelator] 7288733495835us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02,
mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T03:32:57.181Z:
[vmfsCorrelator] 7293678495514us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02,
mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T04:07:34.138Z:
[vmfsCorrelator] 7295755495587us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02,
mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
2021-02-25T04:24:58.116Z:
[vmfsCorrelator] 7296799495322us: [vob.vmfs.nfs.server.disconnect] Lost
connection to the server 172.16.0.4 mount point /MA_HANA_STG_01_L04_DS02,
mounted as 7b84abff-2a86b528-0000-000000000000 (“MA_HANA_STG_01_L04_DS02”)
- Post this
we can see that the Host has rebooted:
2021-02-25T06:41:38.866Z: [GenericCorrelator] 107926801us:
[vob.user.host.boot] Host has booted.
2021-02-25T06:41:38.866Z:
[UserLevelCorrelator] 107926801us: [vob.user.host.boot] Host has booted.
2021-02-25T06:41:38.866Z:
[UserLevelCorrelator] 107927036us: [esx.audit.host.boot] Host has booted.
2021-02-25T06:43:59.498Z:
[GenericCorrelator] 248559208us: [vob.user.maintenancemode.entering] The host
has begun entering maintenance mode
2021-02-25T06:43:59.498Z:
[UserLevelCorrelator] 248559208us: [vob.user.maintenancemode.entering] The host
has begun entering maintenance mode
2021-02-25T06:43:59.499Z:
[UserLevelCorrelator] 248559382us: [esx.audit.maintenancemode.entering] The
host has begun entering maintenance mode.
VMKernel Logs:
- From
Vmkernel Logs as well we can see the same set of events for the Storage
Disconnection with All Paths Down.
2021-02-25T03:43:24.992Z cpu58:68685)NFS: 2333: Failed
to get object (0x43922269b386) 52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd
80bb751f 4000000000 405f7468bd 0 0 :No connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356)
52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0
0 :No connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356)
52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0
0 :No connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.992Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.993Z
cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356)
52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0
0 :No connection
2021-02-25T03:43:24.993Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.993Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.993Z
cpu58:68685)NFS: 2328: [Repeated 2 times] Failed to get object (0x43922269b356)
52 7b84abff 2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0
0 :No connection
2021-02-25T03:43:24.993Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b386) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T03:43:24.993Z
cpu58:68685)NFS: 2333: Failed to get object (0x43922269b356) 52 7b84abff
2a86b528 0 80bb751f 0 40 5f7468bd 80bb751f 4000000000 405f7468bd 0 0 :No
connection
2021-02-25T04:06:10.140Z cpu3:66046)StorageApdHandlerEv: 110: Device or
filesystem with identifier [7b84abff-2a86b528] has entered the All Paths Down
state.
2021-02-25T04:23:34.118Z
cpu3:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier
[7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T04:49:34.085Z
cpu0:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier
[7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T05:32:58.029Z
cpu0:66046)StorageApdHandlerEv: 110: Device or filesystem with identifier
[7b84abff-2a86b528] has entered the All Paths Down state.
2021-02-25T06:02:22.990Z cpu0:66046)StorageApdHandlerEv: 110: Device or
filesystem with identifier [abff1f2a-fc86081e] has entered the All Paths Down
state.
2021-02-25T06:03:46.988Z
cpu78:66728)WARNING: NFS: 337: Lost connection to the server 172.16.0.5 mount
point /MA_HANA_STG_01_L05_DS03, mounted as abff1f2a-fc86081e-0000-000000000000
(“MA_HANA_STG_01_L05_DS03”)
Conclusion:
- Based on
the logs which we can see, the issue has happened with the ESXi Host
losing the Access to the Datastores with the All Paths Down State. Few of
the VMs specifically Linux might still be able to respond if they are
running from inside the Memory.
- Generally
in these scenarios we can sometime expect the ESXi Host to be unresponsive
as all the Datastores associated with the Host went into ALL Paths Down
State.
- Since it
was a networking issue we cannot investigate further from the logs as
generally a live troubleshooting session at the time of issue helps to
isolate the issue in a better way.
Action Plan:
- Since it
was NFS Storage generally the issue can be isolated while doing the live
troubleshooting at the time of issue.
- I will
recommend you to check with your Networking Team to confirm if they had
seen any issues from the NAS at the time of issue.
- Next time
Incase if we face any issues please follow the below steps or engage us on
call so that we can perform live troubleshooting to isolate it better:
- Use the below Command to check the connectivity between the ESXi
Host and the NAS Server:
- vmkping -I vmkX x.x.x.x (where vmkX is the Kernel port on which NFS is connected, and
x.x.x.x is the NFS Server IP Address)
- We can also do a packet capture to see the flow, Incase if the
ping operation is not working.