Time of Issue: 11th OCT 2021 7:30PM CET (Algeria Time)
Time in GMT: 11th OCT 2021 5:30 PM GMT
vCenter name: Vcenter-Xyz-6.7.dbss.abc.com
VPXD Logs:
- Checked
the VPXD Logs to jump to the Start of Events to understand how it was
started:
- From the
Logs we can see that the Host started seeing disconnect events around
12:56 UTC-1 on 11th.
2021-10-11T12:56:22.935-01:00
warning vpxd[05189] [Originator@6876 sub=QuickStats
opID=QuickStatsProcessPollResults-198b028c] Empty quick stats value for host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T12:56:26.782-01:00 info
vpxd[04200] [Originator@6876 sub=drmLogger opID=lro-2-2dfc6514-75d3] No stats
from host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
for 10 minutes. Ignoring
2021-10-11T12:57:22.939-01:00
warning vpxd[05196] [Originator@6876 sub=QuickStats
opID=QuickStatsProcessPollResults-574d62b3] Empty quick stats value for host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T12:57:26.818-01:00 info vpxd[04172]
[Originator@6876 sub=drmLogger opID=lro-2-2dfc6514-75d5] No stats from host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com] for 10
minutes. Ignoring
2021-10-11T12:58:22.944-01:00 warning vpxd[05196]
[Originator@6876 sub=QuickStats opID=QuickStatsProcessPollResults-574d62b3]
Empty quick stats value for host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T12:58:26.853-01:00 info vpxd[04120]
[Originator@6876 sub=drmLogger opID=lro-2-2dfc6514-75d7] No stats from host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com] for 10
minutes. Ignoring
2021-10-11T12:59:14.961-01:00 warning vpxd[04215]
[Originator@6876 sub=MoHost opID=3dd0a308] PMem stats error: host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com] has no
valid stats
2021-10-11T12:59:22.940-01:00
warning vpxd[05189] [Originator@6876 sub=QuickStats
opID=QuickStatsProcessPollResults-198b028c] Empty quick stats value for host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T12:59:26.887-01:00 info
vpxd[04128] [Originator@6876 sub=drmLogger opID=lro-2-2dfc6514-75d9] No stats
from host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
for 10 minutes. Ignoring
2021-10-11T13:00:22.936-01:00
warning vpxd[05196] [Originator@6876 sub=QuickStats
opID=QuickStatsProcessPollResults-574d62b3] Empty quick stats value for host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T13:00:26.922-01:00 info
vpxd[05323] [Originator@6876 sub=drmLogger opID=lro-2-2dfc6514-75db] No stats
from host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
for 10 minutes. Ignoring
2021-10-11T13:00:43.930-01:00 error vpxd[05377]
[Originator@6876 sub=MoHost] Null response from host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]; no
error
2021-10-11T13:00:43.930-01:00 error vpxd[05377]
[Originator@6876 sub=MoDVSwitch] RPC for host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com] got
status code 3
2021-10-11T13:00:44.177-01:00 error vpxd[05331]
[Originator@6876 sub=MoHost] Null response from host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]; no
error
2021-10-11T13:00:44.177-01:00 error vpxd[05331]
[Originator@6876 sub=MoDVSwitch] RPC for host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com] got
status code 3
2021-10-11T13:01:22.936-01:00 warning vpxd[05189]
[Originator@6876 sub=QuickStats opID=QuickStatsProcessPollResults-198b028c]
Empty quick stats value for host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T13:01:25.956-01:00 info vpxd[06202]
[Originator@6876 sub=drmLogger opID=lro-2-2dfc6514-75dd] No stats from host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com] for 10
minutes. Ignoring
2021-10-11T13:02:22.937-01:00 warning vpxd[04165]
[Originator@6876 sub=QuickStats opID=QuickStatsProcessPollResults-7554426d]
Empty quick stats value for host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
- Here the
vCenter started notifying that the Communication between the ESXi Host and
the vCenter is down:
2021-10-11T13:29:59.618-01:00 warning vpxd[04295]
[Originator@6876 sub=InvtHostCnx opID=HB-host-82912@768244-3621cb8] Exception
occurred during host sync; Host communication failed;
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com], e:
N5Vmomi5Fault17HostCommunication9ExceptionE(Fault cause:
vmodl.fault.HostCommunication
2021-10-11T13:29:59.624-01:00 warning vpxd[04295]
[Originator@6876 sub=MoHost opID=HB-host-82912@768244-3621cb8] host
[vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
connection state changed to NO_RESPONSE
2021-10-11T13:29:59.689-01:00 warning vpxd[04295]
[Originator@6876 sub=VpxProfiler opID=HB-host-82912@768244-3621cb8]
InvtHostSyncLRO::StartWork [HostSyncTime] took 1800079 ms
2021-10-11T13:29:59.689-01:00 warning vpxd[04295]
[Originator@6876 sub=VpxProfiler opID=HB-host-82912@768244-3621cb8]
VpxLro::LroMain [TotalTime] took 1800079 ms
2021-10-11T13:30:00.543-01:00 info vpxd[05187]
[Originator@6876 sub=InvtHostCnx opID=HeartbeatStartHandler-23bfdfe6] Missed
179 heartbeats for host [vim.HostSystem:host-82912,alg-s1-l-xyzs016.dbss.abc.com]
2021-10-11T13:30:00.543-01:00 info vpxd[05186]
[Originator@6876 sub=vpxLro opID=HB-SpecSync-host-82912@944235-389cfe8e]
[VpxLRO] — BEGIN lro-8827446 — —
SpecSyncLRO.Synchronize —
ESXi Host: ALG-S1-L-XyzS016.dbss.abc.com
Hostd ID: host-82912 10.96.27.169
VOBD Logs:
- Checked
the VOBD Logs at the start of the ESXi Host issue.
- We can
see that the issue started because of the Paths to the Datastore went into
APD State:
2021-10-11T13:39:36.618Z: [APDCorrelator]
22674691781000us: [vob.storage.apd.start] Device or filesystem with identifier
[naa.60002ac000000000000007200001e000] has entered the All Paths Down state.
2021-10-11T13:39:36.618Z: [scsiCorrelator]
22674691780883us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T5:L1
changed state from on
2021-10-11T13:39:36.618Z: [APDCorrelator]
22675146314313us: [esx.problem.storage.apd.start] Device or filesystem with
identifier [naa.60002ac000000000000007200001e000] has entered the All Paths
Down state.
- Due to
this loss in Connectivity the Datastore got Disconnected:
2021-10-11T13:39:36.622Z: [scsiCorrelator]
22675146318076us: [esx.problem.storage.connectivity.lost] Lost connectivity to
storage device naa.60002ac000000000000007200001e000. Path vmhba2:C0:T5:L1 is
down. Affected datastores: “ALG-S2-P-DB-Backup-NL01”.2021-10-11T13:39:36.622Z:
[scsiCorrelator] 22674691780893us: [vob.scsi.scsipath.pathstate.dead] scsiPath
vmhba2:C0:T4:L1 changed state from on
2021-10-11T13:39:36.625Z: [scsiCorrelator]
22675146320961us: [esx.problem.storage.connectivity.lost] Lost connectivity to
storage device naa.60002ac000000000000007200001e000. Path vmhba2:C0:T4:L1 is
down. Affected datastores: “ALG-S2-P-DB-Backup-NL01”.
2021-10-11T13:39:36.625Z: [scsiCorrelator]
22674691780897us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T2:L1
changed state from on
2021-10-11T13:39:36.628Z: [scsiCorrelator]
22675146323882us: [esx.problem.storage.connectivity.lost] Lost connectivity to
storage device naa.60002ac000000000000007200001e000. Path vmhba2:C0:T2:L1 is
down. Affected datastores: “ALG-S2-P-DB-Backup-NL01”.2021-10-11T13:39:36.628Z:
[scsiCorrelator] 22674691780900us: [vob.scsi.scsipath.pathstate.dead] scsiPath
vmhba2:C0:T3:L1 changed state from on
2021-10-11T13:39:36.631Z: [scsiCorrelator]
22675146326753us: [esx.problem.storage.connectivity.lost] Lost connectivity to
storage device naa.60002ac000000000000007200001e000. Path vmhba2:C0:T3:L1 is
down. Affected datastores:
“ALG-S2-P-DB-Backup-NL01”.2021-10-11T13:39:36.631Z: [scsiCorrelator]
22674691781615us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T5:L110
changed state from standby
2021-10-11T13:39:36.632Z: [APDCorrelator]
22674691795814us: [vob.storage.apd.start] Device or filesystem with identifier
[naa.60002ac0000000000000460c0001e000] has entered the All Paths Down state.
2021-10-11T13:39:36.632Z: [APDCorrelator]
22675146328134us: [esx.problem.storage.apd.start] Device or filesystem with
identifier [naa.60002ac0000000000000460c0001e000] has entered the All Paths
Down state.
2021-10-11T13:39:36.632Z: [APDCorrelator]
22674691795887us: [vob.storage.apd.start] Device or filesystem with identifier
[naa.2ff70002ac01e000] has entered the All Paths Down state.
2021-10-11T13:39:36.632Z: [APDCorrelator]
22675146328308us: [esx.problem.storage.apd.start] Device or filesystem with
identifier [naa.2ff70002ac01e000] has entered the All Paths Down state.
2021-10-11T13:39:36.635Z: [scsiCorrelator] 22675146330922us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.60002ac000000000000020f60001e0a3 degraded. Path vmhba2:C0:T5:L110 is down.
Affected datastores: “ALG-S1-QV-SSD-RBS-DS010”.2021-10-11T13:39:36.635Z: [scsiCorrelator]
22674691781623us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T4:L110
changed state from standby
2021-10-11T13:39:36.639Z: [scsiCorrelator] 22675146334883us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.60002ac000000000000020f60001e0a3 degraded. Path vmhba2:C0:T4:L110 is down.
Affected datastores: “ALG-S1-QV-SSD-RBS-DS010”.2021-10-11T13:39:36.639Z: [scsiCorrelator]
22674691781626us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T3:L110
changed state from standby
2021-10-11T13:39:36.643Z: [scsiCorrelator] 22675146339327us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.60002ac000000000000020f60001e0a3 degraded. Path vmhba2:C0:T3:L110 is down.
Affected datastores: “ALG-S1-QV-SSD-RBS-DS010”.2021-10-11T13:39:36.644Z: [scsiCorrelator]
22674691781628us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T2:L110
changed state from standby
2021-10-11T13:39:36.647Z: [scsiCorrelator] 22675146343395us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.60002ac000000000000020f60001e0a3 degraded. Path vmhba2:C0:T2:L110 is down.
Affected datastores: “ALG-S1-QV-SSD-RBS-DS010”.2021-10-11T13:39:36.648Z: [scsiCorrelator]
22674691781748us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T5:L201
changed state from on
2021-10-11T13:39:36.651Z: [scsiCorrelator] 22675146347060us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.60002ac0000000000000035a0001e000 degraded. Path vmhba2:C0:T5:L201 is down.
Affected datastores: “ALG-S2-QV-SSD-DS001”.2021-10-11T13:39:36.651Z: [scsiCorrelator]
22674691781755us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T4:L201
changed state from on
2021-10-11T13:39:36.655Z: [scsiCorrelator] 22675146351037us:
[esx.problem.storage.redundancy.degraded] Path redundancy to storage device
naa.60002ac0000000000000035a0001e000 degraded. Path vmhba2:C0:T4:L201 is down.
Affected datastores: “ALG-S2-QV-SSD-DS001”.
Hostd Logs:
- From the
Hostd Logs as well we can see the same set of Events for APD State.
2021-10-11T13:39:36.619Z info hostd[2100152]
[Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event:
(vim.event.EventEx) {
–> key =
233,
–>
chainId = -464386928,
–>
createdTime = “1970-01-01T00:00:00Z”,
–>
userName = “”,
–> datacenter
= (vim.event.DatacenterEventArgument) null,
–>
computeResource = (vim.event.ComputeResourceEventArgument) null,
–> host =
(vim.event.HostEventArgument) {
–>
name = “ALG-S1-L-XyzS016.dbss.abc.com”,
–>
host = ‘vim.HostSystem:ha-host’
–> },
–> vm =
(vim.event.VmEventArgument) null,
–> ds =
(vim.event.DatastoreEventArgument) null,
–> net =
(vim.event.NetworkEventArgument) null,
–> dvs =
(vim.event.DvsEventArgument) null,
–>
fullFormattedMessage = <unset>,
–>
changeTag = <unset>,
–>
eventTypeId = “esx.problem.storage.apd.start”,
–>
severity = <unset>,
–>
message = <unset>,
–>
arguments = (vmodl.KeyAnyValue) [
–>
(vmodl.KeyAnyValue) {
–>
key = “1”,
–>
value = “naa.60002ac000000000000007200001e000”
–> }
–> ],
–>
objectId = “ha-host”,
–>
objectType = “vim.HostSystem”,
–>
objectName = <unset>,
–> fault
= (vmodl.MethodFault) null
–> }
2021-10-11T13:39:36.620Z info hostd[2100152]
[Originator@6876 sub=Vimsvc.ha-eventmgr] Event 10398 : Device or filesystem
with identifier naa.60002ac000000000000007200001e000 has entered the All Paths
Down state.
2021-10-11T13:39:36.622Z warning hostd[2099912]
[Originator@6876 sub=Hostsvc.VmkVprobSource] Datastore array is marked as event
object. Check ‘extension.xml’ for event:
‘esx.problem.storage.connectivity.lost’
2021-10-11T13:39:36.623Z info hostd[2099912]
[Originator@6876 sub=Vimsvc.ha-eventmgr] Event 10399 : Lost connectivity to
storage device naa.60002ac000000000000007200001e000. Path vmhba2:C0:T5:L1 is
down. Affected datastores: ALG-S2-P-DB-Backup-NL01.
2021-10-11T13:39:36.625Z warning
hostd[2101147] [Originator@6876 sub=Hostsvc.VmkVprobSource] Datastore array is
marked as event object. Check ‘extension.xml’ for event:
‘esx.problem.storage.connectivity.lost’
2021-10-11T13:39:36.628Z warning
hostd[2101148] [Originator@6876 sub=Hostsvc.VmkVprobSource] Datastore array is
marked as event object. Check ‘extension.xml’ for event:
‘esx.problem.storage.connectivity.lost’
- Since
Hostd Service is also responsible to check the Datastore health using the
Heartbeat. We can see the Failures for the vmkprobe.
2021-10-11T13:39:36.683Z info hostd[2101124]
[Originator@6876 sub=Hostsvc.VmkVprobSource] VmkVprobSource::Post event:
(vim.event.EventEx) {
–> key =
233,
–>
chainId = 10411,
–>
createdTime = “1970-01-01T00:00:00Z”,
–>
userName = “”,
–>
datacenter = (vim.event.DatacenterEventArgument) null,
–>
computeResource = (vim.event.ComputeResourceEventArgument) null,
–> host =
(vim.event.HostEventArgument) {
–>
name = “ALG-S1-L-XyzS016.dbss.abc.com”,
–>
host = ‘vim.HostSystem:ha-host’
–> },
–> vm =
(vim.event.VmEventArgument) null,
–> ds =
(vim.event.DatastoreEventArgument) {
–>
name = “ALG-S2-QV-SSD-DS002”,
–>
datastore = ‘vim.Datastore:5b181500-79bfdfa0-5de3-1402ec975e4c’
–> },
–> net =
(vim.event.NetworkEventArgument) null,
–> dvs =
(vim.event.DvsEventArgument) null,
–>
fullFormattedMessage = <unset>,
–>
changeTag = <unset>,
–>
eventTypeId = “esx.problem.vmfs.heartbeat.timedout”,
–>
severity = <unset>,
–>
message = <unset>,
–>
arguments = (vmodl.KeyAnyValue) [
–>
(vmodl.KeyAnyValue) {
–>
key = “1”,
–>
value = “5b181500-79bfdfa0-5de3-1402ec975e4c”
–> },
–>
(vmodl.KeyAnyValue) {
–>
key = “2”,
–>
value = (vim.event.DatastoreEventArgument) {
–>
name = “ALG-S2-QV-SSD-DS002”,
–>
datastore = ‘vim.Datastore:5b181500-79bfdfa0-5de3-1402ec975e4c’
–>
}
–> }
–> ],
Conclusion:
- Based on
the logs we can see that the issue started because of the Storage issue
which happened on the ESXi Host. Due to the Datastore got disconnected the
heartbeat of the Hostd Service got lost which brought down the Hostd
Service and changed the state of the ESXI Host to Not-Responding.
Action Plan:
- As per
the Article: https://kb.vmware.com/s/article/2032940 : The APD situation needs to be resolved
at the storage array/fabric layer to restore connectivity to the host.
- Please
check with your storage Vendor to check if there is any issues on the SAN
or if there is any Activity performed on the Switch end.