Hostname: vc-abcvap00023.abcde.com
vCenter Version: vCenter Appliance 6.7 Update 3f
Uptime: 01:05:13 up 20:00 :: Seems to be the server is up by 20 Hours.
Time of Restart: 22nd July 05:05 UTC
As per the Article: https://kb.vmware.com/s/article/67818?lang=en_us
Generally the Starting point of
Investigation for a “503 Service Unavailable” starts from the VPXD and Client Virgo logs.
However while looking for the
error in the /log/Vmware I can only see the Event coming in the log file below:
vmware$ grep -ilr “503
service unavailable”
vsan-health/vsanvp.log
vmon/event-publisher.log
vsphere-ui/logs/opId.log
vsphere-ui/logs/dataservice.log
vsphere-ui/logs/vsphere_client_virgo.log
- Checked
the vSphere Client Virgo and can see that the last the this issue was
recorded was 6th June and there are not Service Unavailable Event logged.
cat vsphere-ui/logs/vsphere_client_virgo.* | grep -i “503 Service
Unavailable”
[2020-06-06T00:13:26.476Z]
[ERROR] data-service-pool-302
70000191 100003 200001
com.vmware.vsphere.client.cm.util.ServerPropertyValueExtractor HTTP response with status code 503 (enable
debug logging for details): <HTML><BODY><H1>503 Service
Unavailable (Failed to connect to endpoint:
[N7Vmacore4Http16LocalServiceSpecE:0x00007f6e340027c0] _serverNamespace =
/vmonapi action = Allow _port = 8900)</H1></BODY></HTML>
com.vmware.vapi.client.exception.TransportProtocolException: HTTP
[2020-06-06T00:14:39.874Z]
[ERROR] data-service-pool-334
70000253 100003 200001
com.vmware.vsphere.client.cm.util.ServerPropertyValueExtractor HTTP response with status code 503 (enable
debug logging for details): <HTML><BODY><H1>503 Service
Unavailable (Failed to connect to endpoint:
[N7Vmacore4Http16LocalServiceSpecE:0x00007f6e340027c0] _serverNamespace =
/vmonapi action = Allow _port = 8900)</H1></BODY></HTML>
com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with
status code 503 (enable debug logging for details):
<HTML><BODY><H1>503 Service Unavailable (Failed to connect to
endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x00007f6e340027c0]
_serverNamespace = /vmonapi action = Allow _port = 8900)
- Same is
mentioned in the vMon logs:
vmware$ cat vmon/event-publisher.log | grep -i “503 Service
Unavailable”
2020-06-06T00:35:21.483Z ERROR
event-publisher Error health status change event. EventsData
[‘rhttpproxy,HEALTHY,UNHEALTHY,0’], ErrMsg 503 Service Unavailable (Failed to
connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x0000563c92dcb140]
_serverNamespace = /lookupservice/sdk action = Allow _port = 7080), Backtrace
Service Control Logs:
- From the
Service Control Logs we can see that we tried to start the service around
4:40 but it failed:
2020-07-22T04:38:53.32Z INFO service-control
Performing start operation on service pschealth…
2020-07-22T04:40:53.19Z ERROR service-control Error executing start on
service pschealth. Details {
“componentKey”: null,
“detail”: [
{
“args”: [
“pschealth”
],
“localized”:
“An error occurred while starting service ‘pschealth'”,
“id”:
“install.ciscommon.service.failstart”,
“translatable”: “An error occurred while starting service
‘%(0)s'”
}
],
“problemId”: null,
“resolution”: null
}
2020-07-22T04:52:19.991Z ERROR service-control
Service-control failed. Error: {
“problemId”: null,
“detail”: [
{
“args”: [
“\u2013-all”
],
“localized”: “Unable
to locate service ‘\u2013-all'”,
“id”:
“install.ciscommon.service.notfound”,
“translatable”:
“Unable to locate service ‘%(0)s'”
}
],
“componentKey”: null,
“resolution”: {
“localized”: “Please
refer to component manager documentation to troubleshoot.”,
“id”:
“install.ciscommon.cmlookup.resolution”,
“translatable”: “Please
refer to component manager documentation to troubleshoot.”
}
}
- Around
5:00 AM we can see the services starting up:
2020-07-22T05:06:14.286Z INFO service-control Running:
applmgmt lwsmd pschealth vmafdd vmcad vmdird
vmdnsd vmonapi vmware-analytics vmware-certificatemanagement vmware-cis-license
vmware-cm vmware-eam vmware-imagebuilder vmware-pod vmware-postgres-archiver
vmware-rbd-watchdog vmware-rhttpproxy vmware-sca vmware-statsmonitor
vmware-sts-idmd vmware-stsd vmware-topologysvc vmware-vapi-endpoint vmware-vmon
vmware-vpostgres vmware-vpxd vmware-vpxd-svcs vsphere-ui
2020-07-22T05:06:14.287Z
INFO service-control StartPending:
vmware-content-library vmware-sps
vmware-updatemgr vmware-vsan-health vmware-vsm vsphere-client
2020-07-22T05:06:14.287Z
INFO service-control Stopped:
vmcam
vmware-mbcs vmware-netdumper vmware-perfcharts vmware-vcha vsan-dps
vMon Logs:
- From the
vMon logs we can see that the first time the service stopped at the below
time:
2020-07-21T18:01:31.019548+00:00
warning vmon <pschealth> Service
exited. Exit code 1
2020-07-21T18:01:31.020135+00:00 warning vmon <pschealth> Service exited
unexpectedly. Crash count 0. Taking configured recovery action.
2020-07-21T18:01:51.021551+00:00 warning vmon <event-pub> SysProcess exec timed out.
Force kill. Pid 36179
2020-07-21T18:01:51.026730+00:00
warning vmon Failed to publish health
status change.
2020-07-21T18:03:31.020128+00:00 warning vmon <pschealth> Service exited. Exit code 1
2020-07-21T18:03:31.044739+00:00
warning vmon <pschealth> Service failed
to recover. Try again. Fail count 1
2020-07-22T04:40:52.656028+00:00 notice vmon <pschealth-healthcmd> Constructed
command: /usr/bin/python
/usr/lib/vmware-pschealth/vmon/pschealth-vmon-apihealth.py
2020-07-22T04:40:53.018101+00:00
warning vmon <pschealth> Service
exited. Exit code 1
2020-07-22T04:40:53.018812+00:00
notice vmon <pschealth-healthcmd>
Cancelling execution of pid 19237
2020-07-22T04:40:53.019655+00:00
warning vmon A non tracked pid 19237
terminated.
2020-07-22T04:43:54.279902+00:00 notice vmon Executing service batch op API_HEALTH.
IgnoreFail=1, service count=10
2020-07-22T04:48:56.194666+00:00 notice vmon Executing service batch op API_HEALTH.
IgnoreFail=1, service count=10
2020-07-22T04:49:41.647728+00:00
notice vmon Executing service batch op
STOP. IgnoreFail=1, service count=32
vMon Syslog:
- From the
logs we can see that the First Time the Service Exited unexpectedly is on
21st July 18:01.
2020-07-21T18:01:31.019548+00:00 warning vmon <pschealth> Service exited. Exit code 1
2020-07-21T18:01:31.020135+00:00 warning vmon
<pschealth> Service exited unexpectedly. Crash count 0. Taking
configured recovery action.
2020-07-21T18:01:31.042862+00:00 notice vmon
<pschealth> Restarting service.
2020-07-21T18:03:31.020128+00:00 warning vmon <pschealth> Service exited. Exit code 1
2020-07-21T18:03:31.044270+00:00 notice vmon
<pschealth-healthcmd> Cancelling execution of pid 36902
2020-07-21T18:04:09.203067+00:00 notice vmon
<pschealth> Skip service health check. State STARTING, Curr
request 3
2020-07-21T18:05:31.022997+00:00 warning vmon
<pschealth> Service exited. Exit code 1
2020-07-21T18:05:31.023624+00:00 warning vmon
<pschealth> Service failed to recover. Try again. Fail count 2
2020-07-21T18:07:31.023126+00:00 warning vmon
<pschealth> Service exited. Exit code 1
2020-07-21T18:07:31.023891+00:00 warning vmon
<pschealth> Service failed to recover. Try again. Fail count 3
2020-07-21T18:09:10.912003+00:00 notice vmon
<pschealth> Skip service health check. State STARTING, Curr
request 3
2020-07-21T18:09:31.020513+00:00 warning vmon
<pschealth> Service exited. Exit code 1
2020-07-21T18:09:31.021118+00:00 warning vmon
<pschealth> Service failed to recover. Try again. Fail count 4
2020-07-21T18:11:31.024149+00:00 warning vmon
<pschealth> Service exited. Exit code 1
2020-07-21T18:11:31.024692+00:00 notice vmon
<pschealth-healthcmd> Cancelling execution of pid 39931
2020-07-21T18:11:31.025180+00:00 warning vmon
<pschealth> Service reached max quick failure count. Give up!!!
2020-07-21T18:11:31.025651+00:00
warning vmon A non tracked pid 39931
terminated.
2020-07-21T18:14:11.420989+00:00
notice vmon Executing service batch op
API_HEALTH. IgnoreFail=1, service count=10
2020-07-22T04:38:53.115752+00:00 notice vmon Received start request for pschealth
2020-07-22T04:38:53.718436+00:00 notice vmon
<pschealth> Skip service health check. State STARTING, Curr
request 1
2020-07-22T04:40:53.018101+00:00 warning vmon
<pschealth> Service exited. Exit code 1
2020-07-22T04:40:53.018812+00:00 notice vmon
<pschealth-healthcmd> Cancelling execution of pid 19237
2020-07-22T04:40:53.019655+00:00 warning vmon A non tracked pid 19237 terminated.
2020-07-22T04:43:55.686973+00:00
notice vmon <pschealth> Skip
service health check. State STOPPED, Curr request 0
2020-07-22T04:48:57.344173+00:00 notice vmon
<pschealth> Skip service health check. State STOPPED, Curr request
0
2020-07-22T05:00:04.429650+00:00 err vmon Missing ending % in -XX:ErrorFile=%VMWARE_LOG_DIR%/vmware/vsm/java_error%p.log
2020-07-22T05:00:04.429891+00:00 warning vmon Invalid string entry in StartCommandArgs
field. Skipping…
2020-07-22T05:00:07.646471+00:00
notice vmon <pschealth> Service
STARTED successfully.
Messages Logs:
- From the
messages logs we can see the PSC Health Check Errors along with the we can
see that there are some issues with the Time as well.
- There is
a Time Sync issue hint which is given in the Article: https://kb.vmware.com/s/article/50113811
2020-07-21T18:00:31.013914+00:00 abcvap00023
pschealthd: VMware Directory-Service, 1, 1595043431
2020-07-21T18:00:31.014766+00:00 abcvap00023 pschealthd: VMware
Certificate-Service, 1, 1595043431
2020-07-21T18:00:31.015191+00:00 abcvap00023 pschealthd: Websso, 0, 1595043463
2020-07-21T18:00:31.015570+00:00 abcvap00023 pschealthd: Lookup service, 0,
1595043461
2020-07-21T18:00:31.015937+00:00 abcvap00023 pschealthd: Security Token
Service, 0, 1595043467
2020-07-21T18:00:31.016325+00:00 abcvap00023 pschealthd: License Service, 1,
1595043426
2020-07-21T18:00:31.016692+00:00
abcvap00023 pschealthd: Detected PSC system is not healthy – Wait for atleast
one minute before failing
2020-07-21T18:00:46.013357+00:00 abcvap00023
pschealthd: VMware Directory-Service, 1, 1595043428
2020-07-21T18:00:46.019491+00:00 abcvap00023 pschealthd: VMware
Certificate-Service, 1, 1595043428
2020-07-21T18:00:46.020017+00:00 abcvap00023 pschealthd: Websso, 0, 1595043478
2020-07-21T18:00:46.020474+00:00 abcvap00023 pschealthd: Lookup service, 0,
1595043476
2020-07-21T18:00:46.021083+00:00 abcvap00023 pschealthd: Security Token
Service, 0, 1595043482
2020-07-21T18:00:46.021502+00:00 abcvap00023 pschealthd: License Service, 1,
1595043432
2020-07-21T18:00:46.021886+00:00
abcvap00023 pschealthd: Detected PSC system is not healthy – Wait for atleast
one minute before failing
2020-07-21T18:01:31.095751+00:00 abcvap00023
pschealthd: PscHealthCheckInit dwError=0
2020-07-21T18:01:31.096589+00:00
abcvap00023 pschealthd: pscHealthCheckVmdir dwError=0
2020-07-21T18:01:31.114019+00:00
abcvap00023 pschealthd: Detected vmafd is starting up wait for 1 min
2020-07-21T18:02:01.940550+00:00 abcvap00023 CROND[36360]: (root) CMD (.
/etc/profile.d/VMware-visl-integration.sh;
/usr/lib/applmgmt/backup_restore/scripts/SchedulerCron.py
>>/var/log/vmware/applmgmt/backupSchedulerCron.log 2>&1)
2020-07-21T18:02:01.940982+00:00
abcvap00023 CROND[36361]: (root) CMD ( test -x /usr/sbin/vpxd_periodic
&& /usr/sbin/vpxd_periodic >/dev/null 2>&1)
2020-07-22T04:58:09.781689+00:00 abcvap00023
ntpd[1372]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
2020-07-22T04:58:09.782017+00:00 abcvap00023
ntpd[1372]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
2020-07-22T05:01:18.883080+00:00 abcvap00023
systemd[1]: Stopped target Timers.
2020-07-22T05:04:21.815246+00:00 abcvap00023
augenrules[1163]: backlog_wait_time 15000
2020-07-22T05:04:21.815395+00:00 abcvap00023
systemd[1]: Reached target Timers.
2020-07-22T05:04:23.364731+00:00 abcvap00023
xinetd[1365]: removing daytime
2020-07-22T05:04:23.365170+00:00 abcvap00023
xinetd[1365]: removing daytime
2020-07-22T05:04:23.369915+00:00 abcvap00023
xinetd[1365]: removing time
2020-07-22T05:04:23.370659+00:00 abcvap00023
xinetd[1365]: removing time
2020-07-22T05:04:23.424584+00:00 abcvap00023
systemd[1]: Started Network Time Service.
2020-07-22T05:04:23.427687+00:00 abcvap00023
ntpd[1386]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
2020-07-22T05:04:23.428012+00:00 abcvap00023
ntpd[1386]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
2020-07-22T05:04:23.648985+00:00 abcvap00023
systemd[1]: dnsmasq.service: Service hold-off time over, scheduling restart.
2020-07-22T04:31:57.404664+00:00 abcvap00023 python:
pam_unix(passwd:auth): authentication failure; logname= uid=0 euid=0 tty=
ruser= rhost= user=root
2020-07-22T04:31:58.914639+00:00 abcvap00023 python: pam_unix(passwd:auth):
authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
2020-07-22T04:32:00.885685+00:00 abcvap00023 python: pam_unix(passwd:auth):
authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
2020-07-22T04:32:02.791840+00:00 abcvap00023 python: pam_unix(passwd:auth):
authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=root
2020-07-22T04:32:04.889578+00:00 abcvap00023 cli: vmware.vherd.linux_cli.cli
Error getting authentication cookie from applmgmt service. Verify that the
applmgmt service is running and that the method requested to obtain the cookie
requires authentication.
2020-07-22T04:32:04.976687+00:00 abcvap00023 sshd[16376]:
pam_mgmt_cli(sshd:auth): Failed to authenticate with applmgmt service, service
not responding (253)
2020-07-22T04:32:04.992843+00:00 abcvap00023 sshd[16376]: pam_unix(sshd:auth):
authentication failure; logname= uid=0 euid=0 tty=ssh ruser=
rhost=10.191.205.131 user=root
2020-07-22T04:32:07.088400+00:00 abcvap00023 sshd[16376]: Failed password for
root from 10.191.205.131 port 52001 ssh2
2020-07-22T04:38:53.145962+00:00 abcvap00023 pschealthd: starting pschealthd
2020-07-22T04:38:53.146831+00:00
abcvap00023 pschealthd: PscHealthCheckInit dwError=0
2020-07-22T04:38:53.147216+00:00
abcvap00023 pschealthd: pscHealthCheckVmdir dwError=0
2020-07-22T04:38:53.151807+00:00
abcvap00023 pschealthd: Detected vmafd is starting up wait for 1 min
2020-07-22T04:39:53.021487+00:00 abcvap00023 pschealthd: VMware
Directory-Service, 1, 1595043424
2020-07-22T04:39:53.022376+00:00 abcvap00023 pschealthd: VMware
Certificate-Service, 1, 1595043430
2020-07-22T04:39:53.023009+00:00 abcvap00023 pschealthd: Websso, 0, 1595081708
2020-07-22T04:39:53.023510+00:00 abcvap00023 pschealthd: Lookup service, 0,
1595081823
2020-07-22T04:39:53.024079+00:00 abcvap00023 pschealthd: Security Token
Service, 0, 1595081829
2020-07-22T04:39:53.024589+00:00 abcvap00023 pschealthd: License Service, 1,
1595043431
2020-07-22T04:39:53.024986+00:00
abcvap00023 pschealthd: Detected PSC system is not healthy – Wait for atleast
one minute before failing
2020-07-22T04:40:08.015568+00:00
abcvap00023 pschealthd: VMware Directory-Service, 1, 1595043430
2020-07-22T04:40:08.016532+00:00 abcvap00023 pschealthd: VMware
Certificate-Service, 1, 1595043427
2020-07-22T04:40:08.017033+00:00 abcvap00023 pschealthd: Websso, 0, 1595081723
2020-07-22T04:40:08.017615+00:00 abcvap00023 pschealthd: Lookup service, 0,
1595081838
2020-07-22T04:40:08.018064+00:00 abcvap00023 pschealthd: Security Token
Service, 0, 1595081844
2020-07-22T04:40:08.018481+00:00 abcvap00023 pschealthd: License Service, 1,
1595043428
- Here we
can see that there is a message for the Change in the Time of the System,
around the reboot post which we have not noticed the issue.
2020-07-22T05:07:38.036444+00:00 abcvap00023
systemd[1]: Time has been changed
WebSSO Logs:
- From the
WebSSO we can see the same traces of the Time Sync Issues.
[2020-07-22T05:14:41.799Z tomcat-http–10
vsphere.local
b5dec38b-8fe2-406a-94c6-ba8237c9256b ERROR
com.vmware.identity.samlservice.impl.AuthnRequestStateValidator] Validation
FAILED – Issue Instant 2020-07-22T04:25:54.114Z outside of clock tolerance
600000.
Conclusion:
- Based on
the logs we can say that the PSC Health Service failed to start because of
Time Sync Issues as after the restart when we got the Event: Time has been
changed, the issue was resolved.
- Post the
Reboot the issue is not observed.
- The same
issue is mentioned multiple times in the KB Articles where we have issues
with the PSC Health Service.
https://kb.vmware.com/s/article/52923
https://kb.vmware.com/s/article/50121332?lang=en_us
Action Plan:
- Since the
Time Sync issue is fixed after the reboot we won’t be able to explore
more, but I will recommend you to check if there was any issues with the
Time Sync between the vCenter and the ESXi Host where the VM is running as
well as the Domain controller.
- Incase if
you experience this issue next time, I will recommend you to check if
there is any Time different between the ESXi Host, vCenter and the Domain
Controller before rebooting the vCenter Server.