RCA 28: vCenter Failed with the Error: 503 Service Unavailable

Hostname: vc-abcvap00023.abcde.com

vCenter Version: vCenter Appliance 6.7 Update 3f

 

 

Uptime: 01:05:13 up 20:00 :: Seems to be the server is up by 20 Hours.

 

Time of Restart:  22nd July 05:05 UTC

 

As per the Article:  https://kb.vmware.com/s/article/67818?lang=en_us

 

Generally the Starting point of Investigation for a “503 Service Unavailable”  starts from the VPXD and Client Virgo logs.

However while looking for the error in the /log/Vmware I can only see the Event coming in the log file below:

 

vmware$ grep -ilr “503 service unavailable”

vsan-health/vsanvp.log

vmon/event-publisher.log

vsphere-ui/logs/opId.log

vsphere-ui/logs/dataservice.log

vsphere-ui/logs/vsphere_client_virgo.log

 

 

  • Checked the vSphere Client Virgo and can see that the last the this issue was recorded was 6th June and there are not Service Unavailable Event logged.

 

cat vsphere-ui/logs/vsphere_client_virgo.* | grep -i “503 Service Unavailable”

 

[2020-06-06T00:13:26.476Z] [ERROR] data-service-pool-302        70000191 100003 200001 com.vmware.vsphere.client.cm.util.ServerPropertyValueExtractor    HTTP response with status code 503 (enable debug logging for details): <HTML><BODY><H1>503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x00007f6e340027c0] _serverNamespace = /vmonapi action = Allow _port = 8900)</H1></BODY></HTML> com.vmware.vapi.client.exception.TransportProtocolException: HTTP

 

[2020-06-06T00:14:39.874Z] [ERROR] data-service-pool-334        70000253 100003 200001 com.vmware.vsphere.client.cm.util.ServerPropertyValueExtractor    HTTP response with status code 503 (enable debug logging for details): <HTML><BODY><H1>503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x00007f6e340027c0] _serverNamespace = /vmonapi action = Allow _port = 8900)</H1></BODY></HTML> com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 503 (enable debug logging for details): <HTML><BODY><H1>503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x00007f6e340027c0] _serverNamespace = /vmonapi action = Allow _port = 8900)

 

  • Same is mentioned in the vMon logs:

 

vmware$ cat vmon/event-publisher.log | grep -i “503 Service Unavailable”

2020-06-06T00:35:21.483Z ERROR event-publisher Error health status change event. EventsData [‘rhttpproxy,HEALTHY,UNHEALTHY,0’], ErrMsg 503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x0000563c92dcb140] _serverNamespace = /lookupservice/sdk action = Allow _port = 7080), Backtrace

 

 

Service Control Logs:

 

  • From the Service Control Logs we can see that we tried to start the service around 4:40 but it failed:

 

2020-07-22T04:38:53.32Z INFO service-control Performing start operation on service pschealth…
2020-07-22T04:40:53.19Z ERROR service-control Error executing start on service pschealth. Details {
    “componentKey”: null,
    “detail”: [
        {
            “args”: [
                “pschealth”
            ],
            “localized”: “An error occurred while starting service ‘pschealth'”,
            “id”: “install.ciscommon.service.failstart”,
            “translatable”: “An error occurred while starting service ‘%(0)s'”
        }
    ],
    “problemId”: null,
    “resolution”: null
}

 

 

2020-07-22T04:52:19.991Z ERROR service-control Service-control failed. Error: {
    “problemId”: null,
    “detail”: [
        {
            “args”: [
                “\u2013-all”
            ],
            “localized”: “Unable to locate service ‘\u2013-all'”,
            “id”: “install.ciscommon.service.notfound”,
            “translatable”: “Unable to locate service ‘%(0)s'”
        }
    ],
    “componentKey”: null,
    “resolution”: {
        “localized”: “Please refer to component manager documentation to troubleshoot.”,
        “id”: “install.ciscommon.cmlookup.resolution”,
        “translatable”: “Please refer to component manager documentation to troubleshoot.”
    }
}

 

  • Around 5:00 AM we can see the services starting up:

 

2020-07-22T05:06:14.286Z INFO service-control Running:
 applmgmt lwsmd pschealth vmafdd vmcad vmdird vmdnsd vmonapi vmware-analytics vmware-certificatemanagement vmware-cis-license vmware-cm vmware-eam vmware-imagebuilder vmware-pod vmware-postgres-archiver vmware-rbd-watchdog vmware-rhttpproxy vmware-sca vmware-statsmonitor vmware-sts-idmd vmware-stsd vmware-topologysvc vmware-vapi-endpoint vmware-vmon vmware-vpostgres vmware-vpxd vmware-vpxd-svcs vsphere-ui
2020-07-22T05:06:14.287Z INFO service-control StartPending:
 vmware-content-library vmware-sps vmware-updatemgr vmware-vsan-health vmware-vsm vsphere-client
2020-07-22T05:06:14.287Z INFO service-control Stopped:
vmcam vmware-mbcs vmware-netdumper vmware-perfcharts vmware-vcha vsan-dps

 

 

vMon Logs:

 

  • From the vMon logs we can see that the first time the service stopped at the below time:

 

2020-07-21T18:01:31.019548+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:01:31.020135+00:00 warning vmon  <pschealth> Service exited unexpectedly. Crash count 0. Taking configured recovery action.
2020-07-21T18:01:51.021551+00:00 warning vmon  <event-pub> SysProcess exec timed out. Force kill. Pid 36179
2020-07-21T18:01:51.026730+00:00 warning vmon  Failed to publish health status change.
2020-07-21T18:03:31.020128+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:03:31.044739+00:00 warning vmon  <pschealth> Service failed to recover. Try again. Fail count 1

 

2020-07-22T04:40:52.656028+00:00 notice vmon  <pschealth-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-pschealth/vmon/pschealth-vmon-apihealth.py
2020-07-22T04:40:53.018101+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-22T04:40:53.018812+00:00 notice vmon  <pschealth-healthcmd> Cancelling execution of pid 19237
2020-07-22T04:40:53.019655+00:00 warning vmon  A non tracked pid 19237 terminated.
2020-07-22T04:43:54.279902+00:00 notice vmon  Executing service batch op API_HEALTH. IgnoreFail=1, service count=10

2020-07-22T04:48:56.194666+00:00 notice vmon  Executing service batch op API_HEALTH. IgnoreFail=1, service count=10
2020-07-22T04:49:41.647728+00:00 notice vmon  Executing service batch op STOP. IgnoreFail=1, service count=32

 

vMon Syslog:

 

  • From the logs we can see that the First Time the Service Exited unexpectedly is on 21st July 18:01.

 

2020-07-21T18:01:31.019548+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:01:31.020135+00:00 warning vmon  <pschealth> Service exited unexpectedly. Crash count 0. Taking configured recovery action.
2020-07-21T18:01:31.042862+00:00 notice vmon  <pschealth> Restarting service.

2020-07-21T18:03:31.020128+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:03:31.044270+00:00 notice vmon  <pschealth-healthcmd> Cancelling execution of pid 36902
2020-07-21T18:04:09.203067+00:00 notice vmon  <pschealth> Skip service health check. State STARTING, Curr request 3
2020-07-21T18:05:31.022997+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:05:31.023624+00:00 warning vmon  <pschealth> Service failed to recover. Try again. Fail count 2
2020-07-21T18:07:31.023126+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:07:31.023891+00:00 warning vmon  <pschealth> Service failed to recover. Try again. Fail count 3
2020-07-21T18:09:10.912003+00:00 notice vmon  <pschealth> Skip service health check. State STARTING, Curr request 3
2020-07-21T18:09:31.020513+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:09:31.021118+00:00 warning vmon  <pschealth> Service failed to recover. Try again. Fail count 4
2020-07-21T18:11:31.024149+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-21T18:11:31.024692+00:00 notice vmon  <pschealth-healthcmd> Cancelling execution of pid 39931
2020-07-21T18:11:31.025180+00:00 warning vmon  <pschealth> Service reached max quick failure count. Give up!!!

2020-07-21T18:11:31.025651+00:00 warning vmon  A non tracked pid 39931 terminated.
2020-07-21T18:14:11.420989+00:00 notice vmon  Executing service batch op API_HEALTH. IgnoreFail=1, service count=10

2020-07-22T04:38:53.115752+00:00 notice vmon  Received start request for pschealth
2020-07-22T04:38:53.718436+00:00 notice vmon  <pschealth> Skip service health check. State STARTING, Curr request 1
2020-07-22T04:40:53.018101+00:00 warning vmon  <pschealth> Service exited. Exit code 1
2020-07-22T04:40:53.018812+00:00 notice vmon  <pschealth-healthcmd> Cancelling execution of pid 19237
2020-07-22T04:40:53.019655+00:00 warning vmon  A non tracked pid 19237 terminated.
2020-07-22T04:43:55.686973+00:00 notice vmon  <pschealth> Skip service health check. State STOPPED, Curr request 0
2020-07-22T04:48:57.344173+00:00 notice vmon  <pschealth> Skip service health check. State STOPPED, Curr request 0

 

2020-07-22T05:00:04.429650+00:00 err vmon  Missing ending % in -XX:ErrorFile=%VMWARE_LOG_DIR%/vmware/vsm/java_error%p.log
2020-07-22T05:00:04.429891+00:00 warning vmon  Invalid string entry in StartCommandArgs field. Skipping…
2020-07-22T05:00:07.646471+00:00 notice vmon  <pschealth> Service STARTED successfully.

 

 

Messages Logs:

 

  • From the messages logs we can see the PSC Health Check Errors along with the we can see that there are some issues with the Time as well.

 

 

2020-07-21T18:00:31.013914+00:00 abcvap00023 pschealthd: VMware Directory-Service, 1, 1595043431
2020-07-21T18:00:31.014766+00:00 abcvap00023 pschealthd: VMware Certificate-Service, 1, 1595043431
2020-07-21T18:00:31.015191+00:00 abcvap00023 pschealthd: Websso, 0, 1595043463
2020-07-21T18:00:31.015570+00:00 abcvap00023 pschealthd: Lookup service, 0, 1595043461
2020-07-21T18:00:31.015937+00:00 abcvap00023 pschealthd: Security Token Service, 0, 1595043467
2020-07-21T18:00:31.016325+00:00 abcvap00023 pschealthd: License Service, 1, 1595043426
2020-07-21T18:00:31.016692+00:00 abcvap00023 pschealthd: Detected PSC system is not healthy – Wait for atleast one minute before failing

2020-07-21T18:00:46.013357+00:00 abcvap00023 pschealthd: VMware Directory-Service, 1, 1595043428
2020-07-21T18:00:46.019491+00:00 abcvap00023 pschealthd: VMware Certificate-Service, 1, 1595043428
2020-07-21T18:00:46.020017+00:00 abcvap00023 pschealthd: Websso, 0, 1595043478
2020-07-21T18:00:46.020474+00:00 abcvap00023 pschealthd: Lookup service, 0, 1595043476
2020-07-21T18:00:46.021083+00:00 abcvap00023 pschealthd: Security Token Service, 0, 1595043482
2020-07-21T18:00:46.021502+00:00 abcvap00023 pschealthd: License Service, 1, 1595043432
2020-07-21T18:00:46.021886+00:00 abcvap00023 pschealthd: Detected PSC system is not healthy – Wait for atleast one minute before failing

2020-07-21T18:01:31.095751+00:00 abcvap00023 pschealthd: PscHealthCheckInit dwError=0
2020-07-21T18:01:31.096589+00:00 abcvap00023 pschealthd: pscHealthCheckVmdir dwError=0
2020-07-21T18:01:31.114019+00:00 abcvap00023 pschealthd: Detected vmafd is starting up wait for 1 min
2020-07-21T18:02:01.940550+00:00 abcvap00023 CROND[36360]: (root) CMD (. /etc/profile.d/VMware-visl-integration.sh; /usr/lib/applmgmt/backup_restore/scripts/SchedulerCron.py >>/var/log/vmware/applmgmt/backupSchedulerCron.log 2>&1)

2020-07-21T18:02:01.940982+00:00 abcvap00023 CROND[36361]: (root) CMD ( test -x /usr/sbin/vpxd_periodic && /usr/sbin/vpxd_periodic >/dev/null 2>&1)

 

2020-07-22T04:58:09.781689+00:00 abcvap00023 ntpd[1372]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

2020-07-22T04:58:09.782017+00:00 abcvap00023 ntpd[1372]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

2020-07-22T05:01:18.883080+00:00 abcvap00023 systemd[1]: Stopped target Timers.

2020-07-22T05:04:21.815246+00:00 abcvap00023 augenrules[1163]: backlog_wait_time 15000

2020-07-22T05:04:21.815395+00:00 abcvap00023 systemd[1]: Reached target Timers.

2020-07-22T05:04:23.364731+00:00 abcvap00023 xinetd[1365]: removing daytime

2020-07-22T05:04:23.365170+00:00 abcvap00023 xinetd[1365]: removing daytime

2020-07-22T05:04:23.369915+00:00 abcvap00023 xinetd[1365]: removing time

2020-07-22T05:04:23.370659+00:00 abcvap00023 xinetd[1365]: removing time

2020-07-22T05:04:23.424584+00:00 abcvap00023 systemd[1]: Started Network Time Service.

2020-07-22T05:04:23.427687+00:00 abcvap00023 ntpd[1386]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

2020-07-22T05:04:23.428012+00:00 abcvap00023 ntpd[1386]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

2020-07-22T05:04:23.648985+00:00 abcvap00023 systemd[1]: dnsmasq.service: Service hold-off time over, scheduling restart.

 

2020-07-22T04:31:57.404664+00:00 abcvap00023 python: pam_unix(passwd:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
2020-07-22T04:31:58.914639+00:00 abcvap00023 python: pam_unix(passwd:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
2020-07-22T04:32:00.885685+00:00 abcvap00023 python: pam_unix(passwd:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
2020-07-22T04:32:02.791840+00:00 abcvap00023 python: pam_unix(passwd:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost=  user=root
2020-07-22T04:32:04.889578+00:00 abcvap00023 cli: vmware.vherd.linux_cli.cli Error getting authentication cookie from applmgmt service. Verify that the applmgmt service is running and that the method requested to obtain the cookie requires authentication.
2020-07-22T04:32:04.976687+00:00 abcvap00023 sshd[16376]: pam_mgmt_cli(sshd:auth): Failed to authenticate with applmgmt service, service not responding (253)
2020-07-22T04:32:04.992843+00:00 abcvap00023 sshd[16376]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.191.205.131  user=root
2020-07-22T04:32:07.088400+00:00 abcvap00023 sshd[16376]: Failed password for root from 10.191.205.131 port 52001 ssh2
2020-07-22T04:38:53.145962+00:00 abcvap00023 pschealthd: starting pschealthd
2020-07-22T04:38:53.146831+00:00 abcvap00023 pschealthd: PscHealthCheckInit dwError=0
2020-07-22T04:38:53.147216+00:00 abcvap00023 pschealthd: pscHealthCheckVmdir dwError=0
2020-07-22T04:38:53.151807+00:00 abcvap00023 pschealthd: Detected vmafd is starting up wait for 1 min
2020-07-22T04:39:53.021487+00:00 abcvap00023 pschealthd: VMware Directory-Service, 1, 1595043424
2020-07-22T04:39:53.022376+00:00 abcvap00023 pschealthd: VMware Certificate-Service, 1, 1595043430
2020-07-22T04:39:53.023009+00:00 abcvap00023 pschealthd: Websso, 0, 1595081708
2020-07-22T04:39:53.023510+00:00 abcvap00023 pschealthd: Lookup service, 0, 1595081823
2020-07-22T04:39:53.024079+00:00 abcvap00023 pschealthd: Security Token Service, 0, 1595081829
2020-07-22T04:39:53.024589+00:00 abcvap00023 pschealthd: License Service, 1, 1595043431
2020-07-22T04:39:53.024986+00:00 abcvap00023 pschealthd: Detected PSC system is not healthy – Wait for atleast one minute before failing
2020-07-22T04:40:08.015568+00:00 abcvap00023 pschealthd: VMware Directory-Service, 1, 1595043430
2020-07-22T04:40:08.016532+00:00 abcvap00023 pschealthd: VMware Certificate-Service, 1, 1595043427
2020-07-22T04:40:08.017033+00:00 abcvap00023 pschealthd: Websso, 0, 1595081723
2020-07-22T04:40:08.017615+00:00 abcvap00023 pschealthd: Lookup service, 0, 1595081838
2020-07-22T04:40:08.018064+00:00 abcvap00023 pschealthd: Security Token Service, 0, 1595081844
2020-07-22T04:40:08.018481+00:00 abcvap00023 pschealthd: License Service, 1, 1595043428

 

  • Here we can see that there is a message for the Change in the Time of the System, around the reboot post which we have not noticed the issue.

 

2020-07-22T05:07:38.036444+00:00 abcvap00023 systemd[1]: Time has been changed

 

 

WebSSO Logs:

 

  • From the WebSSO we can see the same traces of the Time Sync Issues.

 

[2020-07-22T05:14:41.799Z tomcat-http–10 vsphere.local        b5dec38b-8fe2-406a-94c6-ba8237c9256b ERROR com.vmware.identity.samlservice.impl.AuthnRequestStateValidator] Validation FAILED – Issue Instant 2020-07-22T04:25:54.114Z outside of clock tolerance 600000.

 

Conclusion:

 

  • Based on the logs we can say that the PSC Health Service failed to start because of Time Sync Issues as after the restart when we got the Event: Time has been changed, the issue was resolved.
  • Post the Reboot the issue is not observed.
  • The same issue is mentioned multiple times in the KB Articles where we have issues with the PSC Health Service.

https://kb.vmware.com/s/article/52923

https://kb.vmware.com/s/article/50121332?lang=en_us

 

 

Action Plan:

 

  • Since the Time Sync issue is fixed after the reboot we won’t be able to explore more, but I will recommend you to check if there was any issues with the Time Sync between the vCenter and the ESXi Host where the VM is running as well as the Domain controller.

 

  • Incase if you experience this issue next time, I will recommend you to check if there is any Time different between the ESXi Host, vCenter and the Domain Controller before rebooting the vCenter Server.

 

 

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply