VDI Users unable to Connect to VDI Desktops due to High CPU on UAG

Post category:Horizon View / VMware
Post published:May 13, 2020
Post last modified:August 2, 2024

Sometime generally during the Peak time when a lot of users are connected, few of the session request might fail and users are not able to get their Desktop. This article generally talks about the scenario where we have UAG and a Load Balancer in Place.

This can happen if Unified Access Gateway virtual appliances becomes intermittently unresponsive with high CPU pegged near 100%.

This can be seen using the ESX Top

top – 07:15:44 up 76 days, 2:10, 0 users, load average: 0.53, 0.70, 0.88

Tasks: 131 total, 1 running, 130 sleeping, 0 stopped, 0 zombie

%Cpu0  :   7.7/5.8    96[|||||||||||||||||||||||||||||||||||||||||||||||||||| ]

%Cpu1  :   7.6/5.6    96[|||||||||||||||||||||||||||||||||||||||||||||||||||| ]

  PID USER      PR  NI    VIRT    RES  %CPU %MEM     TIME+ S COMMAND
2020 gateway   20  0 3432.6m 584.3m 50.0 14.8  0:54.85 S 
2021 gateway   20  0 3473.7m 339.9m 56.2  8.6 48:49.74 S 
2215 gateway   20  0 863.8m 53.5m 37.5  1.4 37:08.13 R 
2216 gateway    20  0 861.8m 51.5m 43.8  1.3 30:31.56 R 
  906 gateway   20   0 3491.3m 941.1m   0.0 23.8   2582:02 S java
 1124 gateway   20   0 4473.3m 419.7m   0.0 10.6 429:13.10 S java
  905 gateway   20   0 3595.3m 402.7m  66.7 10.2 107:05.37 S java
 1161 gateway   20   0 1001.1m 162.6m   6.7  4.1   8051:06 S node
 1162 gateway   20   0  993.2m 155.4m   0.0  3.9   7993:18 S node
  907 gateway   20   0 3503.2m 149.4m   0.0  3.8 199:51.72 S java
 1126 gateway   20   0 1185.9m  54.7m   0.0  1.4  38:43.45 S node
 1160 gateway   20   0  869.9m  33.5m   0.0  0.8   4:06.54 S node
  391 root      20   0  112.8m  32.8m   0.0  0.8   1:54.70 S systemd-journal
  500 root      20   0  812.7m  17.0m   0.0  0.4   1:58.67 S rsyslogd
 7173 root      20   0   56.6m   6.8m   0.0  0.2   0:00.01 S vami_login
 7207 root      20   0  159.8m   6.8m   0.0  0.2   0:00.02 S vami-sfcbd
 1189 gateway   20   0  388.3m   6.5m  20.0  0.2  11854:38 S udpforwarder
 1107 gateway   20   0  457.4m   3.5m   0.0  0.1  20:00.36 S SecurityGateway
    1 root      20   0   45.1m   3.0m   0.0  0.1   0:36.32 S systemd
  761 root      20   0   97.9m   2.3m   0.0  0.1  19:46.93 S supervisord
 7230 gateway   20   0   13.4m   2.2m   6.7  0.1   0:00.01 R top
  981 root      20   0  137.9m   1.9m   0.0  0.0  52:04.61 S vmtoolsd
  840 root      20   0  134.8m   1.3m   0.0  0.0   0:00.00 S vami-sfcbd
  501 root      20   0   24.5m   1.1m   0.0  0.0   0:06.01 S systemd-logind
  496 message+  20   0   25.5m   1.0m   0.0  0.0   0:09.22 S dbus-daemon
  584 root      20   0   11.2m   0.8m   0.0  0.0   0:37.79 S crond
  490 systemd+  20   0  100.5m   0.8m   0.0  0.0   0:11.30 S systemd-timesyn

When Unified Access Gateway is overloaded and using excessive CPU (>90% across all vCPUs), it will inform Load Balancer to not send any more requests as it will not be able to handle it.

This is Done by Sending a Message 503 to the Load Balancer which as per the ERR Says:

PS C:\> .\Err.exe 503
HTTP_STATUS_SERVICE_UNAVAIL                                    winhttp.h
# temporarily overloaded
# for hex 0x503 / decimal 1283
ERROR_PARAMETER_QUOTA_EXCEEDED                                 winerror.h
# Data present in one of the parameters is more than the function can operate on.

For more information on how to use ERR Tool please refer to : https://knowitlikepro.com/a-solution-to-all-error-code/

Now once the Load Balancer has got an Idea that this UAG is not going to take any further Load. It will start routing the traffic to another UAG, of course if there is one Present in the Environment. If not then the User sessions will start getting Terminated.

As per the KB https://kb.vmware.com/s/article/78419

Primarily few reasons for CPU spike :

Blocking UDP 8443 at the firewall. This forces everything to TCP only which uses significantly more CPU.
Switching Blast TCP port from the default of 8443 to 443 (by putting :443 at the end of the blastExternalUrl) which then involves double-hop TCP 443 > 8443.
UAG logging set to Debug/Trace level

Best Practices:

Allow UDP 8443 on the firewall. Blast UDP involves less decrypt/encrypt. If UDP 8443 is blocked (or UDP 22443 is blocked from UAG to the desktop) , the client can’t use this more efficient UDP protocol and has to back down to TCP only. Blast TCP is a TLS WebSockets connection which involves more decrypt/encrypt by the Blast BSG component.
Use 8443 for blastExternalUrl. If 443 is used for blastExternalUrl, Blast connections arrive on UAG on port 443, there is a double hop within UAG to forward it on TCP 8443 to BSG. Double hop is less efficient than single hop.
If UDP is not blocked, then most clients will use UDP so in this case the effect of 2 is much less significant because TCP use will be much less anyway so the double-hop less significant. So not blocking UDP is the most significant way to improve efficiency.
Change UAG logging to Info/Error level. Refer to https://docs.vmware.com/en/Unified-Access-Gateway/3.9/com.vmware.uag-39-deploy-config.doc/GUID-C16913E1-7984-4072-B1E8-7EBAE385A831.html

Health Monitoring setup :

Load balancer settings for Health Monitoring should be set as per KB: https://kb.vmware.com/s/article/56636 :

Interval: 30 seconds
Timeout: 91 seconds
Monitor String: GET /favicon.ico

If modifying above settings do not help with CPU utilization, we also need to take into consideration below factors :

4. Number of connections landing on each UAG : Session count may vary depending upon the connection type (TCP/UDP)
5. Number of CPU’s on the UAG appliance : Increase the CPU from default 2 to 4

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.