VDI Users unable to Connect to VDI Desktops due to High CPU on UAG

  • Post category:Horizon View / VMware
  • Post last modified:August 2, 2024
VDI Users unable to Connect to VDI Desktops due to High CPU on UAG

Sometime generally during the Peak time when a lot of users are connected, few of the session request might fail and users are not able to get their Desktop. This article generally talks about the scenario where we have UAG and a Load Balancer in Place.

 

This can happen if Unified Access Gateway virtual appliances becomes intermittently unresponsive with high CPU pegged near 100%.

This can be seen using the ESX Top

top – 07:15:44 up 76 days,  2:10,  0 users,  load average: 0.53, 0.70, 0.88

Tasks: 131 total,   1 running, 130 sleeping,   0 stopped,   0 zombie

%Cpu0  :   7.7/5.8    96[|||||||||||||||||||||||||||||||||||||||||||||||||||| ]

%Cpu1  :   7.6/5.6    96[|||||||||||||||||||||||||||||||||||||||||||||||||||| ]

  PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
2020 gateway 20 0 3432.6m 584.3m 50.0 14.8 0:54.85 S
2021 gateway 20 0 3473.7m 339.9m 56.2 8.6 48:49.74 S
2215 gateway 20 0 863.8m 53.5m 37.5 1.4 37:08.13 R
2216 gateway 20 0 861.8m 51.5m 43.8 1.3 30:31.56 R
906 gateway 20 0 3491.3m 941.1m 0.0 23.8 2582:02 S java
1124 gateway 20 0 4473.3m 419.7m 0.0 10.6 429:13.10 S java
905 gateway 20 0 3595.3m 402.7m 66.7 10.2 107:05.37 S java
1161 gateway 20 0 1001.1m 162.6m 6.7 4.1 8051:06 S node
1162 gateway 20 0 993.2m 155.4m 0.0 3.9 7993:18 S node
907 gateway 20 0 3503.2m 149.4m 0.0 3.8 199:51.72 S java
1126 gateway 20 0 1185.9m 54.7m 0.0 1.4 38:43.45 S node
1160 gateway 20 0 869.9m 33.5m 0.0 0.8 4:06.54 S node
391 root 20 0 112.8m 32.8m 0.0 0.8 1:54.70 S systemd-journal
500 root 20 0 812.7m 17.0m 0.0 0.4 1:58.67 S rsyslogd
7173 root 20 0 56.6m 6.8m 0.0 0.2 0:00.01 S vami_login
7207 root 20 0 159.8m 6.8m 0.0 0.2 0:00.02 S vami-sfcbd
1189 gateway 20 0 388.3m 6.5m 20.0 0.2 11854:38 S udpforwarder
1107 gateway 20 0 457.4m 3.5m 0.0 0.1 20:00.36 S SecurityGateway
1 root 20 0 45.1m 3.0m 0.0 0.1 0:36.32 S systemd
761 root 20 0 97.9m 2.3m 0.0 0.1 19:46.93 S supervisord
7230 gateway 20 0 13.4m 2.2m 6.7 0.1 0:00.01 R top
981 root 20 0 137.9m 1.9m 0.0 0.0 52:04.61 S vmtoolsd
840 root 20 0 134.8m 1.3m 0.0 0.0 0:00.00 S vami-sfcbd
501 root 20 0 24.5m 1.1m 0.0 0.0 0:06.01 S systemd-logind
496 message+ 20 0 25.5m 1.0m 0.0 0.0 0:09.22 S dbus-daemon
584 root 20 0 11.2m 0.8m 0.0 0.0 0:37.79 S crond
490 systemd+ 20 0 100.5m 0.8m 0.0 0.0 0:11.30 S systemd-timesyn

When Unified Access Gateway is overloaded and using excessive CPU (>90% across all vCPUs), it will inform Load Balancer to not send any more requests as it will not be able to handle it.

This is Done by Sending a Message 503 to the Load Balancer which as per the ERR Says:

PS C:\> .\Err.exe 503
HTTP_STATUS_SERVICE_UNAVAIL                                    winhttp.h
# temporarily overloaded
# for hex 0x503 / decimal 1283
ERROR_PARAMETER_QUOTA_EXCEEDED                                 winerror.h
# Data present in one of the parameters is more than the function can operate on.

For more information on how to use ERR Tool please refer to : https://knowitlikepro.com/a-solution-to-all-error-code/

Now once the Load Balancer has got an Idea that this UAG is not going to take any further Load. It will start routing the traffic to another UAG, of course if there is one Present in the Environment. If not then the User sessions will start getting Terminated.

As per the KB https://kb.vmware.com/s/article/78419

Primarily few reasons for CPU spike :

  1. Blocking UDP 8443 at the firewall. This forces everything to TCP only which uses significantly more CPU.
  2. Switching Blast TCP port from the default of 8443 to 443 (by putting :443 at the end of the blastExternalUrl) which then involves double-hop TCP 443 > 8443.
  3. UAG logging set to Debug/Trace level

Best Practices:

  • Allow UDP 8443 on the firewall. Blast UDP involves less decrypt/encrypt. If UDP 8443 is blocked (or UDP 22443 is blocked from UAG to the desktop) , the client can’t use this more efficient UDP protocol and has to back down to TCP only. Blast TCP is a TLS WebSockets connection which involves more decrypt/encrypt by the Blast BSG component. 
  • Use 8443 for blastExternalUrl. If 443 is used for blastExternalUrl, Blast connections arrive on UAG on port 443, there is a double hop within UAG to forward it on TCP 8443 to BSG. Double hop is less efficient than single hop.
  • If UDP is not blocked, then most clients will use UDP so in this case the effect of 2 is much less significant because TCP use will be much less anyway so the double-hop less significant. So not blocking UDP is the most significant way to improve efficiency.
  • Change UAG logging to Info/Error level. Refer to https://docs.vmware.com/en/Unified-Access-Gateway/3.9/com.vmware.uag-39-deploy-config.doc/GUID-C16913E1-7984-4072-B1E8-7EBAE385A831.html
Health Monitoring setup :  

Load balancer settings for Health Monitoring should be set as per KB: https://kb.vmware.com/s/article/56636 :

  • Interval: 30 seconds
  • Timeout: 91 seconds
  • Monitor String: GET /favicon.ico

 

If modifying above settings do not help with CPU utilization, we also need to take into consideration below factors :

4. Number of connections landing on each UAG : Session count may vary depending upon the connection type (TCP/UDP) 
5. Number of CPU’s on the UAG appliance : Increase the CPU from default 2 to 4 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply