Sometime generally during the Peak time when a lot of users are connected, few of the session request might fail and users are not able to get their Desktop. This article generally talks about the scenario where we have UAG and a Load Balancer in Place.
This can happen if Unified Access Gateway virtual appliances becomes intermittently unresponsive with high CPU pegged near 100%.
This can be seen using the ESX Top
top – 07:15:44 up 76 days, 2:10, 0 users, load average: 0.53, 0.70, 0.88
Tasks: 131 total, 1 running, 130 sleeping, 0 stopped, 0 zombie
%Cpu0 : 7.7/5.8 96[|||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu1 : 7.6/5.6 96[|||||||||||||||||||||||||||||||||||||||||||||||||||| ]
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
2020 gateway 20 0 3432.6m 584.3m 50.0 14.8 0:54.85 S
2021 gateway 20 0 3473.7m 339.9m 56.2 8.6 48:49.74 S
2215 gateway 20 0 863.8m 53.5m 37.5 1.4 37:08.13 R
2216 gateway 20 0 861.8m 51.5m 43.8 1.3 30:31.56 R
906 gateway 20 0 3491.3m 941.1m 0.0 23.8 2582:02 S java
1124 gateway 20 0 4473.3m 419.7m 0.0 10.6 429:13.10 S java
905 gateway 20 0 3595.3m 402.7m 66.7 10.2 107:05.37 S java
1161 gateway 20 0 1001.1m 162.6m 6.7 4.1 8051:06 S node
1162 gateway 20 0 993.2m 155.4m 0.0 3.9 7993:18 S node
907 gateway 20 0 3503.2m 149.4m 0.0 3.8 199:51.72 S java
1126 gateway 20 0 1185.9m 54.7m 0.0 1.4 38:43.45 S node
1160 gateway 20 0 869.9m 33.5m 0.0 0.8 4:06.54 S node
391 root 20 0 112.8m 32.8m 0.0 0.8 1:54.70 S systemd-journal
500 root 20 0 812.7m 17.0m 0.0 0.4 1:58.67 S rsyslogd
7173 root 20 0 56.6m 6.8m 0.0 0.2 0:00.01 S vami_login
7207 root 20 0 159.8m 6.8m 0.0 0.2 0:00.02 S vami-sfcbd
1189 gateway 20 0 388.3m 6.5m 20.0 0.2 11854:38 S udpforwarder
1107 gateway 20 0 457.4m 3.5m 0.0 0.1 20:00.36 S SecurityGateway
1 root 20 0 45.1m 3.0m 0.0 0.1 0:36.32 S systemd
761 root 20 0 97.9m 2.3m 0.0 0.1 19:46.93 S supervisord
7230 gateway 20 0 13.4m 2.2m 6.7 0.1 0:00.01 R top
981 root 20 0 137.9m 1.9m 0.0 0.0 52:04.61 S vmtoolsd
840 root 20 0 134.8m 1.3m 0.0 0.0 0:00.00 S vami-sfcbd
501 root 20 0 24.5m 1.1m 0.0 0.0 0:06.01 S systemd-logind
496 message+ 20 0 25.5m 1.0m 0.0 0.0 0:09.22 S dbus-daemon
584 root 20 0 11.2m 0.8m 0.0 0.0 0:37.79 S crond
490 systemd+ 20 0 100.5m 0.8m 0.0 0.0 0:11.30 S systemd-timesyn
When Unified Access Gateway is overloaded and using excessive CPU (>90% across all vCPUs), it will inform Load Balancer to not send any more requests as it will not be able to handle it.
This is Done by Sending a Message 503 to the Load Balancer which as per the ERR Says:
PS C:\> .\Err.exe 503
HTTP_STATUS_SERVICE_UNAVAIL winhttp.h
# temporarily overloaded
# for hex 0x503 / decimal 1283
ERROR_PARAMETER_QUOTA_EXCEEDED winerror.h
# Data present in one of the parameters is more than the function can operate on.
For more information on how to use ERR Tool please refer to : https://knowitlikepro.com/a-solution-to-all-error-code/
Now once the Load Balancer has got an Idea that this UAG is not going to take any further Load. It will start routing the traffic to another UAG, of course if there is one Present in the Environment. If not then the User sessions will start getting Terminated.
As per the KB https://kb.vmware.com/s/article/78419
Primarily few reasons for CPU spike :
- Blocking UDP 8443 at the firewall. This forces everything to TCP only which uses significantly more CPU.
- Switching Blast TCP port from the default of 8443 to 443 (by putting :443 at the end of the blastExternalUrl) which then involves double-hop TCP 443 > 8443.
- UAG logging set to Debug/Trace level
Best Practices:
- Allow UDP 8443 on the firewall. Blast UDP involves less decrypt/encrypt. If UDP 8443 is blocked (or UDP 22443 is blocked from UAG to the desktop) , the client can’t use this more efficient UDP protocol and has to back down to TCP only. Blast TCP is a TLS WebSockets connection which involves more decrypt/encrypt by the Blast BSG component.
- Use 8443 for blastExternalUrl. If 443 is used for blastExternalUrl, Blast connections arrive on UAG on port 443, there is a double hop within UAG to forward it on TCP 8443 to BSG. Double hop is less efficient than single hop.
- If UDP is not blocked, then most clients will use UDP so in this case the effect of 2 is much less significant because TCP use will be much less anyway so the double-hop less significant. So not blocking UDP is the most significant way to improve efficiency.
- Change UAG logging to Info/Error level. Refer to https://docs.vmware.com/en/Unified-Access-Gateway/3.9/com.vmware.uag-39-deploy-config.doc/GUID-C16913E1-7984-4072-B1E8-7EBAE385A831.html
Health Monitoring setup :
Load balancer settings for Health Monitoring should be set as per KB: https://kb.vmware.com/s/article/56636 :
- Interval: 30 seconds
- Timeout: 91 seconds
- Monitor String: GET /favicon.ico
If modifying above settings do not help with CPU utilization, we also need to take into consideration below factors :
4. Number of connections landing on each UAG : Session count may vary depending upon the connection type (TCP/UDP)
5. Number of CPU’s on the UAG appliance : Increase the CPU from default 2 to 4