How to troubleshoot high load in linux web hosting servers
Even in this age of powerful servers and cloud computing, server load spikes (aka high load average) are all too common. Getting the approach right is half the work done in troubleshooting a high server load.
Here at Bobcares, our Outsourced Support Techs fix server load in our customer’s (web hosts) servers in as little as 5 mins. We do it by systematically tracing an abusive user (or program) from an affected service or over-used resource.
Yes, it sounds like a handful, but years of practice has made it quite easy for us. We’ll explain how, but let’s answer a fundamental question first:
High load average – What is it really?
A server functions with a limited set of resources. For eg., an average server these days will have 8 GB RAM, 4 processors, 75 IOPS SATA II hard disks, and 1 Gigabit NIC cards.
Now, let’s assume one user decided to backup their account. If that process occupies 7.5 GB of RAM, other users or services in the system have to wait for that process to get over.
The longer the backup takes, the longer the wait queue. The “length” of the queue is represented as server load.
So, a server running at load avg. 20, will have a longer wait queue than a server at load avg. 10.
[ Running a hosting business doesn’t have to be hard, or costly. Get world class Hosting Support Specialists at $9.99/hour (bulk discounts available) ]
Why FAST troubleshooting is important
When a server is under high load, chances are that the number of processes in the “wait” queue are growing each second. Your commands take longer to execute, and soon the server could become non-responsive, leading to a reboot. So, it is important to kill the source of the server load as soon as possible.
In our Outsourced Tech Support team, we have a concept called “The Golden Minute“. It says that the best chance to recover from a load spike is in the first minute. Our engineers keep a close eye on the monitoring system 24/7, and immediately log on to the server if a load spike is detected. It is due to this quick reaction and expert mitigation that we’re able to achieve close to 100% server uptime for our customers .
Member of Executive Group, Bobcares
How to troubleshoot a load spike really fast?
It is common for people to try out familiar commands when faced with a high load situation. But without a sound strategy it’s just wasted time.
Bobcares support techs use a principle called go from what you know to what you don’t.
When you get a high load notification, there’s only one thing we know. That is – there’s at least one server resource (RAM, CPU, I/O, etc.) that’s being abused.
- So, the first step is to find out which resource is being abused.
- The next is to find out which service is using that resource. It could be the web server, database server, mail server, or some other service.
- Once you have the service, you can then find out which user in that service is actually abusing the server.
[ Use your time to build your business. We’ll take care of your customers. Hire Our Hosting Support Specialists at $9.99/hr. ]
FAST Linux server load troubleshooting by example
To show how this concept works in reality, we’ll take an example of a high load situation we recently fixed in a CentOS Linux server. Here are the steps we followed:
- Find the over-loaded resource
- Find the service hogging that resource
- Find the virtual host over-using that service
1. Find the over-loaded resource
Our support techs use different tools for different types of servers. For physical servers or hardware virtualized servers, we’ve found atop to be a good fit. In an OS virtualized server, we use the top command, and if it’s a VPS node we use vztop.
The goal here is to locate which one of the resources; viz, CPU, Memory, Disk or Network that is getting hogged. In this case, we used atop, as it was a dedicated server.
We ran the command “atop -Aac“. It showed the accumulated usage of resources for each process, sorted automatically by the most used resource, and the command details. This gave the below output.
Here you see that the most used resource is disk and is marked as ADSK. From the highlighted summary we saw that /dev/sda was 100% busy.
It’s worthwhile to note that the resource that is most vulnerable to over-use is usually Disk (especially if its SATA), followed by Memory, then CPU and then Network.
At this stage of troubleshooting, the following points are worth noting:
- Observe for at least 30 secs before deciding on which resource is being hogged. The one that remains on top the most is your answer.
- If you’re using top, use the “i” switch to see only the active processes, and “c” switch to see the full command line.
- Note the “%wa” in top command to see the wait average to know if its a non-cpu resource hog.
- Use pstree to look for any suspicious processes or unusually high number of a particular service. You can compare the process listing with a similarly loaded server to do a quick check.
- Use netstat to look for any suspicious connections, or too many connections from one particular IP(or IP range).
Troubleshooting is as much an exercise in invalidating possible scenarios as it is about systematically zeroing in one particular possibility. When you know how various commands give an output in a normal stable server, you will gain an instinct of knowing what is NOT right.
2. Find the service hogging that resource
Finding the abused resource is the easy part. Once it is located, we then move on to locate which service is hogging the resource.
For that our support techs use specialist tools that’s tuned to troubleshoot that resource. In our current example, we continued using atop (as it has advanced resource listing functions).
You saw from above that mysql is the service that is automatically sorted on top of the list. Now, to get more details of disk usage we used “d” on the interactive screen. The output looked like below:
Here you can see how the disk operations statistics are jumping off the normal values against the mysql processes.
You can alternatively use iotop for analyzing disk based load. The iotop output for the same server looked like below:
From this we confirmed without doubt that it is mysql which is hogging the disk.
For checking memory you can use atop, top, or some clever use of ps.
To check CPU usage, the best utilities are atop and top. If you are feeling a bit adventerous, try out some bash kung-fu using ps like here:
# ps -eo pcpu,pid,user,args | sort -k 1 -r | head %CPU PID USER COMMAND 9.4 29051 mysql /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql 8.5 28480 mysite /usr/bin/php /home/mysite/public_html/index.php 6.5 28493 mysite /usr/bin/php /home/mysite/public_html/index.php 5.0 13738 root cxswatch - scanning 5.0 13735 root cxswatch - sleeping 4.9 13737 root cxswatch - scanning 20.7 21557 root /bin/bash /usr/local/sbin/maldet -a /home/mydom/ 2.0 28494 root /usr/sbin/exim -Mc 1ZaWJF-0007PK-CJ 19.2 28402 mydom /usr/bin/php /home/mydom/public_html/index.php
For network usage analysis, the best utility is nethogs. It will allow you to map a process ID to a high network usage. Apart from atop armed with netatop module, we haven’t yet seen any other utility do that.
3. Find the virtual host over-using that service
At this point we know which service is causing the bottle-neck. But a service doesn’t act on it’s own. The load spike would be linked to a user’s request of that service.
In this server, we saw that the user “ferc” was very busy in using his database.
A follow-up check of his access log showed us that his comments page was getting hammered by spam bots because his captcha was broken. Also, he had opted to not use our firewall which made his site vulnerable to spam bots. So, it was quickly rectified by enabling mod_security protection in his site, and the load started coming down.
The other services which we’ve have noted to be taking load are backup processes, server maintenance processes like tmpwatch, update scripts, IMAP server, Apache and sometimes SMTP server due to inbound spamming.
The best place to start service specific troubleshooting is to look at their individual access logs. We increase log verbosity, we’ve often found the users taxing that particular service. If its not an internal maintenance process that is inducing the load, a very good option will be to use tshark or tcpdump to log which virtual host is getting all the connection requests on the port of that particular service.
[ You don’t have to lose your sleep to keep your customers happy. Our Hosting Support Specialists cover your servers and support your customers 24/7 at just $9.99/hour. ]
Take-away from this post
Bobcares support techs rely on a systematic troubleshooting approach as much as we focus on the right tools for the job. Today, we’ve seen by example how we go about fixing a high load average issue. The take-aways are:
- It is important to be disciplined in your approach to troubleshooting. Follow the three step process to walk down to the specific virtual host.
- In troubleshooting, knowing what is NOT causing the issue is as important as following a thread to trace what is causing it. Having a habit of frequently checking all command outputs in a normal server will give you the power to immediately see what is wrong.
- There are specialist tools to use in different situations. Developing a curiosity to explore new and better utilities will stand in good stead when you face an emergency.
Bobcares helps businesses of all sizes achieve world-class performance and uptime, using tried and tested website architectures. If you’d like to know how to make your website more reliable, we’d be happy to talk to you.Bobcares provides Outsourced Web Hosting Support and Outsourced Server Management for online businesses. Our services include 24/7 server support, help desk support, live chat support and phone support.