How to troubleshoot high load in linux web hosting servers
Even in this age of powerful servers and cloud computing, server load spikes are all too common. While troubleshooting a high server load, getting the approach right means work half done.
At Bobcares, our server experts fix server load in our customer’s (web hosts) servers, in as little as 5 mins. We do it by systematically tracing an abusive user (or program) from an affected service or over-used resource.
Yes, it sounds like a handful, but years of practice has made it quite easy for us. We’ll explain how, but let’s answer a fundamental question first:
High load average – What is it really?
A server functions with a limited set of resources. For eg., an average server these days will have 8 GB RAM, 4 processors, 75 IOPS SATA II hard disks, and 1 Gigabit NIC cards.
Now, let’s assume one user decided to backup their account. If that process occupies 7.5 GB of RAM, other users or services in the system have to wait for that process to get over.
The longer the backup takes, the longer the wait queue. The “length” of the queue is represented as server load. So, a server running at load avg. 20, will have a longer wait queue than a server at load avg. 10.
[ High server load can ruin your business! Don’t delay anymore. Our expert server specialists will keep your servers stable. ]
Why FAST troubleshooting is important
When a server is under high load, chances are that the number of processes in the “wait” queue are growing each second.
The commands take longer to execute, and soon the server could become non-responsive, leading to a reboot. So, it is important to kill the source of the server load as soon as possible.
In our Server support team, we have a concept called “The Golden Minute”. It says that the best chance to recover from a load spike is in the first minute. Our engineers keep a close eye on the monitoring system 24/7, and immediately log on to the server if a load spike is detected. It is due to this quick reaction and expert mitigation that we’re able to achieve close to 100% server uptime for our customers .
Member of Executive Group, Bobcares
How to troubleshoot a load spike really fast?
It is common for people to try out familiar commands when faced with a high load situation. But without a sound strategy it is just wasted time.
Bobcares support techs use a principle called go from what you know to what you don’t.
When we get a high load notification, there’s one thing we know for sure. There’s at least one server resource (RAM, CPU, I/O, etc.) that’s being abused.
- So, the first step is to find out which resource is being abused.
- The next is to find out which service is using that resource. It could be the web server, database server, mail server, or some other service.
- Once we find out the service, we then find out which user in that service is actually abusing the server.
FAST Linux server load troubleshooting
To show how this concept works in reality, we’ll take an example of a high load situation we recently fixed in a CentOS Linux server. Here are the steps we followed:
- Find the over-loaded resource
- Find the service hogging that resource
- Find the virtual host over-using that service
1. Find the over-loaded resource
Our support techs use different tools for different types of servers. For physical servers or hardware virtualized servers, we’ve found atop to be a good fit. In an OS virtualized server, we use the top command, and if it’s a VPS node we use vztop.
The goal here is to locate which one of the resources; viz, CPU, Memory, Disk or Network that is getting hogged. In this case, we used atop, as it was a dedicated server.
We ran the command “atop -Aac“. It showed the accumulated usage of resources for each process, sorted automatically by the most used resource, and the command details. This gave the below output.
We could see that the most used resource is disk and is marked as ADSK. From the highlighted summary we saw that /dev/sda was 100% busy.
It’s worthwhile to note that the resource that is most vulnerable to over-use is usually Disk (especially if its SATA), followed by memory, then CPU and then network.
At this stage of troubleshooting, the following points are worth noting:
- We observe the server for at least 30 secs before deciding on which resource is being hogged. The one that remains on top the most is the answer.
- While using top, we use the “i” switch to see only the active processes, and “c” switch to see the full command line.
- The “%wa” in top command helps us to see the wait average to know if its a non-CPU resource hog.
- Using pstree, we look for any suspicious processes or unusually high number of a particular service. We then compare the process listing with a similarly loaded server to do a quick check.
- We use netstat to look for any suspicious connections, or too many connections from one particular IP (or IP range).
[ Don’t wait for your server to crash! Grab our Emergency server services to save your servers at affordable pricing. ]
Troubleshooting is as much an exercise in invalidating possible scenarios as it is about systematically zeroing in one particular possibility.
When you know how various commands give an output in a normal stable server, you will gain an instinct of knowing what is NOT right.