Nobody’s been killed in a server crash
Imagine a server that keeps crashing every other day. For most webhosts’ this isn’t something too hard to imagine. Every host would have gone through this phase, where they are clueless as to why their server keeps going down.
Most of the time, the blame falls on faulty hardware. Usually this is true, circumstantial evidence proving that no recent changes were made to the software, and hence the source of issue would most likely be hardware. This doesn’t always have to be correct, as there are other things that could go wrong.
How to proceed
In the case of Linux servers, a quick look at the output of the command
last will give you an idea about the reboot times. Looking at the output of
dmesg or logs such as /var/log/messages could give you an idea of the problem- if the root-cause of the issue was something related to hardware. If you have access to sensor data of your server, you could get an insight into a potential hardware issue- like a failing fan or abnormal voltage levels for your CPU. A quick test of the hard-drives with tools like
smartctl can be used to identify hard-disk problems as well.
iostat, which are part of the
sysstat package, could shed light into the troubleshooting process. Tools like
sar give you more than enough details of the server’s state . But an in depth analysis of the server isn’t always possible, since you would not be troubleshooting the issue when it actually happens(in real time).
Mostly servers get overloaded, causing them to crash. In such cases, the definitive edge in performing an in depth analysis can be obtained by logging the state of the server, it’s processes, and resource usage, so that it can be reviewed at a later stage. Here is a script that can help you with that.
Create a folder /var/log/cpu_mem/ and add a line in
motd, so that all administrators can look for the custom logs in this path. The logs will be created in /var/log/cpu_mem/ and the relevant log can be checked based on the time-stamp of the log. Execute the following from shell :
mkdir /var/log/cpu_mem/; echo “Check logs at /var/log/cpu_mem/ for detailed log for analysis” >>/etc/motd; touch /root/loadmon.sh;chmod 755 /root/loadmon.sh
Add the following content to the file /root/loadmon.sh using any popular text editor.
Create a cron job for the periodic execution of the script. Setting the interval to every 2 or 5 minutes should be enough. Note that the script will record the details only if the server is overloaded. If you would want to test the script, you will have to replace the $CPU with 0, so that the script logs the details, even when server load is 0. Read the script for exact details.
The script can create log files which could take up a lot of space in your server, and it is important to clear old logs periodically. The following script can be set as a daily cron, to clear logs that are older than 2 days. Create a file /root/clear_old_logs.sh
touch /root/clear_old_logs.sh ;chmod 755 /root/clear_old_logs.sh
Add the following lines to the file /root/clear_old_logs.sh with any of your favorite text editors.
Run the following from shell, to set the crons. In some servers like those running Ubuntu, the cron file would be at /var/spool/cron/crontabs/root , in those cases you might have to edit the following script with that path.
A crashing server may not kill people, but it definitely kills business.
About the Author :
Sankar works as a Senior Software Engineer in Bobcares. He joined Bobcares back in April 2006. He loves grooming/mentoring people. During his free time, he listens to music, and enjoys singing..