Nobody's been killed in a server crash

Please Note: This article is part of our historical archive. Because it was published a while ago, some of the information, links, or context may now be outdated.

Imagine a server that keeps crashing every other day. For most webhosts’ this isn’t something too hard to imagine. Every host would have gone through this phase, where they are clueless as to why their server keeps going down.

Most of the time, the blame falls on faulty hardware. Usually this is true, circumstantial evidence proving that no recent changes were made to the software, and hence the source of issue would most likely be hardware. This doesn’t always have to be correct, as there are other things that could go wrong.

How to proceed

In the case of Linux servers, a quick look at the output of the command

last

will give you an idea about the reboot times. Looking at the output of

dmesg

or logs such as /var/log/messages could give you an idea of the problem- if the root-cause of the issue was something related to hardware. If you have access to sensor data of your server, you could get an insight into a potential hardware issue- like a failing fan or abnormal voltage levels for your CPU. A quick test of the hard-drives with tools like

smartctl

can be used to identify hard-disk problems as well.

Tools like

mpstat

and

iostat

, which are part of the

sysstat

package, could shed light into the troubleshooting process. Tools like

sar

give you more than enough details of the server’s state . But an in depth analysis of the server isn’t always possible, since you would not be troubleshooting the issue when it actually happens(in real time).

Mostly servers get overloaded, causing them to crash. In such cases, the definitive edge in performing an in depth analysis can be obtained by logging the state of the server, it’s processes, and resource usage, so that it can be reviewed at a later stage. Here is a script that can help you with that.

Create a folder /var/log/cpu_mem/ and add a line in

motd

, so that all administrators can look for the custom logs in this path. The logs will be created in /var/log/cpu_mem/ and the relevant log can be checked based on the time-stamp of the log. Execute the following from shell :

mkdir /var/log/cpu_mem/; echo “Check logs at /var/log/cpu_mem/ for detailed log for analysis” >>/etc/motd; touch /root/loadmon.sh;chmod 755 /root/loadmon.sh

Add the following content to the file /root/loadmon.sh using any popular text editor.

#!/bin/bash

#This simple script is to record the status of processes, memory usage, disk usage, CPU state, mysql process list, maillog etc. More stuff can be added to the list easily, by adding the command corresponding to the desirable output.

#Script written by Sankar.H

#Sets the variable LOAD to the value picked from proc

CPU=$(grep -c processor /proc/cpuinfo)

LOAD=$(awk '{print int($1)}' /proc/loadavg)

#Replace '$CPU' in the below if statement with the load average in integer, above which you need the logging enabled. -Not recommended

if [ $LOAD -ge $CPU ]

then

{

printf "n";date

printf "nn================nn Memory usage stats nn================nn"

printf " output of free -m Look for memory usage and swap usagen n"

free -m

printf "n Look for swap in and swap outn n"

vmstat

printf "nn================nTOP Snapshotn================nn"

top -n1 -b

printf "nn================nDisk Usagen================nn"

df -h

printf "nn================nMySQL Process Listn================nn"

mysqladmin proc stat

#Comment the above line, and uncomment the line below, if the server is having Plesk installed

#mysqladmin proc stat -u admin -p`cat /etc/psa/.psa.shadow`

printf "nn====n Disk I/O performance- check await and util n===nn"

iostat -xdk

printf "nn===n CPU usage - check for usr sys iowait idle percentages n===nn"

mpstat

printf "nn===nNetwork Stats - approximate no of connections. Check script for enabling more details.n===nn"

netstat -plan |wc -l

#If you need more network related information, uncomment the following line

#printf "nDetailed network logs: n";netstat -plan;netstat -s

printf "nn===nLook for errors or firewall messages in the dmesg o/p below n===nn"

dmesg|tail -30

}

>/root/ldmon;touch /var/log/cpu_mem/log$(date +%F-%H:%M); cat /root/ldmon >/var/log/cpu_mem/log$(date +%F-%H:%M)

fi

Create a cron job for the periodic execution of the script. Setting the interval to every 2 or 5 minutes should be enough. Note that the script will record the details only if the server is overloaded. If you would want to test the script, you will have to replace the $CPU with 0, so that the script logs the details, even when server load is 0. Read the script for exact details.

The script can create log files which could take up a lot of space in your server, and it is important to clear old logs periodically. The following script can be set as a daily cron, to clear logs that are older than 2 days. Create a file /root/clear_old_logs.sh

touch /root/clear_old_logs.sh ;chmod 755 /root/clear_old_logs.sh

Add the following lines to the file /root/clear_old_logs.sh with any of your favorite text editors.

#!/bin/bash

find /var/log/cpu_mem/ -ctime +2 -print|xargs /bin/rm -f

Run the following from shell, to set the crons. In some servers like those running Ubuntu, the cron file would be at /var/spool/cron/crontabs/root , in those cases you might have to edit the following script with that path.

echo "*/2 * * * * /bin/sh /root/loadmon.sh >/dev/null 2>&1">> /var/spool/cron/root

echo “0 4 * * * /bin/sh /root/clear_old_logs.sh >/dev/null 2>&1”>> /var/spool/cron/root

A crashing server may not kill people, but it definitely kills business.

About the Author :

Sankar works as a Senior Software Engineer in Bobcares. He joined Bobcares back in April 2006. He loves grooming/mentoring people. During his free time, he listens to music, and enjoys singing..

Nobody’s been killed in a server crash

How to proceed

Subscribe to our newsletter

Footer newsletter