Bobcares

Nobody’s been killed in a server crash

Imagine a server that keeps crashing every other day. For most webhosts’ this isn’t something too hard to imagine. Every host would have gone through this phase, where they are clueless as to why their server keeps going down.

Most of the time, the blame falls on faulty hardware. Usually this is true, circumstantial evidence proving that no recent changes were made to the software, and hence the source of issue would most likely be hardware. This doesn’t always have to be correct, as there are other things that could go wrong.

How to proceed

In the case of Linux servers, a quick look at the output of the command

last
will give you an idea about the reboot times. Looking at the output of
dmesg
or logs such as /var/log/messages could give you an idea of the problem- if the root-cause of the issue was something related to hardware. If you have access to sensor data of your server, you could get an insight into a potential hardware issue- like a failing fan or abnormal voltage levels for your CPU. A quick test of the hard-drives with tools like
smartctl
can be used to identify hard-disk problems as well.

Tools like

mpstat
and
iostat
, which are part of the
sysstat
package, could shed light into the troubleshooting process. Tools like
sar
give you more than enough details of the server’s state . But an in depth analysis of the server isn’t always possible, since you would not be troubleshooting the issue when it actually happens(in real time).

Mostly servers get overloaded, causing them to crash. In such cases, the definitive edge in performing an in depth analysis can be obtained by logging the state of the server, it’s processes, and resource usage, so that it can be reviewed at a later stage. Here is a script that can help you with that.

Create a folder /var/log/cpu_mem/ and add a line in

motd
, so that all administrators can look for the custom logs in this path. The logs will be created in /var/log/cpu_mem/ and the relevant log can be checked based on the time-stamp of the log. Execute the following from shell :

mkdir /var/log/cpu_mem/; echo “Check logs at /var/log/cpu_mem/ for detailed log for analysis” >>/etc/motd; touch /root/loadmon.sh;chmod 755 /root/loadmon.sh

 

Add the following content to the file /root/loadmon.sh using any popular text editor.

#!/bin/bash
#This simple script is to record the status of processes, memory usage, disk usage, CPU state, mysql process list, maillog etc. More stuff can be added to the list easily, by adding the command corresponding to the desirable output.
#Script written by Sankar.H
#Sets the variable LOAD to the value picked from proc
CPU=$(grep -c processor /proc/cpuinfo)
LOAD=$(awk '{print int($1)}' /proc/loadavg)
#Replace '$CPU' in the below if statement with the load average in integer, above which you need the logging enabled. -Not recommended
if [ $LOAD -ge $CPU ]
then
{
printf "n";date
printf "nn================nn Memory usage stats nn================nn"
printf " output of free -m Look for memory usage and swap usagen n"
free -m
printf "n Look for swap in and swap outn n"
vmstat
printf "nn================nTOP Snapshotn================nn"
top -n1 -b
printf "nn================nDisk Usagen================nn"
df -h
printf "nn================nMySQL Process Listn================nn"
mysqladmin proc stat
#Comment the above line, and uncomment the line below, if the server is having Plesk installed
#mysqladmin proc stat -u admin -p`cat /etc/psa/.psa.shadow`
printf "nn====n Disk I/O performance- check await and util n===nn"
iostat -xdk
printf "nn===n CPU usage - check for usr sys iowait idle percentages n===nn"
mpstat
printf "nn===nNetwork Stats - approximate no of connections. Check script for enabling more details.n===nn"
netstat -plan |wc -l
#If you need more network related information, uncomment the following line
#printf "nDetailed network logs: n";netstat -plan;netstat -s
printf "nn===nLook for errors or firewall messages in the dmesg o/p below n===nn"
dmesg|tail -30
}
>/root/ldmon;touch /var/log/cpu_mem/log$(date +%F-%H:%M); cat /root/ldmon >/var/log/cpu_mem/log$(date +%F-%H:%M)
fi

Create a cron job for the periodic execution of the script. Setting the interval to every 2 or 5 minutes should be enough. Note that the script will record the details only if the server is overloaded. If you would want to test the script, you will have to replace the $CPU with 0, so that the script logs the details, even when server load is 0. Read the script for exact details.

The script can create log files which could take up a lot of space in your server, and it is important to clear old logs periodically. The following script can be set as a daily cron, to clear logs that are older than 2 days. Create a file /root/clear_old_logs.sh

touch /root/clear_old_logs.sh ;chmod 755 /root/clear_old_logs.sh

Add the following lines to the file /root/clear_old_logs.sh with any of your favorite text editors.

#!/bin/bash

find /var/log/cpu_mem/ -ctime +2 -print|xargs /bin/rm -f

Run the following from shell, to set the crons. In some servers like those running Ubuntu, the cron file would be at /var/spool/cron/crontabs/root , in those cases you might have to edit the following script with that path.

echo "*/2 * * * * /bin/sh /root/loadmon.sh >/dev/null 2>&1">> /var/spool/cron/root

echo “0 4 * * * /bin/sh /root/clear_old_logs.sh >/dev/null 2>&1”>> /var/spool/cron/root

 

A crashing server may not kill people, but it definitely kills business.


About the Author :

Sankar works as a Senior Software Engineer in Bobcares. He joined Bobcares back in April 2006. He loves grooming/mentoring people. During his free time, he listens to music, and enjoys singing..


0 Comments

Speed issues driving customers away?
We’ve got your back!

Privacy Preference Center

Necessary

Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

PHPSESSID - Preserves user session state across page requests.

gdpr[consent_types] - Used to store user consents.

gdpr[allowed_cookies] - Used to store user allowed cookies.

PHPSESSID, gdpr[consent_types], gdpr[allowed_cookies]
PHPSESSID
WHMCSpKDlPzh2chML

Statistics

Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.

_ga - Preserves user session state across page requests.

_gat - Used by Google Analytics to throttle request rate

_gid - Registers a unique ID that is used to generate statistical data on how you use the website.

smartlookCookie - Used to collect user device and location information of the site visitors to improve the websites User Experience.

_ga, _gat, _gid
_ga, _gat, _gid
smartlookCookie
_clck, _clsk, CLID, ANONCHK, MR, MUID, SM

Marketing

Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.

IDE - Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.

test_cookie - Used to check if the user's browser supports cookies.

1P_JAR - Google cookie. These cookies are used to collect website statistics and track conversion rates.

NID - Registers a unique ID that identifies a returning user's device. The ID is used for serving ads that are most relevant to the user.

DV - Google ad personalisation

_reb2bgeo - The visitor's geographical location

_reb2bloaded - Whether or not the script loaded for the visitor

_reb2bref - The referring URL for the visit

_reb2bsessionID - The visitor's RB2B session ID

_reb2buid - The visitor's RB2B user ID

IDE, test_cookie, 1P_JAR, NID, DV, NID
IDE, test_cookie
1P_JAR, NID, DV
NID
hblid
_reb2bgeo, _reb2bloaded, _reb2bref, _reb2bsessionID, _reb2buid

Security

These are essential site cookies, used by the google reCAPTCHA. These cookies use an unique identifier to verify if a visitor is human or a bot.

SID, APISID, HSID, NID, PREF
SID, APISID, HSID, NID, PREF