How to troubleshoot server down issues? Let’s discuss.
At times, network issues might occur and all the servers in a Datacenter can go down. This can lead us into some unlucky instances.
As part of our Server Management Services, we assist our customers with several server queries.
Today, let us see how we can troubleshoot server down issues.
How to troubleshoot server down issues
Initially, we need to make sure if it is a false alert.
To do so, we connect to the server via ping or telnet to any of the running ports and check if it is really down or not.
The commands are really simple:
ping server ip telnet serverip port number
ping 188.8.131.52 telnet 184.108.40.206 22
The above commands can perform in different operating systems such as Linux, Mac, or Windows.
If the server responds to the ping fine without any data loss as given below, then everything is fine. It is a false alert.
- ping google.com -c10 PING google.com (220.127.116.11) 56(84) bytes of data. 10 packets transmitted, 10 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 326.477/326.477/326.477/0.000 ms -
However, if the server is not responding to ping or has any packet loss, then our Support Techs recommend contacting DC or hosting providers to get the issue sorted out.
Let us see what one of our customers came across.
When he received the server down alert, he couldn’t access the server. Even a reboot couldn’t bring the server back online.
So, we connect to the server via IPMI and found that the server was stuck at fsck and that led to the problem.
There can be so many reasons for the server being down.
- High load in the server
- Faulty equipment
- High temperature in the DC room
- Partition being full
- No disk space in the server
- Power connection cable
If the server is stuck, a reboot from DC is a better option.
Once the server is up, we need to check the reason why it went down.
To find the reason, we refer to the below logs location in the server.
/var/log/messages — dmesg |grep less /var/log/boot.log /var/log/fsck
Suppose, the load was high due to spam or the number of incoming connections to HTTP being high, we need to troubleshoot accordingly.
However, if the server was down due to a high load, then we check the incoming connections and block them in the server.
If a drive is faulty, then we need to replace them. In our case, the server was running fsck check and that is why it was taking time.
Several reasons may lead to fsck check running:
- The complete unmounting ability of the hard disk
- Using a third-party utility to delete the extended partition
- Problems with any filesystems
- Power failure
- Incomplete shut down
- Hardware failure
The above causes result in file system operations being incomplete.
The few sample logs we can find in the Linux logs are as below.
Checking all file systems. [/sbin/fsck.ext3 (1) — /] fsck.ext3 -a /dev/xvda1 /: clean, 56079/1310720 files, 1243508/2621440 blocks [/sbin/fsck.ext3 (1) — /var/www/virtual] fsck.ext3 -a /dev/sdf fsck.ext3: No such file or directory while trying to open /dev/sdf /dev/sdf: The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193
After the fsck check is complete, the server will be up fine without further errors.
[Stuck with server down? We can help you through it]
In short, network issues might cause the Datacenter to go down. This can lead us into some unlucky instances. However, today we saw how to troubleshoot server down issues.