Yes, troubleshooting is an art!. The key points to mastering this art is knowing the system in and out, using the right tools and, of course, googling. Troubleshooting a problem is not something that can be spoon fed or taught with precise steps. It has to evolve from logical thinking and thorough knowledge of the system.
The first step of debugging any problem is knowing the problem well. After we have found the problem, its time to fix it. But how?. There comes the importance of using the right tools and knowing how to use them effectively. In *nix world, system administration means playing with lot of commands and effective piping of them. Knowing the tools is not the only requirement. We should know where to use them - the source of data. And, if everything else fails - google it!.
I will cover each of these in a bit more detail and I will cite with examples whenever possible.
Understanding the problem
As I said earlier, understanding the problem is the key aspect of
troubleshooting. Assume, we run a script and it results in an error message. Here, that error message is not the problem. Find out what caused the script to give error message. The
next section covers some tools which would help you here.
2.1 grep
Data that we have:
- A script
- The error message that we get when running the script
Now our first job is to find out why the script resulted in an error.
'grep' is a very good command to start with. So we run:
$ grep 'Error Message' scriptname
If we get a result, then our job is fairly simple - just go through
the script, find what condition is failing and fix it. But, practically
in most cases, this kind of a grep won't give any result - which means,
the error message is not directly from the script. This is the importance
of knowing the problem. Suppose the error message was something like
'username not found', but on checking shows that the username do exist.
There, a non existent username is not our 'problem', its something
else which we are yet to find out.
Now we have the second case, grep didn't give any output. Then we
know that the script is calling some other script or binary which
may be giving the error. If its a small script, pointing out the location
is easier. But if we have large script with lot of branches throughout,
we have to introduce check points (put some print lines here and there
and find the exact location which gives the error).
After pin pointing the portion of the script giving the error, if
its from another script inside the original, we follow the same steps
as before. In the case of binary, to make sure it is the same thing
giving the error, we proceed with a second beautiful command - 'strings'.
$ strings binary_name | grep 'Error Message'
strings(1) print the strings of printable characters in files.
Example: Suppose if our error message was 'user joe does not exist'
and the program being /bin/su. At times, we will have to take only
necessary part of the Error Message removing variable part. So in
the above case, we remove 'joe' and take 'does not exist' only.
$ strings /bin/su | grep 'does not exist'
user %s does not exist
$
Now we confirm that the error message is given by the binary we guessed.
To proceed from here, we need to know how to use a bit of powerful
tools.
3 Using the right tools
We move on to a few more complex, but very helpful commands. Out of
this, strace is the one I prefer very much and have helped me a lot.
In my opinion, every sysadmin should know how to use strace.
3.1 Using strace to debug a binary
Running a program under strace
strace is used to trace system calls and signals. With a bit of practice,
strace can be used effectively to find what is going wrong and where.
Sample output of strace command. I am pasting the relevant portion
only.
Usage: strace command arguments
$ strace less /etc/shadow (This is just an example to show how the
output would be like)
wait4(6711, [WIFEXITED(s) && WEXITSTATUS(s) == 127], 0, NULL)
= 6711
stat64("/etc/shadow", {st_mode=S_IFREG|0400,
st_size=1388, ...}) = 0
stat64("/etc/shadow", {st_mode=S_IFREG|0400,
st_size=1388, ...}) = 0
open("/etc/shadow", O_RDONLY|O_LARGEFILE) = -1
EACCES (Permission denied)
write(2, "/etc/shadow: Permission denied\n",
31/etc/shadow: Permission denied
As we all know, users are not allowed to view the /etc/shadow file
and the strace shows this much. One thing to note here is to run the
strace program as the correct user. Some applications, though they
start running as root, forks off to another user before doing something.
So we must be executing strace also as the same user name. If its
a user without a shell, we can do that first by switching to that
user and giving a shell.
$ su - nobody -s /bin/bash
puts you to user nobody with bash as the shell. Now execute strace
from this shell. It is possible that there may be errors when we run
like this which wont be there when we run as root, as root is the
superuser and is allowed to do almost anything without any restriction.
open("/etc/shadow", O_RDONLY|O_LARGEFILE) = -1
EACCES (Permission denied)
shows that the user does not have permission to open the file even
for reading. For a missing file or directory, strace shows something
like this
open("path/to/file", O_RDONLY|O_LARGEFILE) = -1
ENOENT (No such file or directory)
Another important use of strace is to find out which configuration
file, a binary is using. When we install something from source (./configure;
make; make install), the files may get installed to various locations
and may not always be traceable. strace helps here also.
In the following example, we find out the configuration file for sshd.
$ strace /usr/sbin/sshd
and in the output, we see
open("/etc/ssh/sshd_config", O_RDONLY|O_LARGEFILE)
= 3
In this case, its an rpm installation and the path is default one.
But this helps in the case of other source installations.
Using strace on a running program ( strace -p )
strace can also be used to trace an already started program.
Usage: strace -p pid (where pid is the process id of the program)
Note that, in the case of programs with multiple instances (forked
off children), this pid must be of the main instance.
$ ps -ef | grep http
root 2481 1 0 Sep22 ? 00:00:00 /usr/local/apache/bin/httpd -DSSL
apache 2490 2481 0 Sep22 ? 00:01:48 /usr/local/apache/bin/httpd -DSSL
apache 2491 2481 0 Sep22 ? 00:01:36 /usr/local/apache/bin/httpd -DSSL
As you can see, the first instance in this apache is the one running
as root with pid 2481. So, to debug that, do:
$ strace -p 2481
Using strace to chase forked child processes ( strace -f )
In the above example, we saw how to use strace on a running program.
But in the case of apache, though there is one main instance running,
apache forks off lot of other instances also. The -f option of strace
deals with them. If one of the forked child process is giving the
error, strace with -f catches this.
$ strace -f -p pid
$ strace -f /path/to/program
In the first case, we use strace on an already running binary and
the -f traces the child processes created by the fork() system call
as and when they are created.
Second case is similar except that here, strace is used to start a
binary.
strace -o output.log writes the output to log file. Can be helpful
when dealing with programs that run for a while before giving error.
3.2 Using gdb for in-depth debugging.
gdb is a very powerful debugger which is mostly used for debugging
the core files produced by programs which segfault. But these days,
most programs wont core dump. Either because our shells are set like
that (ulimit can change the behavior with core files) or because the
program is designed not to do that. Even in this case, it can give
clues on the problem we are trying to troubleshoot. To start a program
using gdb, do
$ gdb -e /complete/path/to/program
Note that the command line arguments are not specified here. We specify
it later.
$ gdb -e /bin/su
GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are welcome to change it and/or distribute copies of it under
certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty"
for details.
This GDB was configured as "i386-redhat-linux-gnu".
(gdb)
At this prompt, we give run followed by arguments to be passed to
the binary.
(gdb) run arg1 arg2
This will continuously run the program. To break it and wait at a
point, we should introduce break point before running it. Suppose
we know a function name up to which the program executed normally
(from the output of strace), we do
(gdb) break function (or break line_number)
then
(gdb) run arguments
gdb runs the program up to that function and waits there. We can issue
commands to trace the program line by line. Please note that, when
we specify break function, the function should be part of the main
binary and not the included libraries.
Here is a typical condition we have to debug. Chrooting to a virtual
host in ensim server.
[root@host fst]# pwd /home/virtual/site3/fst
[root@host fst]# chroot .
chroot: cannot execute /bin/bash: No such file or directory
[root@host fst]# ls -l bin/bash /bin/bash
-rwxr-xr-x 1 root root 541096 Apr 12 2002 /bin/bash
-rwxr-xr-x 19 root root 541096 Jan 18 2004 bin/bash
[root@host fst]#
As you can see, bin/bash (site's chrooted bin) and /bin/bash are there.
Still the chroot program gives error saying bash is not there.
These are the lines from strace
execve("/bin/bash", ["/bin/bash",
"-i"], [/* 18 vars */]) = -1 ENOENT (No
such file or directory)
write(2, "/usr/sbin/chroot: ", 18/usr/sbin/chroot:
) = 18
write(2, "cannot execute /bin/bash", 24cannot execute
/bin/bash) = 24
write(2, ": No such file or directory", 27: No such
file or directory) = 27
Of course, this is of not much use because it is also saying /bin/bash
not found when its there. So we move to gdb.
One point worth noting here is this line:
write(2, "cannot execute /bin/bash", 24cannot execute
/bin/bash) = 24
The part we should watch carefully is 'cannot execute /bin/bash'.
Then only it says 'No such file or directory'.
(gdb) run .
Starting program: /usr/sbin/chroot .
(no debugging symbols found)...Breakpoint 1 at 0x400ee570
(no debugging symbols found)...
Breakpoint 1, 0x400ee570 in chroot () from /lib/libc.so.6
(gdb) next
Single stepping until exit from function chroot,
which has no line number information.
0x08048cca in chroot ()
(gdb) next
Single stepping until exit from function chroot,
which has no line number information.
/usr/sbin/chroot: cannot execute /bin/bash: No such file or directory
Program exited with code 01.
Here also, we don't have much luck, but we know its something related
to library files. So we use the tool 'ldd' to find more about the
libraries and their linkage with the binaries.
3.3 Using ldd for finding library dependencies of a binary.
We all know that linux binaries are mostly shared ones which depend
on a lot of libraries for their working. ldd command gives the library
files on which a binary is dependent on.
[root@host fst]# ldd bin/bash
libtermcap.so.2 => /lib/libtermcap.so.2 (0x40018000)
libdl.so.2 => /lib/libdl.so.2 (0x4001d000)
libc.so.6 => /lib/libc.so.6 (0x40020000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
We used ldd on bin/bash and not /bin/bash since chroot should be using
the bin/bash (remember we are inside the directory to be chrooted).
The command listed all the library files on which bin/bash is depending
on.
Now we check the sanity of these library files one by one.
[root@host fst]# ls -l lib/libtermcap.so.2
ls: lib/libtermcap.so.2: No such file or directory
[root@host fst]# ls -l lib/libdl.so.2 ls: lib/libdl.so.2: No
such file or directory
[root@host fst]# ls -l lib/libc.so.6 ls: lib/libc.so.6: No such
file or directory
[root@host fst]# ls -l lib/ld-linux.so.2 ls: lib/ld-linux.so.2:
No such file or directory
Though the needed binary is there, its shared libraries are missing!.
If it was on the server wide /lib or /usr/lib, we need to find out
the packages corresponding to that and install them. In this ensim's
case, its all linked to their TEMPLATE copies. So make the required
hard links now.
After making the hard links and also making the required soft links
(library files have lot of links to the latest version), we try chroot
once more.
[root@host fst]# chroot .
bash-2.05a#
Now, all is perfect!.
Till this point, it was all about doing it ourselves. We all know,
no one is perfect or complete. If we did all what we could do and
still we don't have the solution to the problem, then hope for the
best - we may not be the only person having the same problem.
4 Google is your friend
Internet is a vast collection of information and search engines like
google help us to locate the needed information within a short time.
I personally prefer google as the best search engine. But, still there
are some problems. Google is not an intelligent robot; it can't guess
what is there in your head or what you are looking for. You must be
able to present your question in the 'most obvious way' still not
losing the context information. That is where the importance of 'effective
googling' comes.
Google has a lot of keywords that can fine tune our search. For example
to search for site's names only, we can specify site:search_keyword.
To search for a pattern with more than one word, we can include them
in quotes. Google normally removes common words like 'and', 'or',
'when' etc from the search patten. To forcefully include them in the
search, use a '+' just before the word. ( +word_to_be_included)
This link may be helpful.
http://www.google.com/help/index.html
At times, searching with the exact error message wont help. There
we will have to find what is the 'real problem' and search for it.
5 Conclusion
Though articles like this can give guidelines, experience does matter
here. The more your working experience with linux machines, the more
faster you will get into solution. Difference between the experienced
ones and others is the direction of thinking. You have more knowledge
about the system means, you can easily guess the problem areas and
concentrate thinking in that way. That said, there are instances where
none of the above steps work, then probably we are the first one with
such an error. In this case, someone regularly dealing with the machine
has to do a backtracking through his/her recent tasks done on the
machine. Any new softwares installed, configuration changes or anything
like that could be the root cause.
About the author: Jemshad OK worked for close to 3 years in Bobcares.com, Tech support company for WebHosts and ISPs. Now he works in Yahoo.
|