Debugging using gdb - A walkthrough
To illustrate the debugging process, there are C and Fortran example codes at the end of the tutorial that include both a floating point error and a segmentation fault. These examples are trivial, and are simply intended to show how easy it is to use the debugger. Note that the behaviour of the debugger is the same regardless of the language one is using, so we'll just show the C example in the walk-through that follows.
first bug: an FPE
First, to illustrate what happens when the code is run as is:
gcc bugs.c
./a.out
Floating point exception
./a.out
Floating point exception
Notice the Floating point exception message, and the fact that it exited. To debug it in gdb
First compile:
First compile:
gcc -Wall -O0 -g bugs.c
Now start the debugger, specifying the program we want to debug:
gdb a.out
At this point, the program will be loaded, but is not running, so start it:
(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out
Program received signal SIGFPE, Arithmetic exception.
0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
5 printf("%f\n",e/d);
Starting program: /nar_sfs/work/snuser/bugs/a.out
Program received signal SIGFPE, Arithmetic exception.
0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
5 printf("%f\n",e/d);
Note that the debugger will stop at the FPE, and show which function/routine it was in, what values input arguments had, the line number of the source file where the problem occured, and the actual line of the file. In this case this output is sufficient to diagnose the problem: clearly e/d is undefined since the denominator is zero. One can also look at a stack trace, to see what has been called up till this point:
(gdb) where
#0 0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24
#0 0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24
An important caveat concerning the stack trace is that the debugger may display a deep stack (ie. a long list of functions that have been entered), indicating a problem triggered inside a system library. While the system library function is the last function that was executed before the program failed, it is unlikely that there is actually a bug in the system library. One should trace back through the stack to the last call from the program into the library and inspect the arguments that were given to the library function to ensure that they are sensible - typically errors in system libraries occur when the library functions are called with incorrect arguments.
In addition to the stack trace, one may look at the source code file, centered around a particular line:
(gdb) l 5
1 #include <stdio.h>
2
3 void divide(float d, float e)
4 {
5 printf("%f\n",e/d);
6 }
7
8 void arrayq(float f[], int q)
9 {
10 printf("%f\n",f[q]);
1 #include <stdio.h>
2
3 void divide(float d, float e)
4 {
5 printf("%f\n",e/d);
6 }
7
8 void arrayq(float f[], int q)
9 {
10 printf("%f\n",f[q]);
(gdb) p d
$1 = 0
(gdb) p e
$2 = 1
$1 = 0
(gdb) p e
$2 = 1
(gdb) up
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24
24 divide(a,b);
(gdb) p a
$3 = 0
(gdb) p b
$4 = 1
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24
24 divide(a,b);
(gdb) p a
$3 = 0
(gdb) p b
$4 = 1
When one is finished, it's easy to exit:
(gdb) q
The program is running. Exit anyway? (y or n) y
The program is running. Exit anyway? (y or n) y
second bug: a segmentation fault
Now, to illustrate a segfault, change the denominator to be non-zero, eg. a=4.0 Compile the modified code, and run it to see what happens:
./a.out
0.250000Segmentation fault
Notice the Segmentation fault message, and the fact that the job exited with code 139. To debug it in gdb:
gdb a.out
(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out
0.250000
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffe980, q=12000000) at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out
0.250000
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffe980, q=12000000) at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) p f
$1 = (float *) 0x7fbfffe980(gdb) p f[1]
$2 = 1
(gdb) p f[9]
$3 = 9
(gdb) p f[q]
Cannot access memory at address 0x7fc2dc5580
(gdb) p f[11]
$4 = 0
(gdb) p f[1000]
$5 = 7.03598541e+22
(gdb) p f[10000]
Cannot access memory at address 0x7fc00085c0
$4 = 0
(gdb) p f[1000]
$5 = 7.03598541e+22
(gdb) p f[10000]
Cannot access memory at address 0x7fc00085c0
using core files
If a program uses a lot of memory, does not trigger an error condition in a reproducible manner, or takes a long time before it reaches the error condition then it shouldn't be debugged interactively (at least in the first instance). In these situations one should submit the debugging instrumented program to the cluster as a compute job such that it will produce a core file when it crashes. A core file contains the state of the program at the time it crashed - one can then load this file into the debugger to inspect the state and determine what caused the problem.
By default your Linux environment may not configured to produce core files. To enable core files, when using the bash shell on your system (the default shell) one must set the core limit to be non-zero. Setting it to be unlimited should suffice, eg.
ulimit -c unlimited
gcc -g bugs.c
./a.out
0.250000
Segmentation fault (core dumped)
./a.out
0.250000
Segmentation fault (core dumped)
One can then load this into gdb as an additional argument to gdb, eg.
gdb a.out core.10966
#0 0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) where
#0 0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
#1 0x00000000004005f3 in main (argc=1, argv=0x7fbfffe0f8) at /home/merz/bugs/bugs.c:26
(gdb) q
#0 0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) where
#0 0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
#1 0x00000000004005f3 in main (argc=1, argv=0x7fbfffe0f8) at /home/merz/bugs/bugs.c:26
(gdb) q
Note that in this case one does not need to run the program in the debugger - it will simply inspect the state of the core file and use the debugging-instrumented binary to display the type of error and where it occurs. One may then run the gdb where command to get the stack backtrace, etc., to further identify the problem.
As long as one sets their core size limit with the ulimit command before submitting their job, and submits their job with the sqsub -f permitcoredump flag, then this environment setting should propagate to their job and the program should generate a core. Keep in mind that this setting will not persist between logins, so you should either put it in your shell configuration file (eg. ~/.bash_profile ) or run it any time you log into a system if you want your programs to produce a core when they crash.
debugging interactively
If you need to view the state of the program leading up to the crash, perhaps repeatedly, then a core file won't suffice and it is suggested that one submit this as an interactive job (avoid running this on the login node!). If possible, one should try to resume the program from a checkpoint that is near to the crash to avoid waiting a long time while the program reaches the erroneous state.
One can start gdb as follows:
gdb ./a.out
r
(gdb) Starting program: /nar_sfs/work/snuser/bugs/a.out
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffd740, q=12000000)
at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) Starting program: /nar_sfs/work/snuser/bugs/a.out
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffd740, q=12000000)
at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) q
Examples
FORTRAN CODE: bugs.f | C CODE: bugs.c |
---|---|
program bugs implicit none real a,b real c(10) integer p a=0.0 b=1.0 do p=1,10 c(p)=p enddo p=12000000 call divide(a,b) call arrayq(c,p) end program subroutine divide(d,e) implicit none real d,e print *,e/d end subroutine subroutine arrayq(f,g) implicit none real f(10) integer g print *,f(g) end subroutine | #include <stdio.h> void divide(float d, float e) { printf("%f\n",e/d); } void arrayq(float f[], int q) { printf("%f\n",f[q]); } int main(int argc, char **argv) { float a,b; float c[10]; int p; a=0.0; b=1.0; for (p=0;p<10;p++) { c[p]=(float)p; } p=12000000; divide(a,b); arrayq(c,p); return(0); } |
Source ::