Saturday, August 17, 2013


Debugging using gdb - A walkthrough


To illustrate the debugging process, there are C and Fortran example codes at the end of the tutorial that include both a floating point error and a segmentation fault. These examples are trivial, and are simply intended to show how easy it is to use the debugger. Note that the behaviour of the debugger is the same regardless of the language one is using, so we'll just show the C example in the walk-through that follows.

first bug: an FPE

First, to illustrate what happens when the code is run as is:

gcc bugs.c
./a.out
Floating point exception
Notice the Floating point exception message, and the fact that it exited. To debug it in gdb
First compile:

gcc -Wall -O0 -g bugs.c


Now start the debugger, specifying the program we want to debug:

gdb a.out


At this point, the program will be loaded, but is not running, so start it:

(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out
Program received signal SIGFPE, Arithmetic exception.
0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
5 printf("%f\n",e/d);


Note that the debugger will stop at the FPE, and show which function/routine it was in, what values input arguments had, the line number of the source file where the problem occured, and the actual line of the file. In this case this output is sufficient to diagnose the problem: clearly e/d is undefined since the denominator is zero. One can also look at a stack trace, to see what has been called up till this point:

(gdb) where
#0 0x00000000004004db in divide (d=0, e=1) at /nar_sfs/work/snuser/bugs/bugs.c:5
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24


An important caveat concerning the stack trace is that the debugger may display a deep stack (ie. a long list of functions that have been entered), indicating a problem triggered inside a system library. While the system library function is the last function that was executed before the program failed, it is unlikely that there is actually a bug in the system library. One should trace back through the stack to the last call from the program into the library and inspect the arguments that were given to the library function to ensure that they are sensible - typically errors in system libraries occur when the library functions are called with incorrect arguments.


In addition to the stack trace, one may look at the source code file, centered around a particular line:

(gdb) l 5
1 #include <stdio.h>
2
3 void divide(float d, float e)
4 {
5 printf("%f\n",e/d);
6 }
7
8 void arrayq(float f[], int q)
9 {
10 printf("%f\n",f[q]);

One can inspect the values of different variables in the current level of the stack:

(gdb) p d
$1 = 0
(gdb) p e
$2 = 1

Or one can go "up" the stack to look at values in the calling function/routine:

(gdb) up
#1 0x00000000004005e7 in main (argc=1, argv=0x7fbfffeab8) at /nar_sfs/work/snuser/bugs/bugs.c:24
24 divide(a,b);
(gdb) p a
$3 = 0
(gdb) p b
$4 = 1
When one is finished, it's easy to exit:
(gdb) q
The program is running. Exit anyway? (y or n) y


second bug: a segmentation fault

Now, to illustrate a segfault, change the denominator to be non-zero, eg. a=4.0 Compile the modified code, and run it to see what happens:

./a.out
0.250000
Segmentation fault



Notice the Segmentation fault message, and the fact that the job exited with code 139. To debug it in gdb:

gdb a.out
(gdb) r
Starting program: /nar_sfs/work/snuser/bugs/a.out
0.250000
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffe980, q=12000000) at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);

Note that the program stops automatically when it hits the segmentation fault, and shows you which function it is in, the values of the input variables, and the line in the source. One can then try printing out the values of the array, to see why it would have a problem:

(gdb) p f
$1 = (float *) 0x7fbfffe980
(gdb) p f[1]
$2 = 1
(gdb) p f[9]
$3 = 9
(gdb) p f[q]
Cannot access memory at address 0x7fc2dc5580


So it is clear that the program is trying to access something it shouldn't be. Note that this is lucky - had one accidently tried to access something just outs ide the array bounds:

(gdb) p f[11]
$4 = 0
(gdb) p f[1000]
$5 = 7.03598541e+22
(gdb) p f[10000]
Cannot access memory at address 0x7fc00085c0

It would have resulted in a valid number and the program would have carried on, but the results of the program would have been wrong. So one can't count on an array out of bounds to always result in a segmentation fault. Often segmentation faults occur when there are problems with pointers, since they may point to innaccessable addresses, or when a program tries to use too much memory. Using a debugger greatly helps in identifying these sorts of problems.

using core files


If a program uses a lot of memory, does not trigger an error condition in a reproducible manner, or takes a long time before it reaches the error condition then it shouldn't be debugged interactively (at least in the first instance). In these situations one should submit the debugging instrumented program to the cluster as a compute job such that it will produce a core file when it crashes. A core file contains the state of the program at the time it crashed - one can then load this file into the debugger to inspect the state and determine what caused the problem.


By default your Linux environment may not configured to produce core files. To enable core files, when using the bash shell on your system (the default shell) one must set the core limit to be non-zero. Setting it to be unlimited should suffice, eg.

ulimit -c unlimited

then when one runs a program that crashes it should indicate that it has produced (dumped) a core file, eg.

gcc -g bugs.c
./a.out
0.250000
Segmentation fault (core dumped)

The core file should appear in the present working directory with a name of the form core.PID , where PID is the process id of the program instance that crashed. Note: for anything more complex than the examples provided in this tutorial you should submit this as a job to the cluster, in which case the core file will be placed in the working directory used by the job but one must submit their job with the -f permitcoredump option specified to sqsub .


One can then load this into gdb as an additional argument to gdb, eg.

gdb a.out core.10966
#0 0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
10 printf("%f\n",f[q]);
(gdb) where
#0 0x0000000000400514 in arrayq (f=0x7fbfffdfc0, q=12000000) at /home/merz/bugs/bugs.c:10
#1 0x00000000004005f3 in main (argc=1, argv=0x7fbfffe0f8) at /home/merz/bugs/bugs.c:26
(gdb) q


Note that in this case one does not need to run the program in the debugger - it will simply inspect the state of the core file and use the debugging-instrumented binary to display the type of error and where it occurs. One may then run the gdb where command to get the stack backtrace, etc., to further identify the problem.

As long as one sets their core size limit with the ulimit command before submitting their job, and submits their job with the sqsub -f permitcoredump flag, then this environment setting should propagate to their job and the program should generate a core. Keep in mind that this setting will not persist between logins, so you should either put it in your shell configuration file (eg. ~/.bash_profile ) or run it any time you log into a system if you want your programs to produce a core when they crash.

debugging interactively



If you need to view the state of the program leading up to the crash, perhaps repeatedly, then a core file won't suffice and it is suggested that one submit this as an interactive job (avoid running this on the login node!). If possible, one should try to resume the program from a checkpoint that is near to the crash to avoid waiting a long time while the program reaches the erroneous state.


One can start gdb as follows:

gdb ./a.out

One can then proceed to debug in the usual fashion:

r
(gdb) Starting program: /nar_sfs/work/snuser/bugs/a.out
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400514 in arrayq (f=0x7fbfffd740, q=12000000)
at /nar_sfs/work/snuser/bugs/bugs.c:10
10 printf("%f\n",f[q]);

When you exit the debugger the job will automatically terminate

(gdb) q

Note: you may not see the (gdb} prompt, or it may appear out of order (as above), but you can proceed as though it were there.


Examples


FORTRAN CODEbugs.fC CODEbugs.c
 program bugs
     implicit none
     real a,b
     real c(10)
     integer p
     a=0.0
     b=1.0
     do p=1,10
         c(p)=p
     enddo
     p=12000000
     call divide(a,b)
     call arrayq(c,p)
 end program
 
 subroutine divide(d,e)
     implicit none
     real d,e
     print *,e/d
 end subroutine
 
 subroutine arrayq(f,g)
     implicit none
     real f(10)
     integer g
     print *,f(g)
 end subroutine
 #include <stdio.h>
 
 void divide(float d, float e)
 {
     printf("%f\n",e/d);
 }
 
 void arrayq(float f[], int q)
 {
     printf("%f\n",f[q]);
 }
 
 int main(int argc, char **argv)
 {
     float a,b;
     float c[10];
     int p;
     a=0.0;
     b=1.0;
     for (p=0;p<10;p++)
     {
         c[p]=(float)p;
     }
     p=12000000;
    divide(a,b);
    arrayq(c,p);
    return(0);
 }



Source ::

Wednesday, August 7, 2013



Matrix Market File format & PetSc Bin


If you have developed a CFD solver then most likely you have generated a sparse linear system which needs to be solved one way or the other. Apart from the standard Intel or other libraries, it would be definitely advantageous to use the PetSc library. PetSc has huge capabilities varying from linear solvers, pre-conditioners, eigenvalue analysis, to the CPU / GPU implementations.

Getting to know PetSc will take sometime but it is greatly beneficial to employ matrix and vector operations. The only requirement now is to convert the system we have to PetSc known format. This is good for initial testing only. Here are two routines developed by P.Kumar (PetSc) to convert the Matrix in COO format or Matrix Market file format to PetSc Matrix bin format, and the other to convert vectors.

MatrixMarket_to_PetScBin.c

VectorMarket_to_PetScBin.c

README.TXT

Subscribe to RSS Feed Follow me on Twitter!