Saturday, February 23, 2013

Memory Debugging
A Crash course

To get rid of stack errors, buffers, exceptions or segmentation faults , leaked memories, pointer bounds, etc while using pointers and files, and string operations is not straightforward. These bugs are inevitable while programming and can cause weird behaviour on different machines, etc. Most of the time these errors or bugs can be removed cleanly using simple debugging tools. I have been using "valgrind" which is simple, intuitive, and a powerful tool to debug memory / data / file operations generated errors. These are some simple steps that I have used and found greatly beneficial while debugging my code ( a numerical solver for parallel architectures).

1. Debug your code with possible checks wherever needed (or every where) using "assert". It is a good practice to insert assert lines after pointer allocation, file opening, possible constraints on the input data (like x > 0, when read from a file), etc

2. Compile your code with a debug flag "-g" to include the debug symbols required.

3. Just run your code with "valgrind" as follows. And valgrind suggest the possible errors and other flags which will help it to determine help you to locate the source of the error.
  • valgrind ./a.out   [ simple run, and valgrind will further suggest you for possible flags required to run ]
  • valgrind -v --leak-check=full --track-origins=yes  ./a.out [ full check (leak check) and locate source of the problem (track origin) and output on the screen (-v) ]
This will help you start up the and get into proper memory debugging. Sort out more help on valgrind's website.

Also, valgrind can be used for any parallel application (like using MPI / OpenMP) as well.

Monday, February 18, 2013

Basic Metrics of a CUDA application

After developing a CUDA application, the costly routines (in terms of runtime) need to be tuned or optimised for better performance. The primary metrics to identify whether the kernel is memory intensive or computationally intensive  are L1 / L2 cache hit rates, the data throughput, and the computational throughput. This article discusses about the data and computational throughput.

Let's understand this using a simple example;
The kernel SUM adds two vectors a and b and stores the result in c.

_global__ void  SUM( double *a, double *b, double *c)

   int tid = blockIdx.x;

   // N is the problem size
   if (tid < N) 
       c[tid] = a[tid] + b[tid];


Data throughput is simply the total data transfers (read + write) of a CUDA kernel. The data throughput indicates the data transfer the application is able to reach. One can compare it to the CUDA device theoretical throughput for reference.

From the example kernel SUM, we read 2 doubles (a & b) and write 1 double (c)  per thread. Hence,

Read = 2 doubles * 8 bytes = 16 Bytes
Write = 1 double * 8 bytes =  8 Bytes

The bandwidth or the data throughput can be calculated as;

Effective Bandwidth = (Read + Write) / (time * 10^9)  GBytes / sec

where, time is the runtime of the kernel in seconds. The runtime can be obtained directly from nvprof ( with --print-gpu-trace or --print-api-trace command).

Computational Throughput is the total calculations (i.e. floating point operations - flop) performed by the device per second. A flop is an addition, subtraction, multiplication or division. The computational throughput indicates the flops the application is able to maintain. One can compare it to the CUDA device theoretical throughput for reference.

For the simple kernel SUM, we have 1 flop i.e. 1 addition per thread. Hence,

Computational Throughput = Total flops / (time * 10^9) Gflops / sec

where, time is the runtime of the kernel in seconds. The runtime can be obtained directly from nvprof ( with --print-gpu-trace or --print-api-trace command).

Also, it is not straightforward to calculate the flops of a kernel (especially long kernels, other mathematical functions like cos, sin, etc which have more flop counts).

L1 / L2 hit rates are measured directly from the visual profiler. They denote the percentage of data hits to the total data requests to the L1 / L2 caches.

References :
2. A short lecture on flops

Tuesday, February 12, 2013

Compilers, Debuggers & Profilers

Part A : Compilers

1. GNU compilers [ gcc / g++ / gfortran ] - free
2. Intel  [ icc / icpc / ifort ]- free for linux

Part B : Debuggers

1. GNU debugger [ gdb ] - free
2. Intel debugger [ idb ] - free for linux

Part C : Profilers

1. ValGrind (Command line) , Valkyrie (GUI)
Purpose : memory debugging, memory leak detection, and profiling
Link :

2. nvprof (command line),  computeprof / nvvp (GUI)
Purpose : profiling
Languages : CUDA

Some available lists;

A list of memory debuggers ;

CUDA Enabled Devices - Metrics Reference
(Using computeprof / nvvp)

This section contains detailed descriptions of the metrics that can be collected by the Visual Profiler. These metrics can be collected only from within the Visual Profiler. The command-line profiler and nvprof can collect low-level events but are not capable of collecting metrics.

Capability 2.x Metrics
Metric NameDescriptionFormula
sm_efficiencyThe ratio of the time at least one warp is active on a multiprocessor to the total time100 * (active_cycles / #SM) / elapsed_clocks
achieved_occupancyRatio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor100 * (active_warps / active_cycles) / max_warps_per_sm
ipcInstructions executed per cycle(inst_executed / #SM) / elapsed_clocks
branch_efficiencyRatio of non-divergent branches to total branches100 * (branch - divergent_branch) / branch
warp_execution_efficiencyRatio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessorthread_inst_executed / (inst_executed * warp_size)
inst_replay_overheadPercentage of instruction issues due to memory replays100 * (inst_issued - inst_executed) / inst_issued
shared_replay_overheadPercentage of instruction issues due to replays for shared memory conflicts100 * l1_shared_bank_conflict / inst_issue
global_cache_replay_overheadPercentage of instruction issues due to replays for global memory cache misses100 * global_load_miss / inst_issued
local_replay_overheadPercentage of instruction issues due to replays for local memory cache misses100 * (local_load_miss + local_store_miss) / inst_issued
gld_efficiencyRatio of requested global memory load throughput to actual global memory load throughput100 * gld_requested_throughput/ gld_throughput
gst_efficiencyRatio of requested global memory store throughput to actual global memory store throughput100 * gst_requested_throughput / gst_throughput
gld_throughputGlobal memory load throughput((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gst_throughputGlobal memory store throughput(l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gld_requested_throughputRequested global memory load throughput(gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8 * gld_inst_64bit + 16 * gld_inst_128bit) / gputime
gst_requested_throughputRequested global memory store throughput(gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8 * gst_inst_64bit + 16 * gst_inst_128bit) / gputime
dram_read_throughputDRAM read throughput(fb_subp0_read + fb_subp1_read) * 32 / gputime
dram_write_throughputDRAM write throughput(fb_subp0_write + fb_subp1_write) * 32 / gputime
l1_cache_global_hit_rateHit rate in L1 cache for global loads100 * l1_global_ld_hit / (l1_global_ld_hit + l1_global_ld_miss)
l1_cache_local_hit_rateHit rate in L1 cache for local loads and stores100 * (l1_local_ld_hit + l1_local_st_hit)/(l1_local_ld_hit + l1_local_ld_miss + l1_local_st_hit + l1_local_st_miss)
tex_cache_hit_rateTexture cache hit rate100 * (tex0_cache_sector_queries - tex0_cache_misses) / tex0_cache_sector_queries
tex_cache_throughputTexture cache throughputtex_cache_sector_queries * 32 / gputime
sm_efficiency_instanceThe ratio of the time at least one warp is active on a multiprocessor to the total time100 * active_cycles / elapsed_clocks
ipc_instanceInstructions executed per cycleinst_executed / elapsed_clocks
l2_l1_read_hit_rateHitrate at L2 cache for read requests from L1 cache100 * (l2_subp0_read_hit_sectors + l2_subp1_read_hit_sectors) / (l2_subp0_read_sector_queries + l2_subp1_read_sector_queries)
l2_tex_read_hit_rateHitrate at L2 cache for read requests from texture cache100 * (l2_subp0_read_tex_hit_sectors + l2_subp1_read_tex_hit_sectors) / (l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries)
l2_l1_read_throughputMemory read throughput at L2 cache for read requests from L1 cache(l2_subp0_read_sector_queries + l2_subp1_read_sector_queries) * 32 / gputime
l2_tex_read_throughputMemory read throughput at L2 cache for read requests from texture cache(l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries) * 32 / gputime
local_memory_overheadRatio of local memory traffic to total memory traffic between L1 and L2100 * (2 * l1_local_load_miss * 128) / ((l2_subp0_read_requests + l2_subp1_read_requests +l2_subp0_write_requests + l2_subp1_write_requests) * 32)

Fortran binding in C

Most of the libraries or useful routines are in FORTRAN and if you intend to use them in your C code, here are 2 ways of how to use them. Warning working with multi-dimensional arrays should include the row-major behaviour of C and column-major behaviour of FORTRAN. Also, take care of the data types compatibility across the two languages.

Part A - Using .f / .f90 directly

1. Compile the .f / .f90 files to .o object file using;

FC -c -O filename.f 
This generates the filename.o object file and can be used to as a C function routine with little additional effort.

2. Define the function before your main() call and pass by reference the pointers and data as suitable. For more explanation see;

Part B - Convert your .f / .f90 to .c files using f2c converter

Perhaps a more easy way (if you know what you are doing !)

1. Convert the simple .f / .f90 routines to C using f2c [ ]

2. Include the f2c library and include file while converting.

Always check for the consistency of the output after the compile & run parts of the code.

Sunday, February 10, 2013

Profiling a CUDA application 

Tools required : NVIDIA's Visual profiler ( nvprof / computeprof )

Firstly, identify an algorithms 'heavy' areas i.e. the most time consuming routines or kernels. 

Then prepare the code for profiling,

1. Include these headers
cuda_profiler_api.h (or cudaProfiler.h for the driver API)

2. Add functions to start and stop profile data collection.

cudaProfilerStart() is used to start profiling 
cudaProfilerStop() is used to stop profiling

(using the CUDA driver API, you get the same functionality with cuProfilerStart() and cuProfilerStop()).

3. When using the start and stop functions, you also need to instruct the profiling tool to disable profiling at the start of the application. For nvprof you do this with the --profile-from-start-off flag. For the Visual Profiler you use the "Start execution with profiling enabled" checkbox in the Settings View.

4. Flush Profile Data
To reduce profiling overhead, the profiling tools collect and record profile information into internal buffers. These buffers are then flushed asynchronously to disk with low priority to avoid perturbing application behavior. To avoid losing profile information that has not yet been flushed, the application being profiled should call cudaDeviceReset() before exiting. Doing so forces all buffered profile information to be flushed.

5. Select the metrics required to be displayed and analyse the application behaviour.

Friday, February 8, 2013

Sparse (& dense) matrix libraries
A general survey

Useful for BLAS, LAPACK type implementations (but not limited to!) and possible iterative methods with the support of compressed sparse matrix formats. FORTRAN flavours can also be found analogously. Few libraries also include parallel features.

Intel MKL (highly recommended ! )



C++ /

Linear solvers (not sure if sparse functionality is available)

Useful routines (F77)

A comprehensive list of Numerical libraries ......... for daily use

A library used is a hundred bugs eliminated


  • ALGLIB is an open source numerical analysis library which may be used from C++, C#, FreePascal, Delphi, VBA.
  • IMSL Numerical Libraries are libraries of numerical analysis functionality implemented in standard programming languages like C, Java, C# .NET, Fortran, and Python.
  • The NAG Library is a collection of mathematical and statistical routines for multiple programming languages (C, C++, Fortran, Visual Basic, Java and C#) and packages (MATLAB, Excel, R, LabVIEW).



  • Armadillo is a C++ linear algebra library (matrix and vector maths), aiming towards a good balance between speed and ease of use. It employs template classes, and has optional links to BLAS and LAPACK.
  • Blitz++ is a high-performance vector mathematics library written in C++.
  • Eigen is a a vector mathematics library wth performance comparable with Intel's Math Kernel Library
  • Hermes Project: C++/Python library for rapid prototyping of space- and space-time adaptive hp-FEM solvers.
  • IML++ is a C++ library for solving linear systems of equations, capable of dealing with dense, sparse, and distributed matrices.
  • IT++ is a C++ library for linear algebra (matrices and vectors), signal processing and communications. Functionality similar to MATLAB and Octave.
  • LAPACK++, a C++ wrapper library for LAPACK and BLAS
  • LinBox is a C++ template library for doing exact computational linear algebra.
  • MTL4 is a generic C++ template library providing sparse and dense BLAS functionality. MTL4 establishes an intuitive interface (similar to MATLAB) and broad applicability thanks to Generic programming.
  • NTL is a C++ library for number theory.
  • GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License. The library provides a wide range of mathematical routines such as random number generators, etc. There are over 1000 functions in total with an extensive test suite.

.NET Framework languages C#F# and VB.NET

  • ILNumerics.Net high performance, typesafe numerical array classes and functions for general math, FFT and linear algebra, aims .NET/mono, 32&64 bit, script-like syntax in C#, 2D & 3D plot controls, efficient memory management
  • IMSL Numerical Libraries for .NET is a set of mathematical, statistical, data mining, financial and charting classes written in C#.
  • Measurement Studio is an integrated suite UI controls and class libraries for use in developing test and measurement applications. The analysis class libraries provide various digital signal processing, signal filtering, signal generation, peak detection, and other general mathematical functionality.
  • NMath by CenterSpace Software: numerical component libraries for the .NET platform, including signal processing (FFT) classes, a linear algebra (LAPACK & BLAS) framework, and a statistics package.
  • by Numerical Method Inc.: is a large collection of numerical algorithms including linear algebra, (advanced) optimization, interpolation, Markov model, principal component analysis, time series analysis, hypothesis testing, regressions, statistics, ordinary and partial differential equation solvers, and suanshu.
  • NLinear is a generic linear algebra toolkit in C# compatible with Silverlight.


Source ::
Subscribe to RSS Feed Follow me on Twitter!