February 2013 ~ CFD and Coffee .......

Saturday, February 23, 2013

Memory debugging - Crash course

Memory Debugging

A Crash course

To get rid of stack errors, buffers, exceptions or segmentation faults , leaked memories, pointer bounds, etc while using pointers and files, and string operations is not straightforward. These bugs are inevitable while programming and can cause weird behaviour on different machines, etc. Most of the time these errors or bugs can be removed cleanly using simple debugging tools. I have been using "valgrind" which is simple, intuitive, and a powerful tool to debug memory / data / file operations generated errors. These are some simple steps that I have used and found greatly beneficial while debugging my code ( a numerical solver for parallel architectures).

1. Debug your code with possible checks wherever needed (or every where) using "assert". It is a good practice to insert assert lines after pointer allocation, file opening, possible constraints on the input data (like x > 0, when read from a file), etc

2. Compile your code with a debug flag "-g" to include the debug symbols required.

3. Just run your code with "valgrind" as follows. And valgrind suggest the possible errors and other flags which will help it to determine help you to locate the source of the error.

valgrind ./a.out [ simple run, and valgrind will further suggest you for possible flags required to run ]
valgrind -v --leak-check=full --track-origins=yes ./a.out [ full check (leak check) and locate source of the problem (track origin) and output on the screen (-v) ]

This will help you start up the and get into proper memory debugging. Sort out more help on valgrind's website.

Also, valgrind can be used for any parallel application (like using MPI / OpenMP) as well.

MPI tutorial

10:30 AM No comments

"A User's Guide to MPI"

Peter S. Pacheco

http://www.academia.edu/2602564/_A_Users_Guide_to_MPI_by_Peter_S._Pacheco

Basic Metrics of a CUDA application

11:40 AM No comments

Basic Metrics of a CUDA application

After developing a CUDA application, the costly routines (in terms of runtime) need to be tuned or optimised for better performance. The primary metrics to identify whether the kernel is memory intensive or computationally intensive are L1 / L2 cache hit rates, the data throughput, and the computational throughput. This article discusses about the data and computational throughput.

Let's understand this using a simple example;
The kernel SUM adds two vectors a and b and stores the result in c.

_global__ void SUM( double *a, double *b, double *c)
{

int tid = blockIdx.x;

// N is the problem size
if (tid < N)
c[tid] = a[tid] + b[tid];

}

Data throughput is simply the total data transfers (read + write) of a CUDA kernel. The data throughput indicates the data transfer the application is able to reach. One can compare it to the CUDA device theoretical throughput for reference.

From the example kernel SUM, we read 2 doubles (a & b) and write 1 double (c) per thread. Hence,

Read = 2 doubles * 8 bytes = 16 Bytes
Write = 1 double * 8 bytes = 8 Bytes

The bandwidth or the data throughput can be calculated as;

Effective Bandwidth = (Read + Write) / (time * 10^9) GBytes / sec

where, time is the runtime of the kernel in seconds. The runtime can be obtained directly from nvprof ( with --print-gpu-trace or --print-api-trace command).

Computational Throughput is the total calculations (i.e. floating point operations - flop) performed by the device per second. A flop is an addition, subtraction, multiplication or division. The computational throughput indicates the flops the application is able to maintain. One can compare it to the CUDA device theoretical throughput for reference.

For the simple kernel SUM, we have 1 flop i.e. 1 addition per thread. Hence,

Computational Throughput = Total flops / (time * 10^9) Gflops / sec

where, time is the runtime of the kernel in seconds. The runtime can be obtained directly from nvprof ( with --print-gpu-trace or --print-api-trace command).

Also, it is not straightforward to calculate the flops of a kernel (especially long kernels, other mathematical functions like cos, sin, etc which have more flop counts).

L1 / L2 hit rates are measured directly from the visual profiler. They denote the percentage of data hits to the total data requests to the L1 / L2 caches.

References :
1. https://developer.nvidia.com/content/how-implement-performance-metrics-cuda-cc
2. A short lecture on flops

Compilers, Debuggers & Profilers

11:08 PM No comments

Compilers, Debuggers & Profilers

Part A : Compilers

1. GNU compilers [ gcc / g++ / gfortran ] - free
2. Intel [ icc / icpc / ifort ]- free for linux

Part B : Debuggers

1. GNU debugger [ gdb ] - free
2. Intel debugger [ idb ] - free for linux

Part C : Profilers

1. ValGrind (Command line) , Valkyrie (GUI)
Purpose : memory debugging, memory leak detection, and profiling
Link : http://valgrind.org

2. nvprof (command line), computeprof / nvvp (GUI)
Purpose : profiling
Languages : CUDA

Some available lists;

A list of memory debuggers ;
http://en.wikipedia.org/wiki/Memory_debugger

CUDA Device metrics

8:24 PM No comments

CUDA Enabled Devices - Metrics Reference

(Using computeprof / nvvp)

This section contains detailed descriptions of the metrics that can be collected by the Visual Profiler. These metrics can be collected only from within the Visual Profiler. The command-line profiler and nvprof can collect low-level events but are not capable of collecting metrics.

Capability 2.x Metrics
Metric Name	Description	Formula
sm_efficiency	The ratio of the time at least one warp is active on a multiprocessor to the total time	100 * (active_cycles / #SM) / elapsed_clocks
achieved_occupancy	Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor	100 * (active_warps / active_cycles) / max_warps_per_sm
ipc	Instructions executed per cycle	(inst_executed / #SM) / elapsed_clocks
branch_efficiency	Ratio of non-divergent branches to total branches	100 * (branch - divergent_branch) / branch
warp_execution_efficiency	Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor	thread_inst_executed / (inst_executed * warp_size)
inst_replay_overhead	Percentage of instruction issues due to memory replays	100 * (inst_issued - inst_executed) / inst_issued
shared_replay_overhead	Percentage of instruction issues due to replays for shared memory conflicts	100 * l1_shared_bank_conflict / inst_issue
global_cache_replay_overhead	Percentage of instruction issues due to replays for global memory cache misses	100 * global_load_miss / inst_issued
local_replay_overhead	Percentage of instruction issues due to replays for local memory cache misses	100 * (local_load_miss + local_store_miss) / inst_issued
gld_efficiency	Ratio of requested global memory load throughput to actual global memory load throughput	100 * gld_requested_throughput/ gld_throughput
gst_efficiency	Ratio of requested global memory store throughput to actual global memory store throughput	100 * gst_requested_throughput / gst_throughput
gld_throughput	Global memory load throughput	((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gst_throughput	Global memory store throughput	(l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gld_requested_throughput	Requested global memory load throughput	(gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8 * gld_inst_64bit + 16 * gld_inst_128bit) / gputime
gst_requested_throughput	Requested global memory store throughput	(gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8 * gst_inst_64bit + 16 * gst_inst_128bit) / gputime
dram_read_throughput	DRAM read throughput	(fb_subp0_read + fb_subp1_read) * 32 / gputime
dram_write_throughput	DRAM write throughput	(fb_subp0_write + fb_subp1_write) * 32 / gputime
l1_cache_global_hit_rate	Hit rate in L1 cache for global loads	100 * l1_global_ld_hit / (l1_global_ld_hit + l1_global_ld_miss)
l1_cache_local_hit_rate	Hit rate in L1 cache for local loads and stores	100 * (l1_local_ld_hit + l1_local_st_hit)/(l1_local_ld_hit + l1_local_ld_miss + l1_local_st_hit + l1_local_st_miss)
tex_cache_hit_rate	Texture cache hit rate	100 * (tex0_cache_sector_queries - tex0_cache_misses) / tex0_cache_sector_queries
tex_cache_throughput	Texture cache throughput	tex_cache_sector_queries * 32 / gputime
sm_efficiency_instance	The ratio of the time at least one warp is active on a multiprocessor to the total time	100 * active_cycles / elapsed_clocks
ipc_instance	Instructions executed per cycle	inst_executed / elapsed_clocks
l2_l1_read_hit_rate	Hitrate at L2 cache for read requests from L1 cache	100 * (l2_subp0_read_hit_sectors + l2_subp1_read_hit_sectors) / (l2_subp0_read_sector_queries + l2_subp1_read_sector_queries)
l2_tex_read_hit_rate	Hitrate at L2 cache for read requests from texture cache	100 * (l2_subp0_read_tex_hit_sectors + l2_subp1_read_tex_hit_sectors) / (l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries)
l2_l1_read_throughput	Memory read throughput at L2 cache for read requests from L1 cache	(l2_subp0_read_sector_queries + l2_subp1_read_sector_queries) * 32 / gputime
l2_tex_read_throughput	Memory read throughput at L2 cache for read requests from texture cache	(l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries) * 32 / gputime
local_memory_overhead	Ratio of local memory traffic to total memory traffic between L1 and L2	100 * (2 * l1_local_load_miss * 128) / ((l2_subp0_read_requests + l2_subp1_read_requests +l2_subp0_write_requests + l2_subp1_write_requests) * 32)

For more information ;

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

Fortran binding in C

11:32 AM No comments

Fortran binding in C

Most of the libraries or useful routines are in FORTRAN and if you intend to use them in your C code, here are 2 ways of how to use them. Warning working with multi-dimensional arrays should include the row-major behaviour of C and column-major behaviour of FORTRAN. Also, take care of the data types compatibility across the two languages.

Part A - Using .f / .f90 directly

1. Compile the .f / .f90 files to .o object file using;

FC -c -O filename.f

This generates the filename.o object file and can be used to as a C function routine with little additional effort.

2. Define the function before your main() call and pass by reference the pointers and data as suitable. For more explanation see;

http://www.math.utah.edu/software/c-with-fortran.html
http://www.physics.utah.edu/~detar/phys6720/handouts/fortran_binding.html

Part B - Convert your .f / .f90 to .c files using f2c converter

Perhaps a more easy way (if you know what you are doing !)

1. Convert the simple .f / .f90 routines to C using f2c [ http://netlib.org/f2c/ ]

2. Include the f2c library and include file while converting.

Always check for the consistency of the output after the compile & run parts of the code.

CUDA Application profiling

5:03 PM No comments

Profiling a CUDA application

Tools required : NVIDIA's Visual profiler ( nvprof / computeprof )

Readme : http://docs.nvidia.com/cuda/profiler-users-guide/index.html

Firstly, identify an algorithms 'heavy' areas i.e. the most time consuming routines or kernels.

Then prepare the code for profiling,

1. Include these headers

cuda_profiler_api.h (or cudaProfiler.h for the driver API)

2. Add functions to start and stop profile data collection.

cudaProfilerStart() is used to start profiling

cudaProfilerStop() is used to stop profiling

(using the CUDA driver API, you get the same functionality with cuProfilerStart() and cuProfilerStop()).

3. When using the start and stop functions, you also need to instruct the profiling tool to disable profiling at the start of the application. For nvprof you do this with the --profile-from-start-off flag. For the Visual Profiler you use the "Start execution with profiling enabled" checkbox in the Settings View.

4. Flush Profile Data

To reduce profiling overhead, the profiling tools collect and record profile information into internal buffers. These buffers are then flushed asynchronously to disk with low priority to avoid perturbing application behavior. To avoid losing profile information that has not yet been flushed, the application being profiled should call cudaDeviceReset() before exiting. Doing so forces all buffered profile information to be flushed.

5. Select the metrics required to be displayed and analyse the application behaviour.

Sparse Libraries

5:42 PM No comments

Sparse (& dense) matrix libraries

A general survey

Useful for BLAS, LAPACK type implementations (but not limited to!) and possible iterative methods with the support of compressed sparse matrix formats. FORTRAN flavours can also be found analogously. Few libraries also include parallel features.

Intel MKL (highly recommended ! )
http://software.intel.com/en-us/intel-mkl

C++
http://seldon.sourceforge.net/

C++ / POSIX
http://plasimo.phys.tue.nl/TBCI/

C++ /
http://math.nist.gov/sparselib++/

Linear solvers (not sure if sparse functionality is available)
http://aam.mathematik.uni-freiburg.de/IAM/Research/projectskr/lin_solver/

Useful routines (F77)
http://www.netlib.org/toms/

A list of useful numerical libraries

5:39 PM No comments

A comprehensive list of Numerical libraries ......... for daily use

A library used is a hundred bugs eliminated

Multi-language

ALGLIB is an open source numerical analysis library which may be used from C++, C#, FreePascal, Delphi, VBA.
IMSL Numerical Libraries are libraries of numerical analysis functionality implemented in standard programming languages like C, Java, C# .NET, Fortran, and Python.
The NAG Library is a collection of mathematical and statistical routines for multiple programming languages (C, C++, Fortran, Visual Basic, Java and C#) and packages (MATLAB, Excel, R, LabVIEW).

C

BLOPEX (Block Locally Optimal Preconditioned Eigenvalue Xolvers) is an open-source library for the scalable (parallel) solution of eigenvalue problems. Its object-oriented design allows easy portability.
FFTW (Fastest Fourier Transform in the West) is a software library for computing Fourier and related transforms.
GNU Scientific Library, a popular, free numerical analysis library implemented in C.
GNU Multi-Precision Library is a library for doing arbitrary precision arithmetic.
hypre (High Performance Preconditioners) is an open-source library of routines for scalable (parallel) solution of linear systems and preconditioning.
IMSL Numerical Libraries are cross-platform libraries containing a comprehensive set of mathematical and statistical functions that can be embedded in a users application.
LabWindows/CVI is an ANSI C IDE that includes built-in libraries for analysis of raw measurement data, signal generation, windowing, filter functions, signal processing, linear algebra, array and complex operations, curve fitting and statistics.
Lis is a scalable parallel library for solving systems of linear equations and standard eigenvalue problems with real sparse matrices using iterative methods.
Portable, Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations.
SLEPc Scalable Library for Eigenvalue Problem Computations is a PETSc-based open-source library for the scalable (parallel) solution of eigenvalue problems.
Trilinos, an effort to develop scalable (parallel) solver algorithms and libraries within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific applications. A unique design feature of Trilinos is its focus on packages.

C++

Armadillo is a C++ linear algebra library (matrix and vector maths), aiming towards a good balance between speed and ease of use. It employs template classes, and has optional links to BLAS and LAPACK.
Blitz++ is a high-performance vector mathematics library written in C++.
Eigen is a a vector mathematics library wth performance comparable with Intel's Math Kernel Library
Hermes Project: C++/Python library for rapid prototyping of space- and space-time adaptive hp-FEM solvers.
IML++ is a C++ library for solving linear systems of equations, capable of dealing with dense, sparse, and distributed matrices.
IT++ is a C++ library for linear algebra (matrices and vectors), signal processing and communications. Functionality similar to MATLAB and Octave.
LAPACK++, a C++ wrapper library for LAPACK and BLAS
LinBox is a C++ template library for doing exact computational linear algebra.
MTL4 is a generic C++ template library providing sparse and dense BLAS functionality. MTL4 establishes an intuitive interface (similar to MATLAB) and broad applicability thanks to Generic programming.
NTL is a C++ library for number theory.
GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License. The library provides a wide range of mathematical routines such as random number generators, etc. There are over 1000 functions in total with an extensive test suite.

.NET Framework languages C#, F# and VB.NET

ILNumerics.Net high performance, typesafe numerical array classes and functions for general math, FFT and linear algebra, aims .NET/mono, 32&64 bit, script-like syntax in C#, 2D & 3D plot controls, efficient memory management
IMSL Numerical Libraries for .NET is a set of mathematical, statistical, data mining, financial and charting classes written in C#.
Measurement Studio is an integrated suite UI controls and class libraries for use in developing test and measurement applications. The analysis class libraries provide various digital signal processing, signal filtering, signal generation, peak detection, and other general mathematical functionality.
NMath by CenterSpace Software: numerical component libraries for the .NET platform, including signal processing (FFT) classes, a linear algebra (LAPACK & BLAS) framework, and a statistics package.
suanshu.net by Numerical Method Inc.: is a large collection of numerical algorithms including linear algebra, (advanced) optimization, interpolation, Markov model, principal component analysis, time series analysis, hypothesis testing, regressions, statistics, ordinary and partial differential equation solvers, and suanshu.
NLinear is a generic linear algebra toolkit in C# compatible with Silverlight.

Fortran

BLAS (Basic Linear Algebra Subprograms) is a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication.
CERNLIB is a collection of FORTRAN 77 libraries and modules.
EISPACK is a software library for numerical computation of eigenvalues and eigenvectors of matrices, written in FORTRAN. It contains subroutines for calculating the eigenvalues of nine classes of matrices: complex general, complex Hermitian, real general, real symmetric, real symmetric banded, real symmetric tridiagonal, special real tridiagonal, generalized real, and generalized real symmetric matices.
IMSL Numerical Libraries are cross-platform libraries containing a comprehensive set of mathematical and statistical functions that can be embedded in a users application.
Harwell Subroutine Library is a collection of Fortran 77 and 95 codes that address core problems in numerical analysis.
LAPACK, the Linear Algebra PACKage, is a software library for numerical computing originally written in FORTRAN 77 and now written in Fortran 90.
LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Pete Stewart, and was intended for use onsupercomputers in the 1970s and early 1980s. It has been largely superseded by LAPACK, which will run more efficiently on modern architectures.
Lis is a scalable parallel library for solving systems of linear equations and standard eigenvalue problems with real sparse matrices using iterative methods.
MINPACK is a library of FORTRAN subroutines for the solving of systems of nonlinear equations, or the least squares minimization of the residual of a set of linear or nonlinear equations.
NOVAS is a software library for astrometry-related numerical computations. Both Fortran and C versions are available.
Netlib is a repository of scientific computing software which contains a large number of separate programs and libraries including BLAS, EISPACK, LAPACK and others.
Portable, Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations.
QUADPACK is a FORTRAN 77 library for numerical integration of one-dimensional functions
SLATEC is a FORTRAN 77 library of over 1400 general purpose mathematical and statistical routines.
SOFA is a collection of subroutines that implement official IAU algorithms for astronomical computations. Both Fortran and C versions are available.
SPARSKIT is a tool package for working with sparse matrices.
ARPACK is a collection of Fortran77 subroutines designed to solve large scale eigenvalue problems.
SHTOOLS is an archive of fortran 95 based software that can be used to perform (among others) spherical harmonic transforms and reconstructions, rotations of spherical harmonic coefficients, and multitaper spectral analyses on the sphere.

Source ::
http://en.wikipedia.org/wiki/List_of_numerical_libraries

CFD and Coffee .......

Computational Fluid Dynamics for all the coffee addicts

Saturday, February 23, 2013

Memory debugging - Crash course

MPI tutorial

Monday, February 18, 2013

Basic Metrics of a CUDA application

Tuesday, February 12, 2013

Compilers, Debuggers & Profilers

CUDA Device metrics

Fortran binding in C

Sunday, February 10, 2013

CUDA Application profiling

Friday, February 8, 2013

Sparse Libraries

A list of useful numerical libraries

Multi-language

C

C++

.NET Framework languages C#, F# and VB.NET

Fortran

Pageviews

Downloads (Right click - save link as)

Popular Posts

Blog Archive

Contributers