Monday, February 18, 2013




Basic Metrics of a CUDA application


After developing a CUDA application, the costly routines (in terms of runtime) need to be tuned or optimised for better performance. The primary metrics to identify whether the kernel is memory intensive or computationally intensive  are L1 / L2 cache hit rates, the data throughput, and the computational throughput. This article discusses about the data and computational throughput.

Let's understand this using a simple example;
The kernel SUM adds two vectors a and b and stores the result in c.

_global__ void  SUM( double *a, double *b, double *c)
{

   int tid = blockIdx.x;

   // N is the problem size
   if (tid < N) 
       c[tid] = a[tid] + b[tid];

}

Data throughput is simply the total data transfers (read + write) of a CUDA kernel. The data throughput indicates the data transfer the application is able to reach. One can compare it to the CUDA device theoretical throughput for reference.



From the example kernel SUM, we read 2 doubles (a & b) and write 1 double (c)  per thread. Hence,

Read = 2 doubles * 8 bytes = 16 Bytes
Write = 1 double * 8 bytes =  8 Bytes

The bandwidth or the data throughput can be calculated as;

Effective Bandwidth = (Read + Write) / (time * 10^9)  GBytes / sec

where, time is the runtime of the kernel in seconds. The runtime can be obtained directly from nvprof ( with --print-gpu-trace or --print-api-trace command).


Computational Throughput is the total calculations (i.e. floating point operations - flop) performed by the device per second. A flop is an addition, subtraction, multiplication or division. The computational throughput indicates the flops the application is able to maintain. One can compare it to the CUDA device theoretical throughput for reference.

For the simple kernel SUM, we have 1 flop i.e. 1 addition per thread. Hence,

Computational Throughput = Total flops / (time * 10^9) Gflops / sec


where, time is the runtime of the kernel in seconds. The runtime can be obtained directly from nvprof ( with --print-gpu-trace or --print-api-trace command).

Also, it is not straightforward to calculate the flops of a kernel (especially long kernels, other mathematical functions like cos, sin, etc which have more flop counts).


L1 / L2 hit rates are measured directly from the visual profiler. They denote the percentage of data hits to the total data requests to the L1 / L2 caches.

References :
1. https://developer.nvidia.com/content/how-implement-performance-metrics-cuda-cc
2. A short lecture on flops


0 comments:

Subscribe to RSS Feed Follow me on Twitter!