CUDA Device metrics ~ CFD and Coffee .......

CUDA Enabled Devices - Metrics Reference

(Using computeprof / nvvp)

This section contains detailed descriptions of the metrics that can be collected by the Visual Profiler. These metrics can be collected only from within the Visual Profiler. The command-line profiler and nvprof can collect low-level events but are not capable of collecting metrics.

Capability 2.x Metrics
Metric Name	Description	Formula
sm_efficiency	The ratio of the time at least one warp is active on a multiprocessor to the total time	100 * (active_cycles / #SM) / elapsed_clocks
achieved_occupancy	Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor	100 * (active_warps / active_cycles) / max_warps_per_sm
ipc	Instructions executed per cycle	(inst_executed / #SM) / elapsed_clocks
branch_efficiency	Ratio of non-divergent branches to total branches	100 * (branch - divergent_branch) / branch
warp_execution_efficiency	Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor	thread_inst_executed / (inst_executed * warp_size)
inst_replay_overhead	Percentage of instruction issues due to memory replays	100 * (inst_issued - inst_executed) / inst_issued
shared_replay_overhead	Percentage of instruction issues due to replays for shared memory conflicts	100 * l1_shared_bank_conflict / inst_issue
global_cache_replay_overhead	Percentage of instruction issues due to replays for global memory cache misses	100 * global_load_miss / inst_issued
local_replay_overhead	Percentage of instruction issues due to replays for local memory cache misses	100 * (local_load_miss + local_store_miss) / inst_issued
gld_efficiency	Ratio of requested global memory load throughput to actual global memory load throughput	100 * gld_requested_throughput/ gld_throughput
gst_efficiency	Ratio of requested global memory store throughput to actual global memory store throughput	100 * gst_requested_throughput / gst_throughput
gld_throughput	Global memory load throughput	((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gst_throughput	Global memory store throughput	(l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gld_requested_throughput	Requested global memory load throughput	(gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8 * gld_inst_64bit + 16 * gld_inst_128bit) / gputime
gst_requested_throughput	Requested global memory store throughput	(gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8 * gst_inst_64bit + 16 * gst_inst_128bit) / gputime
dram_read_throughput	DRAM read throughput	(fb_subp0_read + fb_subp1_read) * 32 / gputime
dram_write_throughput	DRAM write throughput	(fb_subp0_write + fb_subp1_write) * 32 / gputime
l1_cache_global_hit_rate	Hit rate in L1 cache for global loads	100 * l1_global_ld_hit / (l1_global_ld_hit + l1_global_ld_miss)
l1_cache_local_hit_rate	Hit rate in L1 cache for local loads and stores	100 * (l1_local_ld_hit + l1_local_st_hit)/(l1_local_ld_hit + l1_local_ld_miss + l1_local_st_hit + l1_local_st_miss)
tex_cache_hit_rate	Texture cache hit rate	100 * (tex0_cache_sector_queries - tex0_cache_misses) / tex0_cache_sector_queries
tex_cache_throughput	Texture cache throughput	tex_cache_sector_queries * 32 / gputime
sm_efficiency_instance	The ratio of the time at least one warp is active on a multiprocessor to the total time	100 * active_cycles / elapsed_clocks
ipc_instance	Instructions executed per cycle	inst_executed / elapsed_clocks
l2_l1_read_hit_rate	Hitrate at L2 cache for read requests from L1 cache	100 * (l2_subp0_read_hit_sectors + l2_subp1_read_hit_sectors) / (l2_subp0_read_sector_queries + l2_subp1_read_sector_queries)
l2_tex_read_hit_rate	Hitrate at L2 cache for read requests from texture cache	100 * (l2_subp0_read_tex_hit_sectors + l2_subp1_read_tex_hit_sectors) / (l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries)
l2_l1_read_throughput	Memory read throughput at L2 cache for read requests from L1 cache	(l2_subp0_read_sector_queries + l2_subp1_read_sector_queries) * 32 / gputime
l2_tex_read_throughput	Memory read throughput at L2 cache for read requests from texture cache	(l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries) * 32 / gputime
local_memory_overhead	Ratio of local memory traffic to total memory traffic between L1 and L2	100 * (2 * l1_local_load_miss * 128) / ((l2_subp0_read_requests + l2_subp1_read_requests +l2_subp0_write_requests + l2_subp1_write_requests) * 32)

For more information ;

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

CFD and Coffee .......

Computational Fluid Dynamics for all the coffee addicts

Tuesday, February 12, 2013

CUDA Device metrics

0 comments:

Pageviews

Downloads (Right click - save link as)

Popular Posts

Blog Archive

Contributers