Tuesday, February 12, 2013





CUDA Enabled Devices - Metrics Reference
(Using computeprof / nvvp)

This section contains detailed descriptions of the metrics that can be collected by the Visual Profiler. These metrics can be collected only from within the Visual Profiler. The command-line profiler and nvprof can collect low-level events but are not capable of collecting metrics.



Capability 2.x Metrics
Metric NameDescriptionFormula
sm_efficiencyThe ratio of the time at least one warp is active on a multiprocessor to the total time100 * (active_cycles / #SM) / elapsed_clocks
achieved_occupancyRatio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor100 * (active_warps / active_cycles) / max_warps_per_sm
ipcInstructions executed per cycle(inst_executed / #SM) / elapsed_clocks
branch_efficiencyRatio of non-divergent branches to total branches100 * (branch - divergent_branch) / branch
warp_execution_efficiencyRatio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessorthread_inst_executed / (inst_executed * warp_size)
inst_replay_overheadPercentage of instruction issues due to memory replays100 * (inst_issued - inst_executed) / inst_issued
shared_replay_overheadPercentage of instruction issues due to replays for shared memory conflicts100 * l1_shared_bank_conflict / inst_issue
global_cache_replay_overheadPercentage of instruction issues due to replays for global memory cache misses100 * global_load_miss / inst_issued
local_replay_overheadPercentage of instruction issues due to replays for local memory cache misses100 * (local_load_miss + local_store_miss) / inst_issued
gld_efficiencyRatio of requested global memory load throughput to actual global memory load throughput100 * gld_requested_throughput/ gld_throughput
gst_efficiencyRatio of requested global memory store throughput to actual global memory store throughput100 * gst_requested_throughput / gst_throughput
gld_throughputGlobal memory load throughput((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gst_throughputGlobal memory store throughput(l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
gld_requested_throughputRequested global memory load throughput(gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8 * gld_inst_64bit + 16 * gld_inst_128bit) / gputime
gst_requested_throughputRequested global memory store throughput(gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8 * gst_inst_64bit + 16 * gst_inst_128bit) / gputime
dram_read_throughputDRAM read throughput(fb_subp0_read + fb_subp1_read) * 32 / gputime
dram_write_throughputDRAM write throughput(fb_subp0_write + fb_subp1_write) * 32 / gputime
l1_cache_global_hit_rateHit rate in L1 cache for global loads100 * l1_global_ld_hit / (l1_global_ld_hit + l1_global_ld_miss)
l1_cache_local_hit_rateHit rate in L1 cache for local loads and stores100 * (l1_local_ld_hit + l1_local_st_hit)/(l1_local_ld_hit + l1_local_ld_miss + l1_local_st_hit + l1_local_st_miss)
tex_cache_hit_rateTexture cache hit rate100 * (tex0_cache_sector_queries - tex0_cache_misses) / tex0_cache_sector_queries
tex_cache_throughputTexture cache throughputtex_cache_sector_queries * 32 / gputime
sm_efficiency_instanceThe ratio of the time at least one warp is active on a multiprocessor to the total time100 * active_cycles / elapsed_clocks
ipc_instanceInstructions executed per cycleinst_executed / elapsed_clocks
l2_l1_read_hit_rateHitrate at L2 cache for read requests from L1 cache100 * (l2_subp0_read_hit_sectors + l2_subp1_read_hit_sectors) / (l2_subp0_read_sector_queries + l2_subp1_read_sector_queries)
l2_tex_read_hit_rateHitrate at L2 cache for read requests from texture cache100 * (l2_subp0_read_tex_hit_sectors + l2_subp1_read_tex_hit_sectors) / (l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries)
l2_l1_read_throughputMemory read throughput at L2 cache for read requests from L1 cache(l2_subp0_read_sector_queries + l2_subp1_read_sector_queries) * 32 / gputime
l2_tex_read_throughputMemory read throughput at L2 cache for read requests from texture cache(l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries) * 32 / gputime
local_memory_overheadRatio of local memory traffic to total memory traffic between L1 and L2100 * (2 * l1_local_load_miss * 128) / ((l2_subp0_read_requests + l2_subp1_read_requests +l2_subp0_write_requests + l2_subp1_write_requests) * 32)


0 comments:

Subscribe to RSS Feed Follow me on Twitter!