sm_efficiency | The ratio of the time at least one warp is active on a multiprocessor to the total time | 100 * (active_cycles / #SM) / elapsed_clocks |
achieved_occupancy | Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor | 100 * (active_warps / active_cycles) / max_warps_per_sm |
ipc | Instructions executed per cycle | (inst_executed / #SM) / elapsed_clocks |
branch_efficiency | Ratio of non-divergent branches to total branches | 100 * (branch - divergent_branch) / branch |
warp_execution_efficiency | Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor | thread_inst_executed / (inst_executed * warp_size) |
inst_replay_overhead | Percentage of instruction issues due to memory replays | 100 * (inst_issued - inst_executed) / inst_issued |
shared_replay_overhead | Percentage of instruction issues due to replays for shared memory conflicts | 100 * l1_shared_bank_conflict / inst_issue |
global_cache_replay_overhead | Percentage of instruction issues due to replays for global memory cache misses | 100 * global_load_miss / inst_issued |
local_replay_overhead | Percentage of instruction issues due to replays for local memory cache misses | 100 * (local_load_miss + local_store_miss) / inst_issued |
gld_efficiency | Ratio of requested global memory load throughput to actual global memory load throughput | 100 * gld_requested_throughput/ gld_throughput |
gst_efficiency | Ratio of requested global memory store throughput to actual global memory store throughput | 100 * gst_requested_throughput / gst_throughput |
gld_throughput | Global memory load throughput | ((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime |
gst_throughput | Global memory store throughput | (l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime |
gld_requested_throughput | Requested global memory load throughput | (gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8 * gld_inst_64bit + 16 * gld_inst_128bit) / gputime |
gst_requested_throughput | Requested global memory store throughput | (gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8 * gst_inst_64bit + 16 * gst_inst_128bit) / gputime |
dram_read_throughput | DRAM read throughput | (fb_subp0_read + fb_subp1_read) * 32 / gputime |
dram_write_throughput | DRAM write throughput | (fb_subp0_write + fb_subp1_write) * 32 / gputime |
l1_cache_global_hit_rate | Hit rate in L1 cache for global loads | 100 * l1_global_ld_hit / (l1_global_ld_hit + l1_global_ld_miss) |
l1_cache_local_hit_rate | Hit rate in L1 cache for local loads and stores | 100 * (l1_local_ld_hit + l1_local_st_hit)/(l1_local_ld_hit + l1_local_ld_miss + l1_local_st_hit + l1_local_st_miss) |
tex_cache_hit_rate | Texture cache hit rate | 100 * (tex0_cache_sector_queries - tex0_cache_misses) / tex0_cache_sector_queries |
tex_cache_throughput | Texture cache throughput | tex_cache_sector_queries * 32 / gputime |
sm_efficiency_instance | The ratio of the time at least one warp is active on a multiprocessor to the total time | 100 * active_cycles / elapsed_clocks |
ipc_instance | Instructions executed per cycle | inst_executed / elapsed_clocks |
l2_l1_read_hit_rate | Hitrate at L2 cache for read requests from L1 cache | 100 * (l2_subp0_read_hit_sectors + l2_subp1_read_hit_sectors) / (l2_subp0_read_sector_queries + l2_subp1_read_sector_queries) |
l2_tex_read_hit_rate | Hitrate at L2 cache for read requests from texture cache | 100 * (l2_subp0_read_tex_hit_sectors + l2_subp1_read_tex_hit_sectors) / (l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries) |
l2_l1_read_throughput | Memory read throughput at L2 cache for read requests from L1 cache | (l2_subp0_read_sector_queries + l2_subp1_read_sector_queries) * 32 / gputime |
l2_tex_read_throughput | Memory read throughput at L2 cache for read requests from texture cache | (l2_subp0_read_tex_sector_queries + l2_subp1_read_tex_sector_queries) * 32 / gputime |
local_memory_overhead | Ratio of local memory traffic to total memory traffic between L1 and L2 | 100 * (2 * l1_local_load_miss * 128) / ((l2_subp0_read_requests + l2_subp1_read_requests +l2_subp0_write_requests + l2_subp1_write_requests) * 32) |