GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer Science and Engineering The Ohio State University Columbus, OH, USA

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GPU Programming Gets Popular Many domains are using GPUs for high performance 2 GPU-accelerated Molecular DynamicsGPU-accelerated Seismic Imaging Available in both high-end/low-end systems the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012] commodity desktops/laptops equipped with GPUs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Need careful management of a large amount of threads Writing Efficient GPU Programs is Challenging 3 Thread Blocks

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Need careful management of a large amount of threads multi-layer memory hierarchy Writing Efficient GPU Programs is Challenging 4 Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Thread Blocks Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Need careful management of a large amount of threads multi-layer memory hierarchy Writing Efficient GPU Programs is Challenging 5 Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Thread Blocks Fast but Small Large but Slow Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Writing Efficient GPU Programs is Challenging 6 Which data in shared memory are infrequently accessed? Which data in device memory are frequently accessed? Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Existing tools can’t help much inapplicable to GPU coarse-grained prohibitive runtime overhead cannot handle irregular/indirect accesses Writing Efficient GPU Programs is Challenging 7 Which data in shared memory are infrequently accessed? Which data in device memory are frequently accessed? Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Enhanced Algorithm Evaluation Conclusions 8

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-basic: The Naïve Profiling Approach 9 Shared Memory Profiling integer counters to count accesses to shared memory one counter for each shared memory element atomically update the counter to avoid race condition among threads Device Memory Profiling integer counters to count accesses to device memory one counter for each element in the user device memory array since device memory is too large to be monitored as a whole (e..g, 6GB) atomically update the counter

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 11 Observation I: Many memory accesses can be determined statically 1.__shared__ int s[]; 2.… 3.s[threadIdx.x] = 3;

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 12 Observation I: Many memory accesses can be determined statically 1.__shared__ int s[]; 2.… 3.s[threadIdx.x] = 3; Don’t need to count the access at runtime

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 13 Observation I: Many memory accesses can be determined statically 1.__shared__ int s[]; 2.… 3.s[threadIdx.x] = 3; Don’t need to count the access at runtime How about this … 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 14 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 15 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y Don’t need to profile in every r iteration

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 16 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y Don’t need to profile in every r iteration Observation III: Some accesses are tid-invariant E.g. s[input[c]] is irrelavant to threadIdx

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 17 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y Don’t need to profile in every r iteration Observation III: Some accesses are tid-invariant E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the counter in every thread

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-NA: Non-Atomic Operation Optimization 18 Atomic operation cost a lot Serialize all concurrent threads when updating a shared counter Use non-atomic operation to update counters does not impact the overall accuracy thanks to other optimizations atomicAdd(&counter, 1); … … concurrent threadsserialized threads

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SM: Shared Memory Counters Optimization 19 Make full use of shared memory Store counters in shared memory when possible Reduce counter size E.g., 32-bit integer counters -> 8-bit Read-only Data Cache Device Memory L2 Cache L1 Cache Shared Memory Fast but Small

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SM: Shared Memory Counters Optimization 20 Make full use of shared memory Store counters in shared memory when possible Reduce counter size E.g., 32-bit integer counters -> 8-bit Read-only Data Cache Device Memory L2 Cache L1 Cache Shared Memory Fast but Small GMProf-TH: Threshold Optimization Precise count may not be necessary E.g A is accessed 10 times, while B is accessed > 100 times Stop counting once reaching certain threshold Tradeoff between accuracy and overhead

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-Enhanced: Live Range Analysis 22 The number of accesses to a shared memory location may be misleading shm_buf in Shared Memory input_array in Device Memory data0data1data2 output_array in Device Memory Need to count the accesses/reuse of DATA, not address data0 data1data2 data1data2

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Track data during its live range in shared memory Use logical clock to marks the boundary of each live range Separate counters in each live range based on logical clock GMProf-Enhanced: Live Range Analysis 23 1.... 2.shm_buffer = input_array[0] //load data0 from DM to ShM 3.... 4.output_array[0] = shm_buffer //store data0 from ShM to DM 5.... 6.... 7.shm_buffer = input_array[1] //load data1 from DM to ShM 8.... 9.output_array[1] = shm_buffer //store data1 from ShM to DM 10.... live range of data0 live range of data1

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Platform GPU: NVIDIA Tesla C1060 240 cores (30×8), 1.296GHz 16KB shared memory per SM 4GB device memory CPU: AMD Opteron 2.6GHz ×2 8GB main memory Linux kernel 2.6.32 CUDA Toolkit 3.0 Six Applications Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix- Vector Multiplication, and DXTC 25 Methodology

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs26 Runtime Overhead for Profiling Shared Memory Use 182x144x648x181x 113x 2.6x 90x 648x

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs27 Runtime Overhead for Profiling Device Memory Use 83x197x48.5x 1.6x

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs28 Case Study I: Put the most frequently used data into shared memory Profiling Result GMProf-basic GMProf w/o THw/ TH ShM000 DM A1(276) A2(276) A3(128) A4(1) A1(276) A2(276) A3(128) A4(1) A1(THR) A2(THR) A3(128) A4(1) bo_v1: a naïve implementation where all data arrays are stored in device memory A1 ~ A4: four data arrays (N): average access # of the elements in the corresponding data array

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs bo_v2: an improved version which puts the most frequently used arrays (identified by GMProf) into shared memory 29 Case Study I: Put the most frequently used data into shared memory Profiling Result GMProf-basic GMProf w/o THw/ TH ShM A1 (174,788) A2 (169,221) A1(165,881) A2(160,315) A1(THR) A2(THR) DM A3(128) A4(1) A3(128) A4(1) A3(128) A4(1) bo_v2 outperforms bo_v1 by a factor of 39.63

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs jcb_v1: the shared memory is accessed frequently, but little reuse of the date 30 Case Study II: identify the true reuse of data Profiling Result GMProf-basic GMProf w/o Enh. Alg.w/ Enh. Alg. ShMshm_buf (5,760)shm_buf (5,748)shm_buf (2) DM in(4) out(1) in(4) out(1) in(4) out(1) jcb_v2 outperforms jcb_v1 by 2.59 times jcb_v2: Profiling Result GMProf-basic GMProf w/o Enh. Alg.w/ Enh. Alg. ShMshm_buf (4,757)shm_buf (4,741)shm_buf (4) DM in(1) out(1) in(1) out(1) in(1) out(1)

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Evaluation Conclusions 31

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Conclusions GMProf Statically-assisted dynamic profiling approach Architecture-based optimizations Live range analysis to capture real usage of data Low-overhead & Fine-grained May be applied to profile other events 32 Thanks!

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Similar presentations

Presentation on theme: "GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Similar presentations

Presentation on theme: "GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer."— Presentation transcript:

Similar presentations

About project

Feedback