GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
Mai Zheng, Vignesh T. Ravi, Feng Qin, and Gagan Agrawal
OpenFOAM on a GPU-based Heterogeneous Cluster
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
SAGE: Self-Tuning Approximation for Graphics Engines
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Extracted directly from:
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Gwangsun Kim, Jiyun Jeong, John Kim
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Supporting Fault-Tolerance in Streaming Grid Applications
Linchuan Chen, Xin Huo and Gagan Agrawal
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Year 2 Updates.
Presented by: Isaac Martin
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Lecture 5: GPU Compute Architecture for the last time
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Force Directed Placement: GPU Implementation
Presentation transcript:

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer Science and Engineering The Ohio State University Columbus, OH, USA

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GPU Programming Gets Popular Many domains are using GPUs for high performance 2 GPU-accelerated Molecular DynamicsGPU-accelerated Seismic Imaging Available in both high-end/low-end systems the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012] commodity desktops/laptops equipped with GPUs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Need careful management of a large amount of threads Writing Efficient GPU Programs is Challenging 3 Thread Blocks

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Need careful management of a large amount of threads multi-layer memory hierarchy Writing Efficient GPU Programs is Challenging 4 Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Thread Blocks Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Need careful management of a large amount of threads multi-layer memory hierarchy Writing Efficient GPU Programs is Challenging 5 Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Thread Blocks Fast but Small Large but Slow Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Writing Efficient GPU Programs is Challenging 6 Which data in shared memory are infrequently accessed? Which data in device memory are frequently accessed? Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Existing tools can’t help much inapplicable to GPU coarse-grained prohibitive runtime overhead cannot handle irregular/indirect accesses Writing Efficient GPU Programs is Challenging 7 Which data in shared memory are infrequently accessed? Which data in device memory are frequently accessed? Read-only Data Cache DRAM (Device Memory) L2 Cache L1 Cache Shared Memory Thread Kepler GK110 Memory Hierarchy

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Enhanced Algorithm Evaluation Conclusions 8

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-basic: The Naïve Profiling Approach 9 Shared Memory Profiling integer counters to count accesses to shared memory one counter for each shared memory element atomically update the counter to avoid race condition among threads Device Memory Profiling integer counters to count accesses to device memory one counter for each element in the user device memory array since device memory is too large to be monitored as a whole (e..g, 6GB) atomically update the counter

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Enhanced Algorithm Evaluation Conclusions 10

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 11 Observation I: Many memory accesses can be determined statically 1.__shared__ int s[]; 2.… 3.s[threadIdx.x] = 3;

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 12 Observation I: Many memory accesses can be determined statically 1.__shared__ int s[]; 2.… 3.s[threadIdx.x] = 3; Don’t need to count the access at runtime

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 13 Observation I: Many memory accesses can be determined statically 1.__shared__ int s[]; 2.… 3.s[threadIdx.x] = 3; Don’t need to count the access at runtime How about this … 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 14 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 15 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y Don’t need to profile in every r iteration

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 16 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y Don’t need to profile in every r iteration Observation III: Some accesses are tid-invariant E.g. s[input[c]] is irrelavant to threadIdx

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SA: Static Analysis Optimization 17 Observation II: Some accesses are loop-invariant E.g. s[input[c]] is irrelavant to the outer loop iterator r 1.__shared__ float s[]; 2.… 3.for(r=0; …; …) { 4. for(c=0; …; …) { 5. temp = s[input[c]]; 6. } 7.}y Don’t need to profile in every r iteration Observation III: Some accesses are tid-invariant E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the counter in every thread

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-NA: Non-Atomic Operation Optimization 18 Atomic operation cost a lot Serialize all concurrent threads when updating a shared counter Use non-atomic operation to update counters does not impact the overall accuracy thanks to other optimizations atomicAdd(&counter, 1); … … concurrent threadsserialized threads

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SM: Shared Memory Counters Optimization 19 Make full use of shared memory Store counters in shared memory when possible Reduce counter size E.g., 32-bit integer counters -> 8-bit Read-only Data Cache Device Memory L2 Cache L1 Cache Shared Memory Fast but Small

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-SM: Shared Memory Counters Optimization 20 Make full use of shared memory Store counters in shared memory when possible Reduce counter size E.g., 32-bit integer counters -> 8-bit Read-only Data Cache Device Memory L2 Cache L1 Cache Shared Memory Fast but Small GMProf-TH: Threshold Optimization Precise count may not be necessary E.g A is accessed 10 times, while B is accessed > 100 times Stop counting once reaching certain threshold Tradeoff between accuracy and overhead

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Enhanced Algorithm Evaluation Conclusions 21

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs GMProf-Enhanced: Live Range Analysis 22 The number of accesses to a shared memory location may be misleading shm_buf in Shared Memory input_array in Device Memory data0data1data2 output_array in Device Memory Need to count the accesses/reuse of DATA, not address data0 data1data2 data1data2

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Track data during its live range in shared memory Use logical clock to marks the boundary of each live range Separate counters in each live range based on logical clock GMProf-Enhanced: Live Range Analysis shm_buffer = input_array[0] //load data0 from DM to ShM output_array[0] = shm_buffer //store data0 from ShM to DM shm_buffer = input_array[1] //load data1 from DM to ShM output_array[1] = shm_buffer //store data1 from ShM to DM live range of data0 live range of data1

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Enhanced Algorithm Evaluation Conclusions 24

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Platform GPU: NVIDIA Tesla C cores (30×8), 1.296GHz 16KB shared memory per SM 4GB device memory CPU: AMD Opteron 2.6GHz ×2 8GB main memory Linux kernel CUDA Toolkit 3.0 Six Applications Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix- Vector Multiplication, and DXTC 25 Methodology

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs26 Runtime Overhead for Profiling Shared Memory Use 182x144x648x181x 113x 2.6x 90x 648x

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs27 Runtime Overhead for Profiling Device Memory Use 83x197x48.5x 1.6x

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs28 Case Study I: Put the most frequently used data into shared memory Profiling Result GMProf-basic GMProf w/o THw/ TH ShM000 DM A1(276) A2(276) A3(128) A4(1) A1(276) A2(276) A3(128) A4(1) A1(THR) A2(THR) A3(128) A4(1) bo_v1: a naïve implementation where all data arrays are stored in device memory A1 ~ A4: four data arrays (N): average access # of the elements in the corresponding data array

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs bo_v2: an improved version which puts the most frequently used arrays (identified by GMProf) into shared memory 29 Case Study I: Put the most frequently used data into shared memory Profiling Result GMProf-basic GMProf w/o THw/ TH ShM A1 (174,788) A2 (169,221) A1(165,881) A2(160,315) A1(THR) A2(THR) DM A3(128) A4(1) A3(128) A4(1) A3(128) A4(1) bo_v2 outperforms bo_v1 by a factor of 39.63

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs jcb_v1: the shared memory is accessed frequently, but little reuse of the date 30 Case Study II: identify the true reuse of data Profiling Result GMProf-basic GMProf w/o Enh. Alg.w/ Enh. Alg. ShMshm_buf (5,760)shm_buf (5,748)shm_buf (2) DM in(4) out(1) in(4) out(1) in(4) out(1) jcb_v2 outperforms jcb_v1 by 2.59 times jcb_v2: Profiling Result GMProf-basic GMProf w/o Enh. Alg.w/ Enh. Alg. ShMshm_buf (4,757)shm_buf (4,741)shm_buf (4) DM in(1) out(1) in(1) out(1) in(1) out(1)

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Outline Motivation GMProf Naïve Profiling Approach Optimizations Evaluation Conclusions 31

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs Conclusions GMProf Statically-assisted dynamic profiling approach Architecture-based optimizations Live range analysis to capture real usage of data Low-overhead & Fine-grained May be applied to profile other events 32 Thanks!