Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Intermediate GPGPU Programming in CUDA
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CUDA programming Performance considerations (CUDA best practices)
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering The Ohio State University 2 Department of Electrical and Computer Engineering Louisiana State University A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs * * Supported by NSF

Introduction  Emergence of many-core architectures ◦ High computation power ◦ E.g. GPUs  Development of high-performance codes in such architectures – Non-trivial!  CUDA parallel programming model for NVIDIA GPUs ◦ Good abstraction of the underlying architecture ◦ Not straight-forward to write a high-performance CUDA code A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Introduction  Optimizations needed to address architectural challenges ◦ Memory access model ◦ Granularity and levels of parallelism in architecture  Solution: Compiler infrastructure to automatically generate efficient parallel programs  PLuTo compiler framework [PLDI08] recently developed for general-purpose multi-core targets ◦ Sequential C to OpenMP Parallel Tiled Code  Develop a framework to automatically generate parallel CUDA code A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

◦ An algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations ◦ Regular affine programs ◦ Dense arrays ◦ Loop bounds – affine functions of outer loop variables, constants and program parameters ◦ Array access functions - affine functions of surrounding loop variables, constants and program parameters Polyhedral Model A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

for (i=1; i<=7; i++) for (j=2; j<=6; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; ijij x S1 = I S1 = ij1ij1 ≥ ₣ 1a (x S1 ) = ijij + 0 ₣ 2a (x S1 ) = ijij + 0 ₣ 3a (x S1 ) = ijij j i≥1 i≤7 i j≥2j≤6 Polyhedral Model Φ S (x S ) =. xSn1xSn1 CSCS A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

PLuTo Framework ◦ Available at A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

NVIDIA GPU Architecture  Two levels of parallelism ◦ Threads (Processor cores)  Grouped into SIMD warps ◦ Thread blocks (Multiprocessors)  Various memory units ◦ Different memory access model ◦ Cache and local store hierarchy  Partitioning (e.g. registers) and sharing of resources (e.g. shared memory) Off-chip memory... P1P2P8 Shared Memory Constant Cache Registers Texture Cache SM1 SM16 A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Performance Characterization of NVIDIA GeForce 8800 GTX  Get insights into optimizations to be addressed by a compiler framework  Characterize key features of the machine architecture and their impact on different strategies: ◦ Global memory access ◦ Shared memory (scratchpad) access ◦ Concurrency and register pressure A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Global Memory Access  Measured memory read bandwidth for ◦ Different data sizes ◦ Blocked and cyclic distribution of data access amongst the threads of a single thread block A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Global Memory Access  Cyclic access has much higher bandwidth  Hardware optimization called global memory coalescing ◦ Access from consecutive threads of a (half) warp to consecutive locations coalesced ◦ Base address of (half) warp aligned to 4, 8 or 16 bytes A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Optimizing Global Memory Access  Determine the extent of reuse of arrays  For arrays with sufficient reuse ◦ Copy from global memory to shared memory [PPoPP08]  For arrays with no reuse ◦ Find affine transformations enabling global memory coalescing ◦ If no suitable affine transformation enabling global memory coalescing  Copy to shared memory with possible global memory coalescing A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Optimizing Global Memory Access j i ayx mv kernel: for (i=0;i<n;i++) { P: x [i]=0; for (j=0;j<n;j++) Q: x[i]+=a[i][j] * y[j]; } tmv kernel: for (i=0;i<n;i++) { S: x[i]=0; for (j=0;j<n;j++) T: x[i]+=a[j][i] * y[j]; } A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Experimental Evaluation NDirect Global Copied to Shared 4K K K K K Performance comparison (in GFLOPS) of mv kernel A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Experimental Evaluation NDirect Global Copied to Shared 4K K K K K NNon optimized Global Optimized Global 4K K K K K Performance comparison (in GFLOPS) of mv kernel Performance comparison (in GFLOPS) of tmv kernel A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Shared Memory Access  Shared memory organized into banks ◦ 16 banks in NVIDIA 8800 GTX ◦ Successive 32-bit words in successive banks  Bank conflicts in shared memory ◦ n threads access different address in same bank – n sequential requests (n-way conflict) ◦ Bandwidth of shared memory access inversely proportional to the degree of bank conflicts  Goal: To minimize shared memory bank conflicts A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Optimizing Shared Memory Access  Strategy to minimize bank conflicts during shared memory access ◦ Pad the arrays copied into shared memory  Degree of bank conflicts = gcd (stride of array access across threads of a half warp, number of bank modules)  Cost of accessing a word in shared memory ◦ Linear function of degree of bank conflicts  Find padding factor that minimizes the cost considering all references to the array A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Experimental Evaluation NNon optimized Share Optimized Share 4K K K K K Performance comparison (in GFLOPS) of mv kernel A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Parallelism vs Register Pressure  Performance enhancing approaches ◦ Reduction of number of loads/stores ◦ Increase in ILP ◦ Reduce dynamic instructions  Loop overhead reduction  Well-known optimization: Loop unrolling  Issues ◦ Increased register pressure ◦ Might reduce number of concurrent threads  Registers are partitioned among thread blocks A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Parallelism vs Register Pressure  Higher thread-level parallelism needed to mask global memory access latency ◦ Threads scheduled in an interleaved manner to mask global memory access latency  Trade-off between ◦ number of active concurrent threads ◦ number of registers available for a thread in a thread block  Problem: Register allocation cannot be managed by any external compilation framework  Solution: Empirical evaluation to select an optimal choice A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Model-driven Empirical Search  Need for empirical search ◦ Tight coupling  Program parameters – tile sizes, unroll factors  System parameters – threads, thread blocks  Resources - Number of registers, shared memory ◦ Lack of control on the registers (usage and allocation)  Model to estimate number of loads/stores ◦ Analytically in polyhedral model ◦ Empirically using ptx code  Register usage instrumentation ◦ Empirically using cubin object code A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Model-driven Empirical Search ◦ Perform multilevel tiling (except register-level tiling) ◦ Generate optimal copy code ◦ Prune code versions by global memory traffic ◦ For all selected loop structures  do register-level tiling and explicit unrolling  instrument the register usage  discard those for which increased register pressure reduces concurrency to less than 25% of maximum possible concurrency ◦ In all selected code versions, pad the arrays in shared memory with optimal padding factor ◦ Search empirically among the remaining candidate loop structures  explicitly running them and timing the execution time and select the optimal one A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Experimental Evaluation Performance of Matrix Kernels A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Experimental Evaluation Performance of Matrix Kernels A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

◦ Earlier GPU works  Automatic generation of pixel shader operations from a high- level data-parallel language – Tarditi et al. [ASPLOS’06]  Stream processing – Brook, RapidMind, PeakStream ◦ Considerable work on developing specific optimized algorithms and libraries for GPUs  E.g. CUDPP – CUDA Data Parallel Primitives ◦ Very little work on general compiler optimization strategies for GPUs  Performance metrics to prune the optimization search space on a pareto-optimality basis - by Ryoo et al. [CGO’08]  Optimize data communication between CPU and co-processor – by Gelado et al. [ICS’08] Related Work A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Summary  Developed compiler optimizations to address key performance-influencing factors on NVIDIA GPUs ◦ Enable global memory coalescing in the polyhedral model for regular programs ◦ Reduce shared memory bank conflicts ◦ Determine optimized program parameters (unroll factors, tile sizes) through a model-driven empirical search A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Ongoing and Future Work  Automatic thread-centric CUDA Code Generation in Polyhedral Model  Looking at data layout reordering to enhance memory accesses at various levels A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS’08

Thank You!

 Innermost core computations in many codes ◦ Dense linear algebra ◦ Image and signal processing ◦ Computational Electromagnetics (FDTD) ◦ Explicit PDE solvers (e.g. SWIM, SWEEP3D)  Likely to be more prevalent in future (esp. scientific) ◦ Codes with direct data access better than indirect data access: power & performance ◦ Structured-sparse (block sparse) better than arbitrary sparse (e.g. OSKI) ◦ Algorithms with sparse-outer but regular-inner structure may be more attractive for many-core processors, e.g. multi-block methods How prevalent are affine computations?