Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Slides:

Advertisements

Similar presentations

Lecture 6: Multicore Systems

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Implementation of Efficient Check-pointing and Restart on CPU - GPU

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

6- General Purpose GPU Programming

Presentation transcript:

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng

Institute of Software,Chinese Academy of Sciences Motivation  Modern GPU architectures  More and more diversified, nVidia GPU, AMD GPU.  Optimizing GPU kernels  A Challenging task, detailed underlying hardware knowledge  Explicit parallelization and explicit memory hierarchy  Performance portability of GPU programs  more and more difficult for common programmers  Programmers with limited hardware knowledge of GPU  Implementing high performance GPU kernels directly  Identifying performance bottlenecks  Choosing which optimization be adopted and their order

Institute of Software,Chinese Academy of Sciences OpenCL and GPU Architecture  OpenCL  Open Computing Language  An open industry standard for general purpose parallel programming across platforms  Providing portable and efficient access to the power of heterogeneous computing platforms  GPU  Two major GPU vendors: NVIDIA and AMD  Adopting different architectures  Sharing some kind of architectural similarities

Institute of Software,Chinese Academy of Sciences  Hierarchical Architecture  GPU -> Compute Unit -> Process cores.  Hierarchical Memory Model  Off-chip memory (global and constant memory)  On-chip memory (local memory, cache and register)  Programming Model  STMD (Single Thread Multiple Data)  Multi-threading scheduling unit (warp or wavefront)  Thread Organization  Work-item(thread) -> work-group(block) -> Grid  Scheduling Strategy  zero overhead scheduling strategy  Warps/wavefronts execute interleaved to tolerate intra-warp stall NVIDIA and AMD GPU: Architecture Similarities

Institute of Software,Chinese Academy of Sciences NVIDIA and AMD GPU: Architecture Differences  Design of Process Core and Register Files  NVIDIA GPU: Scalar Architecture  AMD GPU: Vector Architecture  Leading to different program optimization techniques  Focusing on NVIDIA C2050 GPU and AMD HD 5850 GPU

Institute of Software,Chinese Academy of Sciences Memory-bound and Computation-bound kernels  Computation Intensity  Definition: number of single-precision floating-point ops performed per byte of off-chip memory traffic  Kernel Computation Intensity F Total amount of computation divides total amount of data required to transfer from off-chip memory  Hardware Computation Intensity F Throughput of arithmetic instruction divides throughput of memory access instruction F For simplicity, use peak performance divides peak memory bandwidth  GPU kernels  Memory-bound kernel F Kernel Computation Intensity > Specific Hardware Computation Intensity F The most effective optimization method to improve the utilization of memory bandwidth  Computation-bound kernel F Kernel Computation Intensity < Specific Hardware Computation Intensity F The most effective optimization method is to improve the utilization of computing resource

Institute of Software,Chinese Academy of Sciences Performance Optimization Chain  Threshold Chain  Utilization of off-chip memory bandwidth  Tradeoff Chain  Utilization of computation resources  Data locality  Each architecture and kernel has a different balance requirement between them  Performance depend on how well the kernel characteristics mapped onto the architecture hardware characteristics

Institute of Software,Chinese Academy of Sciences Threshold Chain  Set of optimization methods to improve the utilization of off-chip memory bandwidth  Optimization Space  Eliminating channel conflict (ECC) F Continuous memory access  Reducing Memory Transaction(RMT) F Continuous and alignment memory access F Vector memory access  Using FastPath (UFP, for AMD GPU only) F Vector memory access F AMD has CompletePath and FastPath  Performance aspects must be satisfied or mitigated in order to achieve good performance

Institute of Software,Chinese Academy of Sciences Threshold Chain Comparison of the utilization of off-chip memory bandwidth with different vector lengths NVIDIA C2050 GPUAMD HD5850 GPU

Institute of Software,Chinese Academy of Sciences Threshold Chain Comparison of the utilization of off-chip memory bandwidth with various strides and offsets NVIDIA C2050 GPU AMD HD5850 GPU  Threshold Chain  NVIDIA C2050 GPU F Continuous -> alignment -> vector  AMD HD5850 GPU F Continuous -> vector -> alignment

Institute of Software,Chinese Academy of Sciences Tradeoff Chain  Set of optimization methods that can make full use of computation resources  Not clear whether we should maximize or minimize a particular performance aspect for an application on a given architecture  Only providing insights for performance improvement but not accurate.  Optimization Space  Improving thread-level parallelism (TLP)  Improving instruction-level parallelism (ILP)  Reducing dynamic instruction count per thread (RDIC)  Instruction Selection Optimizations (INS)

Institute of Software,Chinese Academy of Sciences Tradeoff Chain  Comparison of the performance with different ILP NVIDIA C2050 GPU AMD HD5850 GPU  Only run one work-group on one computation unit  Varying block sizes and ILPs

Institute of Software,Chinese Academy of Sciences Data Locality  Data Locality  Computation is cheap, data movement is expensive  Maximize locality to minimize data movement  Computation Intensity Wall  Computation intensity can constrain performance like a wall  Improving data locality to increase computation intensity  Optimization Methods  Storing read-only data at cache or constant memory  Improving data reuse  Loop re-order  Rewrite Data Structure  Data padding

Institute of Software,Chinese Academy of Sciences Insightful Optimization Chain  Using Roofline model to make optimization chain insightful NVIDIA C2050 GPU AMD HD5850 GPU  Computation intensity of a kernel determines its optimization region  Node of optimization chain suggests the corresponding method  Order of the nodes suggests the optimization order  Ridge point marks the minimum computation intensity required to achieve peak performance

Institute of Software,Chinese Academy of Sciences Experimental Evaluation NVIIDA C2050 AMD HD 5850 Clock Rate1.15 GHZ GHZ #PEs #CUs14 18 Peak Perf.1030 GFlops 2090 GFlops Memory3.0 GB 1.0 GB Peak Bandwidth144 GB/s 128 GB/s #Register/CU16K #Local Memory/CU48K 32K SDK versionSDK 4.1 SDK 2.6  Configuration of GPUs  Case studies  Matrix Transpose  Laplace Transform  Image Integral

Institute of Software,Chinese Academy of Sciences 1. Matrix Transpose  Algorithm  Input and output matrices address at separate memory location  Offset by 4 bytes to test performance impact of alignment  Computation Intensity on Char is 2 * 4 / 8 = 1  Optimization chain  Char4 instead of Char using FastPath on AMD GPU  Using Local memory to re-map the thread to tile elements  Diagonal block reordering (Eliminating channel conflict)  Setting offset value to 0

Institute of Software,Chinese Academy of Sciences 1. Matrix Transpose  Bottleneck is off-chip memory channel conflict  Vector memory access can improve performance better for AMD HD5850 GPU than NVIDIA C2050 GPU  Alignment has an important influence on performance for NVIDIA GPU  Performance improved by 26.1 and 42.4 times on AMD GPU and NVIDIA GPU respectively

Institute of Software,Chinese Academy of Sciences 2. Laplace Transform  Laplace transform calculates the Laplace value of the source matrix by adding up the second x and y derivatives calculated using Laplacian  Computation intensity is 67/(9*4)=1.8  Calculation of each element performs 67 calculations F 8 additions + 9 multiplications + 10 * 4 address calculations + 9 iterations  Calculation of each element needs 9 operators  After data locality improvement, computation intensity increased to 12.6  Utilizing the Local memory  Moving Laplacian matrix to Constant memory

Institute of Software,Chinese Academy of Sciences 2. Laplace Transform  Increasing Data locality can improve performance significantly  Increasing ILP has a tradeoff impact on performance because of the register restriction  Using ILP is more efficient for AMD GPU than NVIDIA GPU  Performance improved by 14.1 and 7.8 times on AMDGPU and NVIDIA GPU respectively

Institute of Software,Chinese Academy of Sciences 3. Image Integral  Rapid feature evaluation: face detection  Implementing scan algorithm twice in turn on matrix rows and columns respectively.  Two phases:  Up-sweep phase: traversing the tree from leaves to root computing partial sums at internal nodes of the tree.  Down-sweep phase: traversing back up the tree from the root, using the partial sums to build the scan in place on the array using the partial sums computed by the up-sweep phase Up-sweep phaseDown-sweep phase

Institute of Software,Chinese Academy of Sciences 3. Image Integral  Increasing Data locality can improve performance significantly  Work-effective algorithm can improve performance through reducing algorithm complexity  Increasing ILP has a tradeoff impact on performance because of the register restriction  Performance improved by 8.2 and 7.1 times on AMD GPU and NVIDIA GPU respectively

Institute of Software,Chinese Academy of Sciences Conclusions  Proposed an insightful and quantitative optimization chain for both NVIDIA and AMD GPU  Our performance optimization chain has captured almost all the primary performance factors.  With the help of this optimization chain, common programmers can write high performance kernel directly and easily.

Institute of Software,Chinese Academy of Sciences Thank You ！ Q&A