Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

Intermediate GPGPU Programming in CUDA

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Extracted directly from:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Computer Engg, IIT(BHU)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Basic CUDA Programming

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Mattan Erez The University of Texas at Austin

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache ICPDS 2010 : International Conference on Parallel and Distributed Systems Xiang Cui, Yifeng Chen, Changyou Zhang and Hong Mei Presenter: Shih-Meng Teng

Outline Introduction Fermi’s new feature from programmer Related work of GEMM on GPUs Auto-tuned matrix multiplication on Fermi CUDA GEMM code template Auto-tuning the GEMM code template Experiment Conclusion

Introduction GPU architecture : GT200 -> Fermi Compared with GT200 architecture, the new features include : Improved double precision performance, L1/L2 cache hierarchy, Larger register file, More shared memory, ECC support Faster atomic operations. Atomic Operation 所谓原子操作指的是不会被打断的操作，一般来说一个原子操作对应一个特殊的 CPU 指令。在所有的线程同步机制里原子操作是最迅速的，因为完全不需要加锁。原子操作是实现 Lock-Free 最重要的武器。原子操作中除了常见操作，如“加、减、与、或、非、赋值”操作等，常见的还有 Compare-and-Swap, Test-and-Set, Add/Sub-and-Get 等组合操作。

Introduction (cont.) The added cache on one hand takes advantage of data locality in runtime but on the other hand makes performance less predictable. Programmers must understand these constraints of the hardware platform well to achieve high performance on Fermi.

Introduction (cont.) Automatic performance tuning, or auto- tuning in short, is a practical technique to get near-optimal code on complex and unpredictable computational architecture. In this work, auto-tuning is used to optimize GEMM code on Fermi.

Introduction (cont.) Our auto-tuned SGEMM and DGEMM reach 563 Gflops and 253 GFlops respectively on Tesla C2050, which is about 1.7x and 1.6x speedup with respect to CUBLAS 3.0. CUBLAS 3.0 is the fastest GEMM implementation written in CUDA and C without tuning at the level of binary code.

Fermi’s new feature from programmer Programmers must understand the following features. L1/L2 cache Register file usage 32/64-bit device code Global memory access Bank conflict Concurrent execution of multiple kernel

Fermi’s new feature from programmer (cont.) L1/L2 cache Compared to GT200 architecture, Fermi comes with L1/L2 cache for local and global memory accesses.

Fermi’s new feature from programmer (cont.) L1/L2 cache (cont.) Programmers have some control over L1 caching: The same on-chip memory is used for both L1 and shared memory How much of it to be L1 or shared memory is configurable by each kernel call. Kernels that use a lot of local memory could benefit from 48 KB of L1 cache.

Fermi’s new feature from programmer (cont.) L1/L2 cache (cont.) In addition to L1 cache, Fermi also features a unified L2 of 768 KB. Considering the unpredictability of cache’s effect, auto-tuning is a practical approach to get high performance CUDA code.

Fermi’s new feature from programmer (cont.) Register file Register file should be used as the primary on- chip storage space and if the algorithm permits, shared memory should be used less intensively in favor of using the register file.

Fermi’s new feature from programmer (cont.) Register file (cont.) The number of cores per multiprocessor GT200 : 8 Fermi : 32 The number of registers per multiprocessor GT200 : 16K Fermi : 32K. This means the number of registers per core is halved. GT200 : 16K / 8 = 2k Fermi : 32K / 32 = 1k

Fermi’s new feature from programmer (cont.) 32 / 64 bit device code On Fermi architecture, if the application is built in 64-bit mode, the compiler nvcc will compile both the host code and the device code in 64-bit mode. The larger pointers in the device code incur a performance penalty due to the extra space those pointers occupy in the register file. 更大的指針，在設備代碼性能產生負面影響，由於這些指針佔用額外的空間在寄存器文件

Fermi’s new feature from programmer (cont.) 32 / 64 bit device code (cont.) In this implementation of GEMM on Fermi, the host code and the device code are always separately compiled to avoid this 64-bit pointer performance penalty. 更大的指針，在設備代碼性能產生負面影響，由於這些指針佔用額外的空間在寄存器文件

Fermi’s new feature from programmer (cont.) Global memory access GT200 Global memory accesses are processed per half-warp Fermi Global memory accesses are processed per warp 更大的指針，在設備代碼性能產生負面影響，由於這些指針佔用額外的空間在寄存器文件

Fermi’s new feature from programmer (cont.) Global memory access (cont.) Two dimensional thread blocks, for example, should have their x dimension be multiple of the warp size as opposed to half the warp size so that each warp addresses a single cache line when accessing global memory. dimBlock.x Threads per block 更大的指針，在設備代碼性能產生負面影響，由於這些指針佔用額外的空間在寄存器文件

Fermi’s new feature from programmer (cont.) Bank conflicts Number of banks GT200 : 16 banks and accesses are processed by half-warps. Fermi : 32 banks and accesses are processed by warps. 64-bit accesses are specifically handled to minimize bank conflicts. 更大的指針，在設備代碼性能產生負面影響，由於這些指針佔用額外的空間在寄存器文件

Fermi’s new feature from programmer (cont.) Concurrent execution of multiple kernel Fermi supports concurrent kernel execution. Different kernels of the same application context can execute on the GPU at the same time. Execute a number of small kernels to utilize the whole GPU resources. 更大的指針，在設備代碼性能產生負面影響，由於這些指針佔用額外的空間在寄存器文件

Related work of GEMM on GPUs CUBLAS 1.0 NVIDIA provides an example in SDK to explain how to utilize shared memory to avoid overhead of memory latency. The SGEMM in CUBLAS 1.0 with shared memory to store sub-matrix of A and B is not fully optimized.

Related work of GEMM on GPUs (cont.) Volkov and Demmel (CUBLAS 2.0) Modern GPUs should be viewed as multi- threaded vector units, and their algorithms for matrix multiplication resemble those earlier ones developed for vector processors. 現代gpu可以看成是多執行緒向量單元，且它的矩陣乘法演算法其實很類似於早期的向量處理器

Related work of GEMM on GPUs (cont.) Volkov and Demmel (CUBLAS 2.0) Stores sub-matrix of A to shared memory but uses registers to hold sub-matrix of B and C. This modification saves one movement from shared memory to register per MAD operation. (MAD :one multiply + add ) 現代gpu可以看成是多執行緒向量單元，且它的矩陣乘法演算法其實很類似於早期的向量處理器 Shared memory Global Memory subA subB subC

Related work of GEMM on GPUs (cont.) Author’s research In [6], author discuss about his experiences in improving the performance of SGEMM. The following factors have contributed to its better performance: Reducing data transfer volume, Adjusting the thread block size to achieve the peak device-memory bandwidth, Reducing the total number of synchronization. 現代gpu可以看成是多執行緒向量單元，且它的矩陣乘法演算法其實很類似於早期的向量處理器

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template Asub: m * k, Bsub: k * n, Csub: m * n A, B and C are partitioned to M * K, K * N and M * N of these sub blocks. One Csub requires fetching K blocks of Asub and Bsub from A and B. Operations sum = M * N * K * (m * k) +M * N * K * (k * n) = M * m * N * n * K * k * (1/m + 1/n) Those elements reading from the device memory.

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template In [6], we described an SGEMM kernel that achieves a peak performance of 393 Gflops on GTX280. In this work, we take this kernel as the overall template.

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template Matrix B always resides in the global memory, threads in one column of thread blocks need to read the data of one column of Bsub blocks, as shown in Figure 4(a). If the thread blocks can be scheduled in column-major, better cache hit rates can be achieved when reading data of matrix B.

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template In this code template the indices of thread block are permuted. Build-in blockIdx.x and blockIdx.y variables. This modification is shown by Figure 4(b).

Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template

Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template A GEMM auto-tuner is designed to auto- tune this code template by automatically generating and searching a space of the parameters. Two component: A code generator It generates parameterized code according to the pre-defined code template. An execution engine. It runs these codes and finds out the best one.

Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template Define parameters Parameter Explain m Sub-matrix size k n tx X dimension of thread block. ty Y dimension of thread block. perm It indicating whether to permuting the build-in blockIdx.x and blockIdx.y variables cachePreferred It indicating to prefer either L1 cache or shared memory.

Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template Steps : Code generator checks the validity of the input parameters. By validity the input parameters must conform to hardware constraints, e:g:, the maximum number of threads per thread block tx * ty <=1024. Code generator takes the seven parameters as inputs, and generates the kernel code. By changing the input parameters, we can generate different kernel codes. By validity the input parameters must conform to hardware constraints, e:g:, the maximum number of threads per thread block tx * ty ·<=1024.

Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template Steps : Evaluate their performance in order to identify the best combination. Figure 9 shows an example of auto-tuned DGEMM code for matrix size 2048 and its calling code. By validity the input parameters must conform to hardware constraints, e:g:, the maximum number of threads per thread block tx * ty ·<=1024.

Experiment Tip : GEMM = GEneral Matrix Multiply SGEMM = Single precision DGEMM = Double precision Experiment 563 GFlops

Experiment (cont.) 253 GFlops Up to 25% performance is lost if the code doesn’t work in a cache-friendly way.

Experiment (cont.) The peak performance of SGEMM on Tesla C2050 is 563 GFlops, which has about 1.3x speedup with respect to GTX285 and 1.4x speedup with respect to Tesla C1060.

Experiment (cont.) The peak performance of DGEMM on Tesla C2050 is 253 GFlops, which has 3x speedup with respect to GTX285 and 3.4x speedup with respect to Tesla C1060.

Experiment (cont.) These results confirm the vender’s assertion that Fermi architecture has been specifically designed to offer unprecedented performance in double precision.

Conclusion Our focus is to study the techniques that can assist common programmers to obtain high performance code on Fermi architecture. The new features of Fermi architecture, especially the cache hierarchy, make it less easy to predict the performance of a given code. Auto-tuning then becomes a reasonable solution to achieve high performance code on Fermi architecture.

Q & A

GDDR3 v.s. GDDR5 GDDR3 is DDR2 modified. GDDR5 is DDR3 modified. GDDR5 is about 2.0x speedup with respect to GDDR3. Power consumption of GDDR5 is lower than GDDR3. Memory bandwidth of GDDR5 is larger than GDDR3 GDDR5顯存與GDDR3的不同 1、速度不同-頻率可達6.4GHz以上，GDDR5比GDDR3快2倍以上，在GDDR5與GDDR3的諸多不同中，顯存頻率的高低是最為明顯的。GDDR3的頻率最高可達2200MHz，而GDDR5的峰值頻率卻可以達到5000MHz，甚至更高。相信許多朋友會問，為什麼GDDR5可以運行在比GDDR3更高的頻率上呢？目前市售顯卡中最快的顯存是0.4ns的GDDR5，頻率可達5000MHz ，如果直接引用白皮書的專業名詞，相信大家一定會看的雲裡霧裡。簡單地說：是2倍於GDDR3的數據預取量和DQ平行匯流排，使GDDR5顯存的實際速度快了一倍。所謂的資料預取技術實際上和速食店備餐的原理差不多，從GDDR2到GDDR4時代，顯存的預取資料為4bit（也就是提前準備了4份速食）。需要調資料的請求一到（食客下單了），顯存就把這4bit的資料送上去；而GDDR5由於使用了更為先進的演算法，預取的資料達到了8bit（8份速食），當要求到來，它可以一下子將2倍於GDDR3的資料量提交上去。速度自然快了1倍。不僅如此，GDDR5顯存還使用了DQ並行雙匯流排，相當於提供了在原來的基礎上多加了一條通道，而以前的GDDR3顯存卻只有一條通道。如此折算下來，GDDR5的速度可達GDDR3的4倍以上，5000MHz的高頻自然不足為奇。 2、功耗更低-GDDR5比GDDR3節能20% ，毫無疑問，相比GDDR3或GDDR4顯存顆粒而言，GDDR5顯存顆粒最大的亮點就是擁有更高的頻寬，但顯存頻率的提升，也增加了晶片功耗，這會制約顯卡性能的發揮。從技術標準來看，GDDR3顯存顆粒的工作電壓為1.8V，GDDR5顯存電壓僅有1.5V，還具有優秀的電源管理技術，功耗自然更低。不僅如此，GDDR3使用的多為80nm制程，而GDDR5為55nm，制程的提高，使晶片的體積大大縮小，發執量也可以低許多。 3、頻寬更高-GDDR5配128bit顯存，仍比GDDR3配256bit快，根據公式，顯存頻寬＝顯存頻率×顯存位元寬/8，以市售599元的昂達GT240神戈為例，其顯存頻率達了4000MHz。其頻寬=4000MHz x 128bit /8，而GDDR3版本、顯存頻率在1800MHz的GT240，其頻寬=1800MHz x 128bit /8 。前者依靠是後者2.22倍的顯存頻率，獲得了極大的頻寬優勢。即使面對的256bit的顯卡，也可以憑藉高頻率的優勢獲得性能上的領先。　　在晶片相同的情況下，GDDR5版本的GT240，會比GDDR3版本的快28.9%，而當頻率進一步提高（650/4000MHz），兩者的差距會拉到38%。此時，GDDR5版本的高頻GT240，在性能上比256bit的96GT高24%，甚至接近了晶片性能更高的98GT。這一切的根源都來自于GDDR5優異的性能。

CUresult cuFuncSetBlockShape ( CUfunction hfunc, int x, int y, int z ) Specifies the x, y, and z dimensions of the thread blocks that are created when the kernel given by hfunc is launched.

CUresult cuLaunchGrid( CUfunction f, int grid_width, int grid_height ) Invokes the kernel f on a grid_width x grid_height grid of blocks. Each block contains the number of threads specified by a previous call tocuFuncSetBlockShape().