ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Slides:

Advertisements

Similar presentations

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

Advertisements

Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W

GPU Computing CIS-543 Lecture 09: Shared and Constant Memory

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CUDA Parallelism Model

ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,

DRAM Bandwidth Slide credit: Slides adapted from

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Memory and Data Locality

ECE 8823A GPU Architectures Module 4: Memory Model and Locality

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

© David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

Presentation transcript:

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Seven Techniques in Many-core Programming Scatter to gather transformation Granularity coarsening and register tiling Data access tiling Data layout and traversal ordering Binning and cutoff Bin sorting and partitioning for non-uniform data Hierarchical queues and kernels for dynamic data ACS Annual Meeting, August 22, 2010

You can do it. Computational thinking is not as hard as you may think it is. Most techniques have been explained, if at all, at the level of computer experts. The purpose of the course is to make them accessible to domain scientists and engineers. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Tentative Schedule/Make-up Classes Regular make-up classes TBD Week 1: Tue, 8/24 : Lecture 1: Introduction Thu, 8/26: Lecture 2 – Review: GPU performance considerations Make-up class: Week 2: Tue, 8/31: Lecture 3 – Parallelism Scalability Transformations Thu, 9/02: Lecture 4 – Thread Coarsening and Register Tiling MP-1: DCS – scatter vs. scatter Week 3: Tue, 9/07: Lecture 5 – Memory Tiling Thu, 9/09: Lecture 6 – Memory Tiling Make-up class: MP-2: DCS – thread coarsening and register tiling Week 4 Tue, 9/14: Lecture 7 – Register Tiling (make-up class) Thu, 9/16: Lecture 8 – Register Tiling (make-up class) MP-3, 7-Point Stencil – 2D memory tiling © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Tentative Schedule/Make-up Classes Week 5: Tue, 9/21 : Lecture 9 - Data Layout Considerations (make-up class) Thu, 9/23: Lecture 10 – Input Binning Make-up class: MP-4: 7-point stencil – register tiling Week 6: Tue, 9/28: Lecture 11 – Input Binning Thu, 9/30: Lecture 12 – Non-uniform Data (Sparse methods) MP-5: Matrix multiplication – register tiling Week 7: Tue, 10/05: Lecture 13 – Non-Uniform Data (Sparse Methods) Thu, 10/07: Lecture 14 – Non-Uniform Data (Variable Binning) Make-up class: MP-6: Lattice Boltzmann Method: Data Layout Week 8: Tue, 10/12: Lecture 15 – Non-Uniform Data (Variable Binning) Thu, 10/14: Lecture 16 – Dynamic Data MP-7: Cut-off CP - binning © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Tentative Schedule/Make-up Classes Week 9: Tue, 10/19: Lecture 17 - Dynamic Data (make-up class) Thu, 10/21: Lecture 18 – Mapp-Reduce Make-up class: MP-8: MRI – data sorting and partitioning Week 10: Tue, 10/26: Lecture 19 – Final Project Kick-off Workshop Thu, 10/28: Lecture 20 – Final Project Kick-off Workshop MP-9: BFS – hierarchical queues and kernels Week 11: Tue, 11/02: Lecture 21 – Exploratory Topics (Unstructured Mesh?) Thu, 10/04: Lecture 22 – Exploratory Topics (Tree-coded Data) Make-up class: Final Project Work Week 12 Tue, 11/09: Lecture 23 – Final Project Algorithm Presentations Thu, 11/11: Lecture 24 – Final Project Algorithm Presentations © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Tentative Schedule/Make-up Classes Week 13: Tue, 11/16: Lecture 25 - Final Project Algorithm Presentation (make-up class) Thu, 11/18: Lecture 26 – Final Project Algorithm Presentation Make-up class: Final Project Work Week 14: Tue, 11/30: Lecture 27 – Final Project Algorithm Presentation Thu, 12/02: Lecture 28 – Final Project Week 15: Tue, 12/07: Lecture 29 – Course Summary Thu, 12:09: Final Project Symposium (Date may change, 6 hours, 15 minutes per student) © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Global Memory Bandwidth Many-core processors have limited off-chip memory access bandwidth compared to peak compute throughput Fermi 1.5 TFLOPS SPFP peak throughput 0.75 TFLOPS DPFP peak throughput 144 GB/s peak off-chip memory access bandwidth 36 G SPFP operands per second 18 G DPFP operands per second To achieve peak throughput, a program must perform 1,500/36 = ~42 SPFP (21 DPFP) arithmetic operations for each operand value fetched from off-chip memory ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

A Simple CUDA Kernel for Matrix Multiplication __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row][k] * Nd[k][Col]; Pd[Row][Col] = Pvalue; } ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Performance Implication on Fermi Two Global (DRAM) accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*1,500GFLOPS = 6,000 GB/s needed to achieve peak SP FLOP rating 8*750GFLOPS = 6,000 GB/s needed to achieve peak DP FLOP rating 144 GB/s limits the code at 36 SP / 18 DP GFLOPS Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

However The calculation is over simplified It assumes that peak memory bandwidth is achieved through the execution We need to first understand the memory architecture… © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

GPU Memory Architecture, SImplified Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache © David Kirk/NVIDIA and Wen-mei W. Hwu Barcelonal, Spain, July 5-9, 2010

GPU Memory Architecture – Less Simplified Channels Main form of access parallelism 8 in Fermi Ports Second-level (pipelined) access parallelism 32 / channel in Fermi Bursts Bandwidth efficiency 128B/ burst in Fermi © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Achieving Peak Bandwidth All words of the a burst need to be used Every word transferred corresponds to one of the program accesses All channels are actively used Each channel connects to a set of pins Many ports in each channel are activated Enough active burst requests to fully utilize the pin bandwidth © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } int main() // Run ceil(N/256) blocks of 256 threads each vecAdd<<<ceil(N/256), 256>>>(d_A, d_B, d_C, n); © David Kirk/NVIDIA and Wen-mei W. Hwu Barcelonal, Spain, July 5-9, 2010

A Good Memory Access Pattern Adjacent threads access adjacent locations Adjacent warps activate different ports Adjacent thread blocks activate different ports/channels in Thread Block 1 Thread Block N - 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 7 6 5 4 3 2 1 Thread Block 0 © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

GPU Memory Architecture – Less Simplified Channels Main form of access parallelism 8 in Fermi Ports Second-level (pipelined) access parallelism 32 / channel in Fermi Bursts Bandwidth efficiency 128B/ burst in Fermi © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Memory Layout of a Matrix in C Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 1 Time Period 2 … T1 T2 T3 T4 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu Barcelona, Spain, July 5-9, 2010

Memory Layout of a Matrix in C Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 … Time Period 2 T1 T2 T3 T4 Time Period 1 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu Barcelona, Spain, July 5-9, 2010

Memory Access Pattern (Corner Turning) Md Nd Original H Access T D I Pattern W WIDTH Copy into scratchpad memory Md Nd Tiled Access Perform Pattern multiplication with scratchpad values © David Kirk/NVIDIA and Wen-mei W. Hwu Barcelona, Spain, July 5-9, 2010

Data Layout Transformation Transpose a 2D matrix layout can convert a non-coalesced access pattern into a coalesced pattern Md Nd T © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Data Access Conflicts ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Atomic Operations on DRAM Each Load-Modify-Store has two full memory access delays All atomic operations on the same variable (RAM location) are serialized time internal routing internal routing DRAM delay DRAM delay DRAM delay .. transfer delay transfer delay atomic operation N atomic operation N+1 © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Hardware Improvements Atomic operations on Shared Memory Very short latency, but still serialized Private to each thread block Algorithm work for programmers (more later) time internal routing .. data transfer atomic operation N atomic operation N+1 © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Hardware Improvements (cont.) Atomic operations on Fermi L2 cache medium latency, but still serialized Global to all blocks “Free improvement” on Global Memory atomics time internal routing .. data transfer data transfer atomic operation N atomic operation N+1 © Wen-mei W. Hwu and David Kirk/NVIDIA, ECE598HK, 2010

Any More Questions? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010