CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
Introduction to CUDA and CELL SpursEngine Multi-core Programming 1 Reference: 1. NVidia CUDA (Compute Unified Device Architecture) documents 2. Presentation.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
Matrix Multiplication in CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
Introduction to CUDA (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
My Coordinates Office EM G.27 contact time:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Appendix C Graphics and Computing GPUs
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Slides from “PMPP” book
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
NVIDIA Fermi Architecture
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Programming Massively Parallel Processors Performance Considerations
Memory and Data Locality
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
ECE 8823A GPU Architectures Module 5: Execution and Resources - I
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming Fall 2009 Jih-Kwon Peir Computer Information Science Engineering University of Florida

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming Acknowledgement: Slides borrowed from Accelerators for Science and Engineering Applications: GPUs and Multicores, by David Kirk / NVIDIA and Wen-mei Hwu / University of Illinois, 2006-2008, (http://www.greatlakesconsortium.org/events/GPUMulticore/ agenda.html) Course material posted from CUDA zone (http://www.nvidia.com/object/cuda_education.html) Intel Software Network (http://software.intel.com/en- us/academic/) The Art of Multiprocessor Programming (http://software.intel.com/en-us/academic/ ) Presentation slides from various papers

Course Goals Learn how to program massively parallel processors and achieve high performance functionality and maintainability scalability across future generations Acquire technical knowledge required to achieve the above goals principles and patterns of parallel programming processor architecture features and constraints programming API, tools and techniques Learn new many-core general-purpose and GPU processor architecture Organization and memory systems Parallel programming basics: Locking, synchronization, mutual exclusion, transactional memory, etc.

Course Outline Week 1-2: Introduction, GPU architectures, CUDA programming Week 3-6: CUDA threads, code blocks, grids, CUDA memory, synchronization, performance Week 7: Project selection and discussion Week 8-9: Intel many-core architectures Week 10-11: Parallel programming model, synchronization, mutual exclusion, conditional synchronization, locks, barriers, concurrency and correctness, sequential program and consistency. Add Fermi and Larrabee Week 12-13 - Discussion of advanced issues in multi-core architecture and programming Week 14-16 In-depth discussion of project topics and project presentation

CUDA – GPU Proggming Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args);

CUDA Thread Blocks and Threads Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes …

Matrix Multiplication A Simple Example // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; N k j WIDTH M P i WIDTH k WIDTH WIDTH

G80 Example: Thread Scheduling (cont.) SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected

Thread Scheduling (cont.) Each code block assigned to one SM, each SM can take up to 8 blocks Each block up to 512 threads, divided into 32-therad wrap, each wrap scheduled on 8 SP, 4 threads on one SP, wrap executed SIMT mode SP is pipelined ~30 stages, fetch, decode, gather and write-back act on whole warps, so they have a throughput of 1 warp/slow clock Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks The Fetch/decode/... stages have a higher throughput to feed both the MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle Need 6 warps (or 192 threads) per SM to hide the read-after-write latencies

G80 Implementation of CUDA Memories Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Constant Memory Global, constant, and texture memory spaces are persistent across kernels called by the same application.

How about performance on G80? All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS The actual code runs at about 15 GFLOPS Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory

Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; 2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. int bx = blockIdx.x; int by = blockIdx.y; 4. int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; __syncthreads(); 11. for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; Synchthreads(); } 13. Pd[Row*Width+Col] = Pvalue;

Today’s Intel PC Architecture: Single Core System FSB connection between processor and Northbridge (82925X) Memory Control Hub Northbridge handles “primary” PCIe to video/GPU and DRAM. PCIe x16 bandwidth at 8 GB/s (4 GB each direction) Southbridge (ICH6RW) handles other peripherals

GeForce-8 Series HW Overview Streaming Processor Array … TPC TPC TPC TPC TPC TPC Texture Processor Cluster Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch SM Shared Memory TEX SP SP SM SP SP SFU SFU SP SP SP SP

SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 . . . warp 8 instruction 12 warp 3 instruction 96

CUDA Device Memory Space: Review Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Global, constant, and texture memory spaces are persistent across kernels called by the same application. The host can R/W global, constant, and texture memories using Copy function

Memory Layout of a Matrix in C Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 … Time Period 2 T1 T2 T3 T4 Time Period 1 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Bank Addressing Examples 2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths are serialized in G80 The control paths taken by the threads in a warp are traversed one at a time until there is no more. A common case: avoid divergence when branch condition is a function of thread ID Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path WARP_SIZE

Vector Reduction with Branch Divergence Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements

No Divergence until < 16 sub-sums Thread 0 1 2 3 … 13 14 15 16 17 18 19 1 0+16 15+31 3 4

Fundamentals of Parallel Computing Parallel computing requires that The problem can be decomposed into sub-problems that can be safely solved at the same time The programmer structures the code and data to solve these sub-problems concurrently The goals of parallel computing are To solve problems in less time, and/or To solve bigger problems, and/or To achieve better solutions The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.

Challenges of Parallel Programming Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle Computational thinking (J. Wing) Dependences need to be identified and managed The order of task execution may change the answers Obvious: One step feeds result to the next steps Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other Performance can be drastically reduced by many factors Overhead of parallel processing Load imbalance among processor elements Inefficient data sharing patterns Saturation of critical resources such as memory bandwidth

Fermi Implements CUDA Definition of memory scope, grid, thread block, thread, are same as in Tesla Grid: Array of thread blocks Thread Block: up to 1536 concurrent threads, comm. through shared memory GPU has an array of SMs, each executes one or more thread block, each block is grouped into warps with 32 thread per warp Other resource constraints are implementation based

Fermi – GT300 Key Feature 32 cores per SM, 512 cores Fully pipelined integer and floating point unit that implements new IEEE 754-2008 standard include fused multiply-add (FMA) Two warps from different thread blocks (even different kernels) can be issued and executed concurrently ECC protection from the registers to DRAM Linear addressing model with caching at all levels Large shared memory / L1 cache Double precision performance 8x faster than GT200 and reach ~600 double-precision GFLOPs 25

Fermi – GT300 Key Feature (cont.) Fermi supports simultaneous execution of multiple kernels from the same application, each kernel distributed to one or more SMs GigaThread hardware thread scheduler, manages 1,536 simultaneously active threads for each SM across 16 kernels Switching from one application to another is 20x faster on Fermi Fermi supports OpenCL, Fortran, C++, Java, Matlab, and Python. Each SM has 32cores and 16 LS/ST units, 4 SFUs Fermi supports FMA for both singe and double precision 26

Instruction Schedule Example A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of load/store units. This figure shows how instructions are issued to the four execution blocks. It takes two cycles for the 32 instructions in each warp to execute on the cores or load/store units. A warp of 32 special-function instructions is issued in a single cycle but takes eight cycles to complete on the four SFUs Another major improvement in Fermi and PTX 2.0 is a new unified addressing model. All addresses in the GPU are allocated from a continuous 40-bit (one terabyte) address space. Global, shared, and local addresses are defined as ranges within this address space and can be accessed by common load/store instructions. (The load/store instructions support 64-bit addresses to allow for future growth.)

Multi-Core Architecture: Intel Quad Core Technology of Today Cache Structure Core 1 The L2 cache of today’s quad-core processors is not one cache shared by all 4 cores. Instead there are two L2 cache shared by two cores each Core 2 Core 3 4MB Shared L2 Cache 4MB Shared L2 Cache Bus Interface 1066MHz/1333Mhz FSB

(combined C/C++ and Fortran) What Is OpenMP*? C$OMP FLUSH #pragma omp critical C$OMP THREADPRIVATE(/ABC/) CALL OMP_SET_NUM_THREADS(10) call omp_test_lock(jlok) C$OMP parallel do shared(a, b, c) C$OMP MASTER call OMP_INIT_LOCK (ilok) http://www.openmp.org Current spec is OpenMP 2.5 250 Pages (combined C/C++ and Fortran) C$OMP ATOMIC C$OMP SINGLE PRIVATE(X) setenv OMP_SCHEDULE “dynamic” C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP ORDERED C$OMP PARALLEL REDUCTION (+: A, B) C$OMP SECTIONS #pragma omp parallel for private(A, B) !$OMP BARRIER Multi-core Programming: Programming with OpenMP Speaker’s Notes Purpose of the Slide Demonstrate the compact nature of the OpenMP specification and point to current spec version. Background It is claimed that every part of OpenMP spec is featured on this slide. This may have been true of an earlier version. This claim has not been verified for the current version of the spec. C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(XX) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck) Programming with OpenMP*

More material Intel Larrabee Architecture Herlihy’s Book Chapter 1: Introduction Chapter 2: Mutual Exclusion