Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Compsci 220 / ECE 252 (Lebeck)1 Compsci 220/ ECE 252 Computer Architecture Data Parallel: Vectors, SIMD, GPGPU Slides originally developed by Amir Roth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Graphics Processing Unit
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CUDA programming Performance considerations (CUDA best practices)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Single Instruction Multiple Threads
The Present and Future of Parallelism on GPUs
CS427 Multicore Architecture and Parallel Computing
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Graphics Processing Unit
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal

Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends

Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends

System Architecture

GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)

What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!

General Purposed Computing ref:

The Vision of NVIDIA "Within the next few years, there will be single-chip graphics devices more powerful and versatile than any graphics system that has ever been built, at any price." -- David Kirk, NVIDIA, 1998

ref: Single-Chip GPU v.s. Fastest Super Computers

Top500 Super Computer in June 2010

GPU Will Top the List in Nov 2010

The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure

GPU Has 10x Comp Density Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU.

Evolution of Intel Pentium Pentium IPentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?

Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen. Q: What happened instead? Why?

Evolution of Multi-core CPUs Penryn Bloomfield GulftownBeckton Chip area breakdown Q: What can you observe? Why?

Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?

The Memory Hierarchy Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.

The Brick Wall -- UC Berkeley's View Power Wall: power expensive, transistors free Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

The Brick Wall -- UC Berkeley's View Power Wall: power expensive, transistors free Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?

Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.

Interleaved Multi-threading

The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency

Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros: ? Cons: ?

Interleaved Multi-threading The granularity of interleaved multi-threading: 100 cycles: hide off-chip memory latency 10 cycles: + hide cache latency 1 cycle: + hide branch latency, instruction dependency Fine-grained interleaved multi-threading: Pros: remove branch predictor, OOO scheduler, large cache Cons: register pressure, etc.

Fine-Grained Interleaved Threading Pros: reduce cache size, no branch predictor, no OOO scheduler Cons: register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading

HW Support Register file supports zero overhead context switch between interleaved threads.

Can We Make Further Improvement? Reducing large cache gives 2x computational density. Q: Can we make further improvements? Hint: We have only utilized thread level parallelism (TLP) so far.

Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.

Hardware Support Supporting interleaved threading + SIMD execution

Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.

Example of SIMT Execution Assume 32 threads are grouped into one warp.

Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit

NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA

Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)

Throughput Oriented Architectures 1.Fine-grained interleaved threading (~2x comp density) 2.SIMD/SIMT (>10x comp density) 3.Simple core (~2x comp density) Key architectural features of throughput oriented processor. ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM (link)

Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends

CUDA Programming Massive number (>10000) of light-weight threads.

Express Data Parallelism in Threads Compare thread program with vector program.

Vector Program Scalar program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } Vector program (vector width of 8) float A[4][8]; do-all(i=0;i<4;i++){ movups xmm0, [ &A[i][0] ] incps xmm0 movups [ &A[i][0] ], xmm0 } Vector width is exposed to programmers.

CUDA Program Scalar program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } CUDA program float A[4][8]; kernelF >>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). Hardware converts TLP into DLP at run time.

Two Levels of Thread Hierarchy kernelF >>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; }

Multi-dimension Thread and Block ID kernelF >>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++; } Both grid and thread block can have two dimensional index.

Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.

Executing Thread Block on SM Executed on machine with width of 4: Executed on machine with width of 8: Notes: the number of Processing Elements (PEs) is transparent to programmer. kernelF >>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++; }

Multiple Levels of Memory Hierarchy

Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.

Shared Memory and Synchronization kernelF >>(A); __device__ kernelF(A){ __shared__ smem[16][16]; //allocate smem i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j]... + smem[i+1][i+1] ) / 9; } Example: average filter with 3x3 window 3x3 window on image Image data in DRAM

Shared Memory and Synchronization kernelF >>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // thread wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j]... + smem[i+1][i+1] ) / 9; } Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem

Shared Memory and Synchronization kernelF >>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); // every thread is ready A[i][j] = ( smem[i-1][j-1] + smem[i-1][j]... + smem[i+1][i+1] ) / 9; } Example: average filter over 3x3 window 3x3 window on image all threads finish the load

Shared Memory and Synchronization kernelF >>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j]... + smem[i+1][i+1] ) / 9; } Example: average filter over 3x3 window 3x3 window on image Start computation

Programmers Think in Threads Q: Why make this hassle?

Why Use Thread instead of Vector? Thread Pros: Portability. Machine width is transparent in ISA. Productivity. Programmers do not need to take care the vector width of the machine. Thread Cons: Manual sync. Give up lock-step within vector. Scheduling of thread could be inefficient. Debug. "Threads considered harmful". Thread program is notoriously hard to debug.

Features of CUDA Programmers explicitly express DLP in terms of TLP. Programmers explicitly manage memory hierarchy. etc.

Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends

Micro-architecture GF100 micro-architecture

HW Groups Threads Into Warps Example: 32 threads per warp

Example of Implementation Note: NVIDIA may use a more complicated implementation.

Example Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Assume warp 0 and warp 1 are scheduled for execution.

Read Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r1 for warp 0 r4 for warp 1

Buffer Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Push ops to op collector: r1 for warp 0 r4 for warp 1

Read Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Read source operands: r2 for warp 0 r5 for warp 1

Buffer Src Op Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Push ops to op collector: r2 for warp 0 r5 for warp 1

Execute Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Compute the first 16 threads in the warp.

Execute Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Compute the last 16 threads in the warp.

Write back Program Address: Inst 0x0004: add r0, r1, r2 0x0008: sub r3, r4, r5 Write back: r0 for warp 0 r3 for warp 1

Other High Performance GPU ATI Radeon 5000 series.

ATI Radeon 5000 Series Architecture

Radeon SIMD Engine 16 Stream Cores (SC) Local Data Share

VLIW Stream Core (SC)

Local Data Share (LDS)

Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends

Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity

Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity

Shared Mem Contains Multiple Banks

Compute Capability Need arch info to perform optimization. ref: NVIDIA, "CUDA C Programming Guide", (link)

Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:

Performance Optimization Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory alignment and coalescing Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity

Global Memory In Off-Chip DRAM Address space is interleaved among multiple channels.

Global Memory

Roofline Model Identify performance bottleneck: computation bound v.s. bandwidth bound

Optimization Is Key for Attainable Gflops/s

Computation, Bandwidth, Latency Illustrating three bottlenecks in the Roofline model.

Today's Topics GPU architecture GPU programming GPU micro-architecture Performance optimization and model Trends

Coming architectures: Intel's Larabee successor: Many Integrated Core (MIC) CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.

Intel Many Integrated Core (MIC) 32 core version of MIC:

Intel Sandy Bridge Highlight: Reconfigurable shared L3 for CPU and GPU Ring bus

Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

Sandy Bridge's New CPU-GPU interface ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

AMD Llano Fusion APU (expt. Q3 2011) Notes: CPU and GPU are not sharing cache? Unknown interface between CPU/GPU

GPU Research in ES Group GPU research in the Electronic Systems group.