NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

Slides:

Advertisements

Similar presentations

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Advertisements

Optimization on Kepler Zehuan Wang

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.

Introduction to CUDA and GPUGPU Computing

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Computing in CUDA Michael Garland NVIDIA Research.

ECSE , Spring 2014 (modified from Stanford CS 193G) Lecture 1: Introduction to Massively Parallel Computing.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

CS 193G Lecture 2: GPU History & CUDA Programming Basics.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

CUDA programming Performance considerations (CUDA best practices)

Single Instruction Multiple Threads

Computer Engg, IIT(BHU)

Lecture 20 Computing with Graphical Processing Units

Basic CUDA Programming

Programming Massively Parallel Graphics Processors

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Some things are naturally parallel

Mattan Erez The University of Texas at Austin

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Programming Massively Parallel Graphics Processors

Mattan Erez The University of Texas at Austin

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover

© 2008 NVIDIA Corporation Processor Memory Processor Memory Global Memory Generic Manycore Chip Many processors each supporting many hardware threads On-chip memory near processors (cache, RAM, or both) Shared global memory space (external DRAM)

© 2008 NVIDIA Corporation GPU Evolution High throughput computation 933 GFLOP/s High bandwidth memory 102 GB/s High availability to all ~100 million CUDA-capable GPUs sold 1999 GeForce Million Transistors 2002 GeForce4 63 Million Transistors 2003 GeForce FX 130 Million Transistors 2004 GeForce Million Transistors 1995 NV1 1 Million Transistors 2005 GeForce Million Transistors 2008 GeForce GTX Billion Transistors 2008 GeForce GTX Billion Transistors GeForce Million Transistors

© 2008 NVIDIA Corporation Accelerating Computation 146X36X19X17X100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Simulation in Matlab using.mex file CUDA function Astrophysics N-body simulation 149X47X20X24X30X Financial simulation of LIBOR model with swaptions An M- script API for linear Algebra operations on GPU Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences

© 2008 NVIDIA Corporation Lessons from Graphics Pipeline Throughput is paramount must paint every pixel within frame time Create, run, & retire lots of threads very rapidly measured 14.8 Gthread/s on increment() kernel Use multithreading to hide latency 1 stalled thread is OK if 100 are ready to run

© 2008 NVIDIA Corporation NVIDIA GPU Architecture 240 scalar cores On-chip memory Texture units GeForce GTX 280 / Tesla T10

© 2008 NVIDIA Corporation SM DP SFU MT Issue Inst. Cache Const. Cache SFU Memory SP SM Multiprocessor 8 scalar cores (SP) per SM 16K 32-bit registers (64KB) usual ops: float, int, branch, … block-wide barrier in 1 instruction Shared double precision unit IEEE bit floating point fused multiply-add full-speed denorm. operands and results Direct load/store to memory the usual linear sequence of bytes high bandwidth (~100 GB/sec) Low-latency on-chip memory 16KB available per SM shared amongst threads of a block supports thread communication SM

© 2008 NVIDIA Corporation Key Architectural Ideas Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency SIMT ( Single Instruction Multiple Thread ) execution threads run in groups of 32 called warps threads in a warp share instruction unit (IU) HW automatically handles divergence Threads have all resources needed to run any warp not waiting for something can run context switching is (basically) free SM DP SFU MT Issue Inst. Cache Const. Cache SFU Memory SP

© 2008 NVIDIA Corporation Why is this different from a CPU? Different goals produce different designs GPU assumes work load is highly parallel CPU must be good at everything, parallel or not CPU: minimize latency experienced by 1 thread lots of big on-chip caches extremely sophisticated control GPU: maximize throughput of all threads lots of big ALUs multithreading can hide latency … so skip the big caches simpler control, cost amortized over ALUs via SIMD

© 2008 NVIDIA Corporation C UDA: Scalable parallel programming Augment C/C++ with minimalist abstractions let programmers focus on parallel algorithms not mechanics of a parallel programming language Provide straightforward mapping onto hardware good fit to GPU architecture maps well to multi-core CPUs too Scale to 100’s of cores & 10,000’s of parallel threads GPU threads are lightweight — create / switch is free GPU needs 1000’s of threads for full utilization

© 2008 NVIDIA Corporation Key Parallel Abstractions in CUDA Hierarchy of concurrent threads Lightweight synchronization primitives Shared memory model for cooperating threads

© 2008 NVIDIA Corporation Hierarchy of concurrent threads Parallel kernels composed of many threads all threads execute the same sequential program Threads are grouped into thread blocks threads in the same block can cooperate Threads/blocks have unique IDs Thread t t0 t1 … tB Block b

© 2008 NVIDIA Corporation Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C, n); } Device Code

© 2008 NVIDIA Corporation Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C, n); } Host Code

© 2008 NVIDIA Corporation Example: Host code for vecAdd // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // execute the kernel on N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C);

© 2008 NVIDIA Corporation Hierarchy of memory spaces Per-thread local memory Per-block shared memory Per-device global memory Thread per-thread local memory Block per-block shared memory Kernel 0... per-device global memory... Kernel 1...

© 2008 NVIDIA Corporation CUDA Model of Parallelism CUDA virtualizes the physical hardware thread is a virtualized scalar processor(registers, PC, state) block is a virtualized multiprocessor(threads, shared mem.) Scheduled onto physical hardware without pre-emption threads/blocks launch & run to completion blocks should be independent Block Memory Block Memory Global Memory

© 2008 NVIDIA Corporation Thread = virtualized scalar processor Independent thread of execution has its own PC, variables (registers), processor state, etc. no implication about how threads are scheduled CUDA threads might be physical threads as on NVIDIA GPUs CUDA threads might be virtual threads might pick 1 block = 1 physical thread on multicore CPU

© 2008 NVIDIA Corporation Block = virtualized multiprocessor Provides programmer flexibility freely choose processors to fit data freely customize for each kernel launch Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions

© 2008 NVIDIA Corporation Blocks must be independent Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order … concurrently OR sequentially Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock Independence requirement gives scalability

© 2008 NVIDIA Corporation Example: Parallel Reduction Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i]; Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps one possible approach: Butterfly pattern

© 2008 NVIDIA Corporation Example: Parallel Reduction Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i]; Parallel reduction builds a summation tree each thread holds 1 element stepwise partial sums N threads need log N steps one possible approach: Butterfly pattern

© 2008 NVIDIA Corporation Parallel Reduction for 1 Block // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i]

© 2008 NVIDIA Corporation Final Thoughts GPUs are throughput-oriented microprocessors manycore architecture massive hardware multithreading ubiquitous commodity hardware CUDA programming model is simple yet powerful traditional scalar execution model with transparent SIMD simple extensions to existing sequential language Many important research opportunities not to speak of the educational challenges

© 2008 NVIDIA Corporation Questions?