The Present and Future of Parallelism on GPUs

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Intermediate GPGPU Programming in CUDA

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.

Optimization on Kepler Zehuan Wang

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

A many-core GPU architecture.. Price, performance, and evolution.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Graphics Processing Units

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Processor Level Parallelism 1

Single Instruction Multiple Threads

Prof. Zhang Gang School of Computer Sci. & Tech.

CMSC 611: Advanced Computer Architecture

Lecturer: Alan Christopher

Morgan Kaufmann Publishers

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Vector Processing => Multimedia

Mattan Erez The University of Texas at Austin

Presented by: Isaac Martin

CUDA Parallelism Model

Introduction to CUDA Programming

Chapter 1 Introduction.

Mattan Erez The University of Texas at Austin

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

The Present and Future of Parallelism on GPUs Paulius Micikevicius NVIDIA

Terminology We’ll define a “brick” for this talk Since all vendors use different names GPUs are built by replicating bricks (10s) Connected to memory via some network Brick = minimal building block that contains own: Control unit(s) - decodes/issues instructions Registers Pipelines for instruction execution Local cache(s)

HW Commonalities Likely to remain in the future Built by replicating 10s of “bricks” “Bricks” are vector processors Different execution paths within vectors are supported but degrade performance Different execution paths in different vectors have no impact on performance Vectors access memory in cache-lines Consecutive threads (vector elements) should access a contiguous memory region Scatter-gather supported, but will fetch multiple lines, increasing latency and bandwidth-pressure High GPU memory bandwidth (100+ GB/s) In-order instruction issue

NVIDIA GPU Brick (Streaming Multiprocessor) Up to 1536 threads Instructions are issued per 32 threads (warp) Think 32-way vector of threads Source code is for a single thread and is scalar: No vector intrinsics a la SSE HW handles grouping of threads into vectors, vector control flow Dual-issue: instructions from different warps Shared memory, L1 cache Large register file, partitioned among threads

AMD GPU Brick (SIMD Engine) Up to ~1500 threads Instructions are issued per 64 threads VLIW instruction issue: HW designed for 5 “issue” slots (16x5 ‘cores’ per brick) Combine up to 5 instructions from the same thread to maximize performance Source code is for a single thread and is scalar: HW handles grouping threads into vectors and control flow Compiler handles VLIW combining Shared memory, L1 cache Large register file, partitioned among threads

Intel Larrabee Brick (Core) Up to 4 threads Scalar and vector (512-bit SIMD) units For example: 16-fp32 vector SIMD Dual issue: scalar-vector, from the same thread Source code is for a single thread and is vector: Intrinsics for SIMD operations (a la SSE) L1 and L2 caches Intrinsics for pre-fetching and prioritization of cache lines No user-managed shared memory Small register file (relies on caches)

HW Commonalities Likely to remain in the future Built by replicating 10s of “bricks” “Bricks” are vector processors Different execution paths within vectors are supported but degrade performance Different execution paths in different vectors have no impact on performance Vectors access memory in cache-lines Consecutive threads (vector elements) should access a contiguous memory region Scatter-gather supported, but will fetch multiple lines, increasing latency and bandwidth-pressure High GPU memory bandwidth (100+ GB/s) In-order instruction issue

Current Differences Hardware: Programming model Vector width: 16, 32, 64 VLIW (AMD) vs dual-issue (NVIDIA, Intel) Dual-issue: same thread (Intel) vs different threads (NVIDIA) Large register file, small cache (NVIDIA,AMD) vs small register file, larger caches (Intel) Programming model Intel: vector source code (SIMD intrinsics) AMD,NVIDIA: scalar source code hw aggregates threads into vectors and resolves control flow divergence

References NVIDIA: AMD: Intel: http://www.nvidia.com/object/fermi_architecture.html Fermi Compute Architecture Whitepaper (http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf) AMD: Unleashing the Power of Parallel Compujte with Commodity ATI Radeon 5800 GPU, Siggraph Asia 2009 (http://sa09.idav.ucdavis.edu/docs/SA09_AMD_IHV.pdf) Intel: L. Seiler et al. Larrabee: A Many Core X86 Architecture for Visual Computing. Siggraph 2008 (http://software.intel.com/file/18198/) Tom Forsyth. The Challenge of Larrabee as GPU. Colloquium at Stanford, 2010 (http://www.stanford.edu/class/ee380/Abstracts/100106.html)

Simple 2D Stencil Code in CUDA (OpenCL equivalent in comments) __global__ void stencil_2d( float *output, float *input, const int dimx, const int dimy, const int row_size ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // ix = get_global_id(0); int iy = blockIdx.y*blockDim.y + threadIdx.y; // iy = get_global_id(1); int idx = iy * row_size + ix; output[idx] = -4 * input[idx] + input[idx+1] + input[idx-1] + input[idx + row_size] + input[idx – row_size]; }