Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.

Slides:

Advertisements

Similar presentations

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Advertisements

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

The University of Adelaide, School of Computer Science

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

CIS 570 Advanced Computer Systems University of Massachusetts Dartmouth Instructor: Dr. Michael Geiger Fall 2008 Lecture 1: Fundamentals of Computer Design.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

CS61C L28 Intra-machine Parallelism (1) Magrino, Summer 2010 © UCB TA Tom Magrino inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

The Memory Hierarchy CPSC 321 Andreas Klappenecker.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Instruction Level Parallelism (ILP) Colin Stevens.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Old-Fashioned Mud-Slinging with Flash

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.

GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu

Enhancing GPU for Scientific Computing Some thoughts.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Cg Programming Mapping Computational Concepts to GPUs.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

David Luebke 1 1/25/2016 Programmable Graphics Hardware.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.

These slides are based on the book:

CS203 – Advanced Computer Architecture

Lynn Choi School of Electrical Engineering

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Hardware design considerations of implementing neural-networks algorithms Presenter: Nir Hasidim.

Advanced Topic: Alternative Architectures Chapter 9 Objectives

CSCI206 - Computer Organization & Programming

5.2 Eleven Advanced Optimizations of Cache Performance

Graphics Processing Unit

Cache Memory Presentation I

Vector Processing => Multimedia

All-Pairs Shortest Paths

Chapter 4 Multiprocessors

Graphics Processing Unit

6- General Purpose GPU Programming

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Why GPUs? Robert Strzodka

2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics

3 INOUT Data Processing in General Processor IN OUT memory memorywall lack of parallelism

4 Old and New Wisdom in Computer Architecture Old: Power is free, Transistors are expensive New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: “Memory wall”, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New: “ILP wall”, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis

5 Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006  Sea change in chip design: multiple “cores” or processors per chip 3X slide courtesy of Christos Kozyrakis

6 Processor Instruction-Stream-Based Processing instructions cache memory data

7 Instruction- and Data-Streams Addition of 2D arrays: C= A + B for(y=0; y<HEIGHT; y++) for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x]; } instuction stream processing data inputStreams(A,B); outputStream(C); kernelProgram(OP_ADD); processStreams(); data streams undergoing a kernel operation

8 Processor Data-Stream-Based Processing memory pipeline data configuration pipeline

9 Architectures: Data – Processor Locality Field Programmable Gate Array (FPGA) –Compute by configuring Boolean functions and local memory Processor Array / Multi-core Processor –Assemble many (simple) processors and memories on one chip Processor-in-Memory (PIM) –Insert processing elements directly into RAM chips Stream Processor –Create data locality through a hierarchy of memories

10Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics

11 The GPU is a Fast, Parallel Array Processor Input Arrays: 1D, 3D, 2D (typical) Vertex Processor (VP) Kernel changes index regions of output arrays Rasterizer Creates data streams from index regions Stream of array elements, order unknown Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays Output Arrays: 1D, 3D (slice), 2D (typical)

12 Index Regions in Output Arrays Output region Quads and Triangles –Fastest option Output region Line segments –Slower, try to pair lines to 2xh, wx2 quads Output region Point Clouds –Slowest, try to gather points into larger forms

13 High Level Graphics Language for the Kernels Float data types: –half 16-bit (s10e5), float 32-bit (s23e8) Vectors, structs and arrays: – float4, float vec[6], float3x4, float arr[5][3], struct {} Arithmetic and logic operators: –+, -, *, /; &&, ||, ! Trignonometric, exponential functions: –sin, asin, exp, log, pow, … User defined functions –max3(float a, float b, float c) { return max(a,max(b,c)); } Conditional statements, loops: – if, for, while, dynamic branching in PS3 Streaming and random data access

14 Input and Output Arrays CPU Input and output arrays may overlap GPU Input and output arrays must not overlap Input Output Input Output

15 Native Memory Layout – Data Locality CPU 1D input 1D output Higher dimensions with offsets GPU 1D, 2D, 3D input 2D output Other dimensions with offsets Input Output Color coded locality red (near), blue (far)

16 Data-Flow: Gather and Scatter CPU Arbitrary gather Arbitrary scatter GPU Arbitrary gather Restricted scatter InputOutputInputOutput InputOutput InputOutput

17Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics

18 1) Computational Performance GFLOPS chart courtesy of John Owens ATI R520 Note: Sustained performance is usually much lower and depends heavily on the memory system !

19 2) Memory Performance CPU –Large cache –Few processing elements –Optimized for spatial and temporal data reuse GeForce 7800 GTX Pentium 4 chart courtesy of Ian Buck Memory access types: Cache, Sequential, Random GPU –Small cache –Many processing elements –Optimized for sequential (streaming) data access

20 3) Configuration Overhead Configu-rationlimitedCompu-tationlimited chart courtesy of Ian Buck

21Conclusions Parallelism is now indispensable to further increase performance Both memory and processing element dominated designs have pros and cons Mapping algorithms to the appropriate architecture allows enormous speedups Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)