CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Computer Architecture and Data Manipulation Chapter 3.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
Enhancing GPU for Scientific Computing Some thoughts.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Basics and Architectures
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.
Computer Graphics Graphics Hardware
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
Chapter One Introduction to Pipelined Processors.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
History of Microprocessor MPIntroductionData BusAddress Bus
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Lecture 2 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.
EKT303/4 Superscalar vs Super-pipelined.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
Computer Graphics Graphics Hardware
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
Single Instruction Multiple Data
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Microarchitecture.
Graphics Processing Unit
Lecture 2: Intro to the simd lifestyle and GPU internals
Vector Processing => Multimedia
Stream Architecture: Rethinking Media Processor Design
Computer Graphics Graphics Hardware
What is Computer Architecture?
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
CSE 502: Computer Architecture
Presentation transcript:

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University

Computing Architectures Von Neumann  traditional CPUs Systolic arrays SIMD architectures  for example, Intel’s SSE, MMX Vector processors Stream processors  GPUs are a form of these

Von Neumann Classic form of programmable processor Creates Von-Neumann bottleneck  separation of memory and ALU creates bandwidth problems  today’s ALUs are much faster than today’s data links  this limits compute-intensive applications  cache management to overcome slow data links adds to control overhead

Early Streamers Systolic Arrays  arrange computational units in a specific topology (ring, line)  data flow from one unit to the next SIMD (Same Instruction Multiple Data)  a single set of instructions is executed by different processors in a collection  multiple data streams are presented to each processing unit  SSE, MMX is a 4-way SIMD, but still requires instruction decode for each word

Early Streamers Vector processors  made popular by Cray supercomputers  represent data as a vector  load vector with a single instruction (amortizes instruction decode overhead)  exposes data parallelism by operating on large aggregates of data

Stream Processors - Motivation In VLSI technology, computing is cheap  thousands of arithmetic logic units operating at multiple GHz can fit on 1cm 2 die But… delivering instructions and data to these is expensive Example:  only 6.5% of the Itanium die is devoted to its 12 integer and 2 floating point ALUs and their registers  remainder is used for communication, control, and storage

Stream Processors - Motivation Thus, general-purpose use of CPUs comes at a price In contrast:  the more efficient control and communication on the Nvidia GeForce4 enables the use of many hundreds of floating-point and integer ALUs  for the special purpose of rendering 3D images  this task exposes abundant parallelism  requires little global communication and storage

Stream Processors - Motivation Goal:  expose these patterns of communication, control, and parallelism to a wider class of applications  Create a general purpose streaming architecture without compromising its advantages Proposed streaming architectures  Imagine (Stanford)  CHEOPS Existing implementations that come close  Nvidia FX, ATI Radeon GPUs  enable GP-GPU (general purpose streaming, GP-GPU)

Stream Processing Organize an application into streams and kernels  expose inherent locality and concurrency (here, of media- processing applications) This creates the programming model for stream processors  and therefore also for GPGPU

Stream Proc. – Memory Hierarchy Local register files (LRFs)  operands for arithmetic operations (similar to caches on CPUs)  exploit fine-grain locality Stream register files (SRFs)  capture coarse-grain locality  efficiently transfer data to and from the LRFs Off-chip memory  store global data  only use when necessary

Stream Proc. – Memory Hierarchy These form a bandwidth hierarchy as well  roughly an order of magnitude for each level  well matched by today’s VLSI technology By exploiting the locality of media operations the hundreds of ALUs can operate at peak rate While CPUs and DSPs rely on global storage and communication, stream processors get more “bang” out of a die

Stream Processing – Example 1 Stream-C program MPEG-2 video encoder

Stream Processing – Example 2 MPEG-2 I-frame encoder Q: Quantization, IQ: Inverse Quantization, DCT: Discrete Cosine Transform Global communication (from RAM) needed for the reference frames (needed to ensure persistent information)

Stream Processing - Parallelism Instruction-level  exploit parallelism in the scalar operations within a kernel  for example, gen_L_blocks, gen_C_blocks can occur in parallel Data-level  operate on several data items within a stream in parallel  for example, different blocks can be converted simultaneously  note, however, that this gives up the benefits that come with sequential processing (see later) Task parallelism  must obey dependencies in the stream graph  for example, the two DCTQ kernels could run in parallel

Stream Processors vs. GPUs Stream elements  points (vertex stage)  fragments, essentially pixels (fragment stage) Kernels  vertex and fragment shaders Memory  texture memory (SRFs)  not-exposed LRF, if at all  bandwidth to RAM better, with PCI-Express

Stream Processors vs. GPUs Data parallelism  fragments and points are processed in parallel Task parallelism  fragment and vertex shaders can work in parallel  data trabsfer from RAM can be overlapped with computation Instruction parallelism  see task-parallelism

Stream Processors vs. GPUs Stream processors allow looping and jumping  not possible on GPUs (at least not straighforward) Stream processors follow a Turing machine  GPUs are restricted (see above) Stream processors have much more memory  GPUs have 256 MB, soon 512 MB

Conclusions GPUs are not 100% stream processors, but they come close  and one can actually buy them, cheaply Loss of jumps and loops enforces pipeline discipline Lack of memory allows use of small caches and prevents swamping the chip with data Data parallism often requires task decomposition into multiple passes (see later)

References U. Kapasi, S. Rixner, W. Dally et al. “Programmable stream processors,” IEEE Computer August 2003 S.Venkatasubramanian, “The graphics card as a stream computer,“ SIGMOD DIMACS, 2003