The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Slides:

Advertisements

Similar presentations

3-Software Design Basics in Embedded Systems

Advertisements

Chapter 3 General-Purpose Processors: Software

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

DSPs Vs General Purpose Microprocessors

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Parallel computer architecture classification

Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.

June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.

Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.

Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Chapter 1 Basic Structure of Computers. Chapter Outline computer types, structure, and operation instructions and programs numbers, arithmetic operations,

One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Types of Parallelism Chapter 17 Justin Bellomi. Characterizations of Parallelism  Computer Architects characterize the type and amount of parallelism.

CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner.

The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.

I/O Memory Reg File ALU Program Counter Instruction Register Control Interconnect Control 1)PC contains mem address of Instruction, 2)From memory, instr.

Investigating Architectural Balance using Adaptable Probes.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.

ECEn 191 – New Student Seminar - Session 6 Digital Logic Digital Logic ECEn 191 New Student Seminar.

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

The Alpha Thomas Daniels Other Dude Matt Ziegler.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.

Lecture 3: Computer Architectures

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Parallel computer architecture classification

Morgan Kaufmann Publishers

Computer Architecture

Stream Architecture: Rethinking Media Processor Design

Lecture on High Performance Processor Architecture (CS05162)

William J. Dally Computer Systems Laboratory Stanford University

Overview Parallel Processing Pipelining

William J. Dally Computer Systems Laboratory Stanford University

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

CSE 502: Computer Architecture

Presentation transcript:

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

March 30, 20012Convergence Workshop Outline Motivation –We need low-power, programmable TeraOps The problem is bandwidth –Growing gap between special-purpose and general-purpose hardware –Its easy to make ALUs, hard to keep them fed A stream processor gives programmable bandwidth –Streams expose locality and concurrency in the application –A bandwidth hierarchy exploits this Imagine is a 20GFLOPS prototype stream processor Many opportunities to do better –Scaling up –Simplifying programming

March 30, 20013Convergence Workshop Motivation Some things I’d like to do with a few TeraOps –Have a realistic face-to-face meeting with someone in Boston without riding an airplane 4-8 cameras, extract depth, fit model, compress, render to several screens –High-quality rendering at video rates Ray tracing a 2K x 4K image with 10 5 objects at 60 frames/s

March 30, 20014Convergence Workshop The good news – FLOPS are cheap, OPS are cheaper 32-bit FPU – 2GFLOPS/mm 2 – 400GFLOPS/chip 16-bit add – 40GOPS/mm 2 – 8TOPS/chip 460  m  m Local RF Integer Adder

March 30, 20015Convergence Workshop The bad news – General purpose processors can’t harness this

March 30, 20016Convergence Workshop Why do Special-Purpose Processors Perform Well? Fed by dedicated wires/memoriesLots (100s) of ALUs

March 30, 20017Convergence Workshop Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

March 30, 20018Convergence Workshop The problem is bandwidth Can we solve this bandwidth problem without sacrificing programmability?

March 30, 20019Convergence Workshop Streams expose locality and concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism

March 30, Convergence Workshop A Bandwidth Hierarchy exploits locality and concurrency VLIW clusters with shared control bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

March 30, Convergence Workshop Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

March 30, Convergence Workshop The Imagine Stream Processor

March 30, Convergence Workshop Arithmetic Clusters

March 30, Convergence Workshop Performance 16-bit kernels 16-bit applications floating-point application floating-point kernel

March 30, Convergence Workshop Power GOPS/W:

March 30, Convergence Workshop A Look Inside an Application Stereo Depth Extraction 320x240 8-bit grayscale images 30 disparity search 220 frames/second 12.7 GOPS 5.7 GOPS/W

Load original packed row Unpack (8bit -> 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values Stereo Depth Extractor ConvolutionsDisparity Search

March 30, Convergence Workshop 7x7 Convolve Kernel

March 30, Convergence Workshop Imagine gives high performance with low power and flexible programming Matches capabilities of communication-limited technology to demands of signal and image processing applications Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power –three-level register hierarchy gives 2-10GOPS/W Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort

March 30, Convergence Workshop A look forward Next steps –Build some Imagine prototypes Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems Longer term –‘Industrial Strength’ Imagine – GFLOPS/chip Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth –Graphics extensions Texture cache, raster unit – as SRF clients –A streaming supercomputer 64-bit FP, high-bandwidth global memory, MIMD extensions –Simplified stream programming Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data.

March 30, Convergence Workshop Take home message VLSI technology enables us to put TeraOPS on a chip Conventional general-purpose architecture cannot exploit this –The problem is bandwidth Casting an application as kernels operating on streams exposes locality and concurrency A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth –Bandwidth hierarchy, compound stream operations Imagine is a prototype stream processor –One chip – 20GFLOPS peak, 10GFLOPS sustained, 4W –Systems scale to TeraFLOPS and more.