Stream Architecture: Rethinking Media Processor Design

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.
Chris Foster Brian Moore Scott Thibaudeau Overview I/O EE – Emotion Engine!! Graphics Synthesizer Comparison.
Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.
Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
architectural overview
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix E Authors: John Hennessy & David Patterson.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February.
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Ben Gaudette Michael Pfeister CSE 520 Spring 2010.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Playstation2 Architecture Architecture Hardware Design.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Microarchitecture.
A programmable communications processor for future wireless systems
Embedded Systems Design
Architecture & Organization 1
Computer Architecture
Vector Processing => Multimedia
Mattan Erez The University of Texas at Austin
Lecture on High Performance Processor Architecture (CS05162)
Architecture & Organization 1
Compiler Supports and Optimizations for PAC VLIW DSP Processors
William J. Dally Computer Systems Laboratory Stanford University
Computer Architecture
Computer Organization
William J. Dally Computer Systems Laboratory Stanford University
Digital Signal Processors-1
Graphics Processing Unit
Computer Architecture
6- General Purpose GPU Programming
CSE 502: Computer Architecture
ADSP 21065L.
Presentation transcript:

Stream Architecture: Rethinking Media Processor Design Scott Rixner April 9, 2001 Rice University Computer Systems Laboratory

Media Processing Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis Polygon rendering, image-based rendering, ... Image understanding Face recognition, depth extraction, ... Scott Rixner Stream Architecture

Stereo Depth Extraction Left Camera Image Right Camera Image 640x480 @ 30 fps Requirements 11 GOPS Imagine stream processor 12.1 GOPS, 4.6 GOPS/W Depth Map Scott Rixner Stream Architecture

Outline Stream Processing VLSI Constraints Register Organization Imagine Conclusions Scott Rixner Stream Architecture

Media Processing Characteristics Low-precision data 24% 8-bit integer operations 29% 16-bit integer operations Abundant data-parallelism Little global data reuse Average of 1.5 references per global data word Numerous computations per global reference 50-500 operations per global data reference Scott Rixner Stream Architecture

Stream Processing Stream Input Data Kernel Output Data SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (>60 operations per memory reference) Scott Rixner Stream Architecture

Locality and Concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Streams expose data parallelism Scott Rixner Stream Architecture

Sony PlayStation2 Emotion Engine FPU MIPS Core VPU0 VPU1 Graphics Synthesizer Display IPU RDRAM, I/O, DMAC, etc. Scott Rixner Stream Architecture

Special vs. General Purpose Special Purpose Fixed function High performance General Purpose Programmable Insufficient performance Instruction Cache IR IP Registers Scott Rixner Stream Architecture

Register Files Dwarf ALUs Scott Rixner Stream Architecture

Register File Area Each cell requires: Each cell grows as p2 1 word line per port 1 bit line per port Each cell grows as p2 R registers in the file Area: p2R µ N3 Register Bit Cell Scott Rixner Stream Architecture

Register File Access Delay Signal must traverse: Word line to access cell Bit line to transfer data Wire capacitance dominates Delay: pR1/2 µ N3/2 Register File Scott Rixner Stream Architecture

Register File Power Dissipation 100% utilization requires driving all pR1/2 bit lines Wire capacitance dominates Power: p2R µ N3 Register File Scott Rixner Stream Architecture

Centralized Register Organization Area, Power µ N3, Delay µ N3/2 Scott Rixner Stream Architecture

Partitioned Organizations SIMD Data-parallel axis Distributed Register Files (DRF) Instruction-level parallel axis Hierarchical Memory hierarchy axis Stream Optimizing for streams Scott Rixner Stream Architecture

SIMD Register Organization Area, Power µ N3/C2, Delay µ (N/C)3/2 Scott Rixner Stream Architecture

Distributed Register Organization Area, Power µ N2, Delay µ N Scott Rixner Stream Architecture

Combining SIMD and DRF Scalar SIMD Central DRF Scott Rixner Stream Architecture

Hierarchical Register Organization Hierarchical T=40 Area, Power µ N3, Delay µ N3/2 Scott Rixner Stream Architecture

Hierarchical Organizations Scalar SIMD Central DRF Scott Rixner Stream Architecture

Stream Register Organization Area, Power µ N2/C, Delay µ N/C Scott Rixner Stream Architecture

Stream Organizations Scalar SIMD Central DRF Scott Rixner Stream Architecture

Comparison of Organizations 48 ALUs (32-bit), 500 MHz Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x Scott Rixner Stream Architecture

(8% with latency constraints) Performance 16% Performance Drop (8% with latency constraints) 180x Improvement Scott Rixner Stream Architecture

Stream Architecture Stream Processing Stream Register Organization Matched to media processing Exposes locality and concurrency Stream Register Organization Efficiency of special-purpose hardware Optimized for streaming applications Data bandwidth Bandwidth hierarchy Memory access scheduling Conditional streams Scott Rixner Stream Architecture

The Imagine Stream Processor Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller Scott Rixner Stream Architecture

Arithmetic Clusters Communication Unit Scratch-pad Register File Intercluster Network Local Register File + + + * * / CU To SRF Cross Point From SRF Scott Rixner Stream Architecture

Bandwidth Hierarchy SDRAM ALU Cluster ALU Cluster SDRAM Register File Stream SDRAM SDRAM ALU Cluster 2GB/s 32GB/s 544GB/s 41.2 32-bit operations per word of memory bandwidth Scott Rixner Stream Architecture

Stream Recirculation Scott Rixner Stream Architecture

Bandwidth Demands of FIR Filter Scott Rixner Stream Architecture

Bandwidth Utilization of FIR Filter Scott Rixner Stream Architecture

Performance floating-point application 16-bit kernels 16-bit applications 16-bit kernels floating-point kernel Scott Rixner Stream Architecture

Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3 Scott Rixner Stream Architecture

Relative Performance and Power Efficiency FFT Performance Power Efficiency Scott Rixner Stream Architecture

Imagine Floorplan Tapeout ~Q2 ’01 21 million T’s Target: 32 FO4 6M SRF SRAM 6M UC SRAM 6M Clusters 3M Other Target: 32 FO4 300 MHz at SSSS 500 MHz at TTSS TI GS30KA: 0.15 mm Ldrawn 457 Signal Pins Scott Rixner Stream Architecture

Imagine Team William J. Dally Ujval Kapasi Brucek Khailany Peter Mattson Jinyung Namkoong John Owens Ben Serebrin Brian Towles Scott Rixner Don Alpert (Intel) Ghazi Ben Amor Chris Buehler (MIT) JP Grossman (MIT) Brad Johanson Abelardo Lopez-Lagunas Ben Mowery Manman Ren Scott Rixner Stream Architecture

Conclusions Media Processing VLSI Imagine Little data reuse Highly data parallel Compute intensive VLSI Stream register organization Bandwidth hierarchy Imagine Stream architecture 10 GOPS sustained application performance 5 GOPS/W application power efficiency Scott Rixner Stream Architecture