1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Slides:



Advertisements
Similar presentations
Graphics on a Stream Processor
Advertisements

WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.
Chapter 3 General-Purpose Processors: Software
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.
Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Architecture for Network Hub in 2011 David Chinnery Ben Horowitz.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
UTCS Lecture 1 1 CS 352: Computer Systems Architecture Lecture 1: What is Computer Architecture? January 22, 2007 Doug Burger Computer Architecture and.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner.
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 3 General-Purpose Processors: Software.
The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February.
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
Computer Architecture And Organization UNIT-II General System Architecture.
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Ben Gaudette Michael Pfeister CSE 520 Spring 2010.
Investigating Architectural Balance using Adaptable Probes.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.
Overview von Neumann Architecture Computer component Computer function
Lecture 3: Computer Architectures
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Microprocessor Design Process
Hiba Tariq School of Engineering
A programmable communications processor for future wireless systems
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Stream Architecture: Rethinking Media Processor Design
Lecture on High Performance Processor Architecture (CS05162)
Compiler Supports and Optimizations for PAC VLIW DSP Processors
William J. Dally Computer Systems Laboratory Stanford University
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
William J. Dally Computer Systems Laboratory Stanford University
COMS 361 Computer Organization
Computer Architecture
CSE 502: Computer Architecture
ADSP 21065L.
Chapter 4 The Von Neumann Model
Presentation transcript:

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture Group Computer Systems Laboratory Stanford University Brucek Khailany

2Hot Chips 2000Imagine Imagine: A Programmable Signal and Image Processor Motivation –Applications poorly matched to conventional architectures Key stream architecture features –High computational bandwidth (Imagine: 48 on-chip ALUs) –Stream register organization –Data bandwidth hierarchy Performance density of a special purpose processor –0.59 cm 2 CMOS chip, 0.13  m standard cell, 500 MHz –20 GFLOPS peak performance (40 GOPS fixed point) –10 GFLOPS sustained on several apps –> 2 GFLOPS/W, > 5 GOPS/W

3Hot Chips 2000Imagine Stereo Depth Extraction Polygon Rendering MPEG Encoding/Decoding Encoded 2D Data 2D Video Stream Encode/Decode Representative Applications Render

4Hot Chips 2000Imagine Stream Processing Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 arithmetic operations per memory reference) SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map

5Hot Chips 2000Imagine Characteristics of Media Applications Poorly matched to conventional architectures –Caches –Instruction-Level Parallelism –Few arithmetic units Well-matched to modern VLSI technology –Lots (100’s ’s) of ALUs fit on a single chip –Communication bandwidth is the scarce resource

6Hot Chips 2000Imagine Special-Purpose Processors: ALUs fed by dedicated wires/memories Regs Instr. Cache IR IP General-Purpose Processors: ‘Feeding’ Structure Dwarfs ALUs Communication Bandwidth: Care and Feeding of ALUs

7Hot Chips 2000Imagine Stream Architecture Provides Data Bandwidth Hierarchy 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s ALU Cluster SIMD/VLIW Control Peak BW:

8Hot Chips 2000Imagine Application Data Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

9Hot Chips 2000Imagine Stream Register File: Details SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Arbiter To/From Arithmetic Clusters

10Hot Chips 2000Imagine CU Intercluster Network + From SRF To SRF + * / Cross Point Local Register File Arithmetic Cluster: Details Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions –4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC –17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) + *

11Hot Chips 2000Imagine Imagine Programming Environment StereoDepthExtraction(…) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Convolve7x7(…) {... while(!In.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... }

12Hot Chips 2000Imagine Performance 16-bit kernels 16-bit applications floating-point application floating-point kernel

13Hot Chips 2000Imagine Stereo Depth Extraction –320x240 8-bit grayscale –200 frames/second Polygon Rendering –4.5 Million Vertices/sec –5.1 Million Pixels/sec MPEG Encoding –720x bit color –120 frames/second Encoded 2D Data 2D Video Stream Sustained Application Performance Render Encode SPECviewperf ADVS benchmark (unlit)

14Hot Chips 2000Imagine GOPS/W: Power Estimates

15Hot Chips 2000Imagine The Imagine Stream Processor Stream Register File: 32kW SRAM Network Interface Stream Controller Imagine Stream Processor Host Processor Network ALU Cluster 0ALU Cluster 1ALU Cluster 2ALU Cluster 3ALU Cluster 4ALU Cluster 5ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller : 2K VLIW Instrs

16Hot Chips 2000Imagine Imagine Floorplan 22 million transistors 500 MHz TI GS30KA: –0.15  m L drawn –0.  m L eff –CMOS process

17Hot Chips 2000Imagine VLSI Implementation: 22M Transistors with 7 grad students Stream architecture reduces VLSI design complexity –Modularity / Replication –Long wire delays converted to explicit communications Exposed to microarchitecture, software Design methodology –Standard ASIC flow with forced placement of datapaths Bitslice Verilog Improved area, delay –Pre-placement wire length estimates Reduce design iterations

18Hot Chips 2000Imagine Status Imagine team accomplishments –Cycle-accurate simulator –Software tools –Completed synthesizable Verilog –Arithmetic units implemented in standard cells Industrial partners –Texas Instruments: Fab –Intel Future work –Circuits/Logic: expected completion 9/15/00 –Tapeout: expected Q4/2000

19Hot Chips 2000Imagine Summary Key stream architecture features –Stream register organization –Data bandwidth hierarchy Performance density of a special purpose processor –10 GFLOPS sustained on several apps –>2 GFLOPS/W, >5 GOPS/W VLSI Implementation –Validate architectural concepts –Develop experimental prototype