The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Is There a Real Difference between DSPs and GPUs?
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
The University of Adelaide, School of Computer Science
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.
Instruction Level Parallelism (ILP) Colin Stevens.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Basic Operational Concepts of a Computer
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Chapter 04 Authors: John Hennessy & David Patterson.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Computer Science 210 Computer Organization The von Neumann Architecture.
Types of Parallelism Chapter 17 Justin Bellomi. Characterizations of Parallelism  Computer Architects characterize the type and amount of parallelism.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Chapter 5 Computing Components Nell Dale John Lewis.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Ben Gaudette Michael Pfeister CSE 520 Spring 2010.
Investigating Architectural Balance using Adaptable Probes.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.
RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Lecture 3: Computer Architectures
Fundamentals of Programming Languages-II
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
CS203 – Advanced Computer Architecture Performance Evaluation.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Microarchitecture.
Parallel Processing - introduction
A programmable communications processor for future wireless systems
Computer Science 210 Computer Organization
Embedded Systems Design
Stream Architecture: Rethinking Media Processor Design
Lecture on High Performance Processor Architecture (CS05162)
Computer Science 210 Computer Organization
Functional Units.
Compiler Supports and Optimizations for PAC VLIW DSP Processors
Symmetric Multiprocessing (SMP)
William J. Dally Computer Systems Laboratory Stanford University
STUDY AND IMPLEMENTATION
Architecture Overview
Fundamental Concepts Processor fetches one instruction at a time and perform the operation specified. Instructions are fetched from successive memory locations.
William J. Dally Computer Systems Laboratory Stanford University
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
CSE 502: Computer Architecture
Presentation transcript:

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao

Contents Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion

Motivation of stream processor Media-processing applications, such as 3-D polygon rendering, MPEG-2 encoding are becoming an increasingly dominant portion of computing workloads today Properties of media-processing applications  Real-time performance constraints  High arithmetic intensity require parallel solutions  Inherently contain a large amount of data-parallelism Providing large numbers of ALUs to operate on data in parallel is relatively inexpensive Current programmable solutions cannot scale to support this many ALUs  Both providing instructions and transferring data at the necessary rates are problematic.  For example, a 48 ALU single-chip processor must issue up to 48 instructions/cycle and provide up to 144 words/cycle of data bandwidth to operate at peak rate.

What is a stream processor Usually SIMD Allows some applications to more easily exploit a limited form of parallel processing Using the stream programming model to expose parallelism as well as producer-consumer locality can use multiple computational units

The Imagine Processor Imagine is a programmable stream processor and is a hardware implementation of the stream model. Imagine is designed to be a stream coprocessor for a general purpose processor that acts as the host. The programming model organizes the computation in an application into a sequence of arithmetic kernels, and organizes the data-flow into a series of data streams. On a variety of realistic applications, Imagine can sustain up to 50 instructions per cycle, and up to 15 GOPS of arithmetic bandwidth. Load-store architecture for streams (SRF)

Contents Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion

Architecture of Imagine 32 KW stream register file (SRF) The microcontroller keeps track of the program counter as it broadcasts each VLIW instruction to all eight clusters in a SIMD manner. Each ALU cluster: six ALUs and 304 registers in several local register files (LRFs).

Architecture of Imagine The SRF

Clusters SRF: data that needs to be passed from kernel to kernel SRF DRAM: part of truly global data structures All stream operands originate in the SRF and stream results are stored back to the SRF.

Irregular stream locality converted to reuse through memory

Irregular producer-consumer locality captured at the SRF

Data distribution

Data distribution result

Architecture of Imagine The ALU cluster

256 x 32-bit register file

Contents Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion

Example: mapping of a 1024-point radix-2 FFT to the stream model

Contents Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion

Experimental Result Speedup of 8 clusters over 1 cluster

Contents Stream processor Imagine Architecture Example: FFT application Experimental result Conclusion

Stream processors are suitable for media- processing applications Imagine exploits the data-level parallelism (DLP) in streams by executing a kernel on eight successive stream elements in parallel (one on each cluster).  SRF  ALU clusters Application example: 1024pt FFT

Thanks! Questions?