RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

Slides:

Advertisements

Similar presentations

Computer Organization, Bus Structure

Advertisements

Chapter 3 General-Purpose Processors: Software

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Course-Grained Reconfigurable Devices. 2 Dataflow Machines General Structure:  ALU-computing elements,  Programmable interconnections,  I/O components.

Computer Architecture & Organization

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Chapter 17 Parallel Processing.

University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Multiscalar processors

Chapter 12 CPU Structure and Function. Example Register Organizations.

Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Introduction to Parallel Processing Ch. 12, Pg

Pipelining By Toan Nguyen.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Top Level View of Computer Function and Interconnection.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

MapReduce How to painlessly process terabytes of data.

RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February.

1 Introduction CEG 4131 Computer Architecture III Miodrag Bolic.

Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.

Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.

EEE440 Computer Architecture

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,

WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.

Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.

DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.

Overview von Neumann Architecture Computer component Computer function

Computer operation is of how the different parts of a computer system work together to perform a task.

CBP 2002ITY 270 Computer Architecture1 Module Structure Whirlwind Review – Fetch-Execute Simulation Instruction Set Architectures RISC vs x86 How to build.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

An Overview of Parallel Processing

My Coordinates Office EM G.27 contact time:

Architecture of a Massively Parallel Processor Kenneth E. Batcher 1980 presented by Yao Wu April 25, 2003.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Architecture & Organization 1

Chapter 3 Top Level View of Computer Function and Interconnection

Laxmi Narayan Bhuyan SIMD Architectures Laxmi Narayan Bhuyan

Parallel and Multiprocessor Architectures

Stream Architecture: Rethinking Media Processor Design

Lecture on High Performance Processor Architecture (CS05162)

Architecture & Organization 1

Morgan Kaufmann Publishers Computer Organization and Assembly Language

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

DSPs in emerging wireless systems

Presentation transcript:

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed from papers at VT and Stanford.

RICE UNIVERSITY Motivation  ‘Stream’-based computing  what does it mean?  Not a well-defined term  ‘computation’ that uses flow of self-guided info.  ‘sequence of data’  Related to flow of data through architecture  Application to implementing wireless algorithms

RICE UNIVERSITY Outline  Stallion  reconfigurable computing at Virginia Tech  ‘stream’-based computing #1  Custom Configurable Machines (CCM)  Imagine  media processing at Stanford  ‘stream’-based computing #2  programmable architectures

RICE UNIVERSITY Stallion at VT  Wormhole Run-Time Reconfiguration (RTR)  coarse-grained structure  reconfiguration using ‘streams’

RICE UNIVERSITY ‘Stream’ packets A stream packet Stream flow through architecture

RICE UNIVERSITY Functional description of PE

RICE UNIVERSITY Stream module description 4 States: IDLE – reconf. in progress BUSY – doing work PROGRAM – load reconf. data PASS – meant for next module Need to output packet/cycle VALID – maintain sync. - set INVALID instead of wait states - strip information off stack

RICE UNIVERSITY Processing layer  Static section  configures the reconf. section  buffers data during reconf. & sends ‘IDLE’ packets  Reconf. Section  processing of the data done here  Higher layers convert algorithm to data and configuration patterns

RICE UNIVERSITY Cart before the horse Colt before the Stallion Colt architecture (also at VT) IFU Mesh – Mesh of interconnected func. units

RICE UNIVERSITY Stallion chip 16-bit data 4-control

RICE UNIVERSITY IFU mesh in Stallion Dash-line –- skip buses Can send operands over 1/more IFUs

RICE UNIVERSITY IFU details Only left input can do barrel shifting ALU based on LUT Control register – stores control information for reconfiguration Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams Cond. unit Output control unit

RICE UNIVERSITY Radio testbed at VT Stallion

RICE UNIVERSITY Worm-hole routing  stream = worm architecture = holes  multiple, independent streams can wind their way through the chip simultaneously  parts of system can be processing, parts could be reconfiguring  GOAL: Layered Software Radio Architecture

RICE UNIVERSITY ‘Stream’ processing at Stanford  Speeding up media applications  Need lots of computations per memory reference  Lots of data and sub-word parallelism  Current GPP architectures do not have enough ALUs  ‘Stream’ processors to the rescue

RICE UNIVERSITY Special-purpose processors Fed by dedicated wires/memoriesLots (100s) of ALUs

RICE UNIVERSITY Care and feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

RICE UNIVERSITY Architecture implications  Tremendous opportunities  media problems have lots of parallelism and locality  VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder)  Challenging problems  locality - global structures won’t work  explicit parallelism - ILP won’t keep 100 ALUs busy  memory - streaming applications don’t cache well  Its time to try some new approaches

RICE UNIVERSITY Register file organization  Register files functions:  short term storage for intermediate results  communication between multiple function units  Global register files don’t scale with #ALUs  need more registers to hold more results (grows with #ALUs )  need more ports to connect all of the units (grows with #ALUs 2 )

RICE UNIVERSITY Register files dwarf ALUs

RICE UNIVERSITY Distributed register files  Distributed register files means:  not all functional units can access all data  each functional unit input/output no longer has a dedicated route from/to all register files

RICE UNIVERSITY Stream processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map  Little data reuse (pixels never revisited)  Highly data parallel (output pixels not dependent on other output pixels)  Compute intensive (60 operations per memory reference)

RICE UNIVERSITY Stream programming  Streams  Communication void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... }  Kernels  Computation KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; }

RICE UNIVERSITY Stream Processor  Instructions are Load, Store, and Operate  operands are streams  Operate performs a compound stream operation  read elements from input streams  perform a local computation  append elements to output streams  repeat until input stream is consumed  (e.g., triangle transform)

RICE UNIVERSITY Imagine

RICE UNIVERSITY Arithmetic clusters

RICE UNIVERSITY Bandwidth hierarchy  VLIW clusters with shared control  bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

RICE UNIVERSITY Conclusions  ‘Streams’ shown to be promising for reconfigurable computing  wireless may need reconfigurability  ‘Streams’ shown to be promising for media processing  wireless may have similar workloads  Important to understand pros and cons of different methodologies for good wireless architectures  Important to have the right tools