RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed from papers at VT and Stanford.
RICE UNIVERSITY Motivation ‘Stream’-based computing what does it mean? Not a well-defined term ‘computation’ that uses flow of self-guided info. ‘sequence of data’ Related to flow of data through architecture Application to implementing wireless algorithms
RICE UNIVERSITY Outline Stallion reconfigurable computing at Virginia Tech ‘stream’-based computing #1 Custom Configurable Machines (CCM) Imagine media processing at Stanford ‘stream’-based computing #2 programmable architectures
RICE UNIVERSITY Stallion at VT Wormhole Run-Time Reconfiguration (RTR) coarse-grained structure reconfiguration using ‘streams’
RICE UNIVERSITY ‘Stream’ packets A stream packet Stream flow through architecture
RICE UNIVERSITY Functional description of PE
RICE UNIVERSITY Stream module description 4 States: IDLE – reconf. in progress BUSY – doing work PROGRAM – load reconf. data PASS – meant for next module Need to output packet/cycle VALID – maintain sync. - set INVALID instead of wait states - strip information off stack
RICE UNIVERSITY Processing layer Static section configures the reconf. section buffers data during reconf. & sends ‘IDLE’ packets Reconf. Section processing of the data done here Higher layers convert algorithm to data and configuration patterns
RICE UNIVERSITY Cart before the horse Colt before the Stallion Colt architecture (also at VT) IFU Mesh – Mesh of interconnected func. units
RICE UNIVERSITY Stallion chip 16-bit data 4-control
RICE UNIVERSITY IFU mesh in Stallion Dash-line –- skip buses Can send operands over 1/more IFUs
RICE UNIVERSITY IFU details Only left input can do barrel shifting ALU based on LUT Control register – stores control information for reconfiguration Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams Cond. unit Output control unit
RICE UNIVERSITY Radio testbed at VT Stallion
RICE UNIVERSITY Worm-hole routing stream = worm architecture = holes multiple, independent streams can wind their way through the chip simultaneously parts of system can be processing, parts could be reconfiguring GOAL: Layered Software Radio Architecture
RICE UNIVERSITY ‘Stream’ processing at Stanford Speeding up media applications Need lots of computations per memory reference Lots of data and sub-word parallelism Current GPP architectures do not have enough ALUs ‘Stream’ processors to the rescue
RICE UNIVERSITY Special-purpose processors Fed by dedicated wires/memoriesLots (100s) of ALUs
RICE UNIVERSITY Care and feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU
RICE UNIVERSITY Architecture implications Tremendous opportunities media problems have lots of parallelism and locality VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder) Challenging problems locality - global structures won’t work explicit parallelism - ILP won’t keep 100 ALUs busy memory - streaming applications don’t cache well Its time to try some new approaches
RICE UNIVERSITY Register file organization Register files functions: short term storage for intermediate results communication between multiple function units Global register files don’t scale with #ALUs need more registers to hold more results (grows with #ALUs ) need more ports to connect all of the units (grows with #ALUs 2 )
RICE UNIVERSITY Register files dwarf ALUs
RICE UNIVERSITY Distributed register files Distributed register files means: not all functional units can access all data each functional unit input/output no longer has a dedicated route from/to all register files
RICE UNIVERSITY Stream processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)
RICE UNIVERSITY Stream programming Streams Communication void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... } Kernels Computation KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; }
RICE UNIVERSITY Stream Processor Instructions are Load, Store, and Operate operands are streams Operate performs a compound stream operation read elements from input streams perform a local computation append elements to output streams repeat until input stream is consumed (e.g., triangle transform)
RICE UNIVERSITY Imagine
RICE UNIVERSITY Arithmetic clusters
RICE UNIVERSITY Bandwidth hierarchy VLIW clusters with shared control bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s
RICE UNIVERSITY Conclusions ‘Streams’ shown to be promising for reconfigurable computing wireless may need reconfigurability ‘Streams’ shown to be promising for media processing wireless may have similar workloads Important to understand pros and cons of different methodologies for good wireless architectures Important to have the right tools