Download presentation
Presentation is loading. Please wait.
Published byJonas Ward Modified over 9 years ago
1
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed from papers at VT and Stanford.
2
RICE UNIVERSITY Motivation ‘Stream’-based computing what does it mean? Not a well-defined term ‘computation’ that uses flow of self-guided info. ‘sequence of data’ Related to flow of data through architecture Application to implementing wireless algorithms
3
RICE UNIVERSITY Outline Stallion reconfigurable computing at Virginia Tech ‘stream’-based computing #1 Custom Configurable Machines (CCM) Imagine media processing at Stanford ‘stream’-based computing #2 programmable architectures
4
RICE UNIVERSITY Stallion at VT Wormhole Run-Time Reconfiguration (RTR) coarse-grained structure reconfiguration using ‘streams’
5
RICE UNIVERSITY ‘Stream’ packets A stream packet Stream flow through architecture
6
RICE UNIVERSITY Functional description of PE
7
RICE UNIVERSITY Stream module description 4 States: IDLE – reconf. in progress BUSY – doing work PROGRAM – load reconf. data PASS – meant for next module Need to output packet/cycle VALID – maintain sync. - set INVALID instead of wait states - strip information off stack
8
RICE UNIVERSITY Processing layer Static section configures the reconf. section buffers data during reconf. & sends ‘IDLE’ packets Reconf. Section processing of the data done here Higher layers convert algorithm to data and configuration patterns
9
RICE UNIVERSITY Cart before the horse Colt before the Stallion Colt architecture (also at VT) IFU Mesh – Mesh of interconnected func. units
10
RICE UNIVERSITY Stallion chip 16-bit data 4-control 3 3 4 4 2 2
11
RICE UNIVERSITY IFU mesh in Stallion Dash-line –- skip buses Can send operands over 1/more IFUs
12
RICE UNIVERSITY IFU details Only left input can do barrel shifting ALU based on LUT Control register – stores control information for reconfiguration Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams Cond. unit Output control unit
13
RICE UNIVERSITY Radio testbed at VT Stallion
14
RICE UNIVERSITY Worm-hole routing stream = worm architecture = holes multiple, independent streams can wind their way through the chip simultaneously parts of system can be processing, parts could be reconfiguring GOAL: Layered Software Radio Architecture
15
RICE UNIVERSITY ‘Stream’ processing at Stanford Speeding up media applications Need lots of computations per memory reference Lots of data and sub-word parallelism Current GPP architectures do not have enough ALUs ‘Stream’ processors to the rescue
16
RICE UNIVERSITY Special-purpose processors Fed by dedicated wires/memoriesLots (100s) of ALUs
17
RICE UNIVERSITY Care and feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU
18
RICE UNIVERSITY Architecture implications Tremendous opportunities media problems have lots of parallelism and locality VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder) Challenging problems locality - global structures won’t work explicit parallelism - ILP won’t keep 100 ALUs busy memory - streaming applications don’t cache well Its time to try some new approaches
19
RICE UNIVERSITY Register file organization Register files functions: short term storage for intermediate results communication between multiple function units Global register files don’t scale with #ALUs need more registers to hold more results (grows with #ALUs ) need more ports to connect all of the units (grows with #ALUs 2 )
20
RICE UNIVERSITY Register files dwarf ALUs
21
RICE UNIVERSITY Distributed register files Distributed register files means: not all functional units can access all data each functional unit input/output no longer has a dedicated route from/to all register files
22
RICE UNIVERSITY Stream processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)
23
RICE UNIVERSITY Stream programming Streams Communication void main() { Stream a(256); Stream b(256); Stream c(256); Stream d(1024);... example1(a, b, c); example2(c, d);... } Kernels Computation KERNEL example1(istream a, istream b, ostream c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; }
24
RICE UNIVERSITY Stream Processor Instructions are Load, Store, and Operate operands are streams Operate performs a compound stream operation read elements from input streams perform a local computation append elements to output streams repeat until input stream is consumed (e.g., triangle transform)
25
RICE UNIVERSITY Imagine
26
RICE UNIVERSITY Arithmetic clusters
27
RICE UNIVERSITY Bandwidth hierarchy VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s
28
RICE UNIVERSITY Conclusions ‘Streams’ shown to be promising for reconfigurable computing wireless may need reconfigurability ‘Streams’ shown to be promising for media processing wireless may have similar workloads Important to understand pros and cons of different methodologies for good wireless architectures Important to have the right tools
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.