Download presentation
Presentation is loading. Please wait.
1
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture Lab University of Michigan Sept. 2004
2
Introduction Growing need to support multiple wireless protocols Software defined radio: implementing DSP algorithms in software rather than hardware ASIC: high performance, low flexibility Processor: high flexibility, low performance Objective: achieve real time performance with processor flexibility and programmability
3
802.11b 11Mbps Hiperlan2 36Mbps UWB 200Mbps Performance Requirements
4
DSP Algorithms Characteristics Streaming data Short variable liveness High data throughput High data level parallelism Low control flow overhead Counted loops Low data-dependent branches
5
Proposed Coprocessor Architecture: MAPP Stream Data Macro pipeline architecture No cache structure High Data Level Parallelism Vector architecture Low Control Flow Overhead No branch predictors Programmability to support multiple protocols
6
MAPP Architectural Diagram PPU Data Cache VPP Controller ARM Core Instruction Cache Vector Processing Pipeline
7
PPU Architectural Diagram In Queue Out Queue Vector Register File Vector ALU Internal Instruction Buffer Data In Data Out VPP Controller VPP Controller Pipeline Processing Unit
8
Mapping DSP Algorithms: Viterbi ACS v0 0 4 8 8 2 8 0 4 4 8 4 8 2 4 8 2 v1 move{g} s’, v2 l l g e e g l g 0 4 8 2 0 0 4 4 8 2 4 0 2 mask s’ s0bm0 s1 bm1 vadd v0, s0, bm0 vadd v1, s1, bm1 cmp v0, v1 move{le} s’, v1 bm1 bm0 S0 S1 mux S’2
9
Increase Area/Power Efficiency Data slice architecture Most DSP algorithms do not need 32-bit precision Viterbi decoding operates on 8 bits data Filters may need 16 bit precisions Partial processor execution Statically determined code Turn off architecture units not used Energy saving, no area saving
10
Vector Cluster Diagram (4x8 bit data slice) Register FileIn QueueOut Q.ALU Register FileIn QueueOut Q.ALU Register FileIn QueueOut Q.ALU Register FileIn QueueOut Q.ALU 4x4 Local Interconnect Network
11
Performance Results
12
Simplistic Power Analysis Based on ARM9 data in 0.13u Viterbi Decoder (K=7): 0.75W ~ 1W 64x4 8 bit ALU: ~240mW 12KB Mem: ~310mW Clock: ~200mW Others: ~250mW ASIC implementations: 7.65mW ~ 0.7W (with different throughputs)
13
Conclusion & Future Work Programmable coprocessor architecture Can support multiple protocols Achieves real-time computational requirements Reasonable power consumptions Future work Realistic power model simulation Implement complete protocols Algorithm behavior studies Shrink processor area
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.