Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.

Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 11, 2003

Wireless rates  clock rates Need to process 100X more bits per clock cycle today than in 1996 19961997199819992000200120022003200420052006 10 -3 10 -2 10 10 0 1 2 3 4 Year Clock frequency (MHz) W-LAN data rate (Mbps) Cellular data rate (Mbps) 200 MHz 1 Mbps 9.6 Kbps 4 GHz 54-100 Mbps 2-10 Mbps

Base-stations need horsepower Sophisticated signal processing for multiple users Need 100-1000s of arithmetic operations to process 1 bit Base-stations require > 100 ALUs ‘Chip rate’ processing ‘Symbol rate’ processing Decoding ‘Packet rate’ processing RF (Analog) ASIC(s) and/or ASSP(s) and/or FPGA(s) DSP(s) Co-processor(s) and/or ASIC(s) DSP or RISC processor

Power efficiency and flexibility Wireless systems getting harder-to-design –Evolving standards, compatibility issues –More base-stations per unit area –operational and maintenance costs Flexibility provides power-efficiency –Base-stations rarely operate at full capacity –Varying users, data rates, spreading, modulation, coding –Adapt resources to needs implies does not waste power – does not imply low power Wireless gets blacked out too Trying to use your cell phone during the blackout was nearly impossible. What went wrong? August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer

Thesis addresses the following problem Design programmable processors for wireless base-stations with 100s of ALUs : (a)map wireless algorithms on these processors (b)power-efficient (adapt resources to needs) (c) decide #ALUs, clock frequency how much programmable? – as programmable as possible

Choice : Stream processors Single processors won’t do –ILP, subword parallelism not sufficient –Register file explosion with increasing ALUs Multiprocessors –Data parallelism in wireless systems –SIMD (vector) processors appropriate –Stream processors – media processing Share characteristics with wireless systems Shown potential to support 100-1000s of ALUs Cycle accurate simulator and compiler tools available

Thesis contributions (a)Mapping algorithms on stream processors –designing data-parallel algorithm versions –tradeoffs between packing, ALU utilization and memory –reduced inter-cluster communication network (b)Improve power efficiency in stream processors –adapting compute resources to workload variations –varying voltage and frequency to real-time requirements (c) Design exploration between #ALUs and clock frequency to minimize power consumption –fast real-time performance prediction

Outline Background –Wireless systems –Stream processors Mapping algorithms to stream processors Power efficiency Design exploration Broad impact and future work

Wireless workloads System2G3G4G Users Data rates Algorithms Estimation Detection Decoding Theoretical Min ALUs @ 1 GHz 32 16 Kbps /user Single-user Correlator Matched filter Viterbi > 2 32 128 Kbps/user Multi-user Max. likelihood Interference Cancellation Viterbi > 20 32 1 Mbps/user MIMO Chip equalizer Matched filter LDPC > 200 Time 1996 2004 ?

Key kernels studied for wireless FFT – Media processing QRD – Media processing Outer product updates Matrix – vector operations matrix – matrix operations Matrix transpose Viterbi decoding LDPC decoding

Characteristics of wireless Compute-bound Finite precision Limited temporal data reuse –Streaming data Data parallelism Static, deterministic, regular workloads –Limited control flow

Parallelism levels in wireless systems int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } Instruction Level Parallelism (ILP) - DSP Subword Parallelism (MMX) - DSP Data Parallelism (DP) – Vector Processor  DP can decrease by increasing ILP and MMX – Example: loop unrolling ILP DP MMX

Stream Processors : multi-cluster DSPs + + + * * * Internal Memory ILP MMX Memory: Stream Register File (SRF) VLIW DSP (1 cluster) + + + * * * + + + * * * + + + * * * + + + * * * … ILP MMX DP adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters micro controller micro controller

Programming model stream a(1024); stream b(1024); stream sum(1024); stream c(512); stream d(512); stream diff(512); add(a,b,sum); sub(c,d,diff); kernel add(istream a, istream b, ostream sum) { int inputA, inputB, output; loop_stream(a) { a >> inputA; b >> inputB; output = a + b; sum << output; } } kernel sub(istream c, istream d, ostream diff) { int inputC, inputD, output; loop_stream(c) { c >> inputC; d >> inputD; output = c - d; diff << output; } } Your new hardware won’t run your old software – Balch’s law Communication Computation

Viterbi needs inter-cluster comm Exploiting Viterbi DP:  Odd-even grouping of data X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) DP vector Regular ACSACS in SWAPs

Performance of Viterbi decoding Ideal C64x DSP (w/o co-proc) needs ~200 MHz for real-time

Patterns in inter-cluster comm Intercluster comm network fully connected –Structure in access patterns can be exploited Broadcasting –Matrix-vector multiplication, matrix-matrix multiplication, outer product updates Odd-even grouping –Transpose, Packing, Viterbi decoding

Odd-even grouping Packing –overhead when input and output precisions are different –Not always beneficial for performance –Odd-even grouping required for bringing data to right cluster Matrix transpose –Better done in ALUs than in memory –Shown to have an order-of-magnitude better performance –Done in ALUs as repeated odd-even groupings

Odd-even grouping Inter-cluster communication O(C 2 ) wires, O(C 2 ) interconnections, 8 cycles 0/41/52/63/7 4 Clusters Data Entire chip length Limits clock frequency Limits scaling 0 1 2 3 4 5 6 7  0 2 4 8 1 3 5 7

A reduced inter-cluster comm network only nearest neighbor interconnections O(Clog(C)) wires, O(C) interconnections, 8 cycles

Flexibility needed in workloads Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi 0 5 10 15 20 25 Min. ALUs needed at 1 GHz Operation count (in GOPs) (4,7)(4,9)(8,7)(8,9)(16,7)(16,9)(32,7)(32,9) 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) (Users, Constraint lengths) Note: GOPs refer only to arithmetic computations

Flexibility affects DP* WorkloadEstimationDetectionDecoding (U,K)f(U,N) f(U,K,R) (4,7)32416 (4,9)32464 (8,7)32816 (8,9)32864 (16,7)3216 (16,9)321664 (32,7)32 16 (32,9)32 64 U - Users, K - constraint length, N - spreading gain, R - decoding rate *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

When DP changes 4  2 clusters Data not in the right SRF banks Overhead in bringing data to the right banks –Via memory –Via inter-cluster communication network C CC C SRF Clusters

Adapting #clusters to Data Parallelism Adaptive Multiplexer Network CCCC C CC C C C C No reconfiguration4: 2 reconfiguration 4:1 reconfigurationAll clusters off Turned off using voltage gating to eliminate static and dynamic power dissipation SRF Clusters

Cluster utilization variation Cluster Index Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi

Frequency variation

Operation Dynamic Voltage-Frequency scaling when system changes significantly –Users, data rates … –Coarse time scale (every few seconds) Turn off clusters –when parallelism changes significantly –Memory operations –Exceed real-time requirements –Finer time scales (100’s of microseconds)

Power : Voltage Gating & Scaling Power can change from 12.38 W to 300 mW (40x savings) depending on workload changes

Deciding ALUs vs. clock frequency No independent variables –Clusters, ALUs, frequency, voltage (c,a,m,f) –Trade-offs exist How to find the right combination for lowest power!

Static design exploration Static, predictable part (computations) Dynamic part (Memory stalls Microcontroller stalls) Execution Time also helps in quickly predicting real-time performance

Sensitivity analysis important We have a capacitance model [Khailany2003] All equations not exact –Need to see how variations affect solutions

Design exploration methodology 3 types of parallelism: ILP, MMX, DP For best performance (power) –Maximize the use of all Maximize ILP and MMX at expense of DP –Loop unrolling, packing –Schedule on sufficient number of adders/multipliers If DP remains, set clusters = DP –No other way to exploit that parallelism

Setting clusters, adders, multipliers If sufficient DP, linear decrease in frequency with clusters –Set clusters depending on DP and execution time estimate To find adders and multipliers, –Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time Put all numbers in power equation –Compare increase in capacitance due to added ALUs and clusters with benefits in execution time Choose the solution that minimizes the power

Design exploration for clusters (c) For sufficiently large #adders, #multipliers per cluster Explore Algorithm 1 : 32 clusters Explore Algorithm 2 : 64 clusters Explore Algorithm 3 : 64 clusters Explore Algorithm 4 : 16 clusters time DP

Clusters: frequency and power 32 clusters at frequency = 836.692 MHz (p = 1) 64 clusters at frequency = 543.444 MHz (p = 2) 64 clusters at frequency = 543.444 MHz (p = 3) 3G workload

ALU utilization with frequency 3G workload

Choice of adders and multipliers ( ,f p ) Optimal ALU/ClusterCluster/Total AddersMultipliersPower (0.01,1)213061 (0.01,2)213061 (0.01,3)312558 (0.1,1)215269 (0.1,2)215269 (0.1,3)315168 (1,1)118689 (1,2)228487 (1,3)228487

Exploration results ************************* Final Design Conclusion ************************* Clusters : 64 Multipliers/cluster : 1 Multiplier Utilization: 62% Adders/cluster : 3 Adder Utilization: 55% Real-time frequency : 568.68 MHz for 128 Kbps/user ************************* Exploration done in seconds….

Broader impact Results not specific to base-stations –High performance, low power system designs Concepts can be extended to handsets Mux network applicable to all SIMD processors –Power efficiency in scientific computing Results #2, #3 applicable to all stream applications –Design and power efficiency –Multimedia, MPEG, …

Future work Don’t believe the model is the reality (Proof is in the pudding) Fabrication needed to verify concepts –Cycle accurate simulator –Extrapolating models for power LDPC decoding (in progress) –Sparse matrix requires permutations over large data –Indexed SRF may help 3G requires 1 GHz at 128 Kbps/user –4G equalization at 1 Mbps breaks down (expected)

Need for new architectures, definitions and benchmarks Road ends - conventional architectures[Agarwal2000] Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable + –Difficult to compare and contrast –Need new definitions that allow comparisons Wireless workloads –Typically ASIC designs –SPEC benchmark needed for programmable designs

Conclusions Utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures Data parallel algorithms need to be designed and mapped Power efficiency needs to be provided Design exploration needed to decide #ALUs to meet real- time constraints –My thesis lays the initial foundations

Back-up slides

Packing may not be useful 1 2 3 45 67 8 a Multiplication 1 357 p 2 468 q 1 2 34 p 5 678 q 7 Algorithm: short a; int y; for(i= 1; i < 8 ; ++i) { y[i] = a[i]*a[i]; } Re-ordering data 1 3xx p 5 7xx m x x24 n x x68 q 1 324 p 5 768 q Add Re-ordering data Packing uses odd-even grouping

Data re-ordering in memory Matrix transpose –Common in wireless communication systems –Column access to data expensive Re-ordering data inside the ALUs –Faster –Lower power

Trade-offs during memory re-ordering t 1 t 2 Transpose t mem ALUs Memory t 1 t 2 Transpose t mem ALUsMemory t 3 t 1 t 2 ALUs t alu t = t 2 + t stalls 0 < t stalls <t mem (a) t = t 2 (b) t = t 2 + t alu (c)

Transpose uses odd-even grouping N M 0 M/2 1 2 34 A BCD IN OUT Repeat LOG(M ) times { IN = OUT; } A BCD 1 2 34 C 3 D4 A 1B2

ALU Bandwidth > Memory Bandwidth

Arithmetic clusters in stream processors Intercluster Network Comm. Unit Scratchpad (indexed accesses) SRF From/To SRF Cross Point Distributed Register Files (supports more ALUs) + + + * * / + / + + + * * / + /

Stream processor programming Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits Kernels (computation) and streams (communication) Use local data in clusters providing GOPs support Imagine stream processor at Stanford [Rixner’01] Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.

Similar presentations

Presentation on theme: "Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.

Similar presentations

Presentation on theme: "Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003."— Presentation transcript:

Similar presentations

About project

Feedback