Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.

Programmable processors for wireless base-stations Sridhar Rajagopal (sridhar@rice.edu) December 9, 2003

Fact#1: Wireless rates  clock rates Need to process 100X more bits per clock cycle today than in 1996 19961997199819992000200120022003200420052006 10 -3 10 -2 10 10 0 1 2 3 4 Year Clock frequency (MHz) W-LAN data rate (Mbps) Cellular data rate (Mbps) 200 MHz 1 Mbps 9.6 Kbps 4 GHz 54-100 Mbps 2-10 Mbps Source: Intel, IEEE 802.11x, 3GPP

Fact#2: base-stations need horsepower LNA ADC DDC Frequency Offset Compensation Channel estimation Chip level Demodulation Despreading Symbol Detection Symbol Decoding Packet/ Circuit Switch Control BSC/RNC Interface Power Supply and Control Unit Power Measurement and Gain Control (AGC) RF Baseband processing Network Interface E1/T1 or Packet Network RF RX Sophisticated signal processing for multiple users Need 100-1000s of arithmetic operations to process 1 bit Source: Texas Instruments

Need  100 ALUs in base-stations Example: 1000 arithmetic operations/bit with 1 bit/10 cycles –100 arithmetic operations/clock cycle Base-stations need  100 ALUs –irrespective of the type of (clocked) architecture

Fact #3: Base-stations need power- efficiency* Wireless systems getting denser –More base-stations per unit area –operational and maintenance costs Architectures first tested on base-stations *implies does not waste power – does not imply low power Wireless gets blacked out too Trying to use your cell phone during the blackout was nearly impossible. What went wrong? August 16, 2003: 8:58 AM EDT By Paul R. La Monica, CNN/Money Senior Writer

Fact #4: Base-stations need flexibility* Wireless systems are continuously evolving –New algorithms designed and evaluated –allow upgrading, co-existing, minimize design time, reuse Flexibility needed for power-efficiency –Base-stations rarely operate at full capacity –Varying users, data rates, spreading, modulation, coding –Adapt resources to needs *how much flexibility? – as flexible as possible

Fact #5: Current base-stations not flexible / not power-efficient ‘Chip rate’ processing ‘Symbol rate’ processing Decoding Control and protocol RF (Analog) ASIC(s) and/or ASSP(s) and/or FPGA(s) DSP(s) Co-processor(s) and/or ASIC(s) DSP or RISC processor Change implies re-partitioning algorithms, designing new hardware Design done for the worst case – no adaptation with workload Source: [Baines2003]

Thesis addresses the following problem design a base-station (a)supports 100’s of ALUs (b)power-efficient (adapts resources to needs) (c)as flexible as possible How many ALUs at what clock frequency? HYPOTHESIS: Programmable* processors for wireless base-stations *how much programmable? – as programmable as possible

Programmable processors No processor optimization for specific algorithm –As programmable as possible –Example: no instruction for Viterbi decoding –FPGAs, ASICs, ASIPs etc. not considered Use characteristics of wireless systems –precision, parallelism, operations,.. –MMX extensions for multimedia

Single processors won’t do (1) Find ways for increasing clock frequency –C64x DSP: 600 – 720 – 1GHz – 100GHz? –Easiest solution but physical limits to scaling f –Not good for power, given cubic dependence with f (2) Increasing ALUs –Limited instruction level parallelism (ILP,MMX) –Register file area, ports explosion –Compiler issues in extracting more ILP (3) Multiprocessors

Related work - Multiprocessors Multi-chip : TI TMS320C40 DSP Sundance Cm* Clustered VLIW : TI TMS320C6x DSP Multiflow TRACE Alpha 21264 Multiprocessors SIMD (Single Instruction Multiple Data) MIMD (Multiple Instructions Multiple Data) Single chip Vector : CODE Vector IRAM Cray 1 Array : ClearSpeed TM MasPar Illiac-IV BSP Stream : Imagine Motorola RSVP TM Multi-threading (MT) : Sandbridge SandBlaster DSP Cray MTA Sun MAJC PowerPC RS64IV Alpha 21464 Reconfigurable* processors : RAW Chameleon picoChip Chip multiprocessor (CMP) TI TMS320C8x DSP Hydra IBM Power4 Cannot scale to support 100’s of arithmetic units Control Data Parallel *Reconfigurable processor uses reconfiguration for execution time benefits

Challenges in proving hypothesis Architecture choice for design exploration –SIMD generally more programmable* than reconfigurable –Compiler, simulators, tools and support play a major role Benchmark workloads need to be designed –Previously done as ASICs, so none available –Not easy – finite precision, algorithms changing Need detailed knowledge of wireless algorithms, architectures, mapping, compilers, design tools *Programmable here refers to ease of use and write code for

Architecture choice: Stream processors State-of-the-art programmable media processors –Can scale to 1000’s of arithmetic units [Khailany 2003] –Wireless algorithms have similar characteristics Cycle-accurate simulator with open-source code Parameters such as ALUs, register files can be varied Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead … Almost anything can be changed, some changes easier than others!

Thesis contributions Mapping algorithms on stream processors –designing data-parallel algorithm versions –tradeoffs between packing, ALU utilization and memory –reduced inter-cluster communication network Improve power efficiency in stream processors –adapting compute resources to workload variations –varying voltage and frequency to real-time requirements Design exploration between #ALUs and clock frequency to minimize power consumption –fast real-time performance prediction

Outline Background –Wireless systems –Stream processors Contribution #1 : Mapping Contribution #2 : Power-efficiency Contribution #3 : Design exploration Broader impact and limitations

Wireless workloads : 2G (Basic) Sliding correlator Code Matched Filter Viterbi decoder MAC and Network layers Received signal after DDC 2G physical layer signal processing Sliding correlator Code Matched Filter Viterbi decoder User 1 User K User 1 User K 32 users 16 Kbps/user Single-user algorithms (other users noise) > 2 GOPs

3G Multiuser system Multiuser channel estimation Code Matched Filter Viterbi decoder MAC and Network layers Received signal after DDC 3G physical layer signal processing Parallel Interference Cancellation Stages Multiuser detection Code Matched Filter Viterbi decoder User 1 User K User 1 32 users 128 Kbps/user Multi-user algorithms (cancels interference) > 20 GOPs

4G MIMO system 32 users 1 Mbps/user Multiple antennas (higher spectral efficiency, higher data rates) > 200 GOPs

Programmable processors int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } Instruction Level Parallelism (ILP) - DSP Subword Parallelism (MMX) - DSP Data Parallelism (DP) – Vector Processor  DP can decrease by increasing ILP and MMX – Example: loop unrolling ILP DP MMX

Stream Processors : multi-cluster DSPs + + + * * * Internal Memory ILP MMX Memory: Stream Register File (SRF) VLIW DSP (1 cluster) + + + * * * + + + * * * + + + * * * + + + * * * … ILP MMX DP adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters micro controller micro controller

Outline Contribution #1 –Mapping algorithms to stream processors (parallel, fixed pt) –Tradeoffs between packing, ALU utilization and memory –Reduced inter-cluster communication network

Packing Packing introduced around 1996 for exploiting subword parallelism –Intel MMX –Subword parallelism never looked back –Integrated into all current microprocessors and DSPs SIMD + MMX : Stream processor/vector IRAM : 2000 + –relatively new concept Not necessarily useful in SIMD processors –May add to inter-cluster communication

Packing may not be useful 1 2 3 45 67 8 a Multiplication 1 357 p 2 468 q 1 2 34 p 5 678 q 7 Algorithm: short a; int y; for(i= 1; i < 8 ; ++i) { y[i] = a[i]*a[i]; } Re-ordering data 1 3xx p 5 7xx m x x24 n x x68 q 1 324 p 5 768 q Add Re-ordering data Packing uses odd-even grouping

Data re-ordering in memory Matrix transpose –Common in wireless communication systems –Column access to data expensive Re-ordering data inside the ALUs –Faster –Lower power

Trade-offs during memory re-ordering t 1 t 2 Transpose t mem ALUs Memory t 1 t 2 Transpose t mem ALUsMemory t 3 t 1 t 2 ALUs t alu t = t 2 + t stalls 0 < t stalls <t mem (a) t = t 2 (b) t = t 2 + t alu (c)

Transpose uses odd-even grouping N M 0 M/2 1 2 34 A BCD IN OUT Repeat LOG(M ) times { IN = OUT; } A BCD 1 2 34 C 3 D4 A 1B2

ALU Bandwidth > Memory Bandwidth

Viterbi needs odd-even grouping Exploiting Viterbi DP in SWAPs:  Use Register exchange (RE) instead of regular traceback  Re-order ACS, RE X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) DP vector Regular ACSACS in SWAPs

Performance of Viterbi decoding Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

Pattern in inter-cluster comm Broadcasting –Matrix-vector multiplication, matrix-matrix multiplication, outer product updates Odd-even grouping –Transpose, Packing, Viterbi decoding

Odd-even grouping Inter-cluster communication O(C 2 ) wires, O(C 2 ) interconnections, 8 cycles 0/41/52/63/7 4 Clusters Data Entire chip length Limits clock frequency Limits scaling 0 1 2 3 4 5 6 7  0 2 4 8 1 3 5 7

A reduced inter-cluster comm network only nearest neighbor interconnections O(Clog(C)) wires, O(C) interconnections, 8 cycles

Outline Contribution #2 : Power-efficiency High performance is low power - Mark Horowitz

Flexibility needed in workloads 0 5 10 15 20 25 Operation count (in GOPs) (4,7)(4,9)(8,7)(8,9)(16,7)(16,9)(32,7)(32,9) 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) (Users, Constraint lengths) Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi Note: GOPs refer only to arithmetic computations

Flexibility affects Data Parallelism* WorkloadEstimationDetectionDecoding (U,K)f(U,N) f(U,K,R) (4,7)32416 (4,9)32464 (8,7)32816 (8,9)32864 (16,7)3216 (16,9)321664 (32,7)32 16 (32,9)32 64 U - Users, K - constraint length, N - spreading gain, R - decoding rate *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling

Adapting #clusters to Data Parallelism Adaptive Multiplexer Network CCCC Turned off using voltage gating to eliminate static and dynamic power dissipation

Cluster utilization variation Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi

Frequency variation

Operation Dynamic Voltage-Frequency scaling when system changes significantly –Users, data rates … –Coarse time scale (every few seconds) Turn off clusters –when parallelism changes significantly –Memory operations –Exceed real-time requirements –Finer time scales (100’s of microseconds)

Power : Voltage Gating & Scaling Power can change from 12.38 W to 300 mW depending on workload changes

Outline Contribution #3 : Design exploration –How many adders, multipliers, clusters, clock frequency –Quickly predict real-time performance

Deciding ALUs vs. clock frequency No independent variables –Clusters, ALUs, frequency, voltage (c,a,m,f) –Trade-offs exist How to find the right combination for lowest power!

Static design exploration Static part (computations) Dynamic part (Memory stalls Microcontroller stalls) Execution Time also helps in quickly predicting real-time performance

Sensitivity analysis important We have a capacitance model [Khailany2003] All equations not exact –Need to see how variations affect solutions

Design exploration methodology 3 types of parallelism: ILP, MMX, DP For best performance (power) –Maximize the use of all Maximize ILP and MMX at expense of DP –Loop unrolling, packing –Schedule on sufficient number of adders/multipliers If DP remains, use clusters = DP –No other way to exploit that parallelism

Setting clusters, adders, multipliers If sufficient DP, linear decrease in frequency with clusters –Set clusters depending on DP and execution time estimate To find adders and multipliers, –Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time Put all numbers in power equation –Compare increase in capacitance due to added ALUs and clusters with benefits in execution time Choose the solution that minimizes the power

Design exploration For sufficiently large #adders, #multipliers per cluster Explore Algorithm 1 : 32 clusters (t1) Explore Algorithm 2 : 64 clusters (t2) Explore Algorithm 3 : 64 clusters (t3) Explore Algorithm 4 : 16 clusters (t4) ILP DP

Clusters: frequency and power 32 clusters at frequency = 836.692 MHz (p = 1) 64 clusters at frequency = 543.444 MHz (p = 2) 64 clusters at frequency = 543.444 MHz (p = 3) 3G workload

ALU utilization with frequency 3G workload

Power variations with f and 

Choice of adders and multipliers ( ,f p ) Optimal ALU/ClusterCluster/Total AddersMultipliersPower (0.01,1)213061 (0.01,2)213061 (0.01,3)312558 (0.1,1)215269 (0.1,2)215269 (0.1,3)315168 (1,1)118689 (1,2)228487 (1,3)228487

Exploration results ************************* Final Design Conclusion ************************* Clusters : 64 Multipliers/cluster : 1 Utilization: 62% Adders/cluster : 3 Utilization: 55% Real-time frequency : 568.68 MHz ************************* Exploration done with plots generated in seconds….

Outline Broader impact and limitations

Broader impact Results not specific to base-stations –High performance, low power system designs Concepts can be extended to handsets Mux network applicable to all SIMD processors –Power efficiency in scientific computing Results #2, #3 applicable to all stream applications –Design and power efficiency –Multimedia, MPEG, …

Limitations Don’t believe the model is the reality (Proof is in the pudding) Fabrication needed to verify concepts –Cycle accurate simulator –Extrapolating models for power LDPC decoding (in progress) –Sparse matrix requires permutations over large data –Indexed SRF may help 3G requires 1 GHz at 128 Kbps/user –4G equalization at 1 Mbps breaks down (expected)

Conclusions Road ends - conventional architectures[Agarwal2000] Wide range of architectures – DSP, ASSP, ASIP, reconfigurable,stream, ASIC, programmable + –Difficult to compare and contrast –Need new definitions that allow comparisons Wireless workloads – SPECwireless standard needed utilizing 100-1000s ALUs/clock cycle and mapping algorithms not easy in programmable architectures –my thesis lays the initial foundations

Time-scales [met schedules on time – year not mentioned in the proposal]

Alternate view of the CMP DSP Streaming Memory system L2 internal memory Bank C Inter-cluster communication network Bank 2 Bank 1 Prefetch Buffers Clusters Of C64x cluster C cluster 0 cluster 1 Instruction decoder

Adapting clusters using (1) memory transfers SRF Stream A 0123456789101112131415 0 8 7 6 14 5 13 4 12 3 11 2 10 1 9 0 4 8 12 1 5 9 13 2 6 10 14 4 7 11 15 XXX X Stream A' Step 1: Step 2: Memory Clusters

(2) Using conditional streams Conditional Buffer Condition Switch Data received 0 A 1 A B 1 B C 0 - D 0 - 123Cluster index0 1 C 1 D C 0 - D 0 - 123 AB Access 0Access 1 4-clusters reconfiguring to 2

Arithmetic clusters in stream processors Intercluster Network Comm. Unit Scratchpad (indexed accesses) SRF From/To SRF Cross Point Distributed Register Files (supports more ALUs) + + + * * / + / + + + * * / + /

Programming model stream a(1024); stream b(1024); stream sum(1024); stream c(512); stream d(512); stream diff(512); add(a,b,sum); sub(c,d,diff); kernel add(istream a, istream b, ostream sum) { int inputA, inputB, output; loop_stream(a) { a >> inputA; b >> inputB; output = a + b; sum << output; } } kernel sub(istream c, istream d, ostream diff) { int inputC, inputD, output; loop_stream(c) { c >> inputC; d >> inputD; output = c - d; diff << output; } } Your new hardware won’t run your old software – Balch’s law

Stream processor programming Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits Kernels (computation) and streams (communication) Use local data in clusters providing GOPs support Imagine stream processor at Stanford [Rixner’01] Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

Parallel Viterbi Decoding Add-Compare-Select (ACS) : trellis interconnect : computations –Parallelism depends on constraint length (#states) Traceback: searching –Conventional Sequential (No DP) with dynamic branching Difficult to implement in parallel architecture –Use Register Exchange (RE) parallel solution ACS Unit Traceback Unit Detected bits Decoded bits

Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.

Similar presentations

Presentation on theme: "Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.

Similar presentations

Presentation on theme: "Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003."— Presentation transcript:

Similar presentations

About project

Feedback