Download presentation
Presentation is loading. Please wait.
Published byEvelyn Nash Modified over 9 years ago
1
Techniques for Low Power Turbo Coding in Software Radio Joe Antoon Adam Barnett
2
Software Defined Radio Single transmitter for many protocols Protocols completely specified in memory Implementation: – Microprocessors – Field programmable logic
3
Why Use Software Radio? Wireless protocols are constantly reinvented – 5 Wi-Fi protocols – 7 Bluetooth protocols – Proprietary mice and keyboard protocols – Mobile phone protocol alphabet soup Custom DSP logic for each protocol is costly
4
So Why Not Use Software Radio? Requires high performance processors Consumes more power Inefficient general fork Efficient application specific fork Inefficient Field-programmable fork
5
Turbo Coding Channel coding technique Throughput nears theoretical limit Great for bandwidth limited applications – CDMA2000 – WiMAX – NASA ‘s Messenger probe
6
Turbo Coding Considerations Presents a design trade-off Turbo coding is computationally expensive But it reduces cost in other areas – Bandwidth – Transmission power
7
Reducing Power in Turbo Decoders FPGA turbo decoders – Use dynamic reconfiguration General processor turbo decoders – Use a logarithmic number system
8
Generic Turbo Encoder Component Encoder Component Encoder Interleave p1 s p2 Data stream
9
q1 r q2 Generic Turbo Decoder Decoder Interleave
10
Decoder Design Options Multiple algorithms used to decode Maximum A-Posteriori (MAP) – Most accurate estimate possible – Complex computations required Soft-Output Viterbi Algorithm – Less accurate – Simpler calculations Decoder
11
FPGA Design Options Goal Make an adaptive decoder Decoder Received Data Parity Original sequence Tunable Parameter Low power, accuracy High power, accuracy
12
Component Encoder M blocks are 1-bit registers Memory provides encoder state MM Generator Function
13
Encoder State 00 Time 01 10 11 00 01 10 11 0 1 GF 0 00 01 10 11 1 0 1
14
Viterbi’s Algorithm Determine most likely output Simulate encoder state given received values s0s0 s1s1 s2s2 r 0 p 0 r 1 p 1 r 2 p 2 d0d0 d1d1 d2d2 … Time
15
Viterbi’s Algorithm Write: Compute branch metric (likelihood) Traceback: Compute path metric, output data Update: Compute distance between paths Rank paths by path metric and choose best For N memory: – Must calculate 2 N-1 paths for each state
16
Adaptive SOVA SOVA: Inflexible path system scales poorly Adaptive SOVA: Heuristic – Limit to M paths max – Discard if path metric below threshold T – Discard all but top M paths when too many paths
17
Implementing in Hardware Branch Metric Unit Add Compare Select Survivor memory Control q r
18
Implementing in Hardware Controller – – Control memory – select paths Branch Metric Unit – Compute likelihood – Consider all possible “next” states Add, Compare, Select – Append path metric – Discard paths Survivor Memory – Store / discard path bits
19
Implementing in Hardware Add, Compare, Select Unit Present State Path Values Next State Path Values Compute, Compare Paths Branch Values > T Path Distance Threshold
20
Dynamic Reconfiguration Bit Error Rate (BER) – Changes with signal strength – Changes with number of paths used Change hardware at runtime – Weak signal: use many paths, save accuracy – Strong signal: use few paths, save power – Sample SNR every 250k bits, reconfigure
21
Dynamic Reconfiguration
22
Experimental Results K (Number of encoder bits) proportional to average speed, power
23
Experimental Results FPGA decoding has a much higher throughput Due to parallelism
24
Experimental Results ASOVA performs worse than commercial cores However, in other metrics it is much better – Power – Memory usage – Complexity
25
Future Work Use present reconfiguration means to design – Partial reconfiguration – Dynamic voltage scaling Compare to power efficient software methods
26
Power-Efficient Implementation of a Turbo Decoder in SDR System Turbo coding systems are created by using one of three general processor types – Fixed Point (FXP) Cheapest, simplest to implement, fastest – Floating Point (FLP) More precision than fixed point – Logarithmic Numbering System (LNS) Simplifies complex operations Complicates simple add/subtract operations
27
Logarithmic Numbering System X = {s, x = log(b)[|x|]} – S = sign bit, remaining bits used for number value Example – Let b = 2, – Then the decimal number 8 would be represented as log(2)[8] = 3 – Numbers are stored in computer memory in 2’s compliment form (3 = 01111101) (sign bit = 0)
28
Why use Logarithmic System? Greatly simplifies multiplication, division, roots, and exponents – Multiplication simplifies to addition E.g. 8 * 4 = 32, LNS => 3 + 2 = 5 (2^5 = 32) – Division simplifies to subtraction E.g. 8 / 4 = 2, LNS => 3 – 2 = 1 (2^1 = 2)
29
Why use Logarithmic System? Roots are done as right shifts – E.g. sqrt(16) = 4, LNS => 4 shifted right = 2 (2^2 = 4) Exponents are done as left shifts – E.g. 8^2 = 64, LNS => 3 shifted left = 6 (2^6 = 64)
30
So why not use LNS for all processors? Unfortunately addition and subtraction are greatly complicated in LNS. – Addition: log(b)[|x| + |y|] = x + log(b)[1 + b^z] – Subtraction: log(b)[|x| - |y|] = x + log(b)[1 - b^z] Where z = y – x Turbo coding/decoding is computationally intense, requiring more mults, divides, roots, and exps, than adds or subtracts
31
Turbo Decoder block diagram Use present reconfiguration means to design – Partial reconfiguration – Dynamic voltage scaling Compare to power efficient software methods Each bit decision requires a subtraction, table look up, and addition
32
Proposed new block diagram As difference between e^a and e^b becomes larger, error between value stored in lookup table vs. computation becomes negligible. For this simulation a difference of >5 was used
33
How it works For d > 5 New Mux (on right) ignores SRAM input and simply adds 0 to MAX result. d > 5, pre-Decoder circuitry disables the SRAM for power conservation.
34
Comparing the 3 simulations Comparisons were done between a 16-bit fixed point microcontroller, a 16-bit floating point processor, and a 20-bit LNS processor. 11-bits would be sufficient for FXP and FLP, but 16-bit processors are much more common Similarly 17-bits would suffice for LNS processor, but 20-bit is common type
35
Power Consumption
36
Latency Recall: Max*(a,b) = ln(e^a+e^b)
37
Power savings Pre-Decoder circuitry adds 11.4% power consumption compared to SRAM read. So when an SRAM read is required, we use 111.4% of the power compared to the unmodified system However, when SRAM is blocked we only use 11.4% of the power we used before.
38
Power savings The CACTI simulations for the system reported that the Max* operation accounted for 40% of all operations in the decoder The Max* operations for the modified system required 69% of the power when compared to the unmodified system. This leads to an overall power savings of 69% * 40% = 27.6%
39
Conclusion Turbo codes are computationally intense, requiring more complex operations than simple ones LNS processors simplify complex operations at the expense of making adding and subtracting more difficult
40
Conclusion Using a LNS processor with slight modifications can reduce power consumption by 27.6% Overall latency is also reduced due to ease of complex operations in LNS processor when compared to FXP or FLP processors.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.