Automating Transformations from Floating Point to Fixed Point for Implementing Digital Signal Processing Algorithms Prof. Brian L. Evans Embedded Signal.

Slides:



Advertisements
Similar presentations
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Advertisements

FINITE WORD LENGTH EFFECTS
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Programmable FIR Filter Design
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
ISSPIT Ajman University of Science & Technology, UAE
UNIVERSITY OF MASSACHUSETTS Dept
Carnegie Mellon Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic Lawrence J. Chang Inpyo Hong Yevgen Voronenko Markus Püschel Department.
Energy and Delay Improvement via Decimal Floating Point Hossam A.H.Fahmy, Electronics and Communications Department, CairoUniversity Egypt and.
DSP Implementation of a BPSK SNR Estimation Algorithm for OFDM Systems in AWGN Channel University of Patras Department of Electrical & Computer Engineering.
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
Matlab as a Design Environment for Wireless ASIC Design June 16, 2005 Erik Lindskog Beceem Communications, Inc.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
Dynamic Power Consumption In Large FPGAs WILLIAM GARCIA, ANDREW MORTELLARO.
Computer Arithmetic Integers: signed / unsigned (can overflow) Fixed point (can overflow) Floating point (can overflow, underflow) (Boolean / Character)
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Introduction to Adaptive Digital Filters Algorithms
Computer Arithmetic Nizamettin AYDIN
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
Power Reduction for FPGA using Multiple Vdd/Vth
Floating-point to fixed-point code conversion with variable trade-off between computational complexity and accuracy loss Alexandru Bârleanu, Vadim Băitoiu.
Fixed-Point Arithmetics: Part II
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
1 5. Application Examples 5.1. Programmable compensation for analog circuits (Optimal tuning) 5.2. Programmable delays in high-speed digital circuits (Clock.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Low-Power Multipliers with Data Wordlength Reduction Kyungtae Han Brian L. Evans Earl E. Swartzlander, Jr.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Hardware/Software Partitioning of Floating-Point Software Applications to Fixed-Point Coprocessor Circuits Lance Saldanha, Roman Lysecky Department of.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Estimation of Number of PARAFAC Components
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
ESPL 1 Wordlength Optimization with Complexity-and-Distortion Measure and Its Application to Broadband Wireless Demodulator Design Kyungtae Han and Brian.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Multi-Split-Row Threshold Decoding Implementations for LDPC Codes
© 2003 Xilinx, Inc. All Rights Reserved Answers DSP Design Flow.
Copyright 2008 Koren ECE666/Koren Part.7b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Data Word Length Reduction for Low- Power DSP Software Kyungtae Han March 24, 2004.
UNIT 2. ADDITION & SUBTRACTION OF SIGNED NUMBERS.
EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.
RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.
Digital Signal Processor HANYANG UNIVERSITY 학기 Digital Signal Processor 조 성 호 교수님 담당조교 : 임대현
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Jinseok Choi, Brian L. Evans and *Alan Gatherer
Adnan Quadri & Dr. Naima Kaabouch Optimization Efficiency
UNIVERSITY OF MASSACHUSETTS Dept
Yousof Mortazavi, Aditya Chopra, and Prof. Brian L. Evans
Automating Transformations from Floating Point to Fixed Point for Implementing Digital Signal Processing Algorithms Kyungtae Han Ph.D. Defense Committee.
Matlab as a Design Environment for Wireless ASIC Design
Circuit Design Techniques for Low Power DSPs
Post-Silicon Calibration for Large-Volume Products
University of Texas at Austin
UNIVERSITY OF MASSACHUSETTS Dept
Data Wordlength Reduction for Low-Power Signal Processing Software
C Model Sim (Fixed-Point) -A New Approach to Pipeline FFT Processor
Automatic Floating-Point to Fixed-Point Transformations
UNIVERSITY OF MASSACHUSETTS Dept
Presentation transcript:

Automating Transformations from Floating Point to Fixed Point for Implementing Digital Signal Processing Algorithms Prof. Brian L. Evans Embedded Signal Processing Laboratory Dept. of Electrical and Computer Engineering The University of Texas at Austin July 4, 2006 Based on work by PhD student Kyungtae Han (now at Intel Research Labs)

2 Outline Introduction Background Optimize fixed-point wordlengths Reduce power consumption in arithmetic Automate transformations of systems Conclusion

3 Implementing Digital Signal Processing Algorithms Introduction Code Conversion Wordlength Optimization Floating-Point Program Fixed Point (Uniform Wordlength) Fixed Point (Optimized Wordlength) Floating- Point Processor Fixed- Point Processor Fixed- Point ASIC PricePower*Hardware Digital Signal Processing Algorithms * Power consumption H L H L H L ASIC: Application Specific Integrated Circuit

4 Transformations to Fixed Point Advantages  Lower hardware complexity  Lower power consumption  Faster speed in processing Disadvantages  Introduces distortion due to quantization error  Search for optimum wordlengths by trial & error is time-consuming Research goals  Automate transformations to fixed point  Control distortion vs. complexity tradeoffs Code Conversion Wordlength Optimization Floating-Point Program Fixed Point (Optimized Wordlength) Transformation Introduction

5 Outline Introduction Background Optimize fixed-point wordlengths Reduce power consumption in arithmetic Automate transformations of systems Conclusion

6 Fixed-Point Data Format Integer wordlength (IWL)  Number of bits assigned to integer representation  Includes sign bit Fractional wordlength (FWL)  Number of bits assigned to fraction Wordlength: WL = IWL + FWL SystemC format SXXXXX Wordlength Integer wordlength Fractional wordlength (Binary point) π = … (10) [Floating Point] (10) = (2) [WL=9; IWL=3; FWL=6] (10) = (2) [WL=16; IWL=3; FWL=13] Background

7 Feasible region Distortion vs. Complexity Tradeoffs Different wordlengths have different application distortion and implementation complexity tradeoffs Background Minimize implementation cost Minimize application distortion Implementation complexity c(w) Application distortion d(w) Optimal tradeoff curve c(w)c(w)Implementation cost function C max Constant for maximum implementation cost d(w)d(w)Application distortion function D max Constant for maximum application distortion Wordlength lower bounds Wordlength upper bounds Vector of wordlengths:

8 Wordlength Optimization Background Multiple objective optimization Single objective optimization Proposed work fixes integer wordlengths and searches for fractional wordlengths

9 Genetic Algorithm Evolutionary algorithm  Inspired by Holland 1975  Mimic processes of plant and animal evolution  Find optimum of a complex function New Gene Pool Function Evaluation MutationSelection Mating Child Genes Parental Genes Genes w/ Measure [Greg Rohling, Ph.D Defense, Georgia Tech, 2004] Background

10 Pareto Optimality Pareto optimality: “best that could be achieved without disadvantaging at least one group” [Schick, 1970] Pareto optimal set is set of nondominated solutions  E is dominated by C as all objectives for C are less than corresponding objectives for E  Solutions A, B, C, D are nondominated (not dominated by any solution) Pareto front is boundary (tradeoff curve) that connects Pareto optimal set solutions Objective 2 Objective 1 Pareto Front : Nondominated : Dominated F E G H I D C B A Background

11 Outline Introduction Background Optimize fixed-point wordlengths Reduce power consumption in arithmetic Automate transformations of systems Conclusion

12 Search for Optimum Wordlength Exhaustive search impractical for many variables Gradient-based search (single objective)  Utilizes gradient information to determine next candidates  Complexity measure (CM) [Sung & Kum, 1995] Complexity measure (CM)  Distortion measure (DM) [Han et al., 2001] Distortion measure (DM)  Complexity-and-distortion measure (CDM) [Han & Evans, 2004] Guided random search  Genetic algorithm for single objective [Leban & Tasic, 2000]  Multiple objective genetic algorithm [Han, Olson & Evans, 2006] Optimize Fixed-Point Wordlengths Next

13 Complexity-and-Distortion Measure Weighted combination of measures Single objective function Gradient-based search  Initialization  Iterative greedy search based on complexity and distortion gradient information gradient information c(w)c(w)Complexity function d(w)d(w)Distortion function D max Constant for maximum distortion C max Constant for maximum complexity Optimize Fixed-Point Wordlengths

14 Case Study I: Filter Design Infinite impulse response (IIR) filter  Complexity measure: Area model of field-programmable gate array (FPGA) [Constantinides, Cheung & Luk 2003]  Distortion measure: Root mean square (RMS) error  Seven fixed-point variables (indicated by slashes) Delay b0b0 b1b1 -a 1 x[n]y[n] Optimize Fixed-Point Wordlengths

15 Case Study I: Gradient-Based Search CDM could lead to lower complexity and lower number of simulations compared to DM and CM Search Method Gradient Measure Number of System Simulations Complexity Estimate (LUT) Distortion (RMS)* Gradient Complete DM CDM CM ** * Maximum distortion measured by root mean square (RMS) error is 0.1 ** 16 7 = 268,435,456 (8.5 years, if 1 second per 1 simulation) Optimize Fixed-Point Wordlengths

16 Case Study I: Genetic Algorithm 100 th Generation250 th Generation500 th Generation Search Pareto optimal set (nondominated)Pareto optimal set Handles multiple objectives: Error and Area * Population for one generation: 90 Pareto Front LUT: Lookup table 9,000 simulations 22,500 simulations45,000 simulations Optimize Fixed-Point Wordlengths

17 Case Study I: Comparison Gradient-based search (GS) results vs. GA results GS methods can get stuck in a local minimum GS methods reduce running time (CDM: 145 simulations) * Required RMS max for gradient-based search are D max {0.12, 0.1, 0.08} 500 th Generation (45000 simulations)50 th Generation (4500 simulations) Optimize Fixed-Point Wordlengths

18 Case Study II: Communication System Simple binary phase shift keying (BPSK) system  Complexity measure: Area model of field-programmable gate array (FPGA) [Constantinides, Cheung, and Luk 2003]  Distortion measure: Bit error rate (BER)  Four fixed-point variables (indicated by slashes) Integration & Dump Optimize Fixed-Point Wordlengths Decision AWGN Source Data (1 or -1) Carrier BER

19 Case Study II: Gradient-Based Search CDM could lead to lower complexity and lower number of simulations compared to DM and CM Search Method Gradient Measure Number of System Simulations Complexity Estimate (LUT) Distortion (BER)* Gradient Complete DM CDM CM * Maximum distortion measured by bit error rate (BER) error is 0.1 Optimize Fixed-Point Wordlengths

20 Case Study II: Genetic Algorithm Search Pareto optimal setPareto optimal set Handles multiple objectives 50 th Generation100 th Generation200 th Generation * Population for one generation: 90 Pareto Front LUT: Lookup table 4,500 simulations9,000 simulations18,000 simulations Optimize Fixed-Point Wordlengths BERLUT DM CDM CM Error (Bit Error Rate) For Comparison Preliminary results

21 Comparison of Proposed Methods Gradient-based search Genetic algorithm Type of SolutionOne pointFamily of points Tradeoff Curve FoundNoYes Execution TimeShortLong Amount of ComputationLowHigh ParallelismLowHigh Optimize Fixed-Point Wordlengths

22 Outline Introduction Background Optimize fixed-point wordlengths Reduce power consumption in arithmetic Automate transformations of systems Conclusion

23 Lower Power Consumption in DSP Minimize power dissipation due to limited battery power and cooling system Multipliers often a major source of dynamic power consumption in typical DSP applications  Multi-precision multiplier select smaller multipliers (8, 16 or 24 bits) to reduce power consumption  Wordlength reduction to select any word size [Han, Evans & Swartzlander 2004] In general, what reductions in power are possible in software when hardware has fixed wordlengths? Reduce Power Consumption in Arithmetic Next

24 Wordlength Reduction in Multiplication Input data wordlength reduction  Smaller bits enough to represent, e.g. π x π ≈ 9 Truncation Signed right shift  Move toward the least significant bit (LSB)  Signed bit extended for arithmetic right shift Sign bit Reduce Power Consumption in Arithmetic

25 Power consumption  Switching power consumption  Static power consumption Static power consumption Switching power consumption  Switching activity parameter, α Switching activity  Reduce α by wordlength reduction Relationship between reduced wordlength and switching parameter α in power consumption? CLCL Load capacitance V dd Operating voltage f clk Operating frequency Power Reduction via Wordlength Reduction Reduce Power Consumption in Arithmetic

26 Analytical Method InputSwitching expectation Full lengthL/2 Truncate N bitsM/2 N-bit signed right shift L/2 Wordlength (L) = 16 Reduction No Reduction S …… L bits M bitsN bits S …… SS … SS … Reduce Power Consumption in Arithmetic

27 Dynamic Power Consumption for Wallace Multiplier (1 MHz) Reduction (56%) 16-bit x 16-bit multiplier (Simulated on Xilinx XC3S200- 5FT256 FPGA) Truncation- First Truncation- Second Truncate 1 st arg Truncate 2 nd arg (recode,nonrecode) Wallace multiplierWallace multiplier used in TI 320C64 DSP Reduce Power Consumption in Arithmetic

28 Dynamic Power Consumption for Radix-4 Modified Booth Multiplier (1 MHz) Reduction (31%) Sensitive (13%) 16-bit x 16-bit multiplier (Simulated on Xilinx XC3S200- 5FT256 FPGA) Swapping could have benefit Radix-4 modified Booth multiplierRadix-4 modified Booth multiplier used in TI 320C62 DSP Truncate 1 st arg Truncate 2 nd arg (recode,nonrecode) Reduce Power Consumption in Arithmetic

29 Comparison of Proposed Methods Truncation to 8 bits reduces est. power consumption by 56% in Wallace and 31% in Booth 16-bit multipliers Signed right shift has no est. power reduction in Wallace multiplier (for any shift) and 25% reduction in Booth (for 8-bit shift) multiplier Operand swapping reduces power consumption for Booth but has negligible savings for Wallace multiplier Power consumption in tree-based multiplier  Highly dependent on input data  Simulation matches analysis Reduce Power Consumption in Arithmetic

30 Outline Introduction Background Optimize fixed-point wordlengths Reduce power consumption in arithmetic Automate transformations of systems Conclusion

31 Automating Transformations from Floating Point to Fixed Point Existing fixed-point tools  Support fixed-point simulation  Convert floating-point code to raw fixed-point code  Manually find optimum wordlength by trial and error Automating transformations  Fully automate conversion and wordlength optimization Floating-Point Program Wordlength-Optimized Fixed-Point Program Code Conversion Wordlength Optimization SNU gFix, Autoscaler CoWare SPW HDS Synopsys CoCentric MATLAB Fixed-point toolbox MATLAB Fixed-point blockset AccelChip DSP synthesis Catalytic RMS, MCS Fixed-point tools Automatic Transformations of Systems

32 Automatic Transformation Flow Code generation  Parse floating-point program  Generate raw fixed-point program and auxiliary programsraw fixed-point program Range estimation  Estimate range to avoid overflow (Analytical/Simulation)  Determine integer wordlength (IWL) Wordlength optimization  Optimize wordlength according to given input, and error specification (Analytical/Simulation)  Determine fractional wordlength (FWL) Code Generation Wordlength Optimization Range Estimation Automatic Transformations of Systems

33 Automating Transformation Environment for Wordlength Optimization Top Program Search Engine Evaluation Program (Objectives) Fixed-Point Program Floating-Point Program Error Estimation Complexity Estimation Range Estimation Given floating-point program and options, auxiliary programs are automatically generatedautomatically generated Given input data, optimum wordlength is searched Input Data Gradient-based or Genetic algorithm Optimum Wordlength Automatic Transformations of Systems

34 Demo of Released Software Automatic Transformations of Systems

35 Conclusion Search for optimum wordlength  Gradient-based search reduces execution time while solutions could be trapped in local optimum  Genetic algorithm can find distortion vs. complexity tradeoff curve, but it requires longer execution time Reduce power consumption by wordlength reduction of multiplicands Automate transformations from floating-point programs to fixed-point programs Freely distributable software release available at Conclusion

36 Future Work Advanced wordlength search algorithms  Hybrid wordlength optimization  Prune redundant wordlength variables (e.g. delay, adder)  Adaptive step size for gradient-based search methods Further analysis on search algorithms  Analysis of genetic algorithms with different settings  Comparison with simulated annealing Low power consumption  System level including memory [Powell and Chau, 1991]  Wordlength reduction for floating-point multipliers Conclusion

37 Future Work (continued) Electronic design automation software  Enhanced code generator (e.g. rounding preferences)  Hybrid analytical/simulation range estimation Optimum DSP algorithms  Rearranging subsystems at block diagram  Rearranging mathematical expressions in algorithm Developing more sophisticated hardware area models  Avoids having to route each design through synthesis tools  Transcendental functions Conclusion

38 End

39 Backup Slides

40 Publications-I Conference Papers 1. K. Han, A. G. Olson, and B. L. Evans, ``Automatic floating-point to fixed-point transformations'', Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, Nov. 2006, Pacific Grove, CA USA. invited paper. 2. K. Han, B. L. Evans, and E. E. Swartzlander, Jr., ``Low-Power Multipliers with Data Wordlength Reduction'', Proc. IEEE Asilomar Conf. on Signals, Systems, and Computers, Oct. 30-Nov. 2, 2005, pp , Pacific Grove, CA USA. 3. K. Han, B. L. Evans, and E.E. Swartzlander, Jr., ``Data Wordlength Reduction for Low- Power Signal Processing Software,'' Proc. IEEE Work. on Signal Processing Systems, Oct , 2004, pp , Austin, TX USA. 4. K. Han and B. L. Evans, ``Wordlength Optimization with Complexity-And-Distortion Measure and Its Applications to Broadband Wireless Demodulator Design,'' Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Proc., May 17-21, 2004, vol. 5, pp , Montreal, Canada. 5. K. Han, I. Eo, K. Kim, and H. Cho, ``Numerical Word-Length Optimization for CDMA Demodulator,'' Proc. IEEE Int. Sym. on Circuits and Systems, May, 2001, vol. 4, pp , Sydney, Australia. 6. K. Han, I. Eo, K. Kim, and H. Cho, ``Bit Constraint Parameter Decision Method for CDMA Digital Demodulator,'' Proc. CDMA Int. Conf. & Exhibition, Nov. 2000, vol. 2, pp , Seoul, Korea. 7. S. Nahm, K. Han, and W. Sung, ``A CORDIC-based Digital Quadrature Mixer: Comparison with ROM-based Architecture,'' Proc. IEEE Int. Sym. on Circuits and Systems, Jun. 1998, vol. 4, pp , Monterey, CA USA. Publications

41 Publications-II Journal Articles  K. Han and B. L. Evans, ``Optimum Wordlength Search Using A Complexity-And- Distortion Measure,'' EURASIP Journal on Applied Signal Processing, special issue on Design Methods for DSP Systems, vol. 2006, no. 5, pp , Other Publications 1. K. Han, E. Soo, H. Jugn, and K. Kim, Apparatus and Method for Short-Delay Multipath Searcher in Spread Spectrm Systems, U.S. Patent pending, Nov K. Han, I. Lim, E. Soo, H. Seo, K. Kim, H. Jung, and H. Cho, Apparatus and Method for Separating Carrier of Multicarrier Wireless Communication Receiver System, U.S. Patent pending, Sep K. Han, ``Carrier Synchronization Scheme Using Input Signal Interpolation for Digital Receivers,'' Master's Thesis, Seoul National University, Seoul, Korea, Feb Publications

42 Research on Transformation Backup

43 Simulation Flow Generate Optimized fixed-point program Search wordlength set Setup desired specification Gradient-based search algorithm Genetic search algorithm Pick one of sets Search wordlength sets Generate Pareto Front Backup

44 Algorithm Design and Implementation Floating-Point Programs Uniform Wordlength Fixed-Point Programs Optimized Fixed-Point Programs Code Conversion Wordlength Optimization Floating-Point Processor Fixed-Point Processor Fixed-Point IC High Low Algorithm Design Algorithm Implementation Design Time High Low Hardware ComplexityPower Consumption Backup

45 Wordlength Optimization Constraints Distortion constraintComplexity constraint Implementation complexity c(w) Application-specific distortion d(w) D max Implementation complexity c(w) Application-specific distortion d(w) C max Backup

46 Gradient-Based Search Gradient information can be used for update direction Gradient information is measured in design parameters such as implementation complexity, precision distortion, or power consumption  Complexity measurement (CM) [Sung and Kum, 1995] Complexity measurement (CM)  Distortion measurement (DM) [Han et al., 2001] Distortion measurement (DM)  Complexity-and-distortion measurement (CDM) [Han and Evans, 2004] (proposed) Complexity-and-distortion measurement (CDM) Backup

47 Gradient Information 5 w1w1 w2w a b Search direction Gradient Objective value b Nnumber of variable hiteration index nvariable index wwordlength vector f(w)f(w)objective function Backup

48 Gradient-Based Search Direction Wordlength update (s: step size) Direction where Finite Difference Backup

49 Complexity and Distortion Function Complexity function, c(w)Complexity function  Number of multiplications is counted  Hardware complexity is estimated by assuming that complexity linearly increases as wordlength increases  Given hardware model results in accurate complexity Distortion function, d(w)  Difficult to derive closed-form mathematical expression Difficult  Estimated by computer simulation measuring output SNR or bit error rate in digital communication systems Backup

50 Uses complexity sensitivity information as direction to search for optimum wordlength Advantage: minimizes complexity Disadvantage: demands large number of iterations Complexity Measure [Sung and Kum, 1995] Update direction Objective function Optimization problem Backup

51 Applies the application performance information to search for the optimum wordlengths Advantage: Fewer number of iterations Disadvantage: Not guaranteed to yield optimum wordlength for complexity Distortion Measure [Han et al., 2001] Update direction Objective function Optimization problem Backup

52 Feasible Solution Search [Sung and Kum, 1995] Exhaustive search of all possible wordlengths Advantages  Does not miss optimum points  Simple algorithm Disadvantage  Many trials (=experiments) Distance Expected number of iterations Direction of full search: minimum wordlengths {2,2} optimum wordlengths = {5,5} d = 6 trials = 24 Backup

53 Sequential Search [K. Han et al. 2001] Greedy search based on sensitivity information (gradient) Example  Minimum wordlengths {2,2}  Direction of sequential search  Optimum wordlengths {5,5}  12 iterations Advantage: Fewer trials Disadvantage: Could miss global optimum point Backup

54 Case Study: Receiver Design Multicarrier Modulator w 0 w 1 w 2 w 3 Transmitter Wireless Channel Multicarrier Demodulator Channel Equalizer Channel Estimator Bit Error Rate Tester Receiver w0w0 Input wordlength of a multicarrier demodulator which performs a fast Fourier transform (FFT)FFT w1w1 Input wordlength of equalizer w2w2 Input wordlength of channel estimator w3w3 Output wordlength of channel estimator EncoderData Backup

55 Simulation Results CDM leads to lower complexity compared to DM CDM reduces the number of trials compared to CM, feasible solution [Sung and Kim 1995], and exhaustive search  Fast searching Search Method Gradient Measure αcαc Number of Trials SimulationsWordlength for Variables Complexity Estimate Distortion (BER)* Gradient Feasible Exhaustive DM CDM CM {10,9,4,10} {7,10,4,6} {7,7,4,6} * Required BER ≤ 1.5 x Backup

56 Simulation Environments Assumptions  Internal wordlengths of blocks have been decided  Complexity increases linearly as wordlength increases Required application performance  Bit error rate of 1.5 x (without error correcting codes) Simulation tool  LabVIEW 7.0 InputWeight FFT1024 Equalizer (right) 1 Estimator128 Equalizer (upper) 2 Complexity Vector Complexity C(w) = c T. w Backup

57 FFT Cost N Tap FFT cost 256 Tap FFT cost Backup

58 Minimum Wordlengths Change one wordlength variable while keeping other variables at high precision  {1,16,16,16},{2,16,16,16},...  {16,1,16,16},{16,2,16,16},...  …  …{16,16,16,15},{16,16,16,16} Minimum wordlength vector is {5,4,4,4} Backup

59 Number of Trials Start at {5,4,4,4} wordlength Next wordlength vectors for complexity measure (α = 1.0) {5,4,4,4}, {5,5,4,4}, … Increase wordlength one-by- one until satisfying required application performance Backup

60 Power Consumption Power consumption in CMOS circuits Significant power in CMOS circuits isSignificant power dissipated when they are switching Power reduction in hardware part [Chandrakasan and Brodersen, 1995]  Scaling down, minimizing area  Adjusting voltage and frequency during operation Power reduction in software part [Tiwari, Malik and Wolfe, 1994] [Lee et al., 1997]  Instruction ordering and packing  Energy reductions varying from 26% to 73% Low-Power Signal Processing

61 Wordlength for Low-Power Consumption Power model of wordlength [Choi and Burleson, 1994]  Wordlength is considered as capacitance  Power consumption is proportional to wordlength  Switching activity is not considered Data wordlength reduction technique [Han, Evans, and Swartzlander, 2004] (proposed)  Count node transitions for switching activity  Reduce input data wordlength to decrease power consumption Low-Power Signal Processing

62 Dynamic and Static Power Backup Trends in dynamic and static power dissipation showing increasing contribution of static power [S. Thompson, P. Packan, and M. Bohr. MOS Scaling: Transistor Challenges for the 21st Century. Intel Technology Journal, Q3 1998]

63 Power Dissipation of Multiplier Unit Multiply unit is usually a major source of power consumption in typical DSP applications  Multiply unit required for digital communication & digital signal processing algorithms  Digital filters, equalizers, FFT/IFFT, digital down/ upconverter, etc. TMS320C5x Power Dissipation Characteristics from Backup

64 Wallace vs. Booth Multipliers Tree dot diagram in 4-bit Wallace multiplier Radix-4 multiplier based on Booth’s recoding (Χ ● a = P) Asymmetric (one operand recoded) Symmetric Backup

65 Radix-4 Modified Booth Multiplier One multiplicand is recoded The a and x are multiplicands P is product of multiplication Three bits in X are recoded to z Backup

66 Switching Activity in Multipliers Logic delay and propagation cause glitches Proposed analytical method  Hard to estimate glitches in closed form  Analyze switching activity w/r to input data wordlength  Does not consider multiplier architecture Simulation method  Count all switching activities (transition counts in logic)  Power estimation (Xilinx XPower)  Considers multiplier architecture Backup

67 Analytical Method Stream of data for one multiplicand Compare two adjacent numbers in stream after reduction Expectation of bit switching, x, with probability P x  L-bit input data  Truncate input data to M bits (remove N bits)  N-bit signed right shift in L-bit input (Y is sign bit) S …… L bits M bitsN bits S …… SS … SS … Reduce Power Consumption in Arithmetic

68 Analytical Method X has binomial distribution Always L/2 (independent on M and N) Backup

69 Power Reduction in TI DSP TI TMS320VC5416 DSP STARTER KIT  Radix-4 modified Booth multiplier  Measure average current for wordlength reduction of multiplicands loop: STM data_a, AR2; STM data_b, AR3; MPY *AR2+, *AR3+,a ….…. MPY *AR2+, *AR3+,a B loop Assembly program ( data_a and data_b has random data with wordlength w) Backup

70 Code Generation for Fixed-Point Program Adder function in MATLAB Function [c] = adder(a, b) c = 0; c = a + b; Function [c] = adder_fx(a, b, numtype) c = 0; a = fi (a, numtype.a); b = fi (b, numtype.b); c = fi (c, numtype.c); c(:) = a + b; (a) Floating point program for adder (b) Raw fixed-point program Function [c] = adder_fx(a, b) c = 0; a = fi (a, 1,32,16); b = fi (b, 1,32,16); c = fi (c, 1,32,16); c(:) = a + b; (c) Converted fixed-point program for automating optimization S WL FWL fi(a, S,WL,FWL) is a constructor function for a fixed-point object in fixed-point toolbox [S: Signed, WL: Wordlength, FWL: Fraction length] Determined by designers with trial and error Backup

71 Code Generation Backup

72 Running Transformation Just call top function with input data Range and optimum wordlengths depend on input statistic > in = rand(1,1000) > mac_top(in) Backup

73 Advantages/disadvantages of wordlength search algorithms Backup