Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America
Outline Introduction Related Work Problem formulation Algorithms for optimizing polynomials Experimental results Conclusions
Introduction Embedded system applications need to compute polynomial expressions – Continuous functions can be approximated by Taylor Series – Adaptive (polynomial) filters – Polynomial interpolation/extrapolation in Computer Graphics – Encrpytion
Introduction Commonly occuring computations implemented in hardware – More flexibility than processor architecture – NPAs (Hardware accelarators) in PICO project – Custom Instructions (Tensilica) – Upto 100 times improvement over processor implementation (Kastner et.al TODAES’02) Develop techniques for reducing power consumption
Related Work (Behavioral transforms) Power consumption depends on many factors – Reducing number of operations Hardware: (Nguyen and Chatterjee TVLSI’00) Software: (I.Hong et.al TODAES’99) – Voltage reduction after speedup transformations Retiming, Pipelining, Algebraic restructuring (Chandrakasan et. al TCAD’95)
Related Work Scheduling and resource allocation – Shutting down unused resources (Monteiro et. al. DAC 96) – Allocation of registers, functional units and interconnects (A.Raghunathan et. al ICCD’94) Multiple V dd scheduling – Assigning supply voltage to each operation in CDFG (M.Chang and M.Pedram TVLSI’97)
Related Work Switching power is proportional to number of operations Multiplications are expensive in Embedded systems – Average 40 times more power than addition at 5V (V.Krishna et. al, VLSI Design 1999) Careful optimization of expressions is therefore necessary to save power
Reducing operations in polynomial expressions No good tool for polynomials – Designers rely on hand optimized libraries Conventional compiler techniques: CSE and Value numbering not suited for polynomials. Horner form: most popular representation – a n x n + a 1 x n-1 + ….a n-1 x + a 0 = (…((a n x + a n-1 )x + a n-2 )x +..a 1 )x + a 0 – Not good for multivariate polynomials – Only a single polynomial expression at a time
Comparison with Horner form Quartic-spline polynomial (3-D graphics) P = zu 4 + 4avu 3 + 6bu 2 v 2 + 4uv 3 w + qv 4 Horner form (from Maple TM ) P = zu 4 + (4au 3 + (6bu 2 + (4uw + qv)v)v)v (17 multiplications) Proposed algebraic method: d 1 = v 2 ; d 2 = d 1 *v P = u 3 (uz + ad 2 ) + d 1 ( qd 1 + u(wd 2 + 6bu) ) (11 multiplications)
Related Work (Polynomial Expressions Expression Factorization (M.A. Breuer JACM’69) – Allows only one kind of operator at a time Using Symbolic Algebra (M.A.Peymandoust, De Micheli) – Mapping polynomial datapaths to libraries (DAC’01) – Low power embedded software (DATE’02) – Results depend heavily on set of library elements eg. (a 2 – b 2 ) = (a+b)(a-b) iff (a+b) or (a-b) is a library element – Manipulates only a single expression at a time F 1 = A + B + C + D; F 2 = A + P + D; => Extract (A + D)
Motivating Example Consider set of expressions Using CSE 16 multiplications and 4 additions/subtractions 12 multiplications and 4 additions/subtractions
Motivational Example Using Horner transform Using our algebraic technique 12 multiplications and 4 additions/subtractions 7 multiplications and 3 additions/subtractions
Introduction to algebraic technique for redundancy elimination Algebraic techniques in multi-level logic synthesis (MLLS) – Decomposition, factoring reduce number of literals – Distill and Condense use Rectangle Covering methods Polynomial Expressions (Our Technique) – Factoring, Single term common subexpressions reduces number of multiplications – Multiple term common subexpressions reduces number of additions and possibly multiplications Key Differences (Generalization to handle higher orders) – Kernelling techniques – Finding single cube intersections
Introduction to our technique (Outline) Find a subset of all possible subexpressions (kernel generation) Transformation of Polynomial Expressions – Problem formulation Extract multiple term common subexpressions and factors Extract single term common factors
Introduction to our technique Terminology – Literal: A variable or a constant eg. a,b,2,3.14 – Cube: Product of literals e.g. +3a 2 b, -2a 3 b 2 c – SOP: Sum of cubes e.g. +3a 2 b – 2a 3 b 2 c – Cube-free expression: No literal or cube can divide all the cubes of the expression – Kernel: A cube free sub-expression of an expression, e.g. 3 – 2abc – Co-Kernel: A cube that is used to divide an expression to get a kernel, e.g. a 2 b
Introduction to our Technique Matrix Representation of Polynomial Expressions – F = x 3 y – xy 2 z is represented by – Each row represents a product term – Each column represents a variable/constant – Each element (i,j) represents power of variable j in term i +/-xyz
Generation of Kernels (example) P 1 = x 3 y + x 2 y 2 z {L} = {x,y,z} – Divide by x: F t = P 1 /x = x 2 y + xy 2 z xyz xyz
Generation of Kernels (example) F t = P 1 /x = x 2 y + xy 2 z C = Biggest Cube dividing all cubes of F t xyz / C = xyz C == xy
Generation of Kernels (example) Obtain Kernel: F 1 = F t /C = (x 2 y + xy 2 z)/(xy) = ( x + yz) Obtain Co-Kernel D 1 = x*(xy) = x 2 y – No kernels within F 1. Go back to P 1 P 1 = x 3 y + x 2 y 2 z – Divide now by next variable y F t = x 3 + x 2 yz – C = x 2 – But (x < y) ε C Stop Here, to avoid repeating same kernel F t /C = (x + yz) – No more kernels extracted – Record kernel F 1 = P 1 with co-kernel ‘1’
Concept of kernels and co-kernels Theorem: Two expressions f and g can have a multiple term common subexpression iff there are 2 kernels K f and K g having a multiple term intersection Detection of multiple term common subexpressions by intersection of sets of kernels Each co-kernel : kernel pair represents a possible factorization – e.g. x 3 y + x 2 y 2 z = [x 2 y](x + yz) Set of kernels a subset of all possible subexpressions
All Kernels and Co Kernels Which kernels to choose?
Kernel Cube Matrix (KCM) One row for each Kernel generated One column for each distinct kernel cube Each non-zero element represents a term Kernel Cubes xyz4-yz-x CoKernelsCoKernels 41 (3) 1 (4) 000 x2yx2y1 (1) 1 (2) 000 x001 (3) 1 (5) 0 xy001 (6) 01 (7) yz001 (4) 01 (5) x3yx3y
Finding Kernel Intersections (Distill Algorithm) Each kernel intersection or factor appears as a rectangle – Rectangle: Set of rows and columns such that all elements are ‘1’ Value of a rectangle = Weighted sum of the energy savings of the different operations Goal: Maximum valued rectangular covering of KCM Greedy heuristic: covering by prime rectangles
Modeling value function of a rectangle Formula for weighted sum of energy savings on selection of a rectangle R = # of rows ; C = # of columns M(R i ) = # of multiplications in row (co-kernel) i. M(C i ) = # of multiplications in column (kernel-cube) i m = ratio of average energy consumption of multiplication to addition in the target library Value =
Distill Algorithm Kernel Cubes xyz4-yz-x CoKernelsCoKernels 41 (3) 1 (4) 000 x2yx2y1 (1) 1 (2) 000 x001 (3) 1 (5) 0 xy001 (6) 01 (7) yz001 (4) 01 (5) 4x + 4yz = 4d 1 d 1 = (x + yz) x 3 y + x 2 y 2 z = x 2 yd 1 Saves 5 multiplications and 1 addition Value = 201 units (m = 40)
Distill Algorithm Kernel Cubes xyz4-yz-x CoKernelsCoKernels 41 (3) 1 (4) 000 x2yx2y1 (1) 1 (2) 000 x001 (3) 1 (5) 0 xy001 (6) 01 (7) yz001 (4) 01 (5) Remove covered terms 4xy – x 2 y = xyd 2 d 2 = 4 – x Saves 2 multiplications Value = 80
Distill Algorithm Distill algorithm exits after no more kernel intersections can be found P 1 = x 2 yd 1 d 1 = x + yz P 2 = 4d 1 – xyz d 2 = 4 - x P 3 = xyd 2 Can further optimize by finding single cube intersections
Finding single cube intersections (Condense algorithm) Form Cube Literal Matrix (CLM) – One row for each cube – One column for each literal – Eg. 2 cubes F 1 = a 4 b 3 c; and F 2 = a 2 b 4 c 2 abc
Finding single cube intersections (Condense algorithm) Each (single term) common subexpression appears as a rectangle. – Rectangle: Set of rows and columns where all elements are non-zero Value of a rectangle is number of multiplications saved by selecting it – C = cube corresponding to the rectangle Value = Rows*( (ΣC[i] ) -1) Maximum valued rectangular covering will give minimum number of multiplications Use greedy iterative covering by prime rectangles
Cube Literal Matrix (Condense Algorithm) Literals Term+/-xyz4d1d1 d2d2 CubesCubes Save 2 multiplications by extracting xy CLM for our example after Distill algorithm C = xy
Condense Algorithm Extracting xy No more favorable cube intersections found Literals Term+/-xyz4d1d1 d2d2 CubesCubes
Final Implementation – Total 7 multiplications, 3 additions/subtractions – Savings of 5 multiplications, 1 addition/subtraction compared to CSE Impossible to obtain such results using conventional techniques
Experimental setup Polynomials used in Computer graphics and Signal Processing 1.0 µ technology library, characterized for power consumption Synthesized using Synopsys Design Compiler TM – Min Hardware constraints (1 adder + 1 multiplier) – Med Hardware constraints (Max 4 multipliers)
Experimental setup Estimated power using Synopsys Power Compiler TM for random inputs, using RTL Simulator (VCS TM ) Compared energy consumption with CSE and Horner form Compared energy after voltage scaling
Results (Comparing operations) OriginalCSEHornerOur Technique MAMAMAMA ex ex ex ex ex Avg
Results (Min Hardware constraints) AreaEnergyEnergy-DelayEnergy (Scaled V) CHCHCHCH ex ex ex ex ex Avg
Results (Med Hardware constraints) AreaEnergyEnergy-DelayEnergy (Scaled V) CHCHCHCH ex ex ex ex ex Avg
Conclusions Technique to reduce number of operations in polynomial expressions Large savings in energy consumption observed over CSE and Horner methods Need to consider scheduling and resource allocation to obtain further improvements
Conclusions Thank you!! Questions ???
Extra slides
Finding Kernel Intersections (Distill Algorithm) Worst case scenario for Distill algorithm Number of prime rectangles exponential in number of rows/columns – Heuristic methods to find best prime rectangle – In practice polynomial expressions are not so large