Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005 Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America
Outline Introduction Related Work Polynomial transformation Common Subexpression elimination Results Conclusions
Introduction Multiplications by constants encountered in many application areas DSP transforms in Audio, Video, Image processing (DFT, DCT, IDCT etc..) Filtering operations in Communication (FIR, IIR filters) Multiple Input Multiple Output (MIMO) systems Polynomials in Computer graphics
Introduction Multiplication is expensive in hardware Decompose constant multiplications into shifts and additions 13*X = (1101) 2 *X = X + X<<2 + X<<3 Signed digits can reduce the number of additions/subtractions Canonical Signed Digits (CSD) (Knuth’74) (57) 10 = ( ) 2 = ( ) CSD Further reduction possible by common subexpression elimination Upto 50% reduction (R.Hartley TCS’96)
Introduction Common subexpressions = common digit patterns F 1 = 7*X = (0111)*X = X + X<<1 + X<<2 F 2 = 13*X = (1101)*X = X + X<<2 + X<<3 D 1 = X + X<<2 F 1 = D 1 + X<<1 F 2 = D 1 + X<<3 Good for single variable: FIR filters (transposed form) Multiple variable? (DFT, DCT etc..??) “0101” => X + X<<2 3+, 3<< 4+, 4<<
Related Work Simple Bipartite matching (Potkonjak et. al TCAD’95) (10101) and (01101) => common pattern = “101” (10010) and (010010) => cannot detect pattern “1001” Recursive Shift and Add (RESANDS) (H.Nguyen et. Al, TVLSI 2000) (10010) and (010010) => common pattern “1001” Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’99) (1011) => “0011”, “1001”, “1010”, “0101”, “1011”
Related Work Extending techniques for multiple variables Y 1 a 11 a 12 a 13 X 1 Y 2 = a 21 a 22 a 23 x X 2 Y 3 a 31 a 32 a 33 X All Distinct S ij X j and C ik D k Y1Y1 Y2Y2 Y3Y3 Potkonjak et. al. TCAD’95
Related Work Multiple Variable Common Subexpression elimination (A.Hosangadi et. al ASAP’04) Polynomial transformation of linear systems. Use rectangular covering methods Cannot find subexpressions with reversed signs eg. (X 1 – X 2 <<1) ≠ (X 2 <<1 – X 1 ) Common occurrence when signed digits are used Rectangle covering has exponential complexity Method to overcome these limitations ?
Related Work Algebraic methods in multi- level logic synthesis (MLLS) Reducing literal count in a set of Boolean expressions Factoring, decomposition: Established algebraic techniques Typically used for thousands of variables and literals Apply these methods to optimize linear systems? D 1 = X 1 + X 2 <<2 Y 1 = D 1 + D 1 <<3 + X 1 <<3 Y 2 = D 1 + X 2 <<2
Linear systems and polynomial transformation View linear systems as set of arithmetic expressions Expressions consisting of +,-,<< operators Develop methodology for extracting common subexpressions Polynomial formulation C × X = (±X×L i ) (14) 10 × X = (1110) 2 × X = X<<3 + X<<2 + X<<1 = XL 3 + XL 2 + XL 1 = (100-10) CSD × X = XL 4 – XL 1
Linear Systems and polynomial transformation Y X 0 Y 1 = X 1 Y X 2 Y X 3 Decomposing constant multiplications Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 <<1 + X 1 - X 2 - X 3 <<1 Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 <<1 + X 2 <<1 - X 3 Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 <<1 + X 1 - X 2 - X 3 <<1 Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 <<1 + X 2 <<1 - X 3 12+, 4<< H.264 Integer Transform
Linear Systems and polynomial transformation Y X 0 Y 1 = X 1 Y X 2 Y X 3 Polynomial transformation Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3 Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3 12+, 4<< H.264 Integer Transform
Fx algorithm Concurrent Decomposition and Factorization of Boolean Expressions (J.Rajski et. al TCAD’92) Popular as Fast-Extract (Fx) algorithm Expression f = gh + r g = (ab + c) => Double cube divisor g = ab => Single cube divisor Fx algorithm for Linear systems?
Two-term divisors Obtained from every pair of terms in each expression Divide by the minimum exponent of L eg. F = X 1 + X 2 L + X 3 L 3 { +X 2 L, +X 3 L 3 }: Divide by L => (X 2 + X 3 L 2 ) Divisors = (X 1 + X 2 L), (X 1 + X 3 L 3 ), (X 2 + X 3 L 2 ) Two divisors intersect if The terms involved are distinct (X 1 – X 2 L) ∩ ( X 1 - X 2 L ) = φ (X 1 – X 2 L) ∩ (-X 1 + X 2 L) = φ (reversed signs allowed !!)
Two-term divisors Theorem: Multiple term common subexpression in set of expression iff non- overlapping intersection among two-term divisors Many divisors with intersections, which one to choose? Use greedy selection of divisor with most # of intersections Selecting divisors changes expressions Perform concurrent decomposition of expressions
Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression P i { {D new } = Divisors for P i ; {Divisors} = {Divisors} ∩ {D new }; Update frequency statistics of {Divisors} ; }
Algorithm (Step 2) Common Subexpression Elimination {Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {D new } = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {D new }; }
Algorithm complexity MxM constant matrix; N digits of precision Y Y 0 = X 0 + X 0 L +... X M-1 L 3 + X M-1 Y 1.. … … … ….. Y M M M N O(MN) terms => O(M 2 N 2 ) divisors
Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression P i { {D new } = Divisors for P i ; {Divisors} = {Divisors} ∩ {D new }; Update frequency statistics of {Divisors} ; } O(M 2 N 2 ) distinct divisors O(M 2 N 2 ) O(M 3 N 2 )
Algorithm (Step 2) Common Subexpression Elimination {Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {D new } = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {D new }; } O(M 2 N 2 )
Algorithm H.264 example >> Select D 0 = (X 0 + X 3 ) Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3 Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3
Algorithm H.264 example >> Select D 1 = (X 1 – X 2 ) Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - X 1 L + X 2 L - X 3 Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - X 1 L + X 2 L - X 3
Algorithm H.264 example >> Select D 2 = (X 1 + X 2 ) Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - D 1 L - X 3 Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - D 1 L - X 3
Algorithm H.264 example >> Select D 3 = (X 0 – X 3 ) Y 0 = D 0 + D 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - D 2 Y 3 = X 0 - D 1 L - X 3 Y 0 = D 0 + D 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - D 2 Y 3 = X 0 - D 1 L - X 3
Final Implementation Extracting 4 divisors D 0 = X 0 + X 3 Y 0 = D 0 + D 2 D 1 = X 1 – X 2 Y 1 = D 1 + D 3 L D 2 = X 1 + X 2 Y 2 = D 0 - D 2 D 3 = X 0 - X 3 Y 3 = D 3 – D 1 L D 0 = X 0 + X 3 Y 0 = D 0 + D 2 D 1 = X 1 – X 2 Y 1 = D 1 + D 3 L D 2 = X 1 + X 2 Y 2 = D 0 - D 2 D 3 = X 0 - X 3 Y 3 = D 3 – D 1 L 8+, 2<< Original: 12+, 4<< Rectangle Covering: 10+, 3<<
Experimental Setup Goal Reduction in #additions/subtractions Effect on area/latency on synthesis Simulate designs to estimate power consumption Transforms DCT, IDCT,DFT, DST, DHT. 8x8 constant matrices 16 digits precision (CSD representation) Compare with Potkonjak (TCAD’95) RESANDS (Nguyen et. al TVLSI’2000) Rectangle Covering (A.Hosangadi et.al ASAP’04)
Experimental Results Example # of additions/subtractions Original (I) Potkonjak (II) RESANDS (III) Rectangle Covering (IV) Two-term CSE (V) DCT IDCT RealDFT ImagDFT DST DHT Average Run Time 0.81s 0.08s
Experimental results Synthesis results (minimum latency constraints) Example Area (Library Units) Latency (Clock cycles) (III)(IV)(V)(III)(IV)(V) DCT IDCT R-DFT I-DFT DST DHT Average (III) RESANDS (IV) Rect. Covering (V) 2-term CSE
Experimental results Power consumption ExamplePower consumption (µWatts) (III)(IV)(V) DCT IDCT R-DFT I-DFT DST DHT Average (III) RESANDS (IV) Rect. Covering (V) 2-term CSE
Conclusions A new technique for eliminating common subexpressions in linear systems Fewer operations than known methods Much faster than rectangle covering Combine with scheduling on given resources
Thank you Questions??