Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Similar presentations


Presentation on theme: "Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale."— Presentation transcript:

1 Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL) csda Challenges in Automatic Optimization of Arithmetic Circuits

2 2 Circuit Performance Depends Heavily on the Description “Software” Multiplier Multiplier with Optimized Compressor Tree Multiplier with Compressor Tree

3 3 Pre-Synthesis Optimization of Arithmetic Circuits Original Circuit Description Physical Design Logic Synthesis Arithmetic optimizations Known architectures Automatic architecture exploration

4 4 Automation and Computer Arithmetic Automation Heuristics to optimize general classes of circuits Kernel and co-kernel extraction [Brayton82] Decomposition based approaches for general circuits [Bertacco97, Mishchenko01, Yang02] Algorithmic approaches for a particular class of circuits Variable group size CLA adder [Lee91] Irregular partial product compressors [Stelling98]

5 5 Logic Synthesis Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form And when there are plenty of XOR gates? Before expansion : 0.37 ns (138.2 μm 2 ) After expansion : 0.26 ns (146.9 μm 2 ) Before expansion : 0.22 ns (58.8 μm 2 ) After expansion : 0.27 ns (221.2 μm 2 ) ?

6 6 Outline Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Optimizing at Word-Level Verma & Ienne; DAC 2006 Exploring Microscopic Structure Creating Macroscopic Structure Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008

7 7 Outline Optimizing at Word-Level Creating Macroscopic Structure Exploring Microscopic Structure Low Complexity High High Granularity Low

8 8 Outline Optimizing at Word-Level Creating Macroscopic Structure Exploring Microscopic Structure

9 9 Clustering: Maximization of the Use of Carry-Save Representation Two addition nodes are separated by NOT The two addition nodes are clustered Goal: Swap the adders with other logic operations while preserving the semantics to cluster additions

10 10 Examples of Transformations (A << k)  A. 2 k Advancing shift left over add (distributivity of multiplication over addition) Advancing shift right over addition is more complex Advancing SEL over add (existence of the identity element of addition) (C ? A : D) + (C ? B : 0)C ? (A + B) : D

11 11 Some Transformations Have a Cost Advancing PP over add (distributive property of multiplication over addition) This transformation has a significant cost in terms of area!

12 12 Generation of All Pareto-Optimal Implementations Theorem: The transformations form a persistent and confluent reduction system Pareto-optimal: better than any other in terms of area or critical-path delay

13 13 Example: adpcmdecode Kernel Compressor tree AND network 0.85 ns, 5678 μm 2 0.51 ns, 4901 μm 2

14 14 Outline Optimizing at Word-Level Creating Macroscopic Structure Exploring Microscopic Structure Limited scope for optimizations Bit-level

15 15 Implementation of Subcircuits Corresponding to Contiguous Layers Can Be Improved ADD LZD Arithmetic Logic Leading Zero Anticipator A direct implementation of LZA in carry-select fashion [Gerwig99]

16 16 Input Condensation Leader expressions: Sufficient to evaluate the whole of an expression Once you evaluate them, you can discard the input bits Compute all leader expressions in parallel Recursively compute leader expressions again IN Some Large Circuit OUT IN L|L| < |IN| Smaller circuit OUT s c 8-input parallel counter Leader expressions

17 17 Progressive Decomposition: Algorithm Overview Choose a subset of input bits How many bits? Many different combinations? Find leader expressions Optimize via Boolean ring properties Find identities Discard dependent expressions xyz z = f(x, y) Rewrite circuit in terms of leader expressions Recursively process the remaining circuit

18 18 Example: 3-Input Adder (s 2 Output) X = [a 1 b 1 + (a 1 + b 1 )a 0 b 0 ]  [(a 1  b 1  a 0 b 0 )c 1 + c 0 (a 0  b 0 )(c 1 + (a 1  b 1  a 0 b 0 ))] Ripple-Carry Adder L(X, {a 1, b 1, c 1 }) = {a 1  b 1  c 1, a 1 b 1  b 1 c 1  a 1 c 1 } sum carry Carry-save adder 0 ++ X ++ a0a0 b0b0 a1a1 b1b1 + c1c1 c0c0 0 0 Ripple-Carry Adder 3:2 Compressor CSA a0a0 b0b0 c0c0 a1a1 b1b1 c1c1 ++ 0 0 X

19 19 A Better Division Is Used for Leader Expression Computation X = ab (c  d  e)  cd (a  b  e) X = (ab + cd) (a  b  c  d  e) Based on the identity: pq (p  q) = 0 Theorem: An expression of the form (PQ  RS) can be factored as (P  R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q  S = U  V The ideal membership problem can be used to determine the existence of such U and V

20 20 Progressive Decomposition: Qualitative Analysis Completely agnostic of the type of circuit to optimize Automatically infers successful circuit designs from the literature… Carry-lookahead adder (beyond minimal sizes) Structured LZD/LOD circuit Optimized LZA circuit (no sum computation) Carry-save addition Parallel counter …and discovers some unknown to us! Multi-Input comparisons (min/max)

21 21 Multi-Input Comparator (Min/max of k n-bit Integers) Binary tree of comparators Number of comparators: k − 1 Critical path delay: O(log n  log k) Hardware area: O(kn) 0.46 ns, 1755 μm 2 Pairwise comparison of inputs Number of comparators: k (k − 1)/2 Critical path delay: O(log n + log k) Hardware area: O(k 2 n) 0.21 ns, 3479 μm 2 log * () is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log * (2 65536 ) = 5 With Our Structuring Algorithm: Bitwidth reduction using dominators and LODs Number of LODs: k  log * n Critical path delay: O(log n + log k  log * n) Hardware area: O(kn) 0.22 ns, 1331 μm 2

22 22 Outline Optimizing at Word-Level Creating Macroscopic Structure Exploring Microscopic Structure Reed-Muller form can be very inefficient Efficient implementation of the leader expressions ? Exhaustive Exploration

23 23 Problem Statement Given a set of Boolean expressions, generate all their Pareto-optimal implementations no “reuse” total “reuse” selective “reuse”

24 24 Enumerating Common Sub-Expressions Root: Original Reed-Muller form Either x  y or xy replaced by a new variable The nodes of the DAG correspond to all partial implementations of the two expressions with some sharing between them

25 25 Pruning the Enumeration DAG The size of DAG can be as large as O ((n + m) 2m ), where n is the number of variables and m is the size of Boolean expressions Enumerating the whole DAG is computationally infeasible Pruning Criteria Recognizing node equivalence (width reduction) Merging some reductions into a single one (height reduction) Delaying certain reductions (branch reduction)

26 26 There Is Scope for Further Pruning… Area and delay for all 6-bit adders generated by our algorithm Without any pruning, it would be impossible to handle expressions with more than five variables Number of possible implementations: >10 60 Number of explored implementations: 2687 Number of actual Pareto-optimal implementations: 4

27 27 …but the Enumeration Algorithm Finds Interesting Non-Trivial Relations! + 4x4-bit multiplier: better than our best manually-designed cell-based multiplier?! The method has been generalized for higher bitwidth multipliers It reduced the delay of the best cell-based 8 x 8-bit multiplier by 10% Verma & Ienne; ASP-DAC 2007 Best Paper Award nominee

28 28 Summary Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Optimizing at Word-Level Verma & Ienne; DAC 2006 Exploring Microscopic Structure Creating Macroscopic Structure Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008

29 29 Computer Arithmetic and Automation Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues It is perhaps high time to try to change all this…

30 30 Thanks!


Download ppt "Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale."

Similar presentations


Ads by Google