Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

Slides:



Advertisements
Similar presentations
Address comments to FPGA Area Reduction by Multi-Output Sequential Resynthesis Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1.
Advertisements

Logical Design.
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
CPE 626 CPU Resources: Adders & Multipliers Aleksandar Milenkovic Web:
Comparator.
June 6, Using Negative Edge Triggered FFs to Reduce Glitching Power in FPGA Circuits Tomasz S. Czajkowski and Stephen D. Brown Department of Electrical.
Multioperand Addition Lecture 6. Required Reading Chapter 8, Multioperand Addition Note errata at:
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
ECE 331 – Digital System Design
EECS Components and Design Techniques for Digital Systems Lec 17 – Addition, Subtraction, and Negative Numbers David Culler Electrical Engineering.
Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.
Copyright 2008 Koren ECE666/Koren Part.6b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Nov. 29, 2005 ELEC Class Presentation 1 Logic Redesign for Low Power ELEC 6970 Project Presentation By Nitin Yogi.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
ECE C03 Lecture 61 Lecture 6 Arithmetic Logic Circuits Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
UNIVERSITY OF MASSACHUSETTS Dept
Chapter # 5: Arithmetic Circuits Contemporary Logic Design Randy H
Solving the Greatest Common Divisor Problem in Parallel Derrick Coetzee University of California, Berkeley CS 273, Fall 2010, Prof. Satish Rao.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Lecture 8 Arithmetic Logic Circuits
Factoring and Eliminating Common Subexpressions in Polynomial Expressions International Conference on Computer Aided Design (ICCAD), 2004 Farzan Fallah.
Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.
Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department,
Layout-based Logic Decomposition for Timing Optimization Yun-Yin Lien* Youn-Long Lin Department of Computer Science, National Tsing Hua University, Hsin-Chu,
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
Logic Decomposition ECE1769 Jianwen Zhu (Courtesy Dennis Wu)
1 VLSI CAD Flow: Logic Synthesis, Lecture 13 by Ajay Joshi (Slides by S. Devadas)
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Arithmetic Circuits II Anselmo Lastra.
Chapter 4 – Arithmetic Functions and HDLs Logic and Computer Design Fundamentals.
Chapter # 5: Arithmetic Circuits
ICCD Conversion Driven Design of Binary to Mixed Radix Circuits Ashur Rafiev, Julian Murphy, Danil Sokolov, Alex Yakovlev School of EECE, Newcastle.
1 Wire Length Prediction-based Technology Mapping and Fanout Optimization Qinghua Liu Malgorzata Marek-Sadowska VLSI Design Automation Lab UC-Santa Barbara.
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
1 Chapter 7 Computer Arithmetic Smruti Ranjan Sarangi Computer Organisation and Architecture PowerPoint Slides PROPRIETARY MATERIAL. © 2014 The McGraw-Hill.
Multi-operand Addition
Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.
EECS Components and Design Techniques for Digital Systems Lec 16 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Shantanu Dutt ECE Dept. UIC
1 EECS 219B Spring 2001 Timing Optimization Andreas Kuehlmann.
Algebraic Techniques To Enhance Common Sub-expression Extraction for Polynomial System Synthesis Sivaram Gopalakrishnan Synopsys Inc., Hillsboro, OR –
Computing Systems Designing a basic ALU.
ECE 645 – Computer Arithmetic Lecture 6: Multi-Operand Addition ECE 645—Computer Arithmetic 3/5/08.
Unrolling Carry Recurrence
CS 151: Digital Design Chapter 4: Arithmetic Functions and Circuits
March 28, Glitch Reduction for Altera Stratix II devices Tomasz S. Czajkowski PhD Candidate University of Toronto Supervisor: Professor Stephen D.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow Ajay K. Verma 1, Philip Brisk 2, Paolo Ienne 1 International Conference.
Multioperand Addition
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Timing Optimization.
Lecture #23: Arithmetic Circuits-1 Arithmetic Circuits (Part I) Randy H. Katz University of California, Berkeley Fall 2005.
ECE DIGITAL LOGIC LECTURE 15: COMBINATIONAL CIRCUITS Assistant Prof. Fareena Saqib Florida Institute of Technology Fall 2015, 10/20/2015.
Application of Addition Algorithms Joe Cavallaro.
Carry-Lookahead, Carry-Select, & Hybrid Adders ECE 645: Lecture 2.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
An O(bn 2 ) Time Algorithm for Optimal Buffer Insertion with b Buffer Types Authors: Zhuo Li and Weiping Shi Presenter: Sunil Khatri Department of Electrical.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
Carry-Lookahead & Carry-Select Adders
Unsigned Multiplication
Arithmetic Circuits (Part I) Randy H
Timing Optimization Andreas Kuehlmann
Polynomial Construction for Arithmetic Circuits
A Novel FPGA Logic Block for Improved Arithmetic Performance
Overview Part 1 – Design Procedure Part 2 – Combinational Logic
Multioperand Addition
VLSI CAD Flow: Logic Synthesis, Placement and Routing Lecture 5
Carry-Lookahead & Carry-Select Adders
Presentation transcript:

Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University of California, Santa Barbara of America

2 Outline Carry Save Arithmetic Related Work Problem formulation Algebraic methods Delay aware optimization Experimental results

3 Carry Save Arithmetic Multi-Operand addition F = A + B + C + D + E + F Carry propagation major bottleneck Fast adders: Carry Lookahead Adder (CLA), Carry Select Adders, not fast enough Solution: Eliminate Carry propagation to the final step Generate Sums and Carries separately Treat them as separate numbers Keep adding till only two numbers remain Add the numbers using fast adder (CLA)

4 Carry Save Arithmetic CSA + A B CD EF Delay = 3 + log 2 (M + 3) 3 = height of CSA tree M = bitwidth of operands S S S SC C C C F CLA Tree height = log 1.5 (N/2)

5 Carry Save arithmetic RCA (M +1) Delay = (M+5) + 4 Using Ripple carry adders (RCAs) (M +2) (M +3) (M +4) (M +5) Delay thru CSA network = 3 + log 1.5 (M + 3)

6 Related Work Kim et. al “Arithmetic optimization using Carry Save Adders”, DAC’ AB C D E F D E CSA ABC + + F

7 Related Work Kim. et. al “Optimal allocation of CSAs”, ICCAD’99 Delay aware CSA allocation Kim et. al “High performance, low power synthesis”, DAC’2000 Synopsys TM Behavioral optimization for arithmetic (BOA) A.Verma and P.Ienne “Improved use of the carry save representation for the synthesis of complex arithmetic circuits”, ICCAD’2004 Arithmetic Optimizer?

8 Problem formulation No methodology for detecting redundancy in CSA computations Can reduce the number of CSAs Can reduce the number of wires Common subexpression elimination Standard compiler technique Applied to 2-term arithmetic operations –Polynomial expressions (ICCAD’04, VLSI’05) –Constant multiplications (ASAP’04, ASPDAC’05) CSA expressions (Common 3-term subexpressions)

9 Problem formulation Y 1 = X 1 + X 1 <<2 + X 2 + X 2 <<1 + X 2 <<2 Y 2 = X 1 <<2 + X 2 <<2 + X 2 <<3 D 1 = X 1 + X 2 + X 2 <<1 Y 1 = (D 1 S + D 1 C ) + X 1 <<2 + X 2 <<2 Y 2 = (D 1 S + D 1 C )

10 Algebraic methods Polynomial transformation X<<i = XL i Detects shifted common subexpressions and also extends to multiple variables C × X =  (±X×L i ) (14) 10 × X = (1110) 2 × X = X<<3 + X<<2 + X<<1 = XL 3 + XL 2 + XL 1 = (100-10) CSD × X = XL 4 – XL 1

11 Algebraic methods 3-term divisors = All potential common subexpressions Divisor generation One for every combination of 3 terms eg. F 1 = X 1 + X 1 L 2 + X 2 + X 2 L + X 2 L 2 d 1 = X 1 L 2 + X 2 L + X 2 L 2 MinL = L Divisor D 1 = d 1 /L = X 1 L + X 2 + X 2 L # of divisors = Theorem: There exists a 3-term common subexpression iff there exists a non-overlapping intersection among the set of 3-term divisors N 3

12 Algebraic methods Greedy Iterative algorithm Extracts the “best” 3-term divisor Rewrites the expressions containing it Terminates when there are no more common subexpressions F 1 = a + b + c + d + e F 2 = a + b + c + d + f >> D 1 = a + b + c F 1 = D 1 S + D 1 C + d + e F 2 = D 1 S + D 1 C + d + f >> D 2 = D 1 S + D 1 C + e F 1 = D 2 S + D 2 C + e F 2 = D 2 S + D 2 C + f

13 Algebraic methods Algorithm details Optimize ({P i }) { {P i } = Set of expressions in polynomial form; {D} = Set of divisors = φ; // Step 1. Creating divisors and their frequency statistics for each expression P i in {P i } { {D new } = Divisors(P i ); Update frequency statistics of divisors in {D}; {D} = {D} { D new }; } //Step 2. Iterative selection and elimination of best divisor while (1) { Find d = divisor in {D} with most number of non-overlapping intersections; if (d == NULL) break; Rewrite affected expressions in {P i } using d; Remove divisors in {D} that have become invalid; Update frequency statistics of affected divisors; {D new } = Set of new divisors from new terms added by division; {D} = {D} {D new }; }

14 Algebraic methods Algorithm complexity M expressions, each with N terms Divisor generation = M* = O(MN 3 ) Iterative algorithm, worst case – N terms reduced to 2 terms = (N -2) steps – M expressions = O(MN) steps N 3

15 Delay aware optimization Sharing subexpressions can increase the total delay Traditional high level synthesis approach: Reduce delay by Tree Height Reduction (THR) Our solution: Control delay during optimization itself Optimal delay CSA allocation (T.Kim, J.Um, “Timing driven synthesis”, ASPDAC’2000) – Use this to get minimum possible delay F 1 = a (2) + b (0) + c (0) + d (0) + e (0) F 2 = a (2) + b (0) + c (0) + d (0) + f (0)

16 Delay aware optimization Optimal allocation Delay ignorant extraction CSA 000 bcd 1 e a + F1F bcd 1 f a + F2F Delay(F 1 ) = Delay(F 2 ) = 3 + D(Add)

17 Delay aware extraction Control delay during optimization Evaluate each candidate divisor for delay Only consider those divisors that do not increase the delay F 1 = a (2) + b (0) + c (0) + d (0) + e (0) F 2 = a (2) + b (0) + c (0) + d (0) + f (0) >> D 1 (3) = a (2) + b (0) + c (0) F 1 = D 1S (3) + D 1C (3) + d (0) + e (0) F 2 = D 1S (3) + D 1C (3) + d (0) + f (0) Delay = 5 + D(Add)

18 Delay aware extraction Control delay during optimization Evaluate each candidate divisor for delay Only consider those divisors that do not increase the delay F 1 = a (2) + b (0) + c (0) + d (0) + e (0) F 2 = a (2) + b (0) + c (0) + d (0) + f (0) >> D 2 (1) = b (0) + c (0) + d (0) F 1 = D 2S (1) + D 2C (1) + e (0) + a (2) F 2 = D 2S (1) + D 2C (1) + f (0) + a (2) Delay = 3 + D(Add)

19 Experimental results Comparing # of CSAs Average 38.4% reduction

20 Experimental results Synthesis for Standard Cell Designs Synopsys TM Design compiler 0.25 micron library Synthesized for minimum delay Avg 32.7% Area reduction Avg 3.7% increase in delay

21 Experimental results FPGA synthesis Virtex II FPGAs Synthesized designs and performed place & route Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs Avg 5.7% increase in the delay

22 Experimental results Evaluate Delay aware extraction algorithm Consider different arrival times of the signals Assume delay dominated by gate delay (FA delay) Only consider best case delay Example# of CSAsDelay (FA units) Delay ignorant Delay aware Delay Ignorant Delay aware H DCT IDCT FIR6tap FIR20tap FIR41tap Average Best delay with 15.5% increase in #CSAs

23 Conclusions First methodology for common subexpression elimination for Carry Save Arithmetic Significant area/power reduction Delay aware optimization algorithm also developed Can be combined with CSA tree extraction methods for actual application improvement

24 Thank you!! Questions?