Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department,

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

ECE 667 Synthesis & Verification - Algebraic Division 1 ECE 667 ECE 667 Synthesis and Verification of Digital Systems Multi-level Minimization Algebraic.

Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.

Aug 23, ‘021Low-Power Design Minimum Dynamic Power Design of CMOS Circuits by Linear Program Using Reduced Constraint Set Vishwani D. Agrawal Agere Systems,

Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

Finite State Machine State Assignment for Area and Power Minimization Aiman H. El-Maleh, Sadiq M. Sait and Faisal N. Khan Department of Computer Engineering.

Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.

EDA (CS286.5b) Day 15 Logic Synthesis: Two-Level.

Nov. 29, 2005 ELEC Class Presentation 1 Logic Redesign for Low Power ELEC 6970 Project Presentation By Nitin Yogi.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Factoring and Eliminating Common Subexpressions in Polynomial Expressions International Conference on Computer Aided Design (ICCAD), 2004 Farzan Fallah.

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.

Equivalence Verification of Polynomial Datapaths with Fixed-Size Bit-Vectors using Finite Ring Algebra Namrata Shekhar, Priyank Kalla, Florian Enescu,

Simplifying Boolean Expressions Using K-Map Method

Overview Part 2 – Circuit Optimization 2-4 Two-Level Optimization

Two Level Logic Optimization. Two-Level Logic Minimization PLA Implementation Ex: F 0 = A + B’C’ F 1 = AC’ + AB F 2 = B’C’ + AB product term AB, AC’,

CHAPTER 2 Boolean Algebra

Accuracy-Configurable Adder for Approximate Arithmetic Designs

KU College of Engineering Elec 204: Digital Systems Design

ICCAD 2003 Algorithm for Achieving Minimum Energy Consumption in CMOS Circuits Using Multiple Supply and Threshold Voltages at the Module Level Yuvraj.

Chapter 2 Combinational Systems And / Or / Not. TRIAD PRINCIPLE: Combinational is about And / Or / Not combinations As well as equivalent functions. It.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Department of Computer Engineering

Combinatorial Algorithms Unate Covering Binate Covering Graph Coloring Maximum Clique.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Efficient and Scalable Computation of the Energy and Makespan Pareto Front for Heterogeneous Computing Systems Kyle M. Tarplee 1, Ryan Friese 1, Anthony.

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

Zvi Kohavi and Niraj K. Jha 1 Multi-level Logic Synthesis.

Electrical and Computer Engineering Muhammad Noman Ashraf Optimization of Data-Flow Computations Using Canonical TED Representation M. Ciesielski, D. Gomez-Prado,Q.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. Circuit Optimization Logic and Computer Design Fundamentals.

Two-Level Simplification Approaches Algebraic Simplification: - algorithm/systematic procedure is not always possible - No method for knowing when the.

1 2-Hardware Design Basics of Embedded Processors (cont.)

Algebraic Techniques To Enhance Common Sub-expression Extraction for Polynomial System Synthesis Sivaram Gopalakrishnan Synopsys Inc., Hillsboro, OR –

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

How Much Randomness Makes a Tool Randomized? Petr Fišer, Jan Schmidt Faculty of Information Technology Czech Technical University in Prague

2-1 Introduction Gate Logic: Two-Level Simplification Design Example: Two Bit Comparator Block Diagram and Truth Table A 4-Variable K-map for each of the.

CS/EE 3700 : Fundamentals of Digital System Design Chris J. Myers Lecture 4: Logic Optimization Chapter 4.

Weikang Qian. Outline Intersection Pattern and the Problem Motivation Solution 2.

Multi-Split-Row Threshold Decoding Implementations for LDPC Codes

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

1 Gate Level Minimization EE 208 – Logic Design Chapter 3 Sohaib Majzoub.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

Output Grouping-Based Decomposition of Logic Functions Petr Fišer, Hana Kubátová Department of Computer Science and Engineering Czech Technical University.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.

L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

1 CS 352 Introduction to Logic Design Lecture 4 Ahmed Ezzat Multi-level Gate Circuits and Combinational Circuit Design Ch-7 + Ch-8.

State university of New York at New Paltz Electrical and Computer Engineering Department Logic Synthesis Optimization Lect18: Multi Level Logic Minimization.

Digital Systems Design 1 Signal Expressions Multiply out: F = ((X + Y)  Z) + (X  Y  Z) = (X  Z) + (Y  Z) + (X  Y  Z)

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Evaluating Register File Size

CHAPTER 2 Boolean Algebra

CS 352 Introduction to Logic Design

Architecture Synthesis for Cost Constrained Fault Tolerant Biochips

Babak Sorkhpour, Prof. Roman Obermaisser, Ayman Murshed

Optimized Implementation of Logic Function

Objective of This Course

Minimization of Switching Functions

Overview Part 2 – Circuit Optimization

VLSI CAD Flow: Logic Synthesis, Placement and Routing Lecture 5

Presentation transcript:

Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America

Outline Introduction Related Work Problem formulation Algorithms for optimizing polynomials Experimental results Conclusions

Introduction Embedded system applications need to compute polynomial expressions – Continuous functions can be approximated by Taylor Series – Adaptive (polynomial) filters – Polynomial interpolation/extrapolation in Computer Graphics – Encrpytion

Introduction Commonly occuring computations implemented in hardware – More flexibility than processor architecture – NPAs (Hardware accelarators) in PICO project – Custom Instructions (Tensilica) – Upto 100 times improvement over processor implementation (Kastner et.al TODAES’02) Develop techniques for reducing power consumption

Related Work (Behavioral transforms) Power consumption depends on many factors – Reducing number of operations Hardware: (Nguyen and Chatterjee TVLSI’00) Software: (I.Hong et.al TODAES’99) – Voltage reduction after speedup transformations Retiming, Pipelining, Algebraic restructuring (Chandrakasan et. al TCAD’95)

Related Work Scheduling and resource allocation – Shutting down unused resources (Monteiro et. al. DAC 96) – Allocation of registers, functional units and interconnects (A.Raghunathan et. al ICCD’94) Multiple V dd scheduling – Assigning supply voltage to each operation in CDFG (M.Chang and M.Pedram TVLSI’97)

Related Work Switching power is proportional to number of operations Multiplications are expensive in Embedded systems – Average 40 times more power than addition at 5V (V.Krishna et. al, VLSI Design 1999) Careful optimization of expressions is therefore necessary to save power

Reducing operations in polynomial expressions No good tool for polynomials – Designers rely on hand optimized libraries Conventional compiler techniques: CSE and Value numbering not suited for polynomials. Horner form: most popular representation – a n x n + a 1 x n-1 + ….a n-1 x + a 0 = (…((a n x + a n-1 )x + a n-2 )x +..a 1 )x + a 0 – Not good for multivariate polynomials – Only a single polynomial expression at a time

Comparison with Horner form Quartic-spline polynomial (3-D graphics) P = zu 4 + 4avu 3 + 6bu 2 v 2 + 4uv 3 w + qv 4 Horner form (from Maple TM ) P = zu 4 + (4au 3 + (6bu 2 + (4uw + qv)v)v)v (17 multiplications) Proposed algebraic method: d 1 = v 2 ; d 2 = d 1 *v P = u 3 (uz + ad 2 ) + d 1 ( qd 1 + u(wd 2 + 6bu) ) (11 multiplications)

Related Work (Polynomial Expressions Expression Factorization (M.A. Breuer JACM’69) – Allows only one kind of operator at a time Using Symbolic Algebra (M.A.Peymandoust, De Micheli) – Mapping polynomial datapaths to libraries (DAC’01) – Low power embedded software (DATE’02) – Results depend heavily on set of library elements eg. (a 2 – b 2 ) = (a+b)(a-b) iff (a+b) or (a-b) is a library element – Manipulates only a single expression at a time F 1 = A + B + C + D; F 2 = A + P + D; => Extract (A + D)

Motivating Example Consider set of expressions Using CSE 16 multiplications and 4 additions/subtractions 12 multiplications and 4 additions/subtractions

Motivational Example Using Horner transform Using our algebraic technique 12 multiplications and 4 additions/subtractions 7 multiplications and 3 additions/subtractions

Introduction to algebraic technique for redundancy elimination Algebraic techniques in multi-level logic synthesis (MLLS) – Decomposition, factoring reduce number of literals – Distill and Condense use Rectangle Covering methods Polynomial Expressions (Our Technique) – Factoring, Single term common subexpressions reduces number of multiplications – Multiple term common subexpressions reduces number of additions and possibly multiplications Key Differences (Generalization to handle higher orders) – Kernelling techniques – Finding single cube intersections

Introduction to our technique (Outline) Find a subset of all possible subexpressions (kernel generation) Transformation of Polynomial Expressions – Problem formulation Extract multiple term common subexpressions and factors Extract single term common factors

Introduction to our technique Terminology – Literal: A variable or a constant eg. a,b,2,3.14 – Cube: Product of literals e.g. +3a 2 b, -2a 3 b 2 c – SOP: Sum of cubes e.g. +3a 2 b – 2a 3 b 2 c – Cube-free expression: No literal or cube can divide all the cubes of the expression – Kernel: A cube free sub-expression of an expression, e.g. 3 – 2abc – Co-Kernel: A cube that is used to divide an expression to get a kernel, e.g. a 2 b

Introduction to our Technique Matrix Representation of Polynomial Expressions – F = x 3 y – xy 2 z is represented by – Each row represents a product term – Each column represents a variable/constant – Each element (i,j) represents power of variable j in term i +/-xyz

Generation of Kernels (example) P 1 = x 3 y + x 2 y 2 z {L} = {x,y,z} – Divide by x: F t = P 1 /x = x 2 y + xy 2 z xyz xyz

Generation of Kernels (example) F t = P 1 /x = x 2 y + xy 2 z C = Biggest Cube dividing all cubes of F t xyz / C = xyz C == xy

Generation of Kernels (example) Obtain Kernel: F 1 = F t /C = (x 2 y + xy 2 z)/(xy) = ( x + yz) Obtain Co-Kernel D 1 = x*(xy) = x 2 y – No kernels within F 1. Go back to P 1 P 1 = x 3 y + x 2 y 2 z – Divide now by next variable y F t = x 3 + x 2 yz – C = x 2 – But (x < y) ε C Stop Here, to avoid repeating same kernel F t /C = (x + yz) – No more kernels extracted – Record kernel F 1 = P 1 with co-kernel ‘1’

Concept of kernels and co-kernels Theorem: Two expressions f and g can have a multiple term common subexpression iff there are 2 kernels K f and K g having a multiple term intersection Detection of multiple term common subexpressions by intersection of sets of kernels Each co-kernel : kernel pair represents a possible factorization – e.g. x 3 y + x 2 y 2 z = [x 2 y](x + yz) Set of kernels a subset of all possible subexpressions

All Kernels and Co Kernels Which kernels to choose?

Kernel Cube Matrix (KCM) One row for each Kernel generated One column for each distinct kernel cube Each non-zero element represents a term Kernel Cubes xyz4-yz-x CoKernelsCoKernels 41 (3) 1 (4) 000 x2yx2y1 (1) 1 (2) 000 x001 (3) 1 (5) 0 xy001 (6) 01 (7) yz001 (4) 01 (5) x3yx3y

Finding Kernel Intersections (Distill Algorithm) Each kernel intersection or factor appears as a rectangle – Rectangle: Set of rows and columns such that all elements are ‘1’ Value of a rectangle = Weighted sum of the energy savings of the different operations Goal: Maximum valued rectangular covering of KCM Greedy heuristic: covering by prime rectangles

Modeling value function of a rectangle Formula for weighted sum of energy savings on selection of a rectangle R = # of rows ; C = # of columns M(R i ) = # of multiplications in row (co-kernel) i. M(C i ) = # of multiplications in column (kernel-cube) i m = ratio of average energy consumption of multiplication to addition in the target library Value =

Distill Algorithm Kernel Cubes xyz4-yz-x CoKernelsCoKernels 41 (3) 1 (4) 000 x2yx2y1 (1) 1 (2) 000 x001 (3) 1 (5) 0 xy001 (6) 01 (7) yz001 (4) 01 (5) 4x + 4yz = 4d 1 d 1 = (x + yz) x 3 y + x 2 y 2 z = x 2 yd 1 Saves 5 multiplications and 1 addition Value = 201 units (m = 40)

Distill Algorithm Kernel Cubes xyz4-yz-x CoKernelsCoKernels 41 (3) 1 (4) 000 x2yx2y1 (1) 1 (2) 000 x001 (3) 1 (5) 0 xy001 (6) 01 (7) yz001 (4) 01 (5) Remove covered terms 4xy – x 2 y = xyd 2 d 2 = 4 – x Saves 2 multiplications Value = 80

Distill Algorithm Distill algorithm exits after no more kernel intersections can be found P 1 = x 2 yd 1 d 1 = x + yz P 2 = 4d 1 – xyz d 2 = 4 - x P 3 = xyd 2 Can further optimize by finding single cube intersections

Finding single cube intersections (Condense algorithm) Form Cube Literal Matrix (CLM) – One row for each cube – One column for each literal – Eg. 2 cubes F 1 = a 4 b 3 c; and F 2 = a 2 b 4 c 2 abc

Finding single cube intersections (Condense algorithm) Each (single term) common subexpression appears as a rectangle. – Rectangle: Set of rows and columns where all elements are non-zero Value of a rectangle is number of multiplications saved by selecting it – C = cube corresponding to the rectangle Value = Rows*( (ΣC[i] ) -1) Maximum valued rectangular covering will give minimum number of multiplications Use greedy iterative covering by prime rectangles

Cube Literal Matrix (Condense Algorithm) Literals Term+/-xyz4d1d1 d2d2 CubesCubes Save 2 multiplications by extracting xy CLM for our example after Distill algorithm C = xy

Condense Algorithm Extracting xy No more favorable cube intersections found Literals Term+/-xyz4d1d1 d2d2 CubesCubes

Final Implementation – Total 7 multiplications, 3 additions/subtractions – Savings of 5 multiplications, 1 addition/subtraction compared to CSE Impossible to obtain such results using conventional techniques

Experimental setup Polynomials used in Computer graphics and Signal Processing 1.0 µ technology library, characterized for power consumption Synthesized using Synopsys Design Compiler TM – Min Hardware constraints (1 adder + 1 multiplier) – Med Hardware constraints (Max 4 multipliers)

Experimental setup Estimated power using Synopsys Power Compiler TM for random inputs, using RTL Simulator (VCS TM ) Compared energy consumption with CSE and Horner form Compared energy after voltage scaling

Results (Comparing operations) OriginalCSEHornerOur Technique MAMAMAMA ex ex ex ex ex Avg

Results (Min Hardware constraints) AreaEnergyEnergy-DelayEnergy (Scaled V) CHCHCHCH ex ex ex ex ex Avg

Results (Med Hardware constraints) AreaEnergyEnergy-DelayEnergy (Scaled V) CHCHCHCH ex ex ex ex ex Avg

Conclusions Technique to reduce number of operations in polynomial expressions Large savings in energy consumption observed over CSE and Horner methods Need to consider scheduling and resource allocation to obtain further improvements

Conclusions Thank you!! Questions ???

Extra slides

Finding Kernel Intersections (Distill Algorithm) Worst case scenario for Distill algorithm Number of prime rectangles exponential in number of rows/columns – Heuristic methods to find best prime rectangle – In practice polynomial expressions are not so large