Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation.

Slides:



Advertisements
Similar presentations
L3S Research Center University of Hanover Germany
Advertisements

Representing Boolean Functions for Symbolic Model Checking Supratik Chakraborty IIT Bombay.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Fast Algorithms For Hierarchical Range Histogram Constructions
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
1 Don´t Care Minimization of *BMDs: Complexity and Algorithms Christoph Scholl Marc Herbstritt Bernd Becker Institute of Computer Science Albert-Ludwigs-University.
Class Presentation on Binary Moment Diagrams by Krishna Chillara Base Paper: “Verification of Arithmetic Circuits using Binary Moment Diagrams” by.
ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform.
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.
Common Subexpression Elimination Involving Multiple Variables for Linear DSP Synthesis 15 th IEEE International Conference on Application Specific Architectures.
Reducing Multi-Valued Algebraic Operations to Binary J.-H. Roland Jiang Alan Mishchenko Robert K. Brayton Dept. of EECS University of California, Berkeley.
1 Generalized Buffering of PTL Logic Stages using Boolean Division and Don’t Cares Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering,
ECE Synthesis & Verification1 ECE 667 Spring 2011 Synthesis and Verification of Digital Systems Verification Introduction.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Factoring and Eliminating Common Subexpressions in Polynomial Expressions International Conference on Computer Aided Design (ICCAD), 2004 Farzan Fallah.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Energy Efficient Hardware Synthesis of Polynomial Expressions 18 th International Conference on VLSI Design Anup Hosangadi Ryan Kastner ECE Department,
Taylor Expansion Diagrams (TED): Verification EC667: Synthesis and Verification of Digital Systems Spring 2011 Presented by: Sudhan.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
On Solving Presburger and Linear Arithmetic with SAT Ofer Strichman Carnegie Mellon University.
11/26/02CSE FFT,etc CSE Algorithms Polynomial Representations, Fourier Transfer, and other goodies. (Chapters 28-30)
Logic Decomposition ECE1769 Jianwen Zhu (Courtesy Dennis Wu)
Overview Part 2 – Circuit Optimization 2-4 Two-Level Optimization
Prepared by: Hind J. Zourob Heba M. Matter Supervisor: Dr. Hatem El-Aydi Faculty Of Engineering Communications & Control Engineering.
Discrete-Time and System (A Review)
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Electrical and Computer Engineering Archana Rengaraj ABC Logic Synthesis basics ECE 667 Synthesis and Verification of Digital Systems Spring 2011.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Zvi Kohavi and Niraj K. Jha 1 Multi-level Logic Synthesis.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
A NEW ECO TECHNOLOGY FOR FUNCTIONAL CHANGES AND REMOVING TIMING VIOLATIONS Jui-Hung Hung, Yao-Kai Yeh,Yung-Sheng Tseng and Tsai-Ming Hsieh Dept. of Information.
Electrical and Computer Engineering Muhammad Noman Ashraf Optimization of Data-Flow Computations Using Canonical TED Representation M. Ciesielski, D. Gomez-Prado,Q.
Algebraic Techniques To Enhance Common Sub-expression Extraction for Polynomial System Synthesis Sivaram Gopalakrishnan Synopsys Inc., Hillsboro, OR –
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
 Embedded Digital Signal Processing (DSP) systems  Specification with floating-point data types  Implementation in fixed-point architectures  Precision.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Two Level and Multi level Minimization
SIMULATION BOUNDS FOR EQUIVALENCE VERIFICATION OF ARITHMETIC DATAPATHS WITH FINITE WORD-LENGTH OPERANDS Namrata Shekhar, Priyank Kalla, M. Brandon Meredith.
Weikang Qian. Outline Intersection Pattern and the Problem Motivation Solution 2.
1 CSE 20 Lecture 13: Analysis of Recursive Functions CK Cheng.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.
Output Grouping Method Based on a Similarity of Boolean Functions Petr Fišer, Pavel Kubalík, Hana Kubátová Czech Technical University in Prague Department.
Binary Arithmetic for DNA Computers R. Barua and J. Misra Preliminary Proceedings of the Eighth International Meeting on DNA Based Computers, pp ,
1 MOTIVATION AND OBJECTIVE  Discrete Signal Transforms (DSTs) –DFT, DCT: major performance component in many applications –Hardware accelerated but at.
Output Grouping-Based Decomposition of Logic Functions Petr Fišer, Hana Kubátová Department of Computer Science and Engineering Czech Technical University.
BDS – A BDD Based Logic Optimization System Presented by Nitin Prakash (ECE 667, Spring 2011)
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
Test complexity of TED operations Use canonical property of TED for - Software Verification - Algorithm Equivalence check - High Level Synthesis M ac iej.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Digital Systems Design 1 Signal Expressions Multiply out: F = ((X + Y)  Z) + (X  Y  Z) = (X  Z) + (Y  Z) + (X  Y  Z)
Digital Image Processing Lecture 8: Fourier Transform Prof. Charlene Tsai.
Advanced Algorithms Analysis and Design
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
SINGLE-LEVEL PARTITIONING SUPPORT IN BOOM-II
Extended Baum-Welch algorithm
Delay Optimization using SOP Balancing
Objective of This Course
Multiplier-less Multiplication by Constants
Sungho Kang Yonsei University
Overview Part 2 – Circuit Optimization
UNIVERSITY OF MASSACHUSETTS Dept
VLSI CAD Flow: Logic Synthesis, Placement and Routing Lecture 5
DISCRETE COMPUTATIONAL STRUCTURES
UNIVERSITY OF MASSACHUSETTS Dept
Delay Optimization using SOP Balancing
Alan Mishchenko Department of EECS UC Berkeley
Presentation transcript:

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005 Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America

Outline Introduction Related Work Polynomial transformation Common Subexpression elimination Results Conclusions

Introduction Multiplications by constants encountered in many application areas DSP transforms in Audio, Video, Image processing (DFT, DCT, IDCT etc..) Filtering operations in Communication (FIR, IIR filters) Multiple Input Multiple Output (MIMO) systems Polynomials in Computer graphics

Introduction Multiplication is expensive in hardware Decompose constant multiplications into shifts and additions 13*X = (1101) 2 *X = X + X<<2 + X<<3 Signed digits can reduce the number of additions/subtractions Canonical Signed Digits (CSD) (Knuth’74) (57) 10 = ( ) 2 = ( ) CSD Further reduction possible by common subexpression elimination Upto 50% reduction (R.Hartley TCS’96)

Introduction Common subexpressions = common digit patterns F 1 = 7*X = (0111)*X = X + X<<1 + X<<2 F 2 = 13*X = (1101)*X = X + X<<2 + X<<3 D 1 = X + X<<2 F 1 = D 1 + X<<1 F 2 = D 1 + X<<3 Good for single variable: FIR filters (transposed form) Multiple variable? (DFT, DCT etc..??) “0101” => X + X<<2 3+, 3<< 4+, 4<<

Related Work Simple Bipartite matching (Potkonjak et. al TCAD’95) (10101) and (01101) => common pattern = “101” (10010) and (010010) => cannot detect pattern “1001” Recursive Shift and Add (RESANDS) (H.Nguyen et. Al, TVLSI 2000) (10010) and (010010) => common pattern “1001” Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’99) (1011) => “0011”, “1001”, “1010”, “0101”, “1011”

Related Work Extending techniques for multiple variables Y 1 a 11 a 12 a 13 X 1 Y 2 = a 21 a 22 a 23 x X 2 Y 3 a 31 a 32 a 33 X All Distinct S ij X j and C ik D k Y1Y1 Y2Y2 Y3Y3 Potkonjak et. al. TCAD’95

Related Work Multiple Variable Common Subexpression elimination (A.Hosangadi et. al ASAP’04) Polynomial transformation of linear systems. Use rectangular covering methods Cannot find subexpressions with reversed signs eg. (X 1 – X 2 <<1) ≠ (X 2 <<1 – X 1 ) Common occurrence when signed digits are used Rectangle covering has exponential complexity Method to overcome these limitations ?

Related Work Algebraic methods in multi- level logic synthesis (MLLS) Reducing literal count in a set of Boolean expressions Factoring, decomposition: Established algebraic techniques Typically used for thousands of variables and literals Apply these methods to optimize linear systems? D 1 = X 1 + X 2 <<2 Y 1 = D 1 + D 1 <<3 + X 1 <<3 Y 2 = D 1 + X 2 <<2

Linear systems and polynomial transformation View linear systems as set of arithmetic expressions Expressions consisting of +,-,<< operators Develop methodology for extracting common subexpressions Polynomial formulation C × X =  (±X×L i ) (14) 10 × X = (1110) 2 × X = X<<3 + X<<2 + X<<1 = XL 3 + XL 2 + XL 1 = (100-10) CSD × X = XL 4 – XL 1

Linear Systems and polynomial transformation Y X 0 Y 1 = X 1 Y X 2 Y X 3 Decomposing constant multiplications Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 <<1 + X 1 - X 2 - X 3 <<1 Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 <<1 + X 2 <<1 - X 3 Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 <<1 + X 1 - X 2 - X 3 <<1 Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 <<1 + X 2 <<1 - X 3 12+, 4<< H.264 Integer Transform

Linear Systems and polynomial transformation Y X 0 Y 1 = X 1 Y X 2 Y X 3 Polynomial transformation Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3 Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3 12+, 4<< H.264 Integer Transform

Fx algorithm Concurrent Decomposition and Factorization of Boolean Expressions (J.Rajski et. al TCAD’92) Popular as Fast-Extract (Fx) algorithm Expression f = gh + r g = (ab + c) => Double cube divisor g = ab => Single cube divisor Fx algorithm for Linear systems?

Two-term divisors Obtained from every pair of terms in each expression Divide by the minimum exponent of L eg. F = X 1 + X 2 L + X 3 L 3 { +X 2 L, +X 3 L 3 }: Divide by L => (X 2 + X 3 L 2 ) Divisors = (X 1 + X 2 L), (X 1 + X 3 L 3 ), (X 2 + X 3 L 2 ) Two divisors intersect if The terms involved are distinct (X 1 – X 2 L) ∩ ( X 1 - X 2 L ) = φ (X 1 – X 2 L) ∩ (-X 1 + X 2 L) = φ (reversed signs allowed !!)

Two-term divisors Theorem: Multiple term common subexpression in set of expression iff non- overlapping intersection among two-term divisors Many divisors with intersections, which one to choose? Use greedy selection of divisor with most # of intersections Selecting divisors changes expressions Perform concurrent decomposition of expressions

Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression P i { {D new } = Divisors for P i ; {Divisors} = {Divisors} ∩ {D new }; Update frequency statistics of {Divisors} ; }

Algorithm (Step 2) Common Subexpression Elimination {Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {D new } = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {D new }; }

Algorithm complexity MxM constant matrix; N digits of precision Y Y 0 = X 0 + X 0 L +... X M-1 L 3 + X M-1 Y 1.. … … … ….. Y M M M N O(MN) terms => O(M 2 N 2 ) divisors

Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression P i { {D new } = Divisors for P i ; {Divisors} = {Divisors} ∩ {D new }; Update frequency statistics of {Divisors} ; } O(M 2 N 2 ) distinct divisors O(M 2 N 2 ) O(M 3 N 2 )

Algorithm (Step 2) Common Subexpression Elimination {Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {D new } = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {D new }; } O(M 2 N 2 )

Algorithm H.264 example >> Select D 0 = (X 0 + X 3 ) Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3 Y 0 = X 0 + X 1 + X 2 + X 3 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = X 0 - X 1 - X 2 + X 3 Y 3 = X 0 - X 1 L + X 2 L - X 3

Algorithm H.264 example >> Select D 1 = (X 1 – X 2 ) Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - X 1 L + X 2 L - X 3 Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + X 1 - X 2 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - X 1 L + X 2 L - X 3

Algorithm H.264 example >> Select D 2 = (X 1 + X 2 ) Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - D 1 L - X 3 Y 0 = D 0 + X 1 + X 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - X 1 - X 2 Y 3 = X 0 - D 1 L - X 3

Algorithm H.264 example >> Select D 3 = (X 0 – X 3 ) Y 0 = D 0 + D 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - D 2 Y 3 = X 0 - D 1 L - X 3 Y 0 = D 0 + D 2 Y 1 = X 0 L + D 1 - X 3 L Y 2 = D 0 - D 2 Y 3 = X 0 - D 1 L - X 3

Final Implementation Extracting 4 divisors D 0 = X 0 + X 3 Y 0 = D 0 + D 2 D 1 = X 1 – X 2 Y 1 = D 1 + D 3 L D 2 = X 1 + X 2 Y 2 = D 0 - D 2 D 3 = X 0 - X 3 Y 3 = D 3 – D 1 L D 0 = X 0 + X 3 Y 0 = D 0 + D 2 D 1 = X 1 – X 2 Y 1 = D 1 + D 3 L D 2 = X 1 + X 2 Y 2 = D 0 - D 2 D 3 = X 0 - X 3 Y 3 = D 3 – D 1 L 8+, 2<< Original: 12+, 4<< Rectangle Covering: 10+, 3<<

Experimental Setup Goal Reduction in #additions/subtractions Effect on area/latency on synthesis Simulate designs to estimate power consumption Transforms DCT, IDCT,DFT, DST, DHT. 8x8 constant matrices 16 digits precision (CSD representation) Compare with Potkonjak (TCAD’95) RESANDS (Nguyen et. al TVLSI’2000) Rectangle Covering (A.Hosangadi et.al ASAP’04)

Experimental Results Example # of additions/subtractions Original (I) Potkonjak (II) RESANDS (III) Rectangle Covering (IV) Two-term CSE (V) DCT IDCT RealDFT ImagDFT DST DHT Average Run Time 0.81s 0.08s

Experimental results Synthesis results (minimum latency constraints) Example Area (Library Units) Latency (Clock cycles) (III)(IV)(V)(III)(IV)(V) DCT IDCT R-DFT I-DFT DST DHT Average (III)  RESANDS (IV)  Rect. Covering (V)  2-term CSE

Experimental results Power consumption ExamplePower consumption (µWatts) (III)(IV)(V) DCT IDCT R-DFT I-DFT DST DHT Average (III)  RESANDS (IV)  Rect. Covering (V)  2-term CSE

Conclusions A new technique for eliminating common subexpressions in linear systems Fewer operations than known methods Much faster than rectangle covering Combine with scheduling on given resources

Thank you Questions??