Logic Restructuring for Timing Optimization

Slides:



Advertisements
Similar presentations
Timing Analysis - Delay Analysis Models
Advertisements

FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR Topics n Logic synthesis. n Placement and routing.
ECE 551 Digital System Design & Synthesis Lecture 08 The Synthesis Process Constraints and Design Rules High-Level Synthesis Options.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Timing Optimization. Optimization of Timing Three phases 1globally restructure to reduce the maximum level or longest path Ex: a ripple carry adder ==>
ECE 667 Synthesis & Verificatioin - FPGA Mapping 1 ECE 667 Synthesis and Verification of Digital Systems Technology Mapping for FPGAs D.Chen, J.Cong, DAOMap.
Improving Placement under the Constant Delay Model Kolja Sulimma 1, Ingmar Neumann 1, Lukas Van Ginneken 2, Wolfgang Kunz 1 1 EE and IT Department University.
FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
ECE 667 Synthesis and Verification of Digital Systems
ECE Synthesis & Verification - Lecture 8 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Introduction.
➢ Performing Technology Mapping and Optimization by DAG Covering: A Review of Traditional Approaches Evriklis Kounalakis.
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.
Modern VLSI Design 2e: Chapter4 Copyright  1998 Prentice Hall PTR.
Technology Mapping.
Logic Synthesis Outline –Logic Synthesis Problem –Logic Specification –Two-Level Logic Optimization Goal –Understand logic synthesis problem –Understand.
Digital Design – Optimizations and Tradeoffs
Logic Design Outline –Logic Design –Schematic Capture –Logic Simulation –Logic Synthesis –Technology Mapping –Logic Verification Goal –Understand logic.
CS294-6 Reconfigurable Computing Day 15 October 13, 1998 LUT Mapping.
A Probabilistic Method to Determine the Minimum Leakage Vector for Combinational Designs Kanupriya Gulati Nikhil Jayakumar Sunil P. Khatri Department of.
Layout-based Logic Decomposition for Timing Optimization Yun-Yin Lien* Youn-Long Lin Department of Computer Science, National Tsing Hua University, Hsin-Chu,
Modern VLSI Design 2e: Chapter 4 Copyright  1998 Prentice Hall PTR Topics n Crosstalk. n Power optimization.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 18, 2009 Static Timing Analysis and Multi-Level Speedup.
FPGA Technology Mapping. 2 Technology mapping:  Implements the optimized nodes of the Boolean network to the target device library.  For FPGA, library.
Digital Integrated Circuits© Prentice Hall 1995 Combinational Logic COMBINATIONAL LOGIC.
1 A Method for Fast Delay/Area Estimation EE219b Semester Project Mike Sheets May 16, 2000.
Adders. Full-Adder The Binary Adder Express Sum and Carry as a function of P, G, D Define 3 new variable which ONLY depend on A, B Generate (G) = AB.
Lec 17 : ADDERS ece407/507.
1 VLSI CAD Flow: Logic Synthesis, Lecture 13 by Ajay Joshi (Slides by S. Devadas)
Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 11 – Design Concepts.
Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.
Power Reduction for FPGA using Multiple Vdd/Vth
POWER-DRIVEN MAPPING K-LUT-BASED FPGA CIRCUITS I. Bucur, N. Cupcea, C. Stefanescu, A. Surpateanu Computer Science and Engineering Department, University.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Chapter 6-1 ALU, Adder and Subtractor
ECE 260B – CSE 241A Static Timing Analysis 1http://vlsicad.ucsd.edu ECE260B – CSE241A Winter 2005 Logic Synthesis Website:
ECE Advanced Digital Systems Design Lecture 12 – Timing Analysis Capt Michael Tanner Room 2F46A HQ U.S. Air Force Academy I n t e g r i.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
1 EECS 219B Spring 2001 Timing Optimization Andreas Kuehlmann.
Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Combinational network delay. n Logic optimization.
Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.
4. Combinational Logic Networks Layout Design Methods 4. 2
Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.
Topics Combinational network delay.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 23: April 20, 2015 Static Timing Analysis and Multi-Level Speedup.
Lecture 6: Mapping to Embedded Memory and PLAs September 27, 2004 ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Multi-Level Logic Synthesis.
Give qualifications of instructors: DAP
Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Combinational network delay. n Logic optimization.
Static Timing Analysis
Timing Behavior of Gates
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen, Jason Cong , Computer Science Department , UCLA Presented.
Courtesy RK Brayton (UCB) and A Kuehlmann (Cadence) 1 Logic Synthesis Timing Optimization.
1 Timing Closure and the constant delay paradigm Problem: (timing closure problem) It has been difficult to get a circuit that meets delay requirements.
EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.
COE 360 Principles of VLSI Design Delay. 2 Definitions.
COMBINATIONAL LOGIC.
Timing Optimization Andreas Kuehlmann
Alan Mishchenko University of California, Berkeley
Topics Logic synthesis. Placement and routing..
Sungho Kang Yonsei University
ECE 667 Synthesis and Verification of Digital Systems
Timing Optimization.
Reinventing The Wheel: Developing a New Standard-Cell Synthesis Flow
Improvements in FPGA Technology Mapping
Technology Mapping I based on tree covering
VLSI CAD Flow: Logic Synthesis, Placement and Routing Lecture 5
Description and Analysis of MULTIPLIERS using LAVA
CS137: Electronic Design Automation
Presentation transcript:

Logic Restructuring for Timing Optimization Outline: Definitions and problem statement Overview of techniques (motivated by adders) Tree height reduction (THR) Generalized bypass transform (GBX) Generalized select transform (GST) Partial collapsing (?)

Timing Optimization Factors determining delay of circuit: Underlying circuit technology Circuit type (e.g. domino, static CMOS, etc.) Gate type Gate size Logical structure of circuit Length of computation paths False paths Buffering Parasitics Wire loads Layout

Problem Statement Given: Initial circuit function description Library of primitive functions Performance constraints (arrival/required times) Generate: an implementation of the circuit using the primitive functions, such that: performance constraints are met circuit area is minimized

Current Design Process Behavioral description Behavior Optiization (scheduling) Logic and latches Partitioning (retiming) Logic equations Logic synthesis Technology independent Technology mapping Gate library Perf. Constraints Delay models Gate netlist Timing driven place and route Layout

Technology mapping for delay Function tree Buffer tree

Overview of Solutions for delay Circuit re-structuring Rescheduling operations to reduce time of computation Implementation of function trees (technology mapping) Selection of gates from library Minimum delay (load independent model - Kukimoto) Minimize delay and area (Jongeneel, DAC’00) (combines Lehman-Watanabe and Kukimoto) Implementation of buffer trees Touati (LT-trees) Singh Resizing Focus here on circuit re-structuring

Circuit re-structuring Approaches: Local: Mimic optimization techniques in adders Carry lookahead (THR tree height reduction) Conditional sum (GST transformation) Carry bypass (GBX transformation) Global: Reduce depth of entire circuit Partial collapsing Boolean simplification

Re-structuring methods Performance measured by levels, sensitizable paths, technology dependent delays Level based optimizations: Tree height reduction (Singh ‘88) Partial collapsing and simplification (Touati ‘91) Generalized select transform (Berman ‘90) Sensitizable paths Generalized bypass transform (Mcgeer ‘91)

Re-structuring for delay: tree-height reduction 6 n’ Collapsed Critical region 5 n Critical region 5 5 Duplicated logic 1 l m m 1 1 1 4 1 k 2 4 k i j 3 i j 3 h h 2 2 a b c d e f g a b c d e f g

Restructuring for delay: path reduction 4 New delay = 5 n’ 3 n’ Collapsed Critical region 5 5 2 Duplicated logic 1 m 1 m 1 1 1 1 2 4 1 2 k 4 k i j 3 i j 3 h h 2 2 a b c d e f g a b c d e f g Singh ‘88

Generalized bypass transform (GBX) Make critical path false Speed up the circuit Bypass logic of critical path(s) fm=f fm+1 … fn=g McGeer ‘91 fm =f fm+1 … fn=g g’ 1 Boolean difference dg __ df s-a-0 redundant

GBX and KMS transform GBX gives little area increase, BUT have now created an untestable fault (on control input to multiplexor) KMS transform: (remove false paths without increasing delay) fk is last node on false path that fans out. Duplicate false path {f1,…, fk} -> {f’1, … , f’k} f’j fans out to every fanout of fj except fj+1, and fj just fans out to fj+1 Set f0 input to f1 to controlling value and propagate constant (can do because path is false and does not fanout) KMS results Function of every node, except f1, … ,fk is unchanged Added k-1 nodes Area added in linear in size of length of false paths; in practice small area increase.

KMS (Keutzer, Malik, Saldanha ‘90) fm fm+1 fk fk+1 … fn Delay is not increased f’m f’m+1 … f’k fm fm+1 fk fk+1 … fn

End of lecture 20

Generalized select transform (GST) Late signal feeds multiplexor a out b c d e f g Berman ‘90 a=0 b out c d e f g a=1 1 b a c d e f g

GST vs GBX GBX GST dh GBX c g h b a c g h b a 0/1 … … a=0 b c d e f g g’ b GBX a 1 dh __ da 0/1 c g GBX h … g’ b a 1 a=0 b c d e f g a=1 b c d e f g c d e f g b a=0 a=1 out 1 a GST

GST vs GBX Select transform appears to be more area efficient But Boolean difference generally more efficiently formed in practice No delay/speedup advantage for either transform Need one MUX per fanout in GST, only one MUX in GBX GST out2 1 a a=0 out1 b c d e f g a=1 1 b c d e f g a

Technology independent delay reductions Generally THR, GBX, GST (critical path based methods) work OK, but not great Why are technology independent delay reductions hard? Lack of fast and accurate delay models # levels, fast but crude # levels + correction term (fanout, wires,… ): a little better, but still crude (what coefficients to use?) Technology mapped: reasonable, but very slow Place and route: better but extremely slow Silicon: best, but infeasibly slow (except for FPGAs) b e t r s l o w e r

Clustering/partial-collapse Traditional critical-path based methods require Well defined critical path Good delay/slack information Problems: Good delay information comes from mapper and layout Delay estimates and models are weak Possible solutions: Better delay modeling at technology independent level Make speedup, insensitive to actual critical paths and mapped delays

Clustering/partial-collapse Two-level circuits are fast Collapse circuit to 2-level - but Huge area penalty Huge capacitive loading on inputs (can be much slower) To avoid huge area penalty Identify clusters of nodes Each cluster has some fixed size Perform collapse of each cluster Simplify each node Details How to choose the clusters? How to choose cluster size? How to simplify each node?

Lawler’s clustering algorithm Optimal in delay: For a given clustering size May duplicate nodes (hence possible area penalty) Not optimal w.r.t duplication Use a heuristic Fast: O(m x k) m = number of edges in network k = maximum cluster size

Clustering algorithm - overview Label phase: (k is cluster size) If node u is an input, label(u) := L := 0 Else L := max label of fanin of u If (# nodes in TFI(u) with (label = L) >= k) label(u) := L+1 Cluster phase: (outputs to inputs) If node u is an output, L := infinity Else L := max label of fanouts of u If (label(u) < L) then create a new cluster with “root” u and with members all the nodes in TFI(u) with label = label(u) Collapse phase: (order independent) Collapse all nodes in a cluster into a single node Note: a node may be in several clusters (causes area increase

Example of clustering k = 3 1 2 1 1 2 Result: Lawler’s algorithm 1 2 k = 3 Result: Lawler’s algorithm gives minimum depth circuit Typically, we decompose initial circuit into 2-input NANDs and invertors. then cluster size k reflects # 2-input NANDs to be collapsed together. 1 1 2

Choosing k I(k): number of levels, given k d(k): duplication ratio Number of gates in cluster network divided by number of gates in original network Determine k0 where k0/d(k0)~2.0 For every k from 2 to k0, compute d(k), I(k) Use exhaustive enumeration: label and cluster (without collapse) for each k. Each iteration is O(|E|k) Choose k such that I(k) is minimized Break ties using d(k) Minimize d(k) d(k) I(k) 1 2 k0

Area recovery Area increase is due to node duplication - this occurs when node is in multiple clusters Two solutions: Break clusters into smaller pieces off critical path After cluster and collapse, recover area

Relabeling procedure: Attempt to increase node labels without exceeding cluster size In reverse topological order Start : assign Increase label(u) if new-label(u) <= label(v) for each fanout v and new-label(u) = new-label(v) for each fanout v only if label(u) = label(v) before relabeling, and no cluster size is violated

Relabeling example 1 2 before 1 2 after

Post-collapse area recovery Do algebraic factorization, but Undo factorization if depth increases Full_simplify Only consider node v as possible fanin of a node (v introduced by full_simplify using don’t cares) if level of v < level of node. Redundancy removal

Conclusions Variety of methods for delay optimization No single technique dominates (KJ Singh PhD thesis) When applied to ripple-carry adder get Carry-lookahead adder (THR) Carry-bypass adder (GBX) Carry-select adder (GST) ? (partial collapse) All techniques ignore false paths when assessing the delay and critical regions Can use KMS transform to eliminate false paths without increasing delay (area increase however).