Improved Flop Tray-Based Design Implementation for Power Reduction

Slides:

Advertisements

Similar presentations

Porosity Aware Buffered Steiner Tree Construction C. Alpert G. Gandham S. Quay IBM Corp M. Hrkic Univ Illinois Chicago J. Hu Texas A&M Univ.

Advertisements

OCV-Aware Top-Level Clock Tree Optimization

Timing Margin Recovery With Flexible Flip-Flop Timing Model

Minimum Implant Area-Aware Gate Sizing and Placement

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Ch.7 Layout Design Standard Cell Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.

National Tsing Hua University Po-Yang Hsu,Hsien-Te Chen,

UC San Diego / VLSI CAD Laboratory NOLO: A No-Loop, Predictive Useful Skew Methodology for Improved Timing in IC Implementation Tuck-Boon Chan, Andrew.

Toward Better Wireload Models in the Presence of Obstacles* Chung-Kuan Cheng, Andrew B. Kahng, Bao Liu and Dirk Stroobandt† UC San Diego CSE Dept. †Ghent.

Layer Assignment Algorithm for RLC Crosstalk Minimization Bin Liu, Yici Cai, Qiang Zhou, Xianlong Hong Tsinghua University.

Background: Scan-Based Delay Fault Testing Sequentially apply initialization, launch test vector pairs that differ by 1-bit shift A vector pair induces.

Power-Aware Placement

F.F. Dragan (Kent State) A.B. Kahng (UCSD) I. Mandoiu (Georgia Tech/UCLA) S. Muddu (Silicon Graphics) A. Zelikovsky (Georgia State) Provably Good Global.

Architectural-Level Prediction of Interconnect Wirelength and Fanout Kwangok Jeong, Andrew B. Kahng and Kambiz Samadi UCSD VLSI CAD Laboratory

Provably Good Global Buffering Using an Available Buffer Block Plan F. F. Dragan (Kent) A. B. Kahng (UCLA) I. Mandoiu (Gatech) S. Muddu (Silicon graphics)

Supply Voltage Degradation Aware Analytical Placement Andrew B. Kahng, Bao Liu and Qinke Wang UCSD CSE Department {abk, bliu,

Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported.

Fast and Area-Efficient Phase Conflict Detection and Correction in Standard-Cell Layouts Charles Chiang, Synopsys Andrew B. Kahng, UC San Diego Subarna.

An Efficient Chiplevel Time Slack Allocation Algorithm for Dual-Vdd FPGA Power Reduction Yan Lin 1, Yu Hu 1, Lei He 1 and Vijay Raghunathan 2 1 EE Department,

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

1 UCSD VLSI CAD Laboratory ISQED-2009 Revisiting the Linear Programming Framework for Leakage Power vs. Performance Optimization Kwangok Jeong, Andrew.

Estimation of Wirelength Reduction for λ-Geometry vs. Manhattan Placement and Routing H. Chen, C.-K. Cheng, A.B. Kahng, I. Mandoiu, and Q. Wang UCSD CSE.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

1 ENTITY test is port a: in bit; end ENTITY test; DRC LVS ERC Circuit Design Functional Design and Logic Design Physical Design Physical Verification and.

Enhanced Metamodeling Techniques for High-Dimensional IC Design Estimation Problems Andrew B. Kahng, Bill Lin and Siddhartha Nath VLSI CAD LABORATORY,

UC San Diego / VLSI CAD Laboratory Reliability-Constrained Die Stacking Order in 3DICs Under Manufacturing Variability Tuck-Boon Chan, Andrew B. Kahng,

Page 1 Department of Electrical Engineering National Chung Cheng University, Chiayi, Taiwan Power Optimization for Clock Network with Clock Gate Cloning.

-1- UC San Diego / VLSI CAD Laboratory Methodology for Electromigration Signoff in the Presence of Adaptive Voltage Scaling Wei-Ting Jonas Chan, Andrew.

Xin-Wei Shih and Yao-Wen Chang.  Introduction  Problem formulation  Algorithms  Experimental results  Conclusions.

Accuracy-Configurable Adder for Approximate Arithmetic Designs

-1- UC San Diego / VLSI CAD Laboratory A Global-Local Optimization Framework for Simultaneous Multi-Mode Multi-Corner Clock Skew Variation Reduction Kwangsoo.

A New Methodology for Reduced Cost of Resilience Andrew B. Kahng, Seokhyeong Kang and Jiajia Li UC San Diego VLSI CAD Laboratory.

Power Reduction for FPGA using Multiple Vdd/Vth

UC San Diego / VLSI CAD Laboratory Toward Quantifying the IC Design Value of Interconnect Technology Improvement Tuck-Boon Chan, Andrew B. Kahng, Jiajia.

Horizontal Benchmark Extension for Improved Assessment of Physical CAD Research Andrew B. Kahng, Hyein Lee and Jiajia Li UC San Diego VLSI CAD Laboratory.

Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.

UC San Diego / VLSI CAD Laboratory Incremental Multiple-Scan Chain Ordering for ECO Flip-Flop Insertion Andrew B. Kahng, Ilgweon Kang and Siddhartha Nath.

An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.

-1- UC San Diego / VLSI CAD Laboratory Construction of Realistic Gate Sizing Benchmarks With Known Optimal Solutions Andrew B. Kahng, Seokhyeong Kang VLSI.

Kwangsoo Han, Andrew B. Kahng, Hyein Lee and Lutong Wang

Kwangsoo Han‡, Andrew B. Kahng‡† and Hyein Lee‡

A Faster Approximation Scheme for Timing Driven Minimum Cost Layer Assignment Shiyan Hu*, Zhuo Li**, and Charles J. Alpert** *Dept of ECE, Michigan Technological.

-1- UC San Diego / VLSI CAD Laboratory High-Dimensional Metamodeling for Prediction of Clock Tree Synthesis Outcomes Andrew B. Kahng, Bill Lin and Siddhartha.

1 ε -Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time Jeng-Liang Tsai Tsung-Hao Chen Charlie Chung-Ping Chen (National.

Mixed Cell-Height Implementation for Improved Design Quality in Advanced Nodes Sorin Dobre +, Andrew B. Kahng * and Jiajia Li * * UC San Diego VLSI CAD.

1ISPD'03 Process Variation Aware Clock Tree Routing Bing Lu Cadence Jiang Hu Texas A&M Univ Gary Ellis IBM Corp Haihua Su IBM Corp.

Outline Motivation and Contributions Related Works ILP Formulation

-1- UC San Diego / VLSI CAD Laboratory On Potential Design Impacts of Electromigration Awareness Andrew B. Kahng, Siddhartha Nath and Tajana S. Rosing.

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

-1- Delay Uncertainty and Signal Criticality Driven Routing Channel Optimization for Advanced DRAM Products Samyoung Bang #, Kwangsoo Han ‡, Andrew B.

Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Fault-Tolerant Resynthesis for Dual-Output LUTs Roy Lee 1, Yu Hu 1, Rupak Majumdar 2, Lei He 1 and Minming Li 3 1 Electrical Engineering Dept., UCLA 2.

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Kun Young Chung*, Andrew B. Kahng+ and Jiajia Li+

Chang Xu1, Peixin Li1, Guojie Luo1, Yiyu Shi2, and Iris Hui-Ru Jiang3

Chapter 7 – Specialized Routing

Kristof Blutman† , Hamed Fatemi† , Andrew B

3Boston University ECE Dept.;

Improved Performance of 3DIC Implementations Through Inherent Awareness of Mix-and-Match Die Stacking Kwangsoo Han, Andrew B. Kahng and Jiajia Li University.

Design and Analysis of Algorithm

Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts.

Revisiting and Bounding the Benefit From 3D Integration

Optimal Multi-Row Detailed Placement for Yield and Model-Hardware Correlation Improvement in Sub-10nm VLSI Changho Han, Kwangsoo Han, Andrew B. Kahng,

FPGA Glitch Power Analysis and Reduction

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

Measuring the Gap between FPGAs and ASICs

Diffusion Break-Aware Leakage Power Optimization and Detailed Placement in Sub-10nm VLSI Sun ik Heo†, Andrew B. Kahng‡, Minsoo Kim‡ and Lutong Wang‡

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Presentation transcript:

Improved Flop Tray-Based Design Implementation for Power Reduction Andrew B. Kahng, Jiajia Li and Lutong Wang UC San Diego VLSI CAD Laboratory

Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion

Flop Tray Benefits (1) Flop tray = multi-bit flip-flop (MBFF) Application of flop trays significantly reduces #sinks Motivating “thought experiments” Replacing all single-bit flops in a clock tree (N sinks) with 64-bit flop trays can reduce #clock buffers by (N-N/64)/(N-1) ≈ 98.4% ! In a clock tree with N = 100K, F = 8, replacing all single-bit flops with 64-bit flop trays can reduce #levels from 6 to 4  Fewer clock buffers, smaller clock power N sinks root logFN levels Each buffer has F fanouts #Buffers ≈ (N-1)/(F-1) N/K sinks root logF(N/K) levels Each buffer has F fanouts #Buffers ≈ (N/K-1)/(F-1) Use K-bit flop trays

Flop Tray Benefits (2) Inverters for clock signals are shared within a flop tray  Power and area reductions A recent work (Lin et al. TCAD 2015) achieves 22% flop power reduction by using 2-bit and 4-bit flop trays Master latch Slave latch clk Single-bit flop Master latch Slave latch clk 2-bit flop tray

Challenges of Flop Tray Generation Flops occupy large portion of block area In VGA, 30% of instances are flops  51% of block area Flop trays can have high aspect ratio and distinct size 4-bit flop tray = 1 row x 63 sites 64-bit flop tray = 4 rows x 244 sites Clustering of flops imposes additional placement constraints Small clusters do not fully exploit flop tray benefits Large clusters may sacrifice datapath wirelength / power Power overhead on datapaths (flop tray w/ logical clustering vs. single-bit flop)

Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion

Related Works Early-stage flop tray generation [Chen10] enables flop tray generation during synthesis [Hou09] splits flop trays to mitigate routing congestion But are not aware of physical layout Flop tray generation during/after placement [Lin11] clusters flops by finding K-cliques in a merging graph [Jiang12] generates flop trays using interval graphs [Tsai13] guides placement of flops with bonding force Hard to define feasible displacement region But ignore the shape (AR) of flop trays and timing paths Our work: flop tray generation considering flop displacement, timing paths and flop tray shapes

Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion

Overall Optimization Flow In blue are our optimizations Initial placement w/ single-bit flops == “optimal” placement Objectives Minimize displacement of flops Minimize timing impact Minimize #flop trays Two-step optimization Capacitated K-means clustering (in dotted red boxes) ILP-based selection of flop trays

Example of Overall Flow 4-bit only solution 16-bit only solution 64-bit only solution ILP solution Design: AES Technology: 28FDSOI

Capacitated K-Means Clustering Given N points (flops), a capacity of K (flop tray size), obtain (N/K) clusters. Selection of starting points Randomly select one flop among single-bit flops For each flop (h), calculate the total Manhattan distance (d) from h to all selected flops Randomly select one new flop with probability d Repeat Steps II and III until M flops are selected Min-cost flow-based clustering Update of cluster centers Minimize ∑dk Such that |xi + x’ij – xk| + |yi + y’ij – yk| = dk flop location: (xk, yk); flop tray location: (xi, yi); relative slot location (x’ij, y’ij) Initial center Clustering Cluster center update Solution By considering distances between flops and slots, we are aware of flop tray ARs hk : kth flop (point) ti : ith flop tray (cluster) fij : jth slot on ith flop tray dk,ij: Mahattan distance between hk and fij

Example on AES Circles: initial flop locations Red dots: flop tray locations

Awareness of Flop Tray Shapes Our clustering solution more closely matches the AR of flop trays  Smaller displacements Without awareness of flop tray AR, layout Avg. displacement = 15μm With awareness of flop tray AR, layout Avg. displacement = 5μm Design: AES Technology: 28FDSOI

ILP-Based Selection of Flop Tray Solutions Formulate an ILP to select flop tray solutions with various flop tray sizes to minimize displacement, timing impact and flop tray cost Minimize α ∙ W + D + β ∙ Z Such that // flop displacements |∑ij (xi + x’ij - xk) ∙ bk,ij| + |∑ij (yi + y’ij - yk) ∙ bk,ij| = dk ∑k dk = D // relative displacements between timing-critical flop pairs |∑ij (xi +x’ij - xk) · bk,ij - ∑i’j’ (xi’ +x’i’j’ - xk’) · bk’,i’j’| + |∑ij (yi +y’ij - yk) · bk,ij - ∑i’j’ (yi’ +y’i’j’ - yk’) · bk’,i’j’| = zkk’ ∑kk’ zkk’ = Z // cost of flop trays bk,ij ≤ ei ; ei ≤ ∑kj bk,ij ∑i (wi · ei) = W // each flop has exactly one slot to match & each slot can have at most one flop to match ∑ij bk,ij = 1; ∑k bk,ij ≤ 1 Notations D total displacement Z total relative displacement of timing-critical flop pairs W total cost of flop trays α, β weighting parameters (xi, yi) location of ith flop tray (x’ij, y’ij) relative location of jth slot on ith flop tray (xk, yk) location of kth flop bk,ij binary indicator whether kth flop is assigned to jth slot on ith flop tray ei binary indicator whether ith flop tray is selected wi cost of ith flop tray

Impact of α Value Choice of α determines a tradeoff between clock power reduction versus datapath power penalty Small value of α  Small-size flop trays, small displacement Large value of α  Large-size flop trays, large displacement

Minimization of Relative Placement Relative displacement between timing-critical start-end flop pairs degrades timing Move apart  wire↑  delay↑ Move closer  routing/placement congestion Minimization of relative displacement reduces power penalty by 5% logic cone Move closer  placement/routing congestion Move apart  longer wire 5%

Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion

Norm. area/power per bit Experimental Setup Designs: AES, JPEG, MPEG, VGA (from OpenCores website) Technology: 28nm FDSOI, dual-VT Tools Synthesis: Synopsys Design Compiler vH-2013.12-SP3 P&R: Cadence Innovus Implementation System v15.2 Power/timing analysis: Cadence Innovus Implementation System v15.2 Candidate flop trays Tray size 4-bit 8-bit 16-bit 32-bit 64-bit Norm. area/power per bit 0.875 0.854 0.844 AR (#rows x #columns) 1 x 4 2 x 4 4 x 4 4 x 8 4 x 16 AR (#rows x #sites) 1 x 63 2 x 62 4 x 62 4 x 122 4 x 244

Power Benefits Reference flows ref_1b: conventional implementation flow with single-bit flops ref_mb: flop tray-based implementation with logical clustering (flop tray generation during synthesis with commercial tools) Up to 98% sink number reduction and 90% clock power reduction compared to ref_1b Up to 16% more total power reduction and 40% more clock power reduction compared to ref_mb Design Flow Clock power (mW) Total power (mW) #Sinks AES ref_1b 1.53 14.02 530 ref_mb 0.72 13.35 227 opt_mb 0.46 12.56 128 JPEG 13.37 84.54 4512 6.1 76.2 1665 2.28 69.24 515 MPEG 10.72 45.53 3181 5.19 38.7 1316 0.98 31.76 181 VGA 42.19 164.84 17053 20.73 138.99 7665 2.04 111.32 308 17053 42.19 111.32 20.73 138.99 2.04 308

Layout Examples In red are flop trays and flops, in blue are combinational cells

Optimization with Various Flop Tray Sizes Flop tray-based optimization with various combinations of flop tray size candidates Optimization with large-size (i.e., > 16-bit) flop trays achieves 11% more clock power reduction on average, especially on large designs I II III IV V 1 bit {1, 4} bit {1, 4, 8} bit {1, 4, 8, 16} bit {1, 4, 8, 16, 32, 64} bit AES JPEG MPEG VGA

Study of Useful Skew Optimization Comparison of useful skew benefits (= datapath leakage power reductions) across various flows ref_1b: design with only single-bit flops opt_mb: flop tray-based design (w/o skew-aware clustering) opt_mb (skew aware): flop tray-based design (w/ skew-aware clustering) Skew-aware clustering achieves similar useful skew benefits as ref_1b, but with 21% less sink number reduction #sinks ref_1b 530 4512 3181 17053 opt_mb 128 515 181 308 opt_mb (skew aware) 392 1830 205 1245

Outline Background and Motivation Related Work Our Methodology Experimental Setup and Results Conclusion

Conclusion A novel flop tray-based optimization with capacitated K-means algorithm Up to 16% total block power reduction compared to logical clustering Useful skew optimization in the context of flop tray-based design Ongoing / Future works Scalable optimization considering all flop tray sizes Floorplan blockage awareness

Thank you! UCSD ABKGroup is grateful to Qualcomm, Samsung, NXP, the IMPACT+/C-DEN centers, Mentor Graphics and the NSF for research support. We thank IMEC and Cadence for additional research enablements and collaborations.