Methodology for High-Speed Clock Tree Implementation in Large Chips

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

Xilinx Virtex-5 FPGA Clocking
© 2014 Synopsys. All rights reserved.1 Wheres my glass slipper? TAU 2014 Nanda Gopal Director R&D, Characterization.
© 2013 IBM Corporation Use of Hierarchical Design Methodologies in Global Infrastructure of the POWER7+ Processor Brian Veraa Ryan Nett.
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
1 Post RTL structures/flows targeting low power Srinivas R Jammula Intel Corporation Bangalore, India Naveen M Kumar Intel Corporation Bangalore, India.
High-Performance Microprocessor Design. Outline Introduction Technology scaling Power Clock Verification.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Clock Design Adopted from David Harris of Harvey Mudd College.
Dynamic Scan Clock Control In BIST Circuits Priyadharshini Shanmugasundaram Vishwani D. Agrawal
Externally Tested Scan Circuit with Built-In Activity Monitor and Adaptive Test Clock Priyadharshini Shanmugasundaram Vishwani D. Agrawal.
Issues in Future NoC Ran Ginosar. 2 Research Directions – Now NOC for CMP for ASIC / SOC / MPSoC for All Physical Flow CTL Architecture Routing Photonic,
Power-Aware Placement
Priyadharshini Shanmugasundaram Vishwani D. Agrawal DYNAMIC SCAN CLOCK CONTROL FOR TEST TIME REDUCTION MAINTAINING.
MICRO-MODEM RELIABILITY SOLUTION FOR NOC COMMUNICATIONS Arkadiy Morgenshtein, Evgeny Bolotin, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel.
From Compaq, ASP- DAC00. Power Consumption Power consumption is on the rise due to: - Higher integration levels (more devices & wires) - Rising clock.
1 Effect of Increasing Chip Density on the Evolution of Computer Architectures R. Nair IBM Journal of Research and Development Volume 46 Number 2/3 March/May.
abk C.A.D. Agenda u Roadmapping: “Living Roadmaps” for systems u SiP physical implementation platforms (CLC, SOS) s Tools needs u Interfaces and.
PH4705/ET4305: A/D: Analogue to Digital Conversion
Low power CDN. SPEED Operate vdd at half rails Data should operate at full rails.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Hierarchical Physical Design Methodology for Multi-Million Gate Chips Session 11 Wei-Jin Dai.
Modern VLSI Design 4e: Chapter 7 Copyright  2008 Wayne Wolf Topics Global interconnect. Power/ground routing. Clock routing. Floorplanning tips. Off-chip.
Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
A 30-GS/sec Track and Hold Amplifier in 0.13-µm CMOS Technology
ECO Methodology for Very High Frequency Microprocessor Sumit Goswami, Srivatsa Srinath, Anoop V, Ravi Sekhar Intel Technology, Bangalore, India Introduction.
 There are many definitions of marketing. The better definitions are focused on customer orientation and the satisfaction of customer needs.
CAD for Physical Design of VLSI Circuits
Spartan-II Memory Controller For QDR SRAMs Lobby Pitch February 2000 ®
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR.
The George Washington University School of Engineering and Applied Science Department of Electrical and Computer Engineering ECE122 – Lab 7 MOSFET Parameters.
Washington State University
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
PRAVEEN VENKATARAMANI VISHWANI D. AGRAWAL Auburn University, Dept. of ECE Auburn, AL 36849, USA 26 th International.
Sales Training 3/14/2013 Owner : SAYD Cypress Confidential IDT ICS8543 vs. Cypress CY2DL1504 Clock distribution in Router applications Clock signals delivered.
ASIC, Customer-Owned Tooling, and Processor Design Nancy Nettleton Manager, VLSI ASIC Device Engineering April 2000 Design Style Myths That Lead EDA Astray.
Topics Design methodologies. Kitchen timer example.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
I/O STANDARDS & DESIGN Muthukumar Nagarajan 02/29/08.
Proposed Roadmap Tables on STRJ-WG1
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Distributed Computation: Circuit Simulation CK Cheng UC San Diego
Bi-CMOS Prakash B.
By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.
Introduction to Clock Tree Synthesis
Overview of VLSI 魏凱城 彰化師範大學資工系. VLSI  Very-Large-Scale Integration Today’s complex VLSI chips  The number of transistors has exceeded 120 million 
Clock Distribution Network
A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.
Progettazione di circuiti e sistemi VLSI Anno Accademico Lezione 16 Riepilogo 2.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
FaridehShiran Department of Electronics Carleton University, Ottawa, ON, Canada SmartReflex Power and Performance Management Technologies.
LPNHE - Serial links for Control in 65nm CMOS technology - 65nm CMOS - Higher density, less material, less power - Enhanced radiation hardness regular.
May 2006Andreas Steininger1 D istributed A lgorithms for R obust T ick S ynchronization.
Gopakumar.G Hardware Design Group
Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.
Summary Remaining Challenges The Future Messages to Take Home.
A New Coherence Method Using A Multicast Address Network
Top-level Schematics Digital Block Sign-off Digital Model of Chip
CMOS VLSI Design Chapter 13 Clocks, DLLs, PLLs
Overview of VLSI 魏凱城 彰化師範大學資工系.
Transistors on lead microprocessors double every 2 years Moore’s Law in Microprocessors Transistors on lead microprocessors double every 2 years.
CMOS VLSI Design Chapter 13 Clocks, DLLs, PLLs
332:578 Deep Submicron VLSI Design Lecture 14 Design for Clock Skew
Anasim -fp Power Integrity and Energy aware SoC Floor Planning www
Energy Efficient Power Distribution on Many-Core SoC
Presentation transcript:

Methodology for High-Speed Clock Tree Implementation in Large Chips Ravinder Rachala Aaron Grenat Prashanth Vallur Christopher Ang January 31, 2013

Advantages of Custom Clock Distribution Low skew Smaller AOCV timing uncertainty compared to full CTS Custom buffers are more tolerant to OCV, IR drop, supply noise The plot here displays a scenario where increased skew would require boosting voltage to achieve target Fmax. Effectively skew translates to higher power (dynamic and leakage) for meeting a target frequency. Low Skew High Skew

OLD METHODOLOGY – CLOCK SPINE FRIENDLY FPLAN PLL Clock Spine Macros Showing here a typical CPU floorplan, regular and very constrained problem. Clock trees not cutting into too many blocks where blockages from clock buffers would cause congestion. Same macro can be programmed with varying final buffer strengths as the aspect ratio is the same. Regular and repetitive structure like the above floorplan is conducive to thin, long clock macro structures like above. Here we build 2 unique types of clock macros and stamp them. So, custom macro effort is relatively small compared to more complex floorplans.

OLD FLOW - Clock Spine Topology in complex floorplan In more complex floorplans like above we would end up needing too many custom clock spine macros which are resource intensive and hard to converge in time for chip tapeout. Traditional clock spine macro style is not scalable for today’s complex chips

ISSUES with OLD methodology Very resource intensive. Increasing number of SOCs in roadmap makes this even more challenging Area taken by the clock trees is badly utilized …<10% Increasing size of the macros (of the order of ~20mm) runs risk of not converging through the custom macro/IP build flow Floorplan challenges in accommodating the clock macros and minimizing the number of unique macros typically consumes lot of resource energy and time Re-use of clock macros across projects is heavily restricted by even small floorplan changes between projects

TMAC Flow : New Methodology Clock macros are broken down to cells (called as TMACs: Tiny-MACros) that will be flat instantiations at IP level Connection between the TMACs is done in overlay (or RDL - Route Distribution Layers) TMAC cells Clock Macro 1mm

TMAC Flow : New Methodology – sample Clock SPINE + MESH topology CTS Root buffer or Clock Gater MH (Horizontal Low-Res Layer) MV (Vertical Low-Res Layer)

PRIOR work: example Tile/RLM IP floorplan Conduit - 1 Vtree - 1 PLL Tile/RLM Conduit - 1 Vtree - 1 Htree - 8 Total unique clock macros = 10 IP floorplan (All 8 flavors are delay-matched) Bad skew Driving large areas of the design from a corner (i.e., huge cap on the buffer, big current through the wire) causes EM, self-heating issues Long distribution wire susceptible to ringing/reflections (parasitic inductance)

CLOCK COVERAGE IS BETTER IN New Methodology Tile/RLM IP floorplan TMAC Overlay  1 clock spine All TMAC cells connected in overlay More clock coverage PLL TMAC

TMAC Flow : New Methodology BENEFITS Entire distribution is contained in one clock spine Reduces number of circuit and layout resources Frees up area between the TMACs for RLMs/Tiles TMAC library of cells built once per technology node (e.g. GF 28nm), reused across all projects in that process technology Floorplan changes can be easily accommodated even in late stages of design cycle Provides more complete and robust clock coverage. Bad skew zones are avoided, reliability concerns minimized Instance swapping (Sizing clock mesh drivers for power and performance optimization) can be done easily based on the clock mesh load Creates full-custom quality clock spine network with significantly “less” effort

GRID CAP OPTIMIZATION, SDF for SKEW ANNOTATION Clock grid optimization techniques - reduced clock metal capacitance (by ~45%) Classic clock mesh pruning methods like on-demand-grid Pushing VIA stack into the MPCTS (Multi-Point CTS) buffer. Providing clock arrival times at each MPCTS entry point on the mesh (SDF file) for full-chip timing flow New MPCTS buffer cell. Connection from M2 pin to MH layer is built into the cell. Pin is elevated to MH layer. New cell is the same size as standard cell. CLK (M2) CLK (MH) Clock mesh (MH Layer) CLK (M2) Standard MPCTS buffer cell. Auto router built connection from ‘CLK’ pin to ‘MH’ clock grid route. Clock mesh (MH Layer) All of this route cap is saved. Skew from circuitous route is avoided.

TMAC METHODOLOGY : FLOW CHART Import IP/SOC floorplan (DEF or GDS) into Cadence Virtuoso layout XL Merge clock spine DEF with other overlay DEF (top layer power grid + clock mesh etc.) – First Encounter Push down clock design (distribution + mesh/grid) into floorplan views for RLM/tiles to see for CTS buffer placement etc. Draw full clock spine in Cadence Virtuoso XL (schematic, layout) Extract clock routes (StarRCXT) at IP/SOC top level and run timing using Primetime. Export entire clock spine layout to a DEF file (using internal flow)

Custom design data to DEF conversion FLOW CHART def writer gdsii cdl def lvs annotated gdsii file cross reference files internal database data processing tools component cell list

CLOCK GRID INSERTION and SDF GENERATION: FLOW CHART Top level script prunes MH route completely and inserts back shortest possible segment to connect CTS entry buffers to nearest MV layer Draw clock mesh/grid routes in FE (Spec from clock circuit team – route width, space, shielding) Run CES flow. Skew (clock arrival times – SDF file) is reported to full-chip timing flow. Here clock routes are analyzed for EM pass/fail criteria as well. Push down the mesh into the tiles. CTS buffer placement flow is run. Tiles close placement, routing and timing.. All tile DEFs are exported for full clock mesh extraction and spice simulation flow (CES) Extract clock distribution routes at IP/SOC level and run full-chip STA timing (Primetime).

Benefits proven in recent AMD SOCs Less resource needs 32nm SOI APU Graphics IP: 7 clocks. ~30 clock macros. 4 circuit and 4 layout resources 28nm APU Graphics IP: 9 clocks: 1 clock spine DEF. 1.5 circuit and 1 layout resource Area savings 32nm SOI APU Graphics IP area : 98 mm2 clock macro area: 1.21 mm2  1.23% 28nm APU Graphics IP area: 131 mm2 clock macro area: 0.18 mm2  0.12% Floorplan flexibility With the new methodology (TMAC flow), high-speed clock distribution can be designed to fit into any floorplan. E.g.: We were able to deliver clock distribution design to a server SOC in ¼ the time it takes in the old clock spine macro flow. Reuse across projects TMAC library (clock buffer cells etc.) developed for a technology process are being leveraged for multiple APU projects.

Q & A Thank You

Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2012 Advanced Micro Devices, Inc. All rights reserved.