Design Methodology for Semi Custom Processor Cores

Slides:



Advertisements
Similar presentations
Basic HDL Coding Techniques
Advertisements

Programmable FIR Filter Design
Spartan-3 FPGA HDL Coding Techniques
© 2013 IBM Corporation Use of Hierarchical Design Methodologies in Global Infrastructure of the POWER7+ Processor Brian Veraa Ryan Nett.
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.
06/05/08 Biscotti: a Framework for Token-Flow based Asynchronous Systems Charlie Brej.
Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
CSE241 Formal Verification.1Cichy, UCSD ©2003 CSE241A VLSI Digital Circuits Winter 2003 Recitation 6: Formal Verification.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
The Design Process Outline Goal Reading Design Domain Design Flow
Kazi Fall 2006 EEGN 4941 EEGN-494 HDL Design Principles for VLSI/FPGAs Khurram Kazi.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Huffman Encoder Project. Howd - Zur Hung Eric Lai Wei Jie Lee Yu - Chiang Lee Design Manager: Jonathan P. Lee Huffman Encoder Project Final Presentation.
[M2] Traffic Control Group 2 Chun Han Chen Timothy Kwan Tom Bolds Shang Yi Lin Manager Randal Hong Wed. Oct. 27 Overall Project Objective : Dynamic Control.
Architectural-Level Prediction of Interconnect Wirelength and Fanout Kwangok Jeong, Andrew B. Kahng and Kambiz Samadi UCSD VLSI CAD Laboratory
Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,
ELEN468 Lecture 11 ELEN468 Advanced Logic Design Lecture 1Introduction.
Digital System Design Verilog ® HDL Maziar Goudarzi.
ELEN468 Lecture 11 ELEN468 Advanced Logic Design Lecture 1Introduction.
1 Chapter 7 Design Implementation. 2 Overview 3 Main Steps of an FPGA Design ’ s Implementation Design architecture Defining the structure, interface.
Hierarchical Physical Design Methodology for Multi-Million Gate Chips Session 11 Wei-Jin Dai.
FPGA-Based System Design: Chapter 4 Copyright  2004 Prentice Hall PTR HDL coding n Synthesis vs. simulation semantics n Syntax-directed translation n.
Open Discussion of Design Flow Today’s task: Design an ASIC that will drive a TV cell phone Exercise objective: Importance of codesign.
ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.
ASIC Design Flow – An Overview Ing. Pullini Antonio
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
HDL-Based Layout Synthesis Methodologies Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
Lecture 2 1 ECE 412: Microcomputer Laboratory Lecture 2: Design Methodologies.
J. Christiansen, CERN - EP/MIC
COE 405 Design and Modeling of Digital Systems
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTEMS
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
ASIC, Customer-Owned Tooling, and Processor Design Nancy Nettleton Manager, VLSI ASIC Device Engineering April 2000 Design Style Myths That Lead EDA Astray.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Topics Design methodologies. Kitchen timer example.
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
DEVICES AND DESIGN : ASIC. DEFINITION Any IC other than a general purpose IC which contains the functionality of thousands of gates is usually called.
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
By: C. Eldracher, T. McKee, A Morrill, R. Robson. Supervised by: Professor Shams.
Integrated Microsystems Lab. EE372 VLSI SYSTEM DESIGNE. Yoon 1-1 Panorama of VLSI Design Fabrication (Chem, physics) Technology (EE) Systems (CS) Matel.
ASIC/FPGA design flow. Design Flow Detailed Design Detailed Design Ideas Design Ideas Device Programming Device Programming Timing Simulation Timing Simulation.
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
Introduction to ASICs ASIC - Application Specific Integrated Circuit
Programmable Hardware: Hardware or Software?
ASIC Design Methodology
FPGAs in AWS and First Use Cases, Kees Vissers
Design of an 8 Bit Barrel Shifter
Two-phase Latch based design
Fabio Garzia / HIgh Speed Logic, Circuits, Libraries and Layout
Introduction to cosynthesis Rabi Mahapatra CSCE617
Topics HDL coding for synthesis. Verilog. VHDL..
Week 5, Verilog & Full Adder
Timing Analysis 11/21/2018.
Hardware Description Languages
ECE 551: Digital System Design & Synthesis
FPGA Tools Course Answers
ChipScope Pro Software
Powerful High Density Solutions
HIGH LEVEL SYNTHESIS.
Win with HDL Slide 4 System Level Design
ChipScope Pro Software
Lecture 26 Logic BIST Architectures
H a r d w a r e M o d e l i n g O v e r v i e w
Measuring the Gap between FPGAs and ASICs
Digital Designs – What does it take
Presentation transcript:

Design Methodology for Semi Custom Processor Cores Victor Zyuban Sameh Asaad Thomas Fox Anne-Marie Haen Daniel Littrell Jaime Moreno IBM T.J.Watson Research Center, Yorktown Heights, NY

500 MHz WC, 350mW @ 1.5V, 105 C, 0.13um foundry technology Introduction We describe the methodology used in the implementation of a DSP core whose requirements didn’t allow for a typical soft core or hard core approach: 500 MHz WC, 350mW @ 1.5V, 105 C, 0.13um foundry technology Objectives: Exceed performance and power characteristics of designs built using standard ASIC flow - typical ASIC runs at 300Mhz in this technology, without compromising its productivity and generality Enable integration of custom components Enable application of power reduction techniques not provided by ASIC flow, such as power gating, reverse bias, and data retention Allow optimizations across design phases Quick turn-around time, reproducible results

Methodology Overview ISA/uA VHDL Hiasynth/ Booledozer Scan & clock PD modify Arch/uA (change latencies, redefine resource usage) ISA/uA Define hierarchy Clock gating Latch grouping Instantiate custom components re-arrange logic re-group latches VHDL adjust synthesis constraints Define assertion and Synthesis directives Logic Synthesis Optimization w/Pseudo-Latches Hiasynth/ Booledozer Scan & clock Clock Splitters Insertion Scan Insertion Hierarchical Verilog Porting design to Cadence rewire scan Pre-placement / pre-routing Place & Route Extract Timing/Clock Skew/Scan Order Power Analysis PD

Overview of main design techniques Hierarchical VHDL and synthesis + pre-placement of components Grouping of latches for clock splitters in VHDL + pre-placement of latches and clock splitters Enforcing bit ordering in the datapath (bit stack seeding) Instantiation of decoupling buffers in VHDL + pre-placement of decoupling buffers Pre-routing clock grid and power-ground grid

Hierarchical Synthesis and Pre-placement of Components: methodology Every unit is broken up into components (a few thousand gates each) Components are synthesized independently Layout of the unit is organized into a set of overlapping boxes, gates constituting components are assigned to appropriate boxes, leaving sufficient flexibility for the place and route tools FU1 FU1 FU2 FU2 dust FU5 FU4 FU3 FU4 FU5 FU3 VHDL Entry layout

Hierarchical Synthesis and Pre-placement of Components: benefits Different components get best power/performance/area characteristics when synthesized with different directives Gates inside components are sized for smaller area, only gates constituting dust use high-power books Most of the wires are restricted within smaller areas and are therefore short Most of the gates in the design use low power books Both area and power are saved slice 0 slice 1 slice 2 slice 3 control slice pointer update unit (3) bit reverse unit (0)

Latch grouping: fine-grain clock gating VHDL Entry Post-Synthesis Processing clk1_C clk1_B L2 L1 Splitter single bit LSSD latches clk2_C clk2_B grid clock local clock splitters replace placeholder CG-ORs single or multi-bit behavioral latches Gate1 Gate1 clk1 cclk C grid clock Gate2 Gate2 cclk clk2 C CG-OR instances define gated latch groups Control granularity down to latch group to be driven by same splitter. Performs early (L1 and L2) gating. Similar to ASIC Clock-OR methodology

Latch Grouping without clock gating VHDL Entry Post-Synthesis Processing clk1_C clk1_B L2 L1 Splitter single bit LSSD latches clk2_C clk2_B grid clock local clock splitters replace placeholder buffers single or multi-bit behavioral latches clk1 grid clock buffer instances with special names define clock/latch groups clk2 Designer controls latch grouping by inserting special placeholder buffers to form latch groups Post-synthesis script replaces buffers with splitters and behavioral latches with LSSD L1/L2 latches

Pre-placement of latches and clock splitters Clock wires are short – resulting in power reduction Length of clock wires is under control – resulting in small clock skew, higher frequency and faster convergence on timing Bit-precise placement of dataflow latches enforces bit ordering in the datapath – resulting in improved routability and savings in power and area clock distribution clock splitters latches clock, no preplacement

Instantiation and pre-placement of decoupling buffers: methodology Used when a long wire or non-critical block of logic needs to be decoupled from the critical path Decoupling buffers are instantiated in VHDL, and preserved in the synthesis and post-synthesis steps Overlapping pre-placement boxes are created in the layout, decoupling buffers are assigned to the appropriate boxes latch latch latch decoupling buffer FU1 FU1 FU1 FU2 FU2 VHDL Entry (case 1) VHDL Entry (case 2) layout

Instantiation and preplacement of decoupling buffers: benefits The power level of the decoupling buffers is precisely controlled, without impacting the gates constituting the components (FUs) Allows keeping the power level of most books inside the unit small, using high-power books only where they need to drive long wires or high FO Decoupling high capacitance nodes from critical paths improves speed decoupling buffers 40-bit latch output wires

Core assembly and timing methodology overview 1) generation of abstracts - unit level Unit layout Chipbench extraction of global wiring (pd file) EinsTimer Unit layout abstract Unit timing abstract

Core assembly and timing methodology overview 2) final step – core level top schematic Generate Physical Hierarchy Unit layout abstract Unit timing abstract top floorplan Placement (Cadence Preview, skill scripts) Routing (CCAR) top routed floorplan Chipbench extraction of global wiring EinsTimer

Placed and routed eLite core custom instruction memory 32kB custom vector register file 256 x 16bit 8read / 4write VPU AU IU DEC BU BIU CR X bus control 16-bit 40-bit 16-bit 40-bit 16-bit 40-bit 16-bit 40-bit vector control custom data memory 32kB slice 0 slice 1 slice 2 slice 3 VMU SD bus control X bus control reduct unit

Conclusion Significant speed improvement, compared with standard ASIC flow (critical path reduced from 3ns to 2ns in some units) area reduction (> 30%) due to dominant usage of low-power cells power reduction (in the range of 50%) Careful pre-placement of clock splitters and clock gating circuitry allows more time for calculating the clock gating conditions: Increased from 0.1 to 0.6ns for 500 MHz WC design with highly efficient OR-style (early) clock gating, allowing to clock gate 90% of eligible latches Generic VHDL – easy to maintain, port and simulate Short time from VHDL to layout, fast turn-around time to close on timing, with consistent convergence up to 3 VHDL-to-layout iterations per unit per day by 2 to 3 designers