A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

Slides:



Advertisements
Similar presentations
Day - 3 EL-313: Samar Ansari. INTEGRATED CIRCUITS Integrated Circuit Design Methodology EL-313: Samar Ansari Programmable Logic Programmable Array Logic.
Advertisements

NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina.
Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Introduction to CMOS VLSI Design Lecture 13: SRAM
[M2] Traffic Control Group 2 Chun Han Chen Timothy Kwan Tom Bolds Shang Yi Lin Manager Randal Hong Wed. Oct. 29 Overall Project Objective : Dynamic Control.
Introduction to CMOS VLSI Design SRAM/DRAM
Lecture 19: SRAM.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
ECO Methodology for Very High Frequency Microprocessor Sumit Goswami, Srivatsa Srinath, Anoop V, Ravi Sekhar Intel Technology, Bangalore, India Introduction.
CAD for Physical Design of VLSI Circuits
ASIC Design Flow – An Overview Ing. Pullini Antonio
Computer-Aided Design of Digital VLSI Circuits & Systems Priyank Kalla Dept. of Elec. & Comp. Engineering University of Utah,SLC Perspectives on Next-Generation.
J. Christiansen, CERN - EP/MIC
Chonnam national university VLSI Lab 8.4 Block Integration for Hard Macros The process of integrating the subblocks into the macro.
McKenneman, Inc. SRAM Proposal Design Team: Jay Hoffman Tory Kennedy Sholanda McCullough.
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
Physical Design of FabScalar Generated Superscalar Processors EE6052 Class Project Wei Zhang.
Partial Region and Bitstream Cost Models for Hardware Multitasking on Partially Reconfigurable FPGAs + Also Affiliated with NSF Center for High- Performance.
Hot Interconnects TCP-Splitter: A Reconfigurable Hardware Based TCP/IP Flow Monitor David V. Schuehler
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
REALITY STATISTICAL CHARACTERIZATION OF A HIGH-K METAL GATE 32NM ARM926 CORE UNDER PROCESS VARIABILITY IMPACT Paul Zuber Petr Dobrovolny Miguel Miranda.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Greg Alkire/Brian Smith 197 MAPLD An Ultra Low Power Reconfigurable Task Processor for Space Brian Smith, Greg Alkire – PicoDyne Inc. Wes Powell.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Physical Design of FabScalar Generated Cores EE6052 Class Project Wei Zhang.
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.
Introduction to ASICs ASIC - Application Specific Integrated Circuit
Gopakumar.G Hardware Design Group
System-on-Chip Design
Memories.
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
Physical Design of FabScalar Generated Cores
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Ioannis E. Venetis Department of Computer Engineering and Informatics
Multi-core processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instructor: Prof. Levitan, Prof. Jones Student: Xinyu Yi
Instructional Parallelism
Hyperthreading Technology
Hakim Weatherspoon CS 3410 Computer Science Cornell University
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
We will be studying the architecture of XC3000.
Digital Building Blocks
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Dynamically Reconfigurable Architectures: An Overview
XC4000E Series Xilinx XC4000 Series Architecture 8/98
Topics Circuit design for FPGAs: Logic elements. Interconnect.
University of Colorado at Boulder
A High Performance SoC: PkunityTM
STT-RAM Circuit Design
Daniel Klopp, Yao Yao, Hoa Nguyen, Roy Rabindranath December 1, 2009
Electronics for Physicists
Levels of Parallelism within a Single Processor
Measuring the Gap between FPGAs and ASICs
October 9, 2003.
This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.
ECE 721, Spring 2019 Prof. Eric Rotenberg.
ECE 721 Modern Superscalar Microarchitecture
Presentation transcript:

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg This work is supported by NSF grant CCF-1218608. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the NSF. Background Contribution Sensitivity Study Standard-cell based RAM compiler: 1. Per-row clock gating reduces both clock power and switching inside the D flip-flops. 2. A new tri-state based mux cell is added to the standard cell library. Employing it reduces total routing and further reduces switching inside the D flip-flops. The new mux cell presents the memory compiler with another choice for optimizing timing and power. 3. A modular layout strategy reduces total routing. A layout is generated for a smaller building block. A block’s layout is made efficient by careful floorplanning. The modular layout allows stacking multiple blocks in series to compose the overall memory. Effect of per-row clock gating and new mux cells (one partition) Baseline: synthesis and place-and-route M-1: baseline + per-row clock gating M-2: baseline + per-row clock gating + new mux cells FabMem: SRAM from FabMem Superscalar core has many highly ported memories Ports increase with Instruction fetch width and number of parallel execution lanes Many ports required for physical register file, reorder buffer, scheduler, and rename tables Challenges for full-custom designs using multi-ported versions of 6T SRAM bitcell Design effort is high to reliably handle PVT variations and narrow noise margins at low voltages for sub-micron technology Limited port memory compilers are available for sub-micron technology Microarchitectural alternatives Replication or banking of fewer ported memories Efficiency or performance drawbacks Modular SRAM Compiler Step1. Gate-level netlist generator For a given size and port configuration, gate-level netlist is generated Can select mux type (standard cell mux/custom cell mux), clock gating enable/disable, and the number of partitions Step2. Build a stackable block layout If the number of partitions is defined, complete the stackable block layout first Extract a LEF from the completed block layout Step3. Generate a complete layout Stitch blocks to get final layout Effect of number of partitions for a fixed size B-2, B-4, B-8, and B-16, refer to number of blocks: 2, 4, 8, and 16 blocks Per-row clock gating enabled New mux cell not applied Motivation Study standard-cell based RAMs Traditionally, standard-cell based RAMs are used for small memories 24 transistors (4.522 um^2) per one flip-flop vs. 8 transistors (1.78 um^2) per 1r1w bitcell Standard cell 45nm RAM (synthesis report) vs. FabMem 45nm SRAM (estimation tool report) Area of FF: FF is 4.8 times larger for 1r1w memory FF is 1.62 times larger for 8r8w memory FF is 1.2 times larger for 16r8w memory Generate a gate-level netlist for a given size and port configuration Custom mux cell is added to the standard cell library Preliminary Results Baseline: synthesis and place-and-route Modular: best configuration from our RAM compiler FabMem: SRAM from FabMem Build a stackable block layout Area of Modular: - Comparable or slightly better than Baseline - Area overhead w.r.t. FabMem decreases with more ports Timing of Modular: - Better than Baseline for most cases - Better than FabMem for large size and many ports Floorplan Power of Modular: - Better than Baseline due to clock gating, tri-state MUXes, and minimized routing (modular layout). - Better than FabMem Challenges in synthesized flip-flop based RAMs Placing and routing a large number of standard cells using automatic place-and-route tools results in a wide variation in area utilization Large increase in routing congestion with more rows and ports Stitch blocks Final Layout(DRC/LVS clean)