High Speed Cache For: PICo Board Proposal By: Team XOR NOTE TO FUTURE VIEWERS OF THESE SLIDES: ALL YELLOW TEXT BOXES ACCOMPANIED BY ARROWS IN THE DIRECT.

Slides:

Advertisements

Similar presentations

Part 4: combinational devices

Advertisements

COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.

MICROELETTRONICA Sequential circuits Lection 7.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.

Clock Design Adopted from David Harris of Harvey Mudd College.

Los tOHMales CalI e ntes Lauren Cash, Chuhong Duan Rebecca Reed, Andrew Tyler ECE 4332: Intro to VLSI.

Introduction to CMOS VLSI Design Lecture 13: SRAM

1 The Basic Memory Element - The Flip-Flop Up until know we have looked upon memory elements as black boxes. The basic memory element is called the flip-flop.

11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.

Introduction to CMOS VLSI Design SRAM/DRAM

[M2] Traffic Control Group 2 Chun Han Chen Timothy Kwan Tom Bolds Shang Yi Lin Manager Randal Hong Wed. Oct. 27 Overall Project Objective : Dynamic Control.

Low-Power CMOS SRAM By: Tony Lugo Nhan Tran Adviser: Dr. David Parent.

Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 31: Array Subsystems (SRAM) Prof. Sherief Reda Division of Engineering,

Lecture 19: SRAM.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

Lecture 21, Slide 1EECS40, Fall 2004Prof. White Lecture #21 OUTLINE –Sequential logic circuits –Fan-out –Propagation delay –CMOS power consumption Reading:

Digital Integrated Circuits for Communication

Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Lecture 12 – Design Procedure.

High Speed 64kb SRAM ECE 4332 Fall 2013 Team VeryLargeScaleEngineers Robert Costanzo Michael Recachinas Hector Soto.

Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 16, 2012 Memory Periphery.

Modern VLSI Design 4e: Chapter 6 Copyright  2008 Wayne Wolf Topics Memories: –ROM; –SRAM; –DRAM; –Flash. Image sensors. FPGAs. PLAs.

Ratioed Circuits Ratioed circuits use weak pull-up and stronger pull-down networks. The input capacitance is reduced and hence logical effort. Correct.

SRAM DESIGN PROJECT PHASE 2 Nirav Desai VLSI DESIGN 2: Prof. Kia Bazargan Dept. of ECE College of Science and Engineering University of Minnesota,

הפקולטה למדעי ההנדסה Faculty of Engineering Sciences.

Digital Logic Design Lecture # 9 University of Tehran.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 5, 2010 Memory Overview.

Introduction to CMOS VLSI Design Lecture 5: Logical Effort GRECO-CIn-UFPE Harvey Mudd College Spring 2004.

McKenneman, Inc. SRAM Proposal Design Team: Jay Hoffman Tory Kennedy Sholanda McCullough.

4. Combinational Logic Networks Layout Design Methods 4. 2

Low-Power SRAM ECE 4332 Fall 2010 Team 2: Yanran Chen Cary Converse Chenqian Gan David Moore.

Advanced VLSI Design Unit 04: Combinational and Sequential Circuits.

Project SRAM Stevo Bailey Kevin Linger Roger Lorenzo John Thompson ECE 4332: Intro to VLSI.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 16, 2011 Memory Periphery.

CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 22: Memery, ROM

Outline MSI Parts as a Decoder Multiplexer Three State Buffer MSI Parts as a Multiplexer Realization of Switching Functions Using Multiplexers.

Priority encoder. Overview Priority encoder- theoretic view Other implementations The chosen implementation- simulations Calculations and comparisons.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 7, 2014 Memory Overview.

SRAM Design for SPEED GROUP 2 Billy Chantree Daniel Sosa Justin Ferrante.

Digital Logic Design Lecture # 15 University of Tehran.

EE 466/586 VLSI Design Partha Pande School of EECS Washington State University

Bit Cell Ratio Testing. Thin Cell Advantages: Smallest possible area of 6T Bit Cell, Can be mirrored (saves area = can reduce distance between n-wells.

Timing Behavior of Gates

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

EE141 Project: 32x32 SRAM Abhinav Gupta, Glen Wong Optimization goals: Balance between area and performance Minimize area without sacrificing performance.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 8, 2013 Memory Overview.

Sp09 CMPEN 411 L21 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey’s Digital Integrated Circuits,

Low Power SRAM VLSI Final Presentation Stephen Durant Ryan Kruba Matt Restivo Voravit Vorapitat.

COE 360 Principles of VLSI Design Delay. 2 Definitions.

Appendix B The Basics of Logic Design

Designing a Low Power SRAM for PICo

Lecture 19: SRAM.

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Low-Power SRAM Using 0.6 um Technology

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Day 26: November 11, 2011 Memory Overview

Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2002 Lecture 22: Shifters, Decoders, Muxes Mary Jane.

ECE 432 Group 4 Aaron Albin Jisoon Kim Kiwamu Sato

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Team Awesome += 5 PICo Design Presentation

ECE 352 Digital System Fundamentals

ECE 352 Digital System Fundamentals

Day 26: November 10, 2010 Memory Periphery

Presentation transcript:

High Speed Cache For: PICo Board Proposal By: Team XOR NOTE TO FUTURE VIEWERS OF THESE SLIDES: ALL YELLOW TEXT BOXES ACCOMPANIED BY ARROWS IN THE DIRECT VICINITY OF THE YELLOW TEXT BOXES WERE ADDED AFTER THE PRESENTATION, AS PER PROF. CALHOUN’S REQUEST, IN ORDER TO MAKE OUR PRESENTATION MORE UNDERSTANDABLE WHEN VIEWING THESE SLIDES NOT DURING THE ORAL PRESENTATION.

Outline Architecture – How we set up our memory Decoding – Various techniques we implemented and tested to optimize decoding Layout – Layout techniques we implemented to minimize area Simulations – Functionality of our SRAM cache Presentation of Metric

Problem – Need a high speed cache which also uses minimal area and energy Approach – Row decoding is the worst case path  decode as fast as possible – Have a compact layout to minimize total area and decrease parasitics

Block diagram of our entire memory. We discussed how our memory worked (Write then Read operation). Very briefly discussed each component shown and how our address bits came about and what we use each bit for. Most components are typical for a memory except the TXGate Control which we mentioned and said we would discuss later in the presentation.

BL/BLB/PRECH Generator

Fight between bitcell data Q and BL WL127 WL 0 WRITE DATA PRECH SAE Q2 Q2B BL BLB out0 Reading a 0 Reading a 1

To Block Addition of TXgates to Disconnect BL/BLB drivers from BL/BLB to allow BL/BLB to float during a read

SAE PRECH Q2 Q2B Q1 Q1B BL BLB out0 Only small bump in Q with addition of TXGateControl Reading a 0 Reading a 1

Bit Cell Ratio Testing Vm TestingPull-up RatioCell Ratio W PULLUP W PULLDOWN W PASSGATE Want: Q as close to V T as possible, but sizes get too large, so tried to find something between V T and V M. W PULLUP = 180n ; W PULLDOWN = 240n ; W PASSGATE = 200n

Thin Cell Advantages: Smallest possible area of 6T Bit Cell, Can be mirrored (saves area = can reduce distance between n-wells and p-wells)

2by2 Array of Thin Cell Layout Advantages: WLs are horizontal, VDD/VSS/BL/BLB are vertical, Mirrored Thin Cells save area and make it easy to add N/P Taps, Easy to Cascade to other 2by2 Arrays

Peripheral Logic Problem: Needed to generate signals such as precharge, prechargebar, etc. from the given inputs We created the following signals: – Local Write Signals – Local Read Signals – Precharge – Prechargebar – Txcontrol – Txcontrolbar – SenseAmp Enable (already localized)

Decoding Need: High Speed Decoding the proper row location is on critical path Considered numerous options – Static – Dynamic

Decoding Based on our architecture, need to decode 10 bits into the proper Word Line 8 blocks  3 block select bits 128 rows/block  7 row select bits 3-level decoder – Predecoders – AND combinations of predecoded bits to generate global word line – Local word line generation by ANDing global word line with block select

Decoding Critical path requires decoding bits 6 to 0 into the proper row (0 through 127) Thus, we chose to implement this part as dynamic decoder The 3 to 8 block decode would occur in parallel, and would be done much quicker (since only needs to generate 8 signals), thus we can conserve power (and don’t affect delay) by using a 1-hot static decoder Static 3 to 8 Block Decoder Dynamic 3 to 8 Predecoder Dynamic 4 to 16 Predecoder

Decoding Dynamic Decoding of 7 row select bits – 2 predecoders: 4 to 16 (upper bits) 3 to 8 (lower bits) – Asymmetric predecoders forced us to design 3 to 8 predecoder to have the same delay as 4 to 16 predecoder to reduce glitching power – Used DRCMOS and skewing techniques – NOR-style predecoders (same logical effort for larger inputs) with complemented inputs

Decoding Comparison of Dynamic 4 to 16 Predecoders – Non-skewed mW average power, ps delay to global WL – Skewed (2x bigger pmos) mW average power, ps delay to global WL – Skewed (min widths) mW average power, ps delay to global WL

Decoding Static vs. Best Dynamic Decoder – Static (2-input NAND) 7.025mW average power, ps delay E-D product = mW*ps E-D 2 product = 70,250 mW*ps 2 – Dynamic (DRCMOS, skewed) 13.1 mW average power, ps delay E-D product = mW*ps E-D 2 product = 40,686 mW*ps 2 – Thus, we reduced our metric by ~42%

Decoding Combining predecoder outputs – Static combinations – Each of 16 MSB outputs are ANDed with 8 LSB outputs to create 128 Global Word Lines 4 to 163 to 8 WL0 WL5 WL6 WL7

Decoding Local Word Line Generation Must take into account parasitics associated with long metal GWL4 LWL4 BLOCK1 BLOCK0 01 Block Select Parasitic modeling of decoder wires C/2 R

Decoding 3-level decoder optimization requires sweeping number of buffers on decode path – Potential Locations of Buffer Immediately after predecoders Before the decode wires (parasitic models) After generating the local WLs We buffered before the decode wires – Without Buffering: Delay: ps ; Power = mW – With Buffering: Delay: ps ; Power = mW

Decoding Wanted to use “Source-coupled” NAND gates to generate the local word lines Ran into charge sharing problems Local WL Global WL Block Select Figure: Schematic of Source-Coupled NAND

Decoding Notice: 0.25V output! This is incorrectly de-asserted!

Decoding Problems with DRCMOS – Tried using similar schematics as those found in literature which resulted in oscillating predecode output signals Fix: Removed some stages from literature schematic – Also, ran into strange glitching of inputs to NOR-style predecoder Fix: Usually none… could drive inputs more, but the power losses were found to be acceptable when comparing the tradeoff for speed. Note: We did observe that slowing down the inputs to the predecoder could reduce these glitches

Decoding Notice: Glitching of input signals! Notice: 3 pulse oscillation of predecode output

Decoding Nice Local WL!

Decoder

Our Architecture

TX Gates (4x Min Size) – Disconnects BL Drivers, Avoids Fight between Bitcells and BL Drivers Advantages: Outputs nicely spread apart, Select lines are all tied together and come in from side, Inputs from top, Outputs on bottom, Easily N- Tapped and P-Tapped from left or right side, Easily mirrored.

Buffers (First Inv = Min Size, Second Inv = 4x Min) Advantages: Needs to be thin to have 2 Buffers to be smaller than the width of a Thin cell, Easy to souce VDD and VSS, Easily connect inputs and outputs (top and bottom of diagram), Easily N-Tapped, P-Tapped

Precharge/BL/BLB Generator Advantages: Pitched Matched (Made it as thin as possible while fitting it in with rest of circuit), BL/BLB are on the outsides running vertically, Easily P-Tapped, N-Tapped, Thin, PRECH, WRITE, and Data Signals

Word Select (1to2 DEMUX) – Sends Data to Column Advantages: Data comes from top, Address from side (design decision), N-wells together, P-wells together (easy to add N-taps and P-taps), easily mirrored

Sense Amp Advantages: Wanted as short as possible, but width had to be smaller than width of Thin Cell, Similar to Thin Cell by using Cross Coupled Inverter layout, Easily mirrored, Easily P-Tapped, N-Taped

1 SRAM Block Advantages: Symmetrical, Pitched Match (All separate components fit in nicely), 1 VDD/VSS source for Bit Cell Array, Can be mirrored, Most inputs, Inputs come in from Left and Top side We discussed the components laid out in this image. We should’ve annotated the picture to make it easier to understand when looking at just the plot w/o the verbal presentation. Sense Amps with OUT going into a min. sized buffer. BL/BLB/PRECH Generator TXGates to disconnect BL/BLB from their drivers BL/BLB drivers 1:2 DEMUX to send data between columns 8192 bitcells

Block Pair 3 2:1 MUXes. 1 per SRAMblock selects between correct column bit(i.e. word 0 or word1) in specific block. 1 selects word from correct block

Block diagram of 3 2:1 MUXes in Block Pair on previous page

Block Pair Merger

Block Pair

How we laid out our devices

Layout Diagram (Connections Removed to Reduce Clutter) 8 SRAM Blocks Blue lines are metal sending data chosen from a “block pair” to another 2:1 MUX in between 2 block pairs 7:128 Row Decoder

CLK WRITE READ GlobalWL BlockSelect PRECH localWL TXGC Q QB Dout10 Dout17 5 Reads 1 Write Bumps in Q due to reading a 0 TXGates only on during write or PRECH

Metric Breakdown Metric: x J s 2 mm 2 W 1 bitcell area: µm 2 Total area: mm 2 Total Energy: nJ Read Delay: ns Write Delay: ns Total Delay: ns Idle Power: 117 mW

Questions?