Case Study - SRAM & Caches

Slides:

Advertisements

Similar presentations

COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.

Advertisements

1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

Robust Low Power VLSI R obust L ow P ower VLSI Sub-threshold Sense Amplifier (SA) Compensation Using Auto-zeroing Circuitry 01/21/2014 Peter Beshay Department.

5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Power Reduction Techniques For Microprocessor Systems

Introduction to CMOS VLSI Design Lecture 13: SRAM

11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.

Introduction to CMOS VLSI Design SRAM/DRAM

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

Die-Hard SRAM Design Using Per-Column Timing Tracking

Low-Power CMOS SRAM By: Tony Lugo Nhan Tran Adviser: Dr. David Parent.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 31: Array Subsystems (SRAM) Prof. Sherief Reda Division of Engineering,

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 21: Differential Circuits and Sense Amplifiers Prof. Sherief Reda Division.

Lecture 5 – Power Prof. Luke Theogarajan

Modern VLSI Design 2e: Chapter 6 Copyright  1998 Prentice Hall PTR Topics n Memories: –ROM; –SRAM; –DRAM. n PLAs.

Lecture 19: SRAM.

Lecture 7: Power.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Parts from Lecture 9: SRAM Parts from

Low Voltage Low Power Dram

Computer Organization and Assembly language

Memory Technology “Non-so-random” Access Technology:

55:035 Computer Architecture and Organization

Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,

Dept. of Computer Science, UC Irvine

Washington State University

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n Latches and flip-flops. n RAMs and ROMs.

Modern VLSI Design 4e: Chapter 6 Copyright  2008 Wayne Wolf Topics Memories: –ROM; –SRAM; –DRAM; –Flash. Image sensors. FPGAs. PLAs.

SRAM DESIGN PROJECT PHASE 2 Nirav Desai VLSI DESIGN 2: Prof. Kia Bazargan Dept. of ECE College of Science and Engineering University of Minnesota,

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

הפקולטה למדעי ההנדסה Faculty of Engineering Sciences.

CMOS Layout poly diffusion side view top view metal cuts

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Memory Systems How to make the most out of cheap storage.

Large-Scale SRAM Variability Characterization Chip in 45nm CMOS  High end microprocessors continue to require larger on-die cache memory  > 6σ of statistics.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.

Memory System Unit-IV 4/24/2017 Unit-4 : Memory System.

CPEN Digital System Design

Advanced VLSI Design Unit 06: SRAM

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 5, 2010 Memory Overview.

Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.

CSE477 L23 Memories.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 23: Semiconductor Memories Mary Jane Irwin (

Lecture 10: Circuit Families. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 10: Circuit Families2 Outline  Pseudo-nMOS Logic  Dynamic Logic  Pass Transistor.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

경종민 Low-Power Design for Embedded Processor.

Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

Bi-CMOS Prakash B.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition,

Introduction to Computer Organization and Architecture Lecture 7 By Juthawut Chantharamalee wut_cha/home.htm.

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 7, 2014 Memory Overview.

SRAM Design for SPEED GROUP 2 Billy Chantree Daniel Sosa Justin Ferrante.

EE 466/586 VLSI Design Partha Pande School of EECS Washington State University

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 8, 2013 Memory Overview.

Engineered for Tomorrow Date : 11/10/14 Prepared by : MN PRAPHUL & ASWINI N Assistant professor ECE Department Engineered for Tomorrow Subject Name: Fundamentals.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.

Low Power SRAM VLSI Final Presentation Stephen Durant Ryan Kruba Matt Restivo Voravit Vorapitat.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Asynchronous SRAM in 45nM CMOS NCSU Free PDK Paper ID: CSMEPUN International Conference on Computer Science and Mechanical Engineering 10 th November.

Lecture 19: SRAM.

Day 26: November 11, 2011 Memory Overview

CMOS VLSI Design Chapter 12 Memory

Presentation transcript:

Case Study - SRAM & Caches By Nikhil Suryanarayanan

Outline Introduction SRAM Cell Peripheral Circuitry Practical Cache Designs Technology scaling & Power Reduction Summary

Cache Sizes in recent Processors Core 2 Duo 6M Atom 512K Core i7 Extreme 8M Core i7 Extreme Mobile Core 2 Quad 12M Itanium 2 24M Xeon 16M

SRAM Cell Basic building block of on-chip cache Various design 12T, 10T, 8T, 6T, 4T 6T is the most widely used by industry giants in contemporary microprocessors Data bit and its complement and stored in a cross-coupled inverter Simple in design

A 6T SRAM Cell Cross-coupled inverter Access Transistors Access transistors good at passing zeros

Column Circuitry Bitline Conditioning Sense Amplifiers Multiplexer (column decoder) Three most major column circuitry components

Bitline Conditioning Precharge high before reads Equalize bitline delays to minimize voltage difference…. Why? The second circuits provides the added functionality of equalizing the bitlines during precharge

Sense Amplifier Bitlines have many cells attached; enormous capacitive loads Bitlines swing slowly Voltages equalized during precharge SA detects small swing as bitlines are driven apart and bring it up to normal logic Reduced delay by not waiting for full-swing SAs are commonly used to magnify small differential input voltages into larger output voltages They are commonly used in memories in which differential bitlines have enormous load capacitances, Because of the large load the, the bitlines swing slowly To reduce this delay, bitline voltages are equalized first and a small swing can bring it upto normal logic

Types of SA Differential Pair requires no clock but always dissipates static power Clocked sense amp saves power and also isolates the large bitline capacitances Improved gain offered by Cross Coupled Amplifier Different versions of the second sense amplifier are used currently, the basic idea of operation is the same

Read Circuit General Diagram representing a read circuit

Write Circuit Diagram representing a write circuit

Drive strengths Read Stability Writeability NMOS pull down – Strongest Access NMOS – intermediate PMOS pull up - weakest When BL bar is zero, World line is raised, BL_bar should be pulled down through M5 and M1, At same time Q_bar tends to rise, because of the current flowing in from M5, but it should be pulled/kept low by M1 Hence M1 must be stronger than M5 Assume Q_bar is low, and we wish to write 1. Because of read stability constraints, 1 cannot be written through M1, hence write 0 through M6, M4 opposes this operation, hence M4 must be weaker than M6

Industry Designs Novel Design Techniques Embedded Dual 512 KB L2$ in Ultrasparc (2004) L1$, L2$ & L3$ - Designs in two Itanium Generations (2002-2003) Effects of Technology scaling Overview of current Design in 45nm

Design Considerations Larger on-chip cache running at processor frequency improves performance Latency also increases as size grows Memory cell performance does not improve rapidly with new generation Change of design focus with the cache level Memory cell performance does not improve as logic transistor performance Higher wire delays and additional logic for soft error protection are introduced Hence it is important to optimize designs for minimum hit latency, since a large fraction of data and program can reside within the larger L2$, Extra design effort must be taken to meet aggressive clock cycle times and also to convert off-chip memory to on-chip memory Develop optimal solutions for architectural features as well as design methodology that can balance complexity with performance and efficient use of silicon area 1st level optimized for latency 2nd level optimized for b/w 3rd level optimized for size

Itanium 2 L1 cache 16KB 4-ways I & D caches Feature – Eliminates read stall during cache updates Focus – Circuit Technique

L1$ Circuit technique L1$ circuit technique to eliminate stalls during cache update without incurring area penalty of an extra read port. The two single ended ports are shared between read and write. Write needs dual-ended access to the array on the same cycle, since only a discharge of bitline can effectively flip the memory cell state. Here, writes take two cycles with o being written in the first cycle and 1 being written in the subsequent cycle (control logic allows writes to start with either 0 or 1)

UltraSparc(2004) Dual 512 KB L2$ 4-way set-associative design 64 byte line size, 128 bit data bus, 4 cycle to fill a line 8 bit ECC for a 64 bit datum Conversion from off-chip direct mapped cache to on-chip 4-way set-associative cache /******************************************************************** Pseudo random replacement algorithm Almost similar to pseudo LRU approach */ 4 tags

UtraSparc L2$ Since data occupies most of silicon area, is goal was process tolerance & area efficiency Tag array was designed for tight timing considerations Conversion from off-chip direct-mapped to on-chip cache Feature: Read tags for next process during current process Focus : Use of differential access times to tag and data array & Layout Design optimization requires which of the four ways to write during stores

Read Access Pipeline L2$ is pipelined and can handle multiple outstanding stores A special case arises from the need to support the simultaneous ability of store buffer to read the tags and to store other data Tags are re-read dynamically to determine the way to write This requires tag lookup and store support in the same cycle, ideally needing a dual port array This can be achieved by making single cycle access and dual o/p latches 1 cycle for tag and data to reach the farthest end of the L2$ if needed They are latched in the +ve edge of the subsequent cycle Data takes 2 cycles Tag lookup is completed in one cycle and next tag lookup begins in the next cycle Scheme helps meet tight timing requirement by eliminating clock and flop overheads like clock skew, setup time, and clk2q delays Smaller number of flop elements and clock buffers needed by the half-frequency reduced overall power consumption and area compared to single cycle design

Tag array I/O circuit diagram This technique eliminates the need for duplicating word line decoder and I/O peripherals in the 32KB size of tag arrays Clocked self-resetting dynamic circuit with a pseudo-current mode sense amplifier is used to achieve faster access time and higher tag efficiency. The current sense amplifier demonstrates a 15% speed improvement compared to a conventional cross-coupled latch type sense amplifier under same bitline loading conditions P3 and P4 are both tied to bitlines. Column select signals are generated at about 40% of Vdd level. This puts transistors P1 and P2 in saturation mode to act as low impedance devices for current-mode sensing At the beginning eq_l goes high and equalizes node sa and sa_l. Then eq_l turns off when enough differential current is established from N1 and N2 As one of the sense amplifier’s internal nodes goes above the trip point of nand gate, the corresponding PMOS turns on and speeds up the transition to VDD

Data organization 256 rows* 256 columns(8 KB unit) to compose a 128KB unit by tiling 16 8KB banks This 128KB unit is used as a building block for the 512KB L2$ Integrating the SRAM arrays and logic units as one single subsystem achieves both a short latency and an efficient use of silicon area and power

Itanium 3MB L3$ 180nm process 12 way set-associative Fits in an irregular structure Feature : Size of cache Focus : Regular & efficient partition in an irregular space This is for each subarray 96 words with 256 bitlines*8 groups -> 96*256*8=24KB The memory cells in each sub-array are further divided into eight groups with a local sense amplifiers and driver circuitry 8 global column selects, 8 group clocks that activate one of the eight groups

Itanium 3 MB L3$

Sensing Unit During write, the write data pulls down the selected bitline by enabling a pull down chain, while holders hold the bitline complement high

Itanium 2 6MB L3$ 24-way set-associative 130nm process 64 bit EPIC architecture processor Double the number of transistors compared to the previous generation Same power dissipation as previous generation......hmmm 1.5x freq, 2x L3 cache, 3.5x leakage This architecture is inherited from its 180nm predecessor The focus was shrinking the complex circuit topologies to 130nm process, pushing the frequency envelope while maintaining strict control over the power

Technology Scaling Current Model Previous Model The current model has a 50% higher frequency , a 2x L3$ cache and 3.5X leakage current In order to stay within the same power envelope, the active power had to be reduced from 90% to 74%. This was accomplished through an aggressive management of the dynamic power Reduced clock loading Lower power contention Better L3$ power management Current Model Previous Model

L3 Power Reduction Scheme Reduce the number of active arrays to access a cache line

Itanium

Core 2 Duo

Power Reduction techniques used in Silverthorne Control registers disable a pair of ways(out of 8) during low performance operation A way is entirely flushed before disablement During Deep Power Down, the entire voltage plane to L2 is cut off General Cache Architecture has remained the same

Questions?

References CMOS VLSI Design, Weste & Harris 3e On-Chip 3MB Subarray based 3rd level Cache on Itanium Microprocessor - Don Weiss, John J Wuu, Victor Chin Design and Implementation of an Embedded 512KB Level-2 Cache Subsystem, Shin, Petrick, Singh, Leon A 1.5 GHz 130nm Itanium 2 Processor with 6-MB On-die L3 Cache

Need for Caches Large Access Delays of off-chip memory Fast on-chip Memory is expensive Hierarchical Memory System is a solution Bring limited amount of data on-chip and reduce latencies Increase processor performance Large Access Delays to read and write stuff Best solution, make everything fast – not possible Fast memory is expensive and everything cannot be brought on-chip Follow a hierarchy which will provide cost/byte almost as low as cheapest level of memory and speed as fast as the fastest level Memory is a known bottleneck, so improving the speed ad capacity of caches can directly improve the processor performance

Data Array read & write critical signals The high density memory cell selected allows an area-efficient design at the expense of higher delay to develop a bit-line differential signal, due to lower memory cell current

Write Circuit