Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477 CSE477 VLSI Digital Circuits Fall 2002 Lecture 25: Peripheral Memory Circuits Mary Jane.

Slides:



Advertisements
Similar presentations
COEN 180 SRAM. High-speed Low capacity Expensive Large chip area. Continuous power use to maintain storage Technology used for making MM caches.
Advertisements

Computer Organization and Architecture
Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Introduction to CMOS VLSI Design Lecture 13: SRAM
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.
Introduction to CMOS VLSI Design SRAM/DRAM
Overview Memory definitions Random Access Memory (RAM)
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 31: Array Subsystems (SRAM) Prof. Sherief Reda Division of Engineering,
Modern VLSI Design 2e: Chapter 6 Copyright  1998 Prentice Hall PTR Topics n Memories: –ROM; –SRAM; –DRAM. n PLAs.
Lecture 19: SRAM.
Parts from Lecture 9: SRAM Parts from
Contemporary Logic Design Sequential Case Studies © R.H. Katz Transparency No Chapter #7: Sequential Logic Case Studies 7.6 Random Access Memories.
Semiconductor Memories Lecture 1: May 10, 2006 EE Summer Camp Abhinav Agarwal.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Review: Basic Building Blocks  Datapath l Execution units -Adder, multiplier, divider, shifter, etc. l Register file and pipeline registers l Multiplexers,
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n Latches and flip-flops. n RAMs and ROMs.
Modern VLSI Design 4e: Chapter 6 Copyright  2008 Wayne Wolf Topics Memories: –ROM; –SRAM; –DRAM; –Flash. Image sensors. FPGAs. PLAs.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 12.1 EE4800 CMOS Digital IC Design & Analysis Lecture 12 SRAM Zhuo Feng.
Memory System Unit-IV 4/24/2017 Unit-4 : Memory System.
CPEN Digital System Design
Digital Logic Design Instructor: Kasım Sinan YILDIRIM
Advanced VLSI Design Unit 06: SRAM
CSE477 L24 RAM Cores.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 24: RAM Cores Mary Jane Irwin ( )
ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories
CSE477 L23 Memories.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 23: Semiconductor Memories Mary Jane Irwin (
Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.
CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 22: Memery, ROM
CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition,
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University
Sp09 CMPEN 411 L21 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey’s Digital Integrated Circuits,
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE477 L25 Memory Peripheral.1Irwin&Vijay, PSU, 2003 CSE477 VLSI Digital Circuits Fall 2003 Lecture 25: Peripheral Memory Circuits Mary Jane Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
EE586 VLSI Design Partha Pande School of EECS Washington State University
Prof. Hsien-Hsin Sean Lee
CS 704 Advanced Computer Architecture
CSE477 VLSI Digital Circuits Fall Lecture 23: Semiconductor Memories
Lecture 3. Lateches, Flip Flops, and Memory
CS 1251 Computer Organization N.Sundararajan
Computer Organization
Lecture 19: SRAM.
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
The Goal: illusion of large, fast, cheap memory
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
CSE477 VLSI Digital Circuits Fall 2003 Lecture 24: Memory Cell Designs
Full Custom Associative Memory Core
MOS Memory and Storage Circuits
Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2
Hakim Weatherspoon CS 3410 Computer Science Cornell University
Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2002 Lecture 27: System Level Interconnect Mary Jane.
Day 26: November 11, 2011 Memory Overview
Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2002 Lecture 22: Shifters, Decoders, Muxes Mary Jane.
Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2003 Lecture 22: Shifters, Decoders, Muxes Mary Jane.
Semiconductor Memories
Digital Logic & Design Dr. Waseem Ikram Lecture 40.
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Prof. Hakim Weatherspoon
DIICD Class 13 Memories.
Presentation transcript:

Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477 CSE477 VLSI Digital Circuits Fall 2002 Lecture 25: Peripheral Memory Circuits Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477 [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]

Review: Read-Write Memories (RAMs) Static – SRAM data is stored as long as supply is applied large cells (6 fets/cell) – so fewer bits/chip fast – so used where speed is important (e.g., caches) differential outputs (output BL and !BL) use sense amps for performance compatible with CMOS technology Dynamic – DRAM periodic refresh required small cells (1 to 3 fets/cell) – so more bits/chip slower – so used for main memories single ended output (output BL only) need sense amps for correct operation not typically compatible with CMOS technology

Review: 4x4 SRAM Memory read precharge 2 bit words bit line precharge enable WL[0] BL !BL A1 WL[1] Row Decoder A2 WL[2] WL[3] To decrease the bit line delay for reads – use low swing bit lines, i.e., bit lines don’t swing rail-to-rail, but precharge to Vdd (as 1) and discharge to Vdd – 10%Vdd (as 0). (So for 2.5V Vdd, 0 is 2.25V.) Requires sense amplifiers to restore to full swing. Write circuitry – receives full swing from sense amps – or writes full swing to the bit lines A0 Column Decoder clocking and control sense amplifiers BL[i] BL[i+1] write circuitry

Peripheral Memory Circuitry Row and column decoders Sense amplifiers Read/write circuitry Timing and control

Row Decoders Collection of 2M complex logic gates organized in a regular, dense fashion (N)AND decoder WL(0) = !A9!A8!A7!A6!A5!A4!A3!A2!A1!A0 … WL(511) = !A9A8A7A6A5A4A3A2A1A0 NOR decoder WL(0) = !(A9+A8+A7+A6+A5+A4+A3+A2+A1+A0) WL(511) = !(A9+!A8+!A7+!A6+!A5+!A4+!A3+!A2+!A1+!A0)

Dynamic NOR Row Decoder Vdd WL0 WL1 WL2 WL3 Dynamic 2-to-4 NOR decoder – precharge all outputs high, then GND inactive outputs - active “high” output signals Note that signal goes through at most one fet (so constant propagation delay (in theory)) – some output wires have two parallel paths to GND Also note, that the capacitance of the output wires goes linearly with the decoder size (linear grow in fet diffusion capacitance) A0 !A0 A1 !A1 precharge

Dynamic NAND Row Decoder WL0 WL1 WL2 Dynamic 2-to-4 NAND decoder – precharge all outputs high, then discharge active output - active “low” output signals Must ensure that all input signals are low during precharge – else Vdd and GND connected! Note that the number of fets goes linearly with the decoder size – 2-to-4 has two fets in series, 3-to-8 has 3 fets in series, etc.- so will be slower than the NOR implementation if the gate capacitance dominates diffusion capacitance WL3 !A0 A0 !A1 A1 precharge

Split Row Decoder !(!(!A0!A1!A2) + !(!A3!A4!A5) +!A6) WL0 WL0 WL127 *128 *128 WL127 WL127 !(!A0!A1!A2) . . . !(A0A1A2) Since address decoders need to interface with the core array cells, pitch matching with the core storage array can be difficult. Can improve pitch and improve speed by splitting the decoder into smaller pieces. Propagation delay reduced by a factor of 4 (# of inputs halved) – squared since dependency between delay and fan-in; and the load on the vertical address lines halved (also reducing the delay) Also reduced the number of transistors – e.g., for a 10 input decoder with a 4-bit predecoder has 1024*6 (row decoder) + 5*4*4 (predecoder) = 6,224 as compared to 11*1024 = 11,264 yielding a 55% reduction Address bit predecoded; the 18 lines of predecoded values are clocked and routed across the 128 array of final decoders. Each final decoder taps into one signal from each of the three sets to perform the final level of decoders. Note that final row drivers have to drive a very long and heavily loaded word line, so have to drive them carefully (see lecture 24 – positioned in the center of the array, replicating circuitry at each end, strapping cells, etc.) *8 *8 *7 Address<6:0> *7

Pass Transistor Based Column Decoder BL3 !BL3 BL2 !BL2 BL1 !BL1 BL0 !BL0 S3 A1 S2 2 input NOR decoder S1 A0 S0 Essentially a 2**k input multiplexer Can run the NOR decoder while the row decoder and core are working – so only have 1 extra transistor in the signal path Transistor count – 2*(k+1)2**k + 2**k devices - so for k=10 (1024 to 1) it would require 2*12,288 transistors (11*1024 + 1024) Data !Data Advantage: speed since there is only one extra transistor in the signal path Disadvantage: large transistor count

Tree Based Column Decoder BL3 !BL3 BL2 !BL2 BL1 !BL1 BL0 !BL0 !A0 A0 !A1 A1 Data !Data Note no predecoder needed as with previous design Number of transistors comes down to 2* 2*(2**k – 1) – so for k=10 (1024 to1) requires only 2*2,046 transistors Advantage: number of transistors drastically reduced Disadvantage: delay increases quadratically with the number of sections (so prohibitive for large decoders) fix with buffers, progressive sizing, combination of tree and pass transistor approaches

Bit Line Precharging Static Pull-up Precharge BL !BL Clocked Precharge equalization transistor - speeds up equalization of the two bit lines by allowing the capacitance and pull-up device of the nondischarged bit line to assist in precharging the discharged line Static pullup scheme – advantage is that it does not require a heavily loaded precharge clock signal to be routed across the array; disadvantage is that it is always on, so is fighting against the bit line discharge for the bit lines that are moving low (consumes power) Clocked scheme – allows the designer to use much larger precharge devices so that bit line equalization happens more rapidly (note equalization transistor to help even more); disadvantage is the power consumption of the heavily loaded precharge clock signal.

Sense Amplifiers tp = ( C * V ) / Iav small tp = ( C * V ) / Iav large make  V as small as possible Use sense amplifiers (SA) to amplify the small swing on the bit lines to the full rail-to-rail swing needed at the output SA input output

Latch Based Sense Amplifier bit lines V = 0.1Vdd isolate BL and !BL precharged to Vdd Once adequate voltage gap created on BL and !BL, sense amp enabled with “sense” line Positive feedback quickly forces output to a stable operating point sense sense amplifier outputs V = Vdd

Alpha Differential Amplifier/Latch S3 column decoder S2 S1 S0 precharge mux_out !mux_out 0->1 sense amplifier P1 P2 N3 N5 sense N2 N4 N2 and N4 sense the voltage change on the bit lines (gates are connected to the two bit lines coming from the column decoder) N1 enables the SA by providing current to the two detection devices. The cross-coupled inverter pair (P1/N3 and P2/N5) is used to amplify the detected voltage differential. Finally, a set of inverters is used to buffer the output of the SA. Operation begins with sense in low state. P3 and P4 precharge sense and !sense to Vdd and the bit lines gating N2 and N4 are precharged to Vdd. As the read begins one of the two input lines will begin to discharge and sense is asserted. This causes the pulldown N1 to turn on supplying current to the two legs of the SA. sense N1 0->1 off->on V = Vdd P3 P4 Pre -> Closed) sense_out !sense_out

Read/Write Circuitry D: data (write) bus R: read bus W: write signal !BL BL D: data (write) bus R: read bus W: write signal CS: column select (column decoder) Local W (write): BL = D, !BL = !D enabled by W & CS Local R (read): R = BL, !R = !BL enabled by !W & CS SA CS Local R/W D W One of many ways to implement the read/write circuitry Use precharge logic (and possibly another sense amp that’s not shown) to speed up the voltage change on the read bus The capacitance of a selected bit line pair and the data/read buses are isolated by the Local R/W logic helping to keep the latency low !R Precharge R

Approaches to Memory Timing DRAM Timing Multiplexed Addressing SRAM Timing Self-Timed msb’s lsb’s Address Bus Row Addr. Column Addr. Address Bus RAS Address Address transition initiates memory operation CAS RAS-CAS timing

SRAM Address Transition Detection

DRAM Timing

Review: A Typical Memory Hierarchy By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. On-Chip Components Control eDRAM Secondary Memory (Disk) Cache Instr Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in detail in the next lectures on caches). Cache Data RegFile DTLB Speed (ns): .1’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s T’s Cost: highest lowest

Translation Lookaside Buffers (TLBs) Small caches used to speed up address translation in processors with virtual memory All addresses have to be translated before cache access I$ can be virtually indexed/virtually tagged

TLB Structure Most TLBs are small (<= 256 entries) Address issued by CPU (page size = index bits + byte select bits) VA Tag PA Tag Data Tag Data Hit = = Most TLBs are small (<= 256 entries) and thus fully associative content addressable memories (CAMs) Hit Desired word

CAM Design WL<0> Hit word line<0> of data array match<0> WL<2> match<1> WL<3> match<2> match<3> match WL bit Read/Write Circuitry Precharge all match lines high (so Hit = 0) If the match data matches the data stored in the cell, it leaves the match line high so that Hit = 1; a mismatch pulls the match line low match/write data precharge/match

Reliability and Yield Semiconductor memories trade-off noise margin for density and performance Thus, they are highly sensitive to noise (cross talk, supply noise) High density and large die size causes yield problems # of good chips/wafer Yield = 100 # of chips/wafer Y = [(1 – e–AD)/(AD)]2 Increase yield using error correction and redundancy Now a days have around 1 defect/cm**2 A is the chip area and D is the defect density – so the yield goes down as A*D goes up

Alpha Particles

Redundancy in the Memory Structure Fuse bank Redundant row Redundant columns Row address Replace bad row or column with “spare” – done by setting the fuse bank Helps to correct faults that affect a large section of the memory; not good for scattered point errors or local errors (use error correction (ECC) logic like parity bits for that) Column address

Redundancy and Error Correction

Next Lecture and Reminders System level interconnect Reading assignment – Rabaey, et al, xx Reminders Project final reports due December 5th Final grading negotiations/correction (except for the final exam) must be concluded by December 10th Final exam scheduled Monday, December 16th from 10:10 to noon in 118 and 121 Thomas