ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories

Name: ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories
Uploaded: 2017-08-20T15:48:54+00:00
Duration: PTM31S32
Channel: Philip Sullivan
Description: ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories

ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories
Yunsi Fei [Adapted from Jan Rabaey et al’s Digital Integrated Circuits ©2002, PSU Irwin & Vijay © 2002, and Princeton Wayne Wolf’s Modern VLSI Design © 2002 ]

Review: Basic Building Blocks
Datapath Execution units Adder, multiplier, divider, shifter, etc. Register file and pipeline registers Multiplexers, decoders Control Finite state machines (PLA, ROM, random logic) Interconnect Switches, arbiters, buses Memory Caches (SRAMs), TLBs, DRAMs, buffers

A Typical Memory Hierarchy
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. On-Chip Components Control eDRAM Secondary Memory (Disk) Cache Instr Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in detail in the next lectures on caches). Cache Data RegFile DTLB Speed (ns): ’s ’s ’s ’s ,000’s Size (bytes): ’s K’s K’s M’s T’s Cost: highest lowest

Semiconductor Memories
RWM NVRWM ROM Random Access Non-Random Access EPROM Mask-programmed SRAM (cache, register file) FIFO/LIFO E2PROM DRAM Shift Register CAM FLASH Electrically-programmed (PROM) >50% of the transistors in a microprocessor are in the memory (reg file, caches, etc.) - #bits, #bytes, #words, #bits/word, #data ports (#inputs, #outputs) RW – read write (as opposed to RO, read only) NV – non volatile (holds data even when power is off) – ROM and FLASH EPROM – erasable programmable; E2PROM – electrically erasable – write operations take substantially longer than reads

Growth in DRAM Chip Capacity

Decoder reduces # of inputs
1D Memory Architecture Word 0 Word 1 Word 2 Word n-1 Word n-2 Storage Cell m bits n words S0 S1 S2 S3 Sn-2 Sn-1 Input/Output Word 0 Word 1 Word 2 Word n-1 Word n-2 Storage Cell m bits S0 S1 S2 S3 Sn-2 Sn-1 Input/Output A0 A1 Ak-1 Decoder Only one select line active at a time. E.g., N= 10**6 = 2 **20 (1 Mword) means 1 million select signals By adding decoder reduce number of inputs from 1 million to 20 (address lines). Note, still have to generate 1 million select lines with a very big decoder Scheme on right, while reducing #inputs, leads to very tall and narrow memories (and very slow because of very long bit lines). Also very big (and slow) address decoder (good to try to pitch match between the decoder and the memory core). n words  n select signals Decoder reduces # of inputs k = log2 n

2D Memory Architecture bit line 2k-j word line Aj Aj+1 Row Address
Row Decoder storage (RAM) cell Ak-1 m∙2j Column Address A0 selects appropriate word from memory row A1 Column Decoder Put multiple words in one memory row – splits the decoder into two decoders (row and column) and makes the memory core square reducing the length of the bit lines (but increasing the length of the word lines). The lsb part of the address goes into the column decoder (e.g., 6 bits so that 64 words are assigned to one row (with 32 bits per word gives 2**11 bit line pairs) leaving 14 bits for the row decoder (giving 2**14 word lines) for an not quite square array. The RAM cell needs to be as compact and fast as possible since it is replicated thousands of times in the core array. To speed things up, don’t force bit lines to swing from rail-to-rail so need sense amplifiers to restore the signal. This scheme is good only for up to 64 Kb to 256 Kb. For bigger memories it is too SLOW because the word and bit lines are too long. Aj-1 Sense Amplifiers amplifies bit line swing Read/Write Circuits Input/Output (m bits)

3D Memory Architecture Advantages: 1. Shorter word and/or bit lines
Row Addr Column Addr Block Addr 1M word memory with 32 bits/word – 2 bit block address; 6 bit column addr giving 2**11 bit line pairs; 12 bit row address giving 2**12 word lines for almost square memory arrays Input/Output (m bits) Advantages: 1. Shorter word and/or bit lines 2. Block addr activates only 1 block saving power

Precharged MOS NOR ROM Vdd  1 precharge WL(0) GND  1 WL(1) WL(2) GND
 1 precharge WL(0) GND  1 WL(1) WL(2) Lots of ways to build them, this is just the structure of choice. First precharge all bit lines to 1 (make sure all word lines are inactive (0) – can do with an enabled decoder), then activate appropriate word line, existing fets selectively discharge bit lines low. So, no fet means a 1 is stored, fet means a 0 is stored. Note that there is only one pull down in series in the network so its fast! GND WL(3) BL(0) BL(1) BL(2) BL(3)

MOS NOR ROM Layout Metal1 on top of diffusion WL(0) GND (diffusion) WL(1) Basic cell 10 x 7  Metal1 Polysilicon WL(2) GND (diffusion) WL(3) Selective addition of metal to diffusion – add contacts to “program” Note that ground is run in diffusion (not great) and is connected to a verticle diffusion run underneath the m1 strip. A large part of the cell is devoted to the bit line contact and the ground connection. Pull-down fets are 4/2 transistors BL(0) BL(1) BL(2) BL(3) Only 1 layer (contact mask) is used to program memory array, so programming of the ROM can be delayed to one of the last process steps.

Transient Model for NOR ROM
precharge metal1 rword BL poly Cbit WL cword Word line parasitics Resistance/cell: 35 Wire capacitance/cell: fF Gate capacitance/cell: fF Transient response – time from word line activation to bit line traversing voltage swing (typically 10% of Vdd). Note, values are for the old technology (in particular 5V Vdd) These numbers are from the older (0.8 micron) technology. Bit line parasitics Resistance/cell:  Wire capacitance/cell: fF Drain capacitance/cell: fF

Propagation Delay of NOR ROM
Word line delay Delay of a distributed rc-line containing M cells tword = 0.38(rword x cword) M2 = 20 nsec for M = 512 Bit line delay Assuming min size pull-down and 3*min size pull-up with reduced swing bit lines (5V to 2.5V) Cbit = 1.7 pF and IavHL = 0.36 mA so tHL = tLH = 5.9 nsec word line delay is factor of M**2 – quadratic term (since no buffers in word line) so that tword = 0.38 (35 ohms x ( ) fF) 512**2 = 20 nsec bit line delay (2.4/1.2 pull-down and 8/1.2 pull-up), Cbit = 512 x ( ) fF = 1.7 pF tHL = (1.7 pF x 1.25V)/0.36 mA = 5.9 nsec Note that the word line is by far the slowest, mostly due to the larger resistance/cell and the low swing on the bit lines

Read-Write Memories (RAMs)
Static – SRAM data is stored as long as supply is applied large cells (6 fets/cell) – so fewer bits/chip fast – so used where speed is important (e.g., caches) differential outputs (output BL and !BL) use sense amps for performance compatible with CMOS technology Dynamic – DRAM periodic refresh required small cells (1 to 3 fets/cell) – so more bits/chip slower – so used for main memories single ended output (output BL only) need sense amps for correct operation not typically compatible with CMOS technology

Memory Timing Definitions
Read Cycle Read Read Access Read Access Write Cycle Write Write Setup Write Hold Data Data Valid

4x4 SRAM Memory read precharge 2 bit words bit line precharge enable
WL[0] !BL BL A1 WL[1] Row Decoder A2 WL[2] WL[3] To decrease the bit line delay for reads – use low swing bit lines, i.e., bit lines don’t swing rail-to-rail, but precharge to Vdd (as 1) and discharge to Vdd – 10%Vdd (as 0). (So for 2.5V Vdd, 0 is 2.25V.) Requires sense amplifiers to restore to full swing. Write circuitry – receives full swing from sense amps – or writes full swing to the bit lines A0 Column Decoder clocking and control sense amplifiers BL[i] BL[I+1] write circuitry

2D Memory Configuration
Sense Amps Sense Amps Row Decoder In reality, memory is designed in quadrants (hence the 4x growth every generation). This configuration halfs the length of the word and bit lines (for increased speed).

Decreasing Word Line Delay
Drive the word line from both sides polysilicon word line metal word line driver WL Use a metal bypass polysilicon word line metal bypass WL driving from both sides reduces the worst case delay of the word line by 4 (like buffer insertion to reduce RC delay) Use silicides

Decreasing Bit Line Delay (and Energy)
Reduce the bit line voltage swing need sense amp for each column to sense/restore signal Isolate memory cells from the bit lines after sensing (to prevent the cells from changing the bit line voltage further) - pulsed word line generation of word line pulses very critical too short - sense amp operation may fail too long - power efficiency degraded (because bit line swing size depends on duration of the word line pulse) use feedback signal from bit lines Isolate sense amps from bit lines after sensing (to prevent bit lines from having large voltage swings) - bit line isolation

Pulsed Word Line Feedback Signal
Read Word line Bit lines Dummy bit lines Complete 10% populated Dummy column height set to 10% of a regular column and its cells are tied to a fixed value capacitance is only 10% of a regular column

Pulsed Word Line Timing
Read Dummy bit lines have reached full swing and trigger pulse shut off when regular bit lines reach 10% swing Complete Word line V = 0.1Vdd Bit line Dummy bit line V = Vdd

Bit Line Isolation bit lines V = 0.1Vdd isolate Read sense amplifier
sense amplifier outputs

6-transistor SRAM Cell WL M2 M4 Q M6 M5 !Q M1 M3 !BL BL
Note that it is identical to the register cell from static sequential circuit - cross-coupled inverters Consumes power only when switching - no standby power (other than leakage) is consumed The major job of the pullups is to replenish loss due to leakage Sizing of the transistors is critical !BL BL

SRAM Cell Analysis (Read)
WL=1 M4 M6 !Q=0 M5 Q=1 M1 Cbit Cbit !BL=1 BL=1 First precharge both bit lines – BL and !BL – to 1 - Note that bit line capacitance values for Cbit can be in the pF range for large memories (bit line capacitance is from wiring and from diffusion caps of M5’s of all the cells connected to the bit line and the load presented by the sense amp) Then discharge !BL through M5 and M1 (since the cell is holding 1) – key is that you must ensure that the !Q point doesn’t rise too high before Cbit is discharged, or the memory cell will change state- read-disturb. So, the resistance of M5 must be larger than M1 (M5 with bigger L, so that M1 is faster than M5) But this is overly conservative since BL being 1 will help to keep the cell from toggling. Or can precharge the bit lines to Vdd/2 so !Q could never reach the switching point. This has performance benefits as well. Read-disturb (read-upset): must carefully limit the allowed voltage rise on !Q to a value that prevents the read-upset condition from occurring while simultaneously maintaining acceptable circuit speed and area constraints

SRAM Cell Analysis (Read)
WL=1 M4 M6 !Q=0 M5 Q=1 M1 Cbit Cbit !BL=1 BL=1 M5 in saturation, M1 in linear mode Cell Ratio (CR) = (WM1/LM1)/(WM5/LM5) V!Q = [(Vdd - VTn)(1 + CR (CR(1 + CR))]/(1 + CR)

Read Voltages Ratios Vdd = 2.5V VTn = 0.5V
To avoid read-disturb, the voltage on node !Q should remain below the trip point of the inverter pair for all process, noise, and operating conditions. CRs of no less than 1.2 (most microprocessor fabs use a minimum cell ratio of 1.25 to 2) Simulations should be done at high Vdd and low VTn (and considering process variations and misalignments)

SRAM Cell Analysis (Write)
WL=1 M4 M6 !Q=0 M5 Q=1 M1 !BL=1 BL=0 State shown is that before write takes effect (1 is stored, trying to write a 0) In order to write the cell, the pass gate M6 must be more conductive than the M4 to allow node Q to be pulled to a value low enough for the inverter pair (M2/M1) to begin amplifying the new data. The maximum ratio of the pullup size to that of the pass gate required to guarantee that the cell is writable – M6 in linear, M4 in saturation Pullup Ratio (PR) = (WM4/LM4)/(WM6/LM6) VQ = (Vdd - VTn) ((Vdd – VTn)2 – (p/n)(PR)((Vdd – VTn - VTp)2)

Write Voltages Ratios p/n = 0.5 Vdd = 2.5V |VTp| = 0.5V
In order to write the cell, node Q must be pulled to a value low enough to trip the inverter combination – so pulling Q below VTn (0.5V) is required. The limiting case for the write operation occurs at high Vdd when the pfet is strong (mup high, VTp low) and the nfet is weak (mun low, VTn high) Typicaly, the widths of the pullup devices are sized at or near process minimum. Longer than minimum channel lengths may also be employed to further reduce the pullup ratio. This is necessary since the read sizing of the SRAM cell dictates that the pass gate sizing should be minimized to prevent read disturbs.

Cell Sizing Keeping cell size minimized is critical for large caches
Minimum sized pull down fets (M1 and M3) Requires minimum width and longer than minimum channel length pass transistors (M5 and M6) to ensure proper CR But sizing of the pass transistors increases capacitive load on the word lines and limits the current discharged on the bit lines both of which can adversely affect the speed of the read cycle Minimum width and length pass transistors Boost the width of the pull downs (M1 and M3) Reduces the loading on the word lines and increases the storage capacitance in the cell – both are good! – but cell size may be slightly larger

6T-SRAM Layout VDD Q Q GND WL BL BL M2 M4 M1 M3 M5 M6
Area is dominated by wiring and contacts – 11.5 contacts SPEED – tp = CLVswing/Iav Read – critical speed op since have large bit line capacitance to discharge through small transistors Write – speed determined by the propagation delay of the cross-coupled inverters tpLH > tpHL due to precharge Other considerations – size of cell (1092 lambda**2 = 582 micron **2 by 0.7 micron design rules) and power – standby/cell of 10**-15A GND M5 M6 WL BL BL

Multiple Read/Write Port Cell
WL2 BL2 !BL2 M7 M8 !BL1 BL1 WL1 M1 M2 M3 M4 M5 M6 Q !Q Allows multiple simultaneous reads of the same cell, so the cell design must be stable for a case of multiple reads. For the case of reads, with more than one pass gate open, the voltage rise in the cell will be larger and thus the size of the pulldown will have to be increased to maintain an acceptably low level (by a factor equal to the number of simultaneous open read ports).

clocking, control, and refresh
4x4 DRAM Memory read precharge 2 bit words bit line precharge enable WL[0] BL A1 WL[1] Row Decoder A2 WL[2] WL[3] To decrease the bit line delay for reads – use low swing bit lines, i.e., bit lines don’t swing rail-to-rail, but precharge to Vdd (as 1) and discharge to Vdd – 10%Vdd (as 0). (So for 2.5V Vdd, 0 is 2.25V.) Requires sense amplifiers to restore to full swing. Write circuitry – receives full swing from sense amps – or writes full swing to the bit lines sense amplifiers clocking, control, and refresh BL[0] BL[1] BL[2] BL[3] write circuitry A0 Column Decoder

3-Transistor DRAM Cell No constraints on device sizes (ratioless)
WWL WWL write RWL BL1 Vdd M3 X M1 M2 X Vdd-Vt Cs RWL read BL2 Vdd-Vt V Core of first popular MOS memories (e.g., first 1Kbit memory from Intel). Cs is data storage (internal capacitance of wiring, M2 gate, and M1 diffusion capacitances) Note threshold drop at point X which decreases the drive (gate voltage) of M2 and slows down read time – some designs “bootstrap” the word line voltage (raise the VWWL to a value higher than Vdd to get around threshold drop). Write – uses WWL and BL1. Read – uses RWL and BL2. Assume BL2 precharged to Vdd (or Vdd-Vt). If cell is holding 1, then BL2 goes low – so reads are inverting. Refresh – read stored data, put its inverse on BL1 and assert WWL (need to do this every 1 to 4 msec) BL2 BL1 No constraints on device sizes (ratioless) Reads are non-destructive Value stored at node X when writing a “1” is VWWL - Vtn

3T-DRAM Layout BL2 BL1 GND RWL WWL M3 M2 M1
Note many fewer contacts (only 7) and wires than in 6-T SRAM cell 576 lambda**2 as compared to 1092 lambda**2 for SRAM cell M1

1-Transistor DRAM Cell WL write “1” read “1” WL X M1 X Vdd-Vt Cs CBL Vdd BL Vdd/2 BL sensing Most pervasive DRAM cell in commercial designs. Write – set BL and activate WL. Once again could bootstrap WL so that voltage drop at X doesn’t occur (to bring it up to Vdd) – common practice Read - BL precharged to Vpre – typically Vdd/2 – then assert WL and sense state of BL that takes effect due to charge sharing between CBL and Cs. Note that Read is destructive (steal charge from Cs) so must follow with a refresh cycle. Note that Cs << CBL (1 to 2 orders of magnitude) so read voltage swings are typically very small (around 250mV for 0.8 micron technology?). Charge transfer rations are between 1% and 10% delta V = VBL – Vpre = (Vbit – Vpre) (Cs/(Cs + CBL)) REQUIRES a sense amp for each bit line for correct operation (wereas before was used to improve performance (via reduced bit line swings on reads)) Write: Cs is charged (or discharged) by asserting WL and BL Read: Charge redistribution occurs between CBL and Cs Read is destructive, so must refresh after read

1-T DRAM Cell How to build the capacitor. Need to “build” and explicit capacitor that is both small and good – the key design challenge in 1T DRAM cells. Note that the technology is not compatible with CMOS logic technology.

DRAM Cell Observations
DRAM memory cells are single ended (complicates the design of the sense amp) 1T cell requires a sense amp for each bit line due to charge redistribution read 1T cell read is destructive; refresh must follow to restore data 1T cell requires an extra capacitor that must be explicitly included in the design A threshold voltage is lost when writing a 1 (can be circumvented by bootstrapping the word lines to a higher value than Vdd)

ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories

Similar presentations

Presentation on theme: "ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories

Similar presentations

Presentation on theme: "ECE 300 Advanced VLSI Design Fall 2006 Lecture 19: Memories"— Presentation transcript:

Similar presentations

About project

Feedback