Download presentation
Presentation is loading. Please wait.
Published byJessie Rodgers Modified over 6 years ago
1
Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg477
CSE477 VLSI Digital Circuits Fall Lecture 23: Semiconductor Memories Mary Jane Irwin ( ) [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, © J. Rabaey, A. Chandrakasan, B. Nikolic]
2
Review: Basic Building Blocks
Datapath Execution units Adder, multiplier, divider, shifter, etc. Register file and pipeline registers Multiplexers, decoders Control Finite state machines (PLA, ROM, random logic) Interconnect Switches, arbiters, buses Memory Caches (SRAMs), TLBs, DRAMs, buffers
3
Memory Definitions Size – Kbytes, Mbytes, Gbytes, Tbytes Speed
Read Access – delay between read request and the data available Write Access – delay between write request and the writing of the data into the memory (Read or Write) Cycle - minimum time required between successive reads or writes Read Cycle Read Write Cycle Read Access Read Access Write Write Setup Write Access Data Data Valid Data Written
4
A Typical Memory Hierarchy
By taking advantage of the principle of locality, we can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology. On-Chip Components Control eDRAM Secondary Memory (Disk) Cache Instr Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Cache Data The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). What makes this work is one of the most important principle in computer design - the principle of locality. While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. RegFile DTLB Speed (ns): ’s ’s ’s ’s ,000’s Size (bytes): ’s K’s K’s M’s T’s Cost: highest lowest
5
More Memory Definitions
Function – functionality, nature of the storage mechanism static and dynamic; volatile and nonvolatile (NV); read only (ROM) Access pattern – random, serial, content addressable Read Write Memories (RWM) NVRWM ROM Random Access Non-Random Access EPROM Mask-prog. ROM SRAM (cache, register file) DRAM (main memory) CAM FIFO, LIFO Shift Register EEPROM FLASH Electrically- prog. PROM Input-output architecture – number of data input and output ports (multiported memories) Application – embedded, secondary, tertiary
6
Random Access Read Write Memories (WRMs)
SRAM – Static Random Access Memory data is stored as long as supply is applied large cells (6 fets/cell) – so fewer bits/chip fast – so used where speed is important (e.g., caches) differential outputs (output BL and !BL) use sense amps for performance compatible with CMOS technology DRAM - Dynamic Random Access Memory periodic refresh required (every 1 to 4 ms) to compensate for the charge loss caused by leakage small cells (1 to 3 fets/cell) – so more bits/chip slower – so used for main memories single ended output (output BL only) need sense amps for correct operation not typically compatible with CMOS technology
7
Evolution in DRAM Chip Capacity
human memory human DNA 4X growth every 3 years! 0.07 m 0.1 m 0.13 m encyclopedia 2 hrs CD audio 30 sec HDTV book m m m m m m page
8
6-transistor SRAM Storage Cell
WL M2 M4 Q M6 M5 !Q M1 M3 Note that it is identical to the register cell from static sequential circuit - cross-coupled inverters Consumes power only when switching - no standby power (other than leakage) is consumed The major job of the pullups is to replenish loss due to leakage Sizing of the transistors is critical !BL BL Will cover how the cell works in detail in the next lecture
9
Decoder reduces # of inputs
1D Memory Architecture Word 0 Word 1 Word 2 Word N-1 Word N-2 Storage Cell M bits N words S0 S1 S2 S3 SN-2 SN-1 Input/Output Word 0 Word 1 Word 2 Word N-1 Word N-2 Storage Cell M bits S0 S1 S2 S3 SN-2 SN-1 Input/Output A0 A1 Ak-1 Decoder Only one select line active at a time. E.g., N= 10**6 = 2 **20 (1 Mword) means 1 million select signals By adding decoder reduce number of inputs from 1 million to 20 (address lines). Note, still have to generate 1 million select lines with a very biggggg decoder (see last lecture) Scheme on right, while reducing #inputs, leads to very tall and narrow memories (and very slow because of very long bit lines). Also very big (and slow) address decoder (good to try to pitch match between the decoder and the memory core). N words N select signals Decoder reduces # of inputs K = log2 N
10
(least significant bits)
2D Memory Architecture bit line (BL) 2K-L word line (WL) AL AL+1 Row Address Row Decoder storage (RAM) cell AK-1 M2L A0 Column Address (least significant bits) selects appropriate word from memory row A1 Column Decoder AL-1 Put multiple words in one memory row – splits the decoder into two decoders (row and column) and makes the memory core square reducing the length of the bit lines (but increasing the length of the word lines). The lsb part of the address goes into the column decoder (e.g., 6 bits so that 64 words are assigned to one row (with 32 bits per word gives 2**11 bit line pairs) leaving 14 bits for the row decoder (giving 2**14 word lines) for an not quite square array. The RAM cell needs to be as compact and fast as possible since it is replicated thousands of times in the core array. Often are willing to trade off noise margins, logic swing, input-output isolation, fan-out, or speed for area. To speed things up (and reduce power consumption), don’t force bit lines to swing from rail-to-rail so need sense amplifiers to restore the signal to full rail-to-rail amplitude. This scheme is good only for up to 64 Kb to 256 Kb. For bigger memories it is too SLOW because the word and bit lines are too long. Sense Amplifiers amplifies bit line swing Read/Write Circuits Input/Output (M bits)
11
3D (or Banked) Memory Architecture
Row Addr Column Addr Block Addr A1 A0 1M word memory with 32 bits/word – 2 bit block address; 6 bit column addr giving 2**11 bit line pairs; 12 bit row address giving 2**12 word lines for almost square memory arrays Input/Output (M bits) Advantages: 1. Shorter word and bit lines so faster access 2. Block addr activates only 1 block saving power
12
2D 4x4 SRAM Memory Bank read precharge bit line precharge enable WL[0]
Row Decoder A2 WL[2] WL[3] 2 bit words A0 Column Decoder clocking and control To decrease the bit line delay for reads – use low swing bit lines, i.e., bit lines don’t swing rail-to-rail, but precharge to Vdd (as 1) and discharge to Vdd – 10%Vdd (as 0). (So for 2.5V Vdd, 0 is 2.25V.) Requires sense amplifiers to restore to full swing. Write circuitry – receives full swing from sense amps – or writes full swing to the bit lines sense amplifiers BLi BLi+1 write circuitry
13
Quartering Gives Shorter WLs and BLs
Precharge Circuit Precharge Circuit data Write Circuitry Write Circuitry Sense Amps Row Decoder Sense Amps Ai-1 … A0 Column Decoder Column Decoder In reality, memory is designed in quadrants (hence the 4x growth every generation). This configuration halfs the length of the word and bit lines (for increased speed). Read Precharge Read Precharge AN-1 … Ai
14
Decreasing Word Line Delay
Drive the word line from both sides polysilicon word line metal word line driver WL Use a metal bypass polysilicon word line metal bypass WL driving from both sides reduces the worst case delay of the word line by 4 (like buffer insertion to reduce RC delay) Use silicides
15
Decreasing Bit Line Delay (and Energy)
Reduce the bit line voltage swing need sense amp for each column to sense/restore signal Isolate sense amps from bit lines after sensing (to prevent the sense amps from changing the bit line voltage further) - bit line isolation Isolate memory cells from the bit lines after sensing (to prevent the memory cells from changing the bit line voltage further) - pulsed word line generation of word line pulses very critical too short - sense amp operation may fail too long - power efficiency degraded (because bit line swing size depends on duration of the word line pulse) use feedback signal from bit lines
16
Bit Line Isolation !BL BL V = 0.1Vdd isolate Read sense amplifier
sense amplifier outputs
17
Pulsed Word Line From Row Decoder !BL BL WL Dummy column cells tied to a fixed value (0) Done (1 0) 10% populated so capacitance is 10% of a regular column Read Done WL BL Dummy BL V = Vdd V = 0.1Vdd Dummy BL has reached full swing and triggers Done ( 0) when regular BLs reach 10% swing
18
Read Only Memories (ROMs)
A memory that can only be read and never altered Programs for fixed applications that once developed and debugged, never need to be changed, only read Fixing the contents at manufacturing time leads to small and fast implementations. WL BL = 1 BL = 0 WL BL = 0 BL = 1
19
MOS OR ROM Cell Array BL(0) BL(1) BL(2) BL(3) WL(0) VDD WL(1) on on
For class handout – skip covering this one in class (would make a good basis for an exam question) WL(3) predischarge
20
MOS OR ROM Cell Array 1 0 0 1 0 0 0 0 BL(0) BL(1) BL(2) BL(3) WL(0)
BL(0) BL(1) BL(2) BL(3) WL(0) VDD 1 WL(1) on on WL(2) VDD notice how the overhead of the supply lines are reduced by sharing them between neighboring cells. This requires the mirroring of the odd cells around the horizontal axis, an approach that is extensively used in memory cores of all styles. What are the values of the data stored? Beware of threshold drop. WL(3) predischarge 1 0
21
Precharged MOS NOR ROM VDD precharge WL(0) enable GND WL(1) A1
Row Decoder A2 WL(2) GND For class handout. WL(3) BL(0) BL(1) BL(2) BL(3)
22
Precharged MOS NOR ROM VDD 1 precharge WL(0) enable GND WL(1) A1 1
1 precharge WL(0) enable GND WL(1) A1 1 on on 1 Row Decoder A2 WL(2) GND For lecture. Lots of ways to build them, this is just the structure of choice. First precharge all bit lines to 1 (make sure all word lines are inactive (0) – can do with an enabled decoder), then activate appropriate word line, existing fets selectively discharge bit lines low. So, no fet means a 1 is stored, fet means a 0 is stored. Note that there is only one pull down in series in the network so its fast! What is stored in this ROM is the inverse of what is stored in the OR ROM WL(3) BL(0) BL(1) BL(2) BL(3)
23
MOS NOR ROM Layout 1 Memory is programmed by adding transistors where needed (ACTIVE mask – early in the fab process) cell size of 9.5 x 7 WL(0) GND WL(1) metal1 on top of diffusion WL(2) GND WL(3)
24
MOS NOR ROM Layout 2 Memory is programmed by adding contacts where needed (CONTACT mask – one of the last processing steps) the presence of a metal contact creates a 0-cell WL(0) GND WL(1) cell size of 11 x 7 WL(2) Selective addition of metal to diffusion – add contacts to “program”. Wafers can be prefabricated up to the CONTACT mask and stockpiled. Note that ground is run in diffusion (not great) and is connected to a vertical diffusion run underneath the m1 strip. Done for the sake of space. A metal bypass with regularly-spaced straps keeps the voltage drop over the ground wire within bounds. A large part of the cell is devoted to the bit line contact and the ground connection. Pull-down fets are 4/2 transistors GND WL(3)
25
Transient Model for 512x512 NOR ROM
precharge metal1 rword BL poly Cbit WL cword Word line parasitics (distributed RC model) Resistance/cell: Wire capacitance/cell: fF Gate capacitance/cell: fF Transient response – time from word line activation to bit line traversing voltage swing (typically 10% of Vdd). Most of the delay is attributable to interconnect parasitics. WL is best modeled as a distributed RC line since its implemented in poly with a relatively high sheet resistance (silicided poly would be advisable). BL is implemented in metal1, so can use capacitive model and all capacitive loads can be lumped into a single element. Bit line parasitics (lumped C model) Resistance/cell: (which is negligible) Wire capacitance/cell: fF Drain capacitance/cell: 0.8 fF
26
Propagation Delay of 512x512 NOR ROM
Word line delay Delay of a distributed rc-line containing M cells tword = 0.38(rword x cword) M2 = 0.38 (17.5 x ( ) fF) 5122 = 1.4 nsec Bit line delay Assuming (0.5/0.25) pull-down and (1.3125/0.25) pull-up with reduced swing bit lines (2.5V to 1.5V) Cbit = 512 x ( ) fF = 0.46 pF tHL = 0.69 (13 k/2 || 31k/5.25) 0.46 pF = 0.98 nsec (and tLH = 0.69 (31k/5.25) 0.46 pF = 1.87 nsec) word line delay is factor of M**2 – quadratic term (since no buffers in word line) so that tword = 1.4 nsec. Word line delay dominates due to the large resistance of the poly wire. Driving the line from both sides and using metal bypass lines (global word lines) would help speed it up. But the most effective approach is to carefully partition the memory into sub-blocks of adequate size that balance word and bit-line delay. Partitioning also helps to reduce the energy consumption attributed to driving and switching the word lines. The bit-line delay can be reduced as well by further reducing the voltage swing on the BL and letting the SA restor the output signal to full swing. Voltage swings of around 0.5V are quite common.
27
Nonvolatile Read-Write Memories (NVRWM)
UV (ultraviolet light exposure), FN (Fowler-Nordheim tunneling), Hot e (avalanche hot-electron-injection); VDD = 3.3 or 5V; VPP = 12 or 12.5V # Trans /Cell Cell Area Mechanism Power Supply Program/ Erase Cycles Erase Write Read MASK ROM 1T 0.35-5 VDD EPROM 1 UV Hot e VPP ~100 EEPROM 2T 3-5 FN VPP (int) FLASH 1-2
28
Next Lecture and Reminders
SRAM, DRAM, and CAM cores Reading assignment – Rabaey, et al, Reminders HW#5 will (optional) due December 2nd Project final reports due December 4th Final grading negotiations/correction (except for the final exam) must be concluded by December 10th Final exam scheduled Tuesday, December 16th from 10:10 to noon in 118 and 113 Thomas
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.