ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Prith Banerjee ECE C03 Advanced Digital Design Spring 1998
CP1610: Introduction to Computer Components Primary Memory.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
1 The Basic Memory Element - The Flip-Flop Up until know we have looked upon memory elements as black boxes. The basic memory element is called the flip-flop.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Main Memory by J. Nelson Amaral.
CompE 460 Real-Time and Embedded Systems Lecture 5 – Memory Technologies.
Memory Technology “Non-so-random” Access Technology:
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.
Survey of Existing Memory Devices Renee Gayle M. Chua.
ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM and Storage Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
Lecture 13 Main Memory Computer Architecture COE 501.
Main Memory CS448.
CPEN Digital System Design
University of Tehran 1 Interface Design DRAM Modules Omid Fatemi
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Computer Architecture Lecture 24 Fasih ur Rehman.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Memory Hierarchy (I). 2 Outline Random-Access Memory (RAM) Nonvolatile Memory Disk Storage Suggested Reading: 6.1.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
CSE431 L18 Memory Hierarchy.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 18: Memory Hierarchy Review Mary Jane Irwin (
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS 704 Advanced Computer Architecture
COMP541 Memories II: DRAMs
Lecture 3. Lateches, Flip Flops, and Memory
CS 1251 Computer Organization N.Sundararajan
Chapter 5 Internal Memory
William Stallings Computer Organization and Architecture 7th Edition
CSE 502: Computer Architecture
Types of RAM (Random Access Memory)
Modern Computer Architecture
Reducing Hit Time Small and simple caches Way prediction Trace caches
The University of Adelaide, School of Computer Science
COMP541 Memories II: DRAMs
5.2 Eleven Advanced Optimizations of Cache Performance
Memory Units Memories store data in units from one to eight bits. The most common unit is the byte, which by definition is 8 bits. Computer memories are.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
William Stallings Computer Organization and Architecture 7th Edition
Computer Architecture & Operations I
Lecture 15: DRAM Main Memory Systems
William Stallings Computer Organization and Architecture 8th Edition
The Main Memory system: DRAM organization
Lecture: DRAM Main Memory
Computer Architecture
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
William Stallings Computer Organization and Architecture 8th Edition
DRAM Bandwidth Slide credit: Slides adapted from
BIC 10503: COMPUTER ARCHITECTURE
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture 15: Memory Design
Chapter 4: MEMORY.
15-740/ Computer Architecture Lecture 19: Main Memory
Shared Memory Accesses
DRAM Hwansoo Han.
William Stallings Computer Organization and Architecture 8th Edition
Modified from notes by Saeid Nooshabadi
Main Memory Background
Bob Reese Micro II ECE, MSU
Presentation transcript:

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology With adaptations and additions by S. Yalamanchili for ECE 4100/6100 – Spring 2009

Reading Section 5.3 Suggested Readings

Main Memory Storage Technologies DRAM: “Dynamic” Random Access Memory Highest densities Optimized for cost/bit  main memory SRAM: “Static” Random Access Memory Densities ¼ to 1/8 of DRAM Speeds 8-16x faster than DRAM Cost 8-16x more per bit Optimized for speed  caches

The DRAM Cell 1T1C DRAM cell Why DRAMs Disadvantages Word Line (Control) Storage Capacitor Bit Line (Information) 1T1C DRAM cell Stack capacitor (vs. Trench capacitor) Source: Memory Arch Course, Insa. Toulouse Why DRAMs Higher density than SRAMs Disadvantages Longer access times Leaky, needs to be refreshed Cannot be easily integrated with CMOS

SRAM Cell Bit is stored in a latch using 6 transistors To read: Wordline Bit line Bit line Bit is stored in a latch using 6 transistors To read: set bitlines to 2.5v drive wordline, bitlines settle to 0v / 5v To write: set bitlines to 0v / 5v drive wordline, bitlines “overpower” latch transistors

One DRAM Bank bitlines wordline Row decoder Address Sense amps I/O gating Row decoder Column decoder Data out Address

Example: 512Mb 4-bank DRAM (x4) Row decoder Row decoder Row decoder BA[1:0] Row decoder Bank0 16384 x 2048 x 4 A[13:0] 16K Address Sense amps I/O gating Column decoder Column decoder Column decoder Column decoder A[10:0] A DRAM page = 2kx4 = 1KB Address Multiplexing Data out D[3:0] A x4 DRAM chip

DRAM Cell Array bitline0 bitline1 bitline2 bitline15 Wordline0

DRAM Basics Address multiplexing DRAM reads are self-destructive Send row address when RAS asserted Send column address when CAS asserted DRAM reads are self-destructive Rewrite after a read Memory array All bits within an array work in unison Memory bank Different banks can operate independently DRAM rank Chips inside the same rank are accessed simultaneously

Examples of DRAM DIMM Standards x64 (No ECC) D0 D7 x8 D8 D15 x8 CB0 CB7 x8 D16 D23 x8 D24 D31 x8 D32 D39 x8 D40 D47 x8 D48 D55 x8 D56 D63 x8 X72 (ECC)

DRAM Ranks x8 x8 Rank0 Memory Controller Rank1 CS0 CS1 D0 D7 D8 D15

DRAM Ranks Single Rank Single Rank Dual- Rank 64b 8b 8b 8b 8b 8b 8b 8b

Source: Memory Systems Architecture Course, B. Jacobs, Maryland DRAM Organization Source: Memory Systems Architecture Course, B. Jacobs, Maryland

Organization of DRAM Modules Addr and Cmd Bus Memory Controller Data Bus Channel Multi-Banked DRAM Chip Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland

DRAM Configuration Example Source: MICRON DDR3 DRAM

Memory Read Timing: Conventional Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland

Memory Read Timing: Fast Page Mode Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland

Memory Read Timing: Burst Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland

Memory Controller Convert to DRAM commands Commands sent to DRAM Memory Controller Core Transaction request sent to MC Consider all of steps a LD instruction must go through! Virtual  physical rank/bank Scheduling policies are increasingly important Give preference to references in the same page?

Integrated Memory Controllers *From http://chip-architect.com/news/Shanghai_Nehalem.jpg

DRAM Refresh Leaky storage Periodic Refresh across DRAM rows Un-accessible when refreshing Read, and write the same data back Example: 4k rows in a DRAM 100ns read cycle Decay in 64ms 4096*100ns = 410s to refresh once 410s / 64ms = 0.64% unavailability

DRAM Refresh Styles Bursty Distributed 64ms 64ms 64ms 410s =(100ns*4096) 410s 64ms 64ms Distributed 64ms 15.6s 100ns

DRAM Refresh Policies RAS-Only Refresh CAS-Before-RAS (CBR) Refresh DRAM Module Assert RAS Memory Controller RAS CAS WE Row Address Addr Bus Refresh Row DRAM Module Assert RAS Memory Controller RAS Assert CAS CAS No address involved WE High WE# Addr counter Addr Bus Increment counter Refresh Row

Types of DRAM Asynchronous DRAM Synchronous DRAM RDRAM Normal: Responds to RAS and CAS signals (no clock) Fast Page Mode (FPM): Row remains open after RAS for multiple CAS commands Extended Data Out (EDO): Change output drivers to latches. Data can be held on bus for longer time Burst Extended Data Out: Internal counter drives address latch. Able to provide data in burst mode. Synchronous DRAM SDRAM: All of the above with clock. Adds predictability to DRAM operation DDR, DDR2, DDR3: Transfer data on both edges of the clock FB-DIMM: DIMMs connected using point to point connection instead of bus. Allows more DIMMs to be incorporated in server based systems RDRAM Low pin count

Main Memory Organizations registers registers registers ALU ALU ALU cache cache cache bus wide bus bus Memory Mem Mem Mem Mem Mem Mem Mem Mem The processor-memory bus may have width of one or more memory words Multiple memory banks can operate in parallel Transfer from memory to the cache is subject to the width of the processor-memory bus Wide memory comes with constraints on expansion Use of error correcting codes require the complete “width” to be read to recompute the codes on writes Minimum expansion unit size is increased

Word Level Interleaved Memory Read the output of a memory access memory access 1 memory access 2 word interleaving Memory Module 1 2 3 4 5 6 7 τ τ Time Bank 0 Bank 1 Bank 2 Bank 3 Time to read the output of memory Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns Assuming a word-wide bus, cache miss penalty is Taddress + Tmem_access + #words * Ttransfer cycles Note the effect of a split transaction vs. locked bus

Sequential Bank Operation bank 1 n-m higher order bank m lower order bits bits m-1 access 1 module 0 module 1 word 1 Implement using DRAMs with page mode access

Concurrent Bank Operation DATA 1 m-1 n-m ADDR m module 0 module 1 module 2 word 1 Supports arbitrary accesses Needs sources of multiple, independent accesses Lock-up free caches, data speculation, write buffers, pre-fetching

Concurrent Bank Operation memory bank access Memory Module τ Time Each bank can be addressed independently Sequence of addresses Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses

Data Skewing for Concurrent Access a2 a1 a0 a3 a6 a5 a4 a7 A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.

Conflict-Free Access Conflict free access to elements of the vector if  M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals

Conflict-Free Access Implications for M = N = even number? For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)

Avoiding Bank Conflicts: Compiler Techniques Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing

Study Guide: Glossary Asynchronous DRAM Bank and rank Bit line Burst mode access Conflict free access Data skewing DRAM High-order and low-order interleaving Leaky transistors Memory controller Page mode access RAS and CAS Refresh RDRAM SRAM Synchronous DRAM Word interleaving Word line

Study Guide Differences between SRAM/DRAM in operation and performance Given a memory organization determine the miss penalty in cycles Cache basics Mappings from main memory to locations in the cache hierarchy Computation of CPI impact of miss penalties, miss rate, and hit times Computation CPI impact of update strategies Find a skewing scheme for concurrent accesses to a given data structure For example, diagonals of a matrix Sub-blocks of a matrix Evaluate the CPI impact of various optimizations Relate mapping of data structures to main memory (such as matrices) to cache behavior and the behavior of optimizations