ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM
Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology With adaptations and additions by S. Yalamanchili for ECE 4100/6100 – Spring 2009

Reading Section 5.3 Suggested Readings

Main Memory Storage Technologies
DRAM: “Dynamic” Random Access Memory Highest densities Optimized for cost/bit  main memory SRAM: “Static” Random Access Memory Densities ¼ to 1/8 of DRAM Speeds 8-16x faster than DRAM Cost 8-16x more per bit Optimized for speed  caches

The DRAM Cell 1T1C DRAM cell Why DRAMs Disadvantages
Word Line (Control) Storage Capacitor Bit Line (Information) 1T1C DRAM cell Stack capacitor (vs. Trench capacitor) Source: Memory Arch Course, Insa. Toulouse Why DRAMs Higher density than SRAMs Disadvantages Longer access times Leaky, needs to be refreshed Cannot be easily integrated with CMOS

SRAM Cell Bit is stored in a latch using 6 transistors To read:
Wordline Bit line Bit line Bit is stored in a latch using 6 transistors To read: set bitlines to 2.5v drive wordline, bitlines settle to 0v / 5v To write: set bitlines to 0v / 5v drive wordline, bitlines “overpower” latch transistors

One DRAM Bank bitlines wordline Row decoder Address Sense amps
I/O gating Row decoder Column decoder Data out Address

Example: 512Mb 4-bank DRAM (x4)
Row decoder Row decoder Row decoder BA[1:0] Row decoder Bank0 16384 x 2048 x 4 A[13:0] 16K Address Sense amps I/O gating Column decoder Column decoder Column decoder Column decoder A[10:0] A DRAM page = 2kx4 = 1KB Address Multiplexing Data out D[3:0] A x4 DRAM chip

DRAM Cell Array bitline0 bitline1 bitline2 bitline15 Wordline0

DRAM Basics Address multiplexing DRAM reads are self-destructive
Send row address when RAS asserted Send column address when CAS asserted DRAM reads are self-destructive Rewrite after a read Memory array All bits within an array work in unison Memory bank Different banks can operate independently DRAM rank Chips inside the same rank are accessed simultaneously

Examples of DRAM DIMM Standards
x64 (No ECC) D0 D7 x8 D8 D15 x8 CB0 CB7 x8 D16 D23 x8 D24 D31 x8 D32 D39 x8 D40 D47 x8 D48 D55 x8 D56 D63 x8 X72 (ECC)

DRAM Ranks x8 x8 Rank0 Memory Controller Rank1 CS0 CS1 D0 D7 D8 D15

DRAM Ranks Single Rank Single Rank Dual- Rank 64b 8b 8b 8b 8b 8b 8b 8b

Source: Memory Systems Architecture Course, B. Jacobs, Maryland
DRAM Organization Source: Memory Systems Architecture Course, B. Jacobs, Maryland

Organization of DRAM Modules
Addr and Cmd Bus Memory Controller Data Bus Channel Multi-Banked DRAM Chip Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland

DRAM Configuration Example
Source: MICRON DDR3 DRAM

Memory Read Timing: Conventional
Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland

Memory Read Timing: Fast Page Mode

Memory Read Timing: Burst

Memory Controller Convert to DRAM commands Commands sent to DRAM Memory Controller Core Transaction request sent to MC Consider all of steps a LD instruction must go through! Virtual  physical rank/bank Scheduling policies are increasingly important Give preference to references in the same page?

Integrated Memory Controllers
*From

DRAM Refresh Leaky storage Periodic Refresh across DRAM rows
Un-accessible when refreshing Read, and write the same data back Example: 4k rows in a DRAM 100ns read cycle Decay in 64ms 4096*100ns = 410s to refresh once 410s / 64ms = 0.64% unavailability

DRAM Refresh Styles Bursty Distributed 64ms 64ms 64ms
410s =(100ns*4096) 410s 64ms 64ms Distributed 64ms 15.6s 100ns

DRAM Refresh Policies RAS-Only Refresh CAS-Before-RAS (CBR) Refresh
DRAM Module Assert RAS Memory Controller RAS CAS WE Row Address Addr Bus Refresh Row DRAM Module Assert RAS Memory Controller RAS Assert CAS CAS No address involved WE High WE# Addr counter Addr Bus Increment counter Refresh Row

Types of DRAM Asynchronous DRAM Synchronous DRAM RDRAM
Normal: Responds to RAS and CAS signals (no clock) Fast Page Mode (FPM): Row remains open after RAS for multiple CAS commands Extended Data Out (EDO): Change output drivers to latches. Data can be held on bus for longer time Burst Extended Data Out: Internal counter drives address latch. Able to provide data in burst mode. Synchronous DRAM SDRAM: All of the above with clock. Adds predictability to DRAM operation DDR, DDR2, DDR3: Transfer data on both edges of the clock FB-DIMM: DIMMs connected using point to point connection instead of bus. Allows more DIMMs to be incorporated in server based systems RDRAM Low pin count

Main Memory Organizations
registers registers registers ALU ALU ALU cache cache cache bus wide bus bus Memory Mem Mem Mem Mem Mem Mem Mem Mem The processor-memory bus may have width of one or more memory words Multiple memory banks can operate in parallel Transfer from memory to the cache is subject to the width of the processor-memory bus Wide memory comes with constraints on expansion Use of error correcting codes require the complete “width” to be read to recompute the codes on writes Minimum expansion unit size is increased

Word Level Interleaved Memory
Read the output of a memory access memory access 1 memory access 2 word interleaving Memory Module 1 2 3 4 5 6 7 τ τ Time Bank 0 Bank 1 Bank 2 Bank 3 Time to read the output of memory Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns Assuming a word-wide bus, cache miss penalty is Taddress + Tmem_access + #words * Ttransfer cycles Note the effect of a split transaction vs. locked bus

Sequential Bank Operation
bank 1 n-m higher order bank m lower order bits bits m-1 access 1 module 0 module 1 word 1 Implement using DRAMs with page mode access

Concurrent Bank Operation
DATA 1 m-1 n-m ADDR m module 0 module 1 module 2 word 1 Supports arbitrary accesses Needs sources of multiple, independent accesses Lock-up free caches, data speculation, write buffers, pre-fetching

Concurrent Bank Operation
memory bank access Memory Module τ Time Each bank can be addressed independently Sequence of addresses Difference with interleaved memory Flexibility in addressing Requires greater address bandwidth Separate controllers and memory buses Support for non-blocking caches with multiple outstanding misses

Data Skewing for Concurrent Access
a a a a3 a a a a7 A 3-ordered 8 vector with C = 2. How can we guarantee that data can be accessed in parallel? Avoid bank conflicts Storage Scheme: A set of rules that determine for each array element, the address of the module and the location within a module Design a storage scheme to ensure concurrent access d-ordered n vector: the ith element is in module (d.i + C) mod M.

Conflict-Free Access Conflict free access to elements of the vector if  M >= N M >= N. gcd(M,d) Multi-dimensional arrays treated as arrays of 1-d vectors Conflict free access for various patterns in a matrix requires M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals

Conflict-Free Access Implications for M = N = even number?
For non-power-of-two values of M, indexing and address computation must be efficient Vectors that are accessed are scrambled Unscrambling of vectors is a non-trivial performance issue Data dependencies can still reduce bandwidth far below O(M)

Avoiding Bank Conflicts: Compiler Techniques
Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses Solutions: Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing

Study Guide: Glossary Asynchronous DRAM Bank and rank Bit line
Burst mode access Conflict free access Data skewing DRAM High-order and low-order interleaving Leaky transistors Memory controller Page mode access RAS and CAS Refresh RDRAM SRAM Synchronous DRAM Word interleaving Word line

Study Guide Differences between SRAM/DRAM in operation and performance
Given a memory organization determine the miss penalty in cycles Cache basics Mappings from main memory to locations in the cache hierarchy Computation of CPI impact of miss penalties, miss rate, and hit times Computation CPI impact of update strategies Find a skewing scheme for concurrent accesses to a given data structure For example, diagonals of a matrix Sub-blocks of a matrix Evaluate the CPI impact of various optimizations Relate mapping of data structures to main memory (such as matrices) to cache behavior and the behavior of optimizations

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Similar presentations

Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Similar presentations

Presentation on theme: "ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM"— Presentation transcript:

Similar presentations

About project

Feedback