CSE 502: Computer Architecture

Slides:



Advertisements
Similar presentations
CSE502: Computer Architecture Memory / DRAM. CSE502: Computer Architecture SRAM vs. DRAM SRAM = Static RAM – As long as power is present, data is retained.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.
Chapter 5 Internal Memory
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.
Main Memory by J. Nelson Amaral.
Memory Technology “Non-so-random” Access Technology:
Main Memory Background Random Access Memory (vs. Serial Access Memory) Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Survey of Existing Memory Devices Renee Gayle M. Chua.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Main Memory CS448.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Computer Architecture Lecture 24 Fasih ur Rehman.
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
Semiconductor Memory Types
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Computer Architecture Chapter (5): Internal Memory
“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.
15-740/ Computer Architecture Lecture 25: Main Memory
Physical Memory and Physical Addressing ( Chapter 10 ) by Polina Zapreyeva.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS 704 Advanced Computer Architecture
COMP541 Memories II: DRAMs
Chapter 5 Internal Memory
William Stallings Computer Organization and Architecture 7th Edition
7-5 DRAM ICs High storage capacity Low cost
ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM
Types of RAM (Random Access Memory)
Reducing Hit Time Small and simple caches Way prediction Trace caches
SRAM Memory External Interface
COMP541 Memories II: DRAMs
5.2 Eleven Advanced Optimizations of Cache Performance
Memory Units Memories store data in units from one to eight bits. The most common unit is the byte, which by definition is 8 bits. Computer memories are.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
William Stallings Computer Organization and Architecture 7th Edition
Lecture 15: DRAM Main Memory Systems
William Stallings Computer Organization and Architecture 8th Edition
Lecture: DRAM Main Memory
Computer Architecture
William Stallings Computer Organization and Architecture 7th Edition
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
William Stallings Computer Organization and Architecture 8th Edition
BIC 10503: COMPUTER ARCHITECTURE
Lecture: Memory Technology Innovations
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 3410, Spring 2014 Computer Science Cornell University
15-740/ Computer Architecture Lecture 19: Main Memory
AKT211 – CAO 07 – Computer Memory
DRAM Hwansoo Han.
William Stallings Computer Organization and Architecture 8th Edition
Bob Reese Micro II ECE, MSU
Memory Principles.
Presentation transcript:

CSE 502: Computer Architecture Memory / DRAM

SRAM vs. DRAM SRAM = Static RAM DRAM = Dynamic RAM SRAM: 6T per bit As long as power is present, data is retained DRAM = Dynamic RAM If you don’t do anything, you lose the data SRAM: 6T per bit built with normal high-speed CMOS technology DRAM: 1T per bit (+1 capacitor) built with special DRAM process optimized for density Again, should be review for ECE students… CS students may not have seen this type of stuff.

Hardware Structures SRAM b DRAM wordline wordline Trench Capacitor b b

DRAM Chip Organization (1/2) Row Address decoder Sense Amps Row Buffer Column Address multiplexor DRAM is much denser than SRAM

DRAM Chip Organization (2/2) Low-Level organization is very similar to SRAM Cells are only single-ended Reads destructive: contents are erased by reading Row buffer holds read data Data in row buffer is called a DRAM row Often called “page” - not necessarily same as OS page Read gets entire row into the buffer Block reads always performed out of the row buffer Reading a whole row, but accessing one block Similar to reading a cache line, but accessing one word

After read of 0 or 1, cell contents close to ½ Destructive Read 1 sense amp output Vdd Vdd bitline voltage Wordline Enabled Sense Amp Enabled Wordline Enabled Sense Amp Enabled Vdd Vdd capacitor voltage After read of 0 or 1, cell contents close to ½

Process is called opening or closing a row DRAM Read After a read, the contents of the DRAM cell are gone But still “safe” in the row buffer Write bits back before doing another read Reading into buffer is slow, but reading buffer is fast Try reading multiple lines from buffer (row-buffer hit) DRAM cells Sense Amps Row Buffer Process is called opening or closing a row

Must periodically refresh all contents DRAM Refresh (1/2) Gradually, DRAM cell loses contents Even if it’s not accessed This is why it’s called “dynamic” DRAM must be regularly read and re-written What to do if no read/write to row for long time? 1 Gate Leakage capacitor voltage Vdd Long Time Must periodically refresh all contents

DRAM Refresh (2/2) Burst Refresh Distributed refresh Stop the world, refresh all memory Distributed refresh Space out refresh one row at a time Avoids blocking memory for a long time Self-refresh (low-power mode) Tell DRAM to refresh itself Turn off memory controller Takes some time to exit self-refresh

Typical DRAM Access Sequence (1/5)

Typical DRAM Access Sequence (2/5)

Typical DRAM Access Sequence (3/5)

Typical DRAM Access Sequence (4/5)

Typical DRAM Access Sequence (5/5)

Original DRAM specified Row & Column every time DRAM Read Timing Original DRAM specified Row & Column every time

DRAM Read Timing with Fast-Page Mode FPM enables multiple reads from page without RAS

SDRAM Read Timing SDRAM uses clock, supports bursts Double-Data Rate (DDR) DRAM transfers data on both rising and falling edge of the clock SDRAM uses clock, supports bursts

Actual DRAM Signals

Distance matters, even at the speed of light DRAM Signal Timing Distance matters, even at the speed of light

Examining Memory Performance Miss penalty for an 8-word cache block 1 cycle to send address 6 cycles to access each word 1 cycle to send word back ( 1 + 6 + 1) x 8 = 64 (Expensive) Wider bus option Read all words in parallel Miss penalty for 8-word block: 1 + 6 + 1 = 8

Simple Interleaved Main Memory Divide memory into n banks “interleave” addresses across them Access one bank while another is busy Increases bandwidth w/o a wider bus Bank 0 Bank 1 Bank 2 Bank n word 0 word 1 word 2 word n-1 word n word n+1 word n+2 word 2n-1 word 2n word 2n+1 word 2n+2 word 3n-1 PA Bank Word offset Use parallelism in memory banks to hide latency

All banks are independent, DRAM Organization DIMM x8 DRAM All banks within the rank share all address and control pins Rank DRAM DRAM DRAM DRAM Bank DRAM DRAM All banks are independent, but can only talk to one bank at a time DRAM DRAM x8 means each DRAM outputs 8 bits, need 8 chips for DDRx (64-bit) DRAM DRAM DRAM DRAM DRAM DRAM x8 DRAM Why 9 chips per rank? 64 bits data, 8 bits ECC DRAM DRAM DRAM DRAM Dual-rank x8 (2Rx8) DIMM

Use multiple channels for more bandwidth Memory Channels One controller One 64-bit channel Mem Controller Commands Data Mem Controller One controller Two 64-bit channels Mem Controller Two controllers Two 64-bit channels Mem Controller Use multiple channels for more bandwidth

Address Mapping Schemes (1/3) Map consecutive addresses to improve performance Multiple independent channels  max parallelism Map consecutive cache lines to different channels Multiple channels/ranks/banks  OK parallelism Limited by shared address and/or data pins Map close cache lines to banks within same rank Reads from same rank are faster than from different ranks Accessing rows from one bank is slowest All requests serialized, regardless of row-buffer mgmt. policies Rows mapped to same bank should avoid spatial locality Column mapping depends on row-buffer mgmt. (Why?)

Address Mapping Schemes (2/3) [… … … … bank column ...] 0x00000 0x00400 0x00800 0x00C00 0x00100 0x00500 0x00900 0x00D00 0x00200 0x00600 0x00A00 0x00E00 0x00300 0x00700 0x00B00 0x00F00 [… … … … column bank …] 0x00000 0x00100 0x00200 0x00300 0x00400 0x00500 0x00600 0x00700 0x00800 0x00900 0x00A00 0x00B00 0x00C00 0x00D00 0x00E00 0x00F00

Address Mapping Schemes (3/3) Example Open-page Mapping Scheme: High Parallelism: [row rank bank column channel offset] Easy Expandability: [channel rank row bank column offset] Example Close-page Mapping Scheme: High Parallelism: [row column rank bank channel offset] Easy Expandability: [channel rank row column bank offset]

CPU-to-Memory Interconnect (1/3) North Bridge can be Integrated onto CPU chip to reduce latency Figure from ArsTechnica

CPU-to-Memory Interconnect (2/3) FSB North Bridge Memory Bus South Bridge Discrete North and South Bridge chips

CPU-to-Memory Interconnect (3/3) South Bridge CPU Memory Bus Integrated North Bridge

Memory Controller (1/2) Memory Controller Commands Read Queue Write Response Queue Data To/From CPU Scheduler Buffer Channel 0 Channel 1

Memory Controller (2/2) Memory controller connects CPU and DRAM Receives requests after cache misses in LLC Possibly originating from multiple cores Complicated piece of hardware, handles: DRAM Refresh Row-Buffer Management Policies Address Mapping Schemes Request Scheduling

Row-Buffer Management Policies Open-page After access, keep page in DRAM row buffer Next access to same page  lower latency If access to different page, must close old one first Good if lots of locality Close-page After access, immediately close page in DRAM row buffer Next access to different page  lower latency If access to different page, old one already closed Good if no locality (random access)

Request Scheduling (1/3) Write buffering Writes can wait until reads are done Queue DRAM commands Usually into per-bank queues Allows easily reordering ops. meant for same bank Common policies: First-Come-First-Served (FCFS) First-Ready—First-Come-First-Served (FR-FCFS)

Request Scheduling (2/3) First-Come-First-Served Oldest request first First-Ready—First-Come-First-Served Prioritize column changes over row changes Skip over older conflicting requests Find row hits (on queued reqs., even if close-page policy) Find oldest If no conflicts with in-progress request  good Otherwise (if conflicts), try next oldest

Request Scheduling (3/3) Why is it hard? Tons of timing constraints in DRAM tWTR: Min. cycles before read after a write tRC: Min. cycles between consecutive open in bank … Simultaneously track resources to prevent conflicts Channels, banks, ranks, data bus, address bus, row buffers Do it for many queued requests at the same time … while not forgetting to do refresh

Memory-Level Parallelism (MLP) What if memory latency is 10000 cycles? Runtime dominated by waiting for memory What matters is overlapping memory accesses Memory-Level Parallelism (MLP): “Average number of outstanding memory accesses when at least one memory access is outstanding.” MLP is a metric Not a fundamental property of workload Dependent on the microarchitecture

In many cases, MLP dictates performance AMAT with MLP If … cache hit is 10 cycles (core to L1 and back) memory access is 100 cycles (core to mem and back) Then … at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55 Unless MLP is >1.0, then… at 50% mr,1.5 MLP,avg. access:(0.5×10+0.5×100)/1.5 = 37 at 50% mr,4.0 MLP,avg. access:(0.5×10+0.5×100)/4.0 = 14 In many cases, MLP dictates performance

Overcoming Memory Latency Caching Reduce average latency by avoiding DRAM altogether Limitations Capacity (programs keep increasing in size) Compulsory misses Memory-Level Parallelism Perform multiple concurrent accesses Prefetching Guess what will be accessed next Put in into the cache