18-447: Computer Architecture Lecture 19: Main Memory

Slides:

Advertisements

Similar presentations

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Advertisements

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)

Main Memory CS448.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)

Computer Architecture Lecture 22: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/26/2014.

15-740/ Computer Architecture Lecture 25: Main Memory

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

Lecture 2. A Computer System for Labs

18-447: Computer Architecture Lecture 25: Main Memory

CMSC 611: Advanced Computer Architecture

CS 704 Advanced Computer Architecture

COMP541 Memories II: DRAMs

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM

Reducing Memory Interference in Multicore Systems

CSE 502: Computer Architecture

Computer Architecture Lecture 21: Main Memory

18-447: Computer Architecture Lecture 23: Caches

Reducing Hit Time Small and simple caches Way prediction Trace caches

Associativity in Caches Lecture 25

SRAM Memory External Interface

COMP541 Memories II: DRAMs

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Samira Khan University of Virginia Oct 9, 2017

Lecture 15: DRAM Main Memory Systems

Lecture: Memory, Multiprocessors

The Main Memory system: DRAM organization

CMSC 611: Advanced Computer Architecture

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

DRAM Bandwidth Slide credit: Slides adapted from

Lecture: Cache Innovations, Virtual Memory

Lecture: Memory Technology Innovations

MICROPROCESSOR MEMORY ORGANIZATION

Lecture 24: Memory, VM, Multiproc

Adapted from slides by Sally McKee Cornell University

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 15: Memory Design

Lecture 22: Cache Hierarchies, Memory

Samira Khan University of Virginia Nov 14, 2018

15-740/ Computer Architecture Lecture 19: Main Memory

DRAM Hwansoo Han.

Bob Reese Micro II ECE, MSU

Computer Architecture Lecture 30: In-memory Processing

Presentation transcript:

18-447: Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/2/2012

Reminder: Homeworks Homework 5 Due today Topics: Out-of-order execution, dataflow, vector processing, memory, caches

Homework 4 Grades Average 83.14 Median 83 Max 105 Min 51 Max Possible Points Total number of students 47

Reminder: Lab Assignments Implementing caches and branch prediction in a high-level timing simulator of a pipelined processor Due April 6 Extra credit: Cache exploration and high performance with optimized caches

Lab 4 Grades Average 665.3 Median 695 Max 770 Min 230 Max Possible Points (w/o EC) 700 Total number of students 46

Lab 4: Correct Designs and Extra Credit Rank Student Crit. Path (ns) Cycles Execution Time (ns) Relative Execution Time 1 Eric Brunstad 10.425 34568 360371.4 1.00 2 Arthur Chang 10.686 34804 371915.5 1.03 3 Alex Crichton 10.85 34636 375800.6 1.04 4 Jason Lin 11.312 34672 392209.7 1.09 5 Anish Phophaliya 10.593 37560 397873.0 1.10 6 James Wahawisan 9.16 44976 411980.2 1.14 7 Prerak Patel 11.315 37886 428680.1 1.19 8 Greg Nazario 12.23 35696 436562.1 1.21 9 Kee Young Lee 10.019 450614.5 1.25 10 Jonathan Loh 13.731 33668 462295.3 1.28 11 Vikram Rajkumar 13.823 34932 482865.0 1.34 12 Justin Wagner 15.065 33728 508112.3 1.41 13 Daniel Jacobs 13.593 37782 513570.7 1.43 14 Mike Mu 14.055 36832 517673.8 1.44 15 Qiannan Zhang 13.484 38764 522693.8 1.45 16 Andrew Tan 16.754 34660 580693.6 1.61 17 Dennis Liang 16.722 37176 621657.1 1.73 18 Dev Gurjar 12.864 57332 737518.8 2.05 19 Winnie Woo 23.281 33976 790995.3 2.19

Lab 4 Extra Credit Rank Student Crit. Path (ns) Cycles Execution Time Relative Execution Time 1 Eric Brunstad 10.425 34568 360371.4 1.00 2 Arthur Chang 10.686 34804 371915.5 1.03 3 Alex Crichton 10.85 34636 375800.6 1.04

Reminder: Midterm II Next week April 11 Everything covered in the course can be on the exam You can bring in two cheat sheets (8.5x11’’)

Review of Last Lecture Wrap up basic caches Start main memory Handling writes Sectored caches Instruction vs. data Multi-level caching issues Cache performance Multiple outstanding misses Multiple accesses per cycle Start main memory DRAM basics Interleaving Bank, rank concepts

Review: Interleaving Interleaving (banking) Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel Goal: Reduce the latency of memory array access and enable multiple accesses in parallel Idea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles) Each bank is smaller than the entire memory storage Accesses to different banks can be overlapped Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)

The DRAM Subsystem

DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column

The DRAM Bank Structure

The DRAM Bank Structure

Page Mode DRAM A DRAM bank is a 2D array of cells: rows x columns A “DRAM row” is also called a “DRAM page” “Sense amplifiers” also called “row buffer” Each address is a <row,column> pair Access to a “closed row” Activate command opens row (placed into row buffer) Read/write command reads/writes column in the row buffer Precharge command closes the row and prepares the bank for next access Access to an “open row” No need for activate command

DRAM Bank Operation Access Address: (Row 0, Column 0) Columns Row address 0 Row address 1 Row decoder Rows Row 0 Empty Row 1 Row Buffer CONFLICT ! HIT HIT Column address 0 Column address 1 Column address 0 Column address 85 Column mux Data

The DRAM Chip Consists of multiple banks (2-16 in Synchronous DRAM) Banks share command/address/data buses The chip itself has a narrow interface (4-16 bits per read)

128M x 8-bit DRAM Chip

DRAM Rank and Module Rank: Multiple chips operated together to form a wide interface All chips comprising a rank are controlled at the same time Respond to a single command Share address and command buses, but provide different data A DRAM module consists of one or more ranks E.g., DIMM (dual inline memory module) This is what you plug into your motherboard If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM

A 64-bit Wide DIMM (One Rank)

A 64-bit Wide DIMM (One Rank) Advantages: Acts like a high-capacity DRAM chip with a wide interface Flexibility: memory controller does not need to deal with individual chips Disadvantages: Granularity: Accesses cannot be smaller than the interface width

Multiple DIMMs Advantages: Disadvantages: Enables even higher capacity Interconnect complexity and energy consumption can be high

DRAM Channels 2 Independent Channels: 2 Memory Controllers (Above) 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)

Generalized Memory Structure

Generalized Memory Structure

The DRAM Subsystem The Top Down View

DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column

DIMM (Dual in-line memory module) The DRAM subsystem “Channel” DIMM (Dual in-line memory module) Processor Memory channel Memory channel

DIMM (Dual in-line memory module) Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM

DIMM (Dual in-line memory module) Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1

Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel

. . . Breaking down a Rank Chip 0 Chip 1 Chip 7 Rank 0 <0:63> <0:7> <8:15> <56:63> Data <0:63>

Breaking down a Chip ... 8 banks Chip 0 Bank 0 <0:7> <0:7>

Breaking down a Bank ... ... Row-buffer Bank 0 <0:7> 1B 1B 1B 1B (column) row 16k-1 Bank 0 ... row 0 <0:7> Row-buffer 1B 1B 1B ... <0:7>

DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column

Example: Transferring a cache block Physical memory space 0xFFFF…F Channel 0 ... DIMM 0 Mapped to 0x40 Rank 0 64B cache block 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 8B 0x00 8B

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 8B 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block 8B Data <0:63> 8B 0x00 8B

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block 8B Data <0:63> 8B 0x00 A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially.

Latency Components: Basic DRAM Operation CPU → controller transfer time Controller latency Queuing & scheduling delay at the controller Access converted to basic commands Controller → DRAM transfer time DRAM bank latency Simple CAS if row is “open” OR RAS + CAS if array precharged OR PRE + RAS + CAS (worst case) DRAM → CPU transfer time (through controller)

Multiple Banks (Interleaving) and Channels Enable concurrent DRAM accesses Bits in address determine which bank an address resides in Multiple independent channels serve the same purpose But they are even better because they have separate data buses Increased bus bandwidth Enabling more concurrency requires reducing Bank conflicts Channel conflicts How to select/randomize bank/channel indices in address? Lower order bits have more entropy Randomizing hash functions (XOR of different address bits)

How Multiple Banks/Channels Help

Multiple Channels Advantages Disadvantages Increased bandwidth Multiple concurrent accesses (if independent channels) Disadvantages Higher cost than a single channel More board wires More pins (if on-chip memory controller)

Address Mapping (Single Channel) Single-channel system with 8-byte memory bus 2GB memory, 8 banks, 16K rows & 2K columns per bank Row interleaving Consecutive rows of memory in consecutive banks Cache block interleaving Consecutive cache block addresses in consecutive banks 64 byte cache blocks Accesses to consecutive cache blocks can be serviced in parallel How about random accesses? Strided accesses? Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits

Bank Mapping Randomization DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely 3 bits Column (11 bits) Byte in bus (3 bits) XOR Bank index (3 bits)

Address Mapping (Multiple Channels) Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Where are consecutive cache blocks? Row (14 bits) C Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Row (14 bits) Bank (3 bits) C Column (11 bits) Byte in bus (3 bits) Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits) C Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) C High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column C Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column Bank (3 bits) C Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column Bank (3 bits) Low Col. C Byte in bus (3 bits) 8 bits 3 bits

Interaction with VirtualPhysical Mapping Operating System influences where an address maps to in DRAM Operating system can control which bank a virtual page is mapped to. It can randomize Page<Bank,Channel> mappings Application cannot know/determine which bank it is accessing Virtual Page number (52 bits) Page offset (12 bits) VA Physical Frame number (19 bits) Page offset (12 bits) PA Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA

DRAM Refresh (I) DRAM capacitor charge leaks over time The memory controller needs to read each row periodically to restore the charge Activate + precharge each row every N ms Typical N = 64 ms Implications on performance? -- DRAM bank unavailable while refreshed -- Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends Burst refresh: All rows refreshed immediately after one another Distributed refresh: Each row refreshed at a different time, at regular intervals

DRAM Refresh (II) Distributed refresh eliminates long pause times How else we can reduce the effect of refresh on performance? Can we reduce the number of refreshes?

Effect of DRAM Refresh Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

Retention Time of DRAM Cells Observation: DRAM cells have different data retention times Corollary: Not all rows need to be refreshed at the same frequency

Reducing DRAM Refresh Operations Idea: If we can identify the retention time of different rows, we can refresh each row at the frequency it really needs to be refreshed Implementation: Refresh controller bins the rows according to their minimum retention times and refreshes rows in each bin at the frequency specified for the bin e.g., a bin for 64-128ms, another for 128-256ms, … Observation: Only very few rows need to be refreshed very frequently (every 256ms)  Have only a few bins  low HW overhead while reducing refresh frequency for most rows by 4X Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

RAIDR Mechanism Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

DRAM Controller Purpose and functions Ensure correct operation of DRAM (refresh and timing) Service DRAM requests while obeying timing constraints of DRAM chips Constraints: resource conflicts (bank, bus, channel), minimum write-to-read delays Translate requests to DRAM command sequences Buffer and schedule requests to improve performance Reordering and row-buffer management Manage power consumption and thermals in DRAM Turn on/off DRAM chips, manage power modes

DRAM Controller Issues Where to place? In chipset + More flexibility to plug different DRAM types into the system + Less power density in the CPU chip On CPU chip + Reduced latency for main memory access + Higher bandwidth between cores and controller More information can be communicated (e.g. request’s importance in the processing core)

DRAM Controller (II)

A Modern DRAM Controller

DRAM Scheduling Policies (I) FCFS (first come first served) Oldest request first FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  maximize DRAM throughput Actually, scheduling is done at the command level Column commands (read/write) prioritized over row commands (activate/precharge) Within each group, older commands prioritized over younger ones

DRAM Scheduling Policies (II) A scheduling policy is essentially a prioritization order Prioritization can be based on Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality Oldest miss in the core? How many instructions in core are dependent on it?

Row Buffer Management Policies Open row Keep the row open after an access + Next access might need the same row  row hit -- Next access might need a different row  row conflict, wasted energy Closed row Close the row after an access (if no other requests already in the request buffer need the same row) + Next access might need a different row  avoid a row conflict -- Next access might need the same row  extra activate latency Adaptive policies Predict whether or not the next access to the bank will be to the same row

Open vs. Closed Row Policies Policy First access Next access Commands needed for next access Open row Row 0 Row 0 (row hit) Read Row 1 (row conflict) Precharge + Activate Row 1 + Closed row Row 0 – access in request buffer (row hit) Row 0 – access not in request buffer (row closed) Activate Row 0 + Read + Precharge Row 1 (row closed) Activate Row 1 + Read + Precharge

Why are DRAM Controllers Difficult to Design? Need to obey DRAM timing constraints for correctness There are many (50+) timing constraints in DRAM tWTR: Minimum number of cycles to wait before issuing a read command after a write command is issued tRC: Minimum number of cycles between the issuing of two consecutive activate commands to the same bank … Need to keep track of many resources to prevent conflicts Channels, banks, ranks, data bus, address bus, row buffers Need to handle DRAM refresh Need to optimize for performance (in the presence of constraints) Reordering is not simple Predicting the future?

Why are DRAM Controllers Difficult to Design? From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” HPS Technical Report, April 2010.

DRAM Power Management DRAM chips have power modes Idea: When not accessing a chip power it down Power states Active (highest power) All banks idle Power-down Self-refresh (lowest power) State transitions incur latency during which the chip cannot be accessed