15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
18-740/640 Computer Architecture Lecture 15: Memory Resource Management II Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/2/2015.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Reducing Memory Interference in Multicore Systems
CSE 502: Computer Architecture
Prof. Onur Mutlu Carnegie Mellon University 11/9/2012
Prof. Onur Mutlu Carnegie Mellon University
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Computer Architecture Lecture 24: Memory Scheduling
Lecture 15: DRAM Main Memory Systems
Computer Architecture: Multithreading (I)
Lecture: DRAM Main Memory
Achieving High Performance and Fairness at Low Cost
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Samira Khan University of Virginia Nov 14, 2018
15-740/ Computer Architecture Lecture 14: Prefetching
15-740/ Computer Architecture Lecture 19: Main Memory
Presented by Florian Ettinger
18-447: Computer Architecture Lecture 19: Main Memory
Presentation transcript:

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University

Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture  Memory controller  Memory buses  Banks, ranks, channels, DIMMs  Address mapping: software vs. hardware  DRAM refresh Memory scheduling policies Memory power/energy management Multi-core issues  Fairness, interference  Large DRAM capacity 2

Readings Required:  Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Memory Controllers,” IEEE Micro Top Picks  Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO Recommended:  Zhang et al., “A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality,” MICRO  Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO  Rixner et al., “Memory Access Scheduling,” ISCA

Review: Generalized Memory Structure 4

Review: DRAM Controller Purpose and functions  Ensure correct operation of DRAM (refresh)  Service DRAM requests while obeying timing constraints of DRAM chips Constraints: resource conflicts (bank, bus, channel), minimum write-to-read delays Translate requests to DRAM command sequences  Buffer and schedule requests to improve performance Reordering and row-buffer management  Manage power consumption and thermals in DRAM Turn on/off DRAM chips, manage power modes 5

DRAM Controller (II) 6

7 A Modern DRAM Controller

DRAM Scheduling Policies (I) FCFS (first come first served)  Oldest request first FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  maximize DRAM throughput  Actually, scheduling is done at the command level Column commands (read/write) prioritized over row commands (activate/precharge) Within each group, older commands prioritized over younger ones 8

DRAM Scheduling Policies (II) A scheduling policy is essentially a prioritization order Prioritization can be based on  Request age  Row buffer hit/miss status  Request type (prefetch, read, write)  Requestor type (load miss or store miss)  Request criticality Oldest miss in the core? How many instructions in core are dependent on it? 9

Row Buffer Management Policies Open row  Keep the row open after an access + Next access might need the same row  row hit -- Next access might need a different row  row conflict, wasted energy Closed row  Close the row after an access (if no other requests already in the request buffer need the same row) + Next access might need a different row  avoid a row conflict -- Next access might need the same row  extra activate latency Adaptive policies  Predict whether or not the next access to the bank will be to the same row 10

Open vs. Closed Row Policies PolicyFirst accessNext accessCommands needed for next access Open rowRow 0Row 0 (row hit)Read Open rowRow 0Row 1 (row conflict) Precharge + Activate Row 1 + Read Closed rowRow 0Row 0 – access in request buffer (row hit) Read Closed rowRow 0Row 0 – access not in request buffer (row closed) Activate Row 0 + Read + Precharge Closed rowRow 0Row 1 (row closed)Activate Row 1 + Read + Precharge 11

Why are DRAM Controllers Difficult to Design? Need to obey DRAM timing constraints for correctness  There are many (50+) timing constraints in DRAM  tWTR: Minimum number of cycles to wait before issuing a read command after a write command is issued  tRC: Minimum number of cycles between the issuing of two consecutive activate commands to the same bank  … Need to keep track of many resources to prevent conflicts  Channels, banks, ranks, data bus, address bus, row buffers Need to handle DRAM refresh Need to optimize for performance (in the presence of constraints)  Reordering is not simple  Predicting the future? 12

Why are DRAM Controllers Difficult to Design? From Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,” HPS Technical Report, April

DRAM Power Management DRAM chips have power modes Idea: When not accessing a chip power it down Power states  Active (highest power)  All banks idle  Power-down  Self-refresh (lowest power) State transitions incur latency during which the chip cannot be accessed 14

Multi-Core Issues (I) Memory controllers, pins, and memory banks are shared Pin bandwidth is not increasing as fast as number of cores  Bandwidth per core reducing Different threads executing on different cores interfere with each other in the main memory system Threads delay each other by causing resource contention:  Bank, bus, row-buffer conflicts  reduced DRAM throughput Threads can also destroy each other’s DRAM bank parallelism  Otherwise parallel requests can become serialized 15

Effects of Inter-Thread Interference in DRAM Queueing/contention delays  Bank conflict, bus conflict, channel conflict, … Additional delays due to DRAM constraints  Called “protocol overhead”  Examples Row conflicts Read-to-write and write-to-read delays Loss of intra-thread parallelism 16

17 DRAM Controllers A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00] (1) Row-hit (column) first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first This scheduling policy aims to maximize DRAM throughput  But, it is unfair when multiple threads share the DRAM system

18 Inter-Thread Interference in DRAM Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM throughput Existing DRAM controllers are unaware of inter-thread interference in DRAM system DRAM scheduling policies are thread-unaware and unfair  Row-hit first: unfairly prioritizes threads with high row buffer locality Streaming threads Threads that keep on accessing the same row  Oldest-first: unfairly prioritizes memory-intensive threads

Consequences of Inter-Thread Interference in DRAM 19 Unfair slowdown of different threads System performance loss Vulnerability to denial of service Inability to enforce system-level thread priorities Cores make very slow progress Memory performance hogLow priority High priority DRAM is the only shared resource

20 Why the Disparity in Slowdowns? CORE 1CORE 2 L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 Shared DRAM Memory System Multi-Core Chip unfairness INTERCONNECT matlab gcc DRAM Bank 3

21 An Example Memory Performance Hog STREAM - Sequential memory access - Very high row buffer locality (96% hit rate) - Memory intensive

22 A Co-Scheduled Application RDARRAY - Random memory access - Very low row buffer locality (3% hit rate) - Similarly memory intensive

23 What does the MPH do? Row Buffer Row decoder Column mux Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0T1: Row 111 T0: Row 0 T1: Row 5 T0: Row 0 Request Buffer T0: STREAM T1: RANDOM Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1

A Multi-Core DRAM Controller Should control inter-thread interference in DRAM Properties of a good multi-core DRAM controller:  provides high system performance preserves each thread’s DRAM bank parallelism efficiently utilizes the scarce memory bandwidth  provides fairness to threads sharing the DRAM system Substrate for providing performance guarantees to different cores  is controllable and configurable by system software enables different service levels for threads with different priorities 24

Stall-Time Fair Memory Access Scheduling Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.

26 Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system DRAM-related stall-time: The time a thread spends waiting for DRAM memory ST shared : DRAM-related stall-time when the thread runs with other threads ST alone : DRAM-related stall-time when the thread runs alone Memory-slowdown = ST shared /ST alone  Relative increase in stall-time Stall-Time Fair Memory scheduler (STFM) aims to equalize Memory-slowdown for interfering threads, without sacrificing performance  Considers inherent DRAM performance of each thread  Aims to allow proportional progress of threads

27 STFM Scheduling Algorithm [MICRO’07] For each thread, the DRAM controller  Tracks ST shared  Estimates ST alone Each cycle, the DRAM controller  Computes Slowdown = ST shared /ST alone for threads with legal requests  Computes unfairness = MAX Slowdown / MIN Slowdown If unfairness <   Use DRAM throughput oriented scheduling policy If unfairness ≥   Use fairness-oriented scheduling policy (1) requests from thread with MAX Slowdown first (2) row-hit first, (3) oldest-first

28 How Does STFM Prevent Unfairness? Row Buffer Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0 T1: Row 111 T0: Row 0 T1: Row 5 T0: Row 0 T0 Slowdown T1 Slowdown 1.00 Unfairness  Row 16Row 111

29 STFM Implementation Tracking ST shared  Increase ST shared if the thread cannot commit instructions due to an outstanding DRAM access Estimating ST alone  Difficult to estimate directly because thread not running alone  Observation: ST alone = ST shared - ST interference  Estimate ST interference : Extra stall-time due to interference  Update ST interference when a thread incurs delay due to other threads When a row buffer hit turns into a row-buffer conflict (keep track of the row that would have been in the row buffer) When a request is delayed due to bank or bus conflict

30 Support for System Software System-level thread weights (priorities)  OS can choose thread weights to satisfy QoS requirements  Larger-weight threads should be slowed down less  OS communicates thread weights to the memory controller  Controller scales each thread’s slowdown by its weight  Controller uses weighted slowdown used for scheduling Favors threads with larger weights  : Maximum tolerable unfairness set by system software  Don’t need fairness? Set  large.  Need strict fairness? Set  close to 1.  Other values of  : trade off fairness and throughput

Parallelism-Aware Batch Scheduling Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: …,” ISCA 2008, IEEE Micro Top Picks 2009.

Another Problem due to Interference Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests  Memory-Level Parallelism (MLP)  Out-of-order execution, non-blocking caches, runahead execution Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks Multiple threads share the DRAM controller DRAM controllers are not aware of a thread’s MLP  Can service each thread’s outstanding requests serially, not in parallel 32

Bank Parallelism of a Thread 33 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency Thread A : Bank 0 Bank 1 Compute 2 DRAM Requests Bank 0 Stall Compute Bank 1 Single Thread:

Compute 2 DRAM Requests Bank Parallelism Interference in DRAM 34 Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 A : Compute 2 DRAM Requests Bank 0 Stall Bank 1 Baseline Scheduler: B: Compute Bank 0 Stall Bank 1 Stall Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies

2 DRAM Requests Parallelism-Aware Scheduler 35 Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 A : 2 DRAM Requests Parallelism-aware Scheduler: B: Compute Bank 0 Stall Compute Bank 1 Stall 2 DRAM Requests A : Compute 2 DRAM Requests Bank 0 Stall Compute Bank 1 B: Compute Bank 0 Stall Compute Bank 1 Stall Baseline Scheduler: Compute Bank 0 Stall Compute Bank 1 Saved Cycles Average stall-time: ~1.5 bank access latencies

Parallelism-Aware Batch Scheduling (PAR-BS) Principle 1: Parallelism-awareness  Schedule requests from a thread (to different banks) back to back  Preserves each thread’s bank parallelism  But, this can cause starvation… Principle 2: Request Batching  Group a fixed number of oldest requests from each thread into a “batch”  Service the batch before all other requests  Form a new batch when the current one is done  Eliminates starvation, provides fairness  Allows parallelism-awareness within a batch 36 Bank 0Bank 1 T1 T0 T2 T3 T2 Batch T0 T1

PAR-BS Components Request batching Within-batch scheduling  Parallelism aware 37

Request Batching Each memory request has a bit (marked) associated with it Batch formation:  Mark up to Marking-Cap oldest requests per bank for each thread  Marked requests constitute the batch  Form a new batch when no marked requests are left Marked requests are prioritized over unmarked ones  No reordering of requests across batches: no starvation, high fairness How to prioritize requests within a batch? 38

Within-Batch Scheduling Can use any existing DRAM scheduling policy  FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism  Service each thread’s requests back to back Scheduler computes a ranking of threads when the batch is formed  Higher-ranked threads are prioritized over lower-ranked ones  Improves the likelihood that requests from a thread are serviced in parallel by different banks Different threads prioritized in the same order across ALL banks 39 HOW?

How to Rank Threads within a Batch Ranking scheme affects system throughput and fairness Maximize system throughput  Minimize average stall-time of threads within the batch Minimize unfairness (Equalize the slowdown of threads)  Service threads with inherently low stall-time early in the batch  Insight: delaying memory non-intensive threads results in high slowdown Shortest stall-time first (shortest job first) ranking  Provides optimal system throughput [Smith, 1956]*  Controller estimates each thread’s stall-time within the batch  Ranks threads with shorter stall-time higher 40 * W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.

Maximum number of marked requests to any bank (max-bank-load)  Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load)  Breaks ties: rank thread with lower total-load higher Shortest Stall-Time First Ranking 41 T2T3T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1T3 T2 T1T2 T1T0T2T0 T3T2T3 max-bank-loadtotal-load T013 T124 T226 T359 Ranking: T0 > T1 > T2 > T3

7 5 3 Example Within-Batch Scheduling Order 42 T2T3T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1T3 T2 T1T2 T1T0T2T0 T3T2T3 Baseline Scheduling Order (Arrival order) PAR-BS Scheduling Order T2 T3 T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1 T3 T2 T1 T2 T1 T0 T2 T0 T3 T2 T3 T0T1T2T AVG: 5 bank access latenciesAVG: 3.5 bank access latencies Stall times T0T1T2T Stall times Time Ranking: T0 > T1 > T2 > T Time

Putting It Together: PAR-BS Scheduling Policy PAR-BS Scheduling Policy (1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first Three properties:  Exploits row-buffer locality and intra-thread bank parallelism  Work-conserving Services unmarked requests to banks without marked requests  Marking-Cap is important Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads Trade-offs analyzed in [ISCA 2008] 43 Batching Parallelism-aware within-batch scheduling

44 Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]

45 System Performance