18-447 Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015.

Slides:

Advertisements

Similar presentations

Thank you for your introduction.

Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Understanding a Problem in Multicore and How to Solve It

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions Onur Mutlu June 5, 2011 ISMM/MSPC.

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

18-740/640 Computer Architecture Lecture 15: Memory Resource Management II Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/2/2015.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

CS 704 Advanced Computer Architecture

lecture 5: CPU Scheduling

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

CSE 502: Computer Architecture

18-447: Computer Architecture Lecture 23: Caches

Prof. Onur Mutlu Carnegie Mellon University 11/9/2012

Prof. Onur Mutlu Carnegie Mellon University

5.2 Eleven Advanced Optimizations of Cache Performance

Computer Architecture Lecture 24: Memory Scheduling

Cache Memory Presentation I

Computer Architecture Recitation 1

Lecture 15: DRAM Main Memory Systems

Application Slowdown Model

Computer Architecture: Multithreading (I)

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture: DRAM Main Memory

Achieving High Performance and Fairness at Low Cost

Address-Value Delta (AVD) Prediction

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Memory Technology Innovations

Samira Khan University of Virginia Nov 14, 2018

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 19: Main Memory

15-740/ Computer Architecture Lecture 18: Caching Basics

Prof. Onur Mutlu ETH Zürich Fall November 2017

Computer Architecture Lecture 10a: Simulation

Presented by Florian Ettinger

18-447: Computer Architecture Lecture 19: Main Memory

Presentation transcript:

Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015

Lab 4 Grades 2 Mean: 89.7 Median: 94.3 Standard Deviation: 16.2

Lab 4 Extra Credit Pete Ehrett (fastest) – 2% Navneet Saini (2 nd fastest) – 1% 3

Announcements (I) No office hours today  Hosting a seminar in this room right after this lecture  Swarun Kumar, MIT, “Pushing the Limits of Wireless Networks: Interference Management and Indoor Positioning”  March 25, 2:30-3:30pm, HH 1107 From talk abstract: (…) perhaps our biggest expectation from modern wireless networks is faster communication speeds. However, state-of-the-art Wi-Fi networks continue to struggle in crowded environments — airports and hotel lobbies. The core reason is interference — Wi-Fi access points today avoid transmitting at the same time on the same frequency, since they would otherwise interfere with each other. I describe OpenRF, a novel system that enables today’s Wi-Fi access points to directly combat this interference and demonstrate significantly faster data-rates for real applications. 4

Today’s Seminar on Flash Memory (4-5pm) March 25, Wednesday, CIC Panther Hollow Room, 4-5pm Yixin Luo, PhD Student, CMU Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, "Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery" Proceedings of the 21st International Symposium on High- Performance Computer Architecture (HPCA), Bay Area, CA, February [Slides (pptx) (pdf)] Best paper session.21st International Symposium on High- Performance Computer ArchitectureSlides (pptx)(pdf)  retention_hpca15.pdf retention_hpca15.pdf 5

Flash Memory (SSD) Controllers Similar to DRAM memory controllers, except:  They are flash memory specific  They do much more: error correction, garbage collection, page remapping, … 6 Cai+, “Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime”, ICCD 2012.

Where We Are in Lecture Schedule The memory hierarchy Caches, caches, more caches Virtualizing the memory hierarchy: Virtual Memory Main memory: DRAM Main memory control, scheduling Memory latency tolerance techniques Non-volatile memory Multiprocessors Coherence and consistency Interconnection networks Multi-core issues 7

Required Reading (for the Next Few Lectures) Onur Mutlu, Justin Meza, and Lavanya Subramanian, "The Main Memory System: Challenges and Opportunities" Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.Communications of the Korean Institute of Information Scientists and Engineers system_kiise15.pdfhttp://users.ece.cmu.edu/~omutlu/pub/main-memory- system_kiise15.pdf 8

Required Readings on DRAM DRAM Organization and Operation Basics  Sections 1 and 2 of: Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA  Sections 1 and 2 of Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA DRAM Refresh Basics  Sections 1 and 2 of Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA refresh_isca12.pdf refresh_isca12.pdf 9

Memory Controllers

DRAM versus Other Types of Memories Long latency memories have similar characteristics that need to be controlled. The following discussion will use DRAM as an example, but many scheduling and control issues are similar in the design of controllers for other types of memories  Flash memory  Other emerging memory technologies Phase Change Memory Spin-Transfer Torque Magnetic Memory  These other technologies can place other demands on the controller 11

DRAM Types DRAM has different types with different interfaces optimized for different purposes  Commodity: DDR, DDR2, DDR3, DDR4, …  Low power (for mobile): LPDDR1, …, LPDDR5, …  High bandwidth (for graphics): GDDR2, …, GDDR5, …  Low latency: eDRAM, RLDRAM, …  3D stacked: WIO, HBM, HMC, … …… Underlying microarchitecture is fundamentally the same A flexible memory controller can support various DRAM types This complicates the memory controller  Difficult to support all types (and upgrades) 12

DRAM Types (II) 13 Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” IEEE Comp Arch Letters 2015.

DRAM Controller: Functions Ensure correct operation of DRAM (refresh and timing) Service DRAM requests while obeying timing constraints of DRAM chips  Constraints: resource conflicts (bank, bus, channel), minimum write-to-read delays  Translate requests to DRAM command sequences Buffer and schedule requests to for high performance + QoS  Reordering, row-buffer, bank, rank, bus management Manage power consumption and thermals in DRAM  Turn on/off DRAM chips, manage power modes 14

DRAM Controller: Where to Place In chipset + More flexibility to plug different DRAM types into the system + Less power density in the CPU chip On CPU chip + Reduced latency for main memory access + Higher bandwidth between cores and controller More information can be communicated (e.g. request’s importance in the processing core) 15

A Modern DRAM Controller (I) 16

17 A Modern DRAM Controller (II)

DRAM Scheduling Policies (I) FCFS (first come first served)  Oldest request first FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  maximize DRAM throughput  Actually, scheduling is done at the command level Column commands (read/write) prioritized over row commands (activate/precharge) Within each group, older commands prioritized over younger ones 18

DRAM Scheduling Policies (II) A scheduling policy is a request prioritization order Prioritization can be based on  Request age  Row buffer hit/miss status  Request type (prefetch, read, write)  Requestor type (load miss or store miss)  Request criticality Oldest miss in the core? How many instructions in core are dependent on it? Will it stall the processor?  Interference caused to other cores …… 19

Row Buffer Management Policies Open row  Keep the row open after an access + Next access might need the same row  row hit -- Next access might need a different row  row conflict, wasted energy Closed row  Close the row after an access (if no other requests already in the request buffer need the same row) + Next access might need a different row  avoid a row conflict -- Next access might need the same row  extra activate latency Adaptive policies  Predict whether or not the next access to the bank will be to the same row 20

Open vs. Closed Row Policies PolicyFirst accessNext accessCommands needed for next access Open rowRow 0Row 0 (row hit)Read Open rowRow 0Row 1 (row conflict) Precharge + Activate Row 1 + Read Closed rowRow 0Row 0 – access in request buffer (row hit) Read Closed rowRow 0Row 0 – access not in request buffer (row closed) Activate Row 0 + Read + Precharge Closed rowRow 0Row 1 (row closed)Activate Row 1 + Read + Precharge 21

Memory Interference and Scheduling in Multi-Core Systems

23 Review: A Modern DRAM Controller

Review: DRAM Bank Operation 24 Row Buffer (Row 0, Column 0) Row decoder Column mux Row address 0 Column address 0 Data Row 0Empty (Row 0, Column 1) Column address 1 (Row 0, Column 85) Column address 85 (Row 1, Column 0) HIT Row address 1 Row 1 Column address 0 CONFLICT ! Columns Rows Access Address:

Scheduling Policy for Single-Core Systems A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer FR-FCFS (first ready, first come first served) scheduling policy 1. Row-hit first 2. Oldest first Goal 1: Maximize row buffer hit rate  maximize DRAM throughput Goal 2: Prioritize older requests  ensure forward progress Is this a good policy in a multi-core system? 25

Trend: Many Cores on Chip Simpler and lower power than a single large core Large scale parallelism on chip 26 IBM Cell BE 8+1 cores Intel Core i7 8 cores Tilera TILE Gx 100 cores, networked IBM POWER7 8 cores Intel SCC 48 cores, networked Nvidia Fermi 448 “cores” AMD Barcelona 4 cores Sun Niagara II 8 cores

Many Cores on Chip What we want:  N times the system performance with N times the cores What do we get today? 27

(Un)expected Slowdowns in Multi-Core 28 Low priority High priority (Core 0) (Core 1) Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

29 Uncontrolled Interference: An Example CORE 1CORE 2 L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 Shared DRAM Memory System Multi-Core Chip unfairness INTERCONNECT stream random DRAM Bank 3

// initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … } 30 A Memory Performance Hog STREAM - Sequential memory access - Very high row buffer locality (96% hit rate) - Memory intensive RANDOM - Random memory access - Very low row buffer locality (3% hit rate) - Similarly memory intensive // initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … } streaming random Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

31 What Does the Memory Hog Do? Row Buffer Row decoder Column mux Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0T1: Row 111 T0: Row 0 T1: Row 5 T0: Row 0 Memory Request Buffer T0: STREAM T1: RANDOM Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1 Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Effect of the Memory Performance Hog X slowdown 2.82X slowdown Results on Intel Pentium D running Windows XP (Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux) Slowdown Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Problems due to Uncontrolled Interference 33 Unfair slowdown of different threads Low system performance Vulnerability to denial of service Priority inversion: unable to enforce priorities/SLAs Cores make very slow progress Memory performance hogLow priority High priority Slowdown Main memory is the only shared resource

Problems due to Uncontrolled Interference 34 Unfair slowdown of different threads Low system performance Vulnerability to denial of service Priority inversion: unable to enforce priorities/SLAs Poor performance predictability (no performance isolation) Uncontrollable, unpredictable system

Recap: Inter-Thread Interference in Memory Memory controllers, pins, and memory banks are shared Pin bandwidth is not increasing as fast as number of cores  Bandwidth per core reducing Different threads executing on different cores interfere with each other in the main memory system Threads delay each other by causing resource contention:  Bank, bus, row-buffer conflicts  reduced DRAM throughput Threads can also destroy each other’s DRAM bank parallelism  Otherwise parallel requests can become serialized 35

Effects of Inter-Thread Interference in DRAM Queueing/contention delays  Bank conflict, bus conflict, channel conflict, … Additional delays due to DRAM constraints  Called “protocol overhead”  Examples Row conflicts Read-to-write and write-to-read delays Loss of intra-thread parallelism  A thread’s concurrent requests are serviced serially instead of in parallel 36

Problem: QoS-Unaware Memory Control Existing DRAM controllers are unaware of inter-thread interference in DRAM system They simply aim to maximize DRAM throughput  Thread-unaware and thread-unfair  No intent to service each thread’s requests in parallel  FR-FCFS policy: 1) row-hit first, 2) oldest first Unfairly prioritizes threads with high row-buffer locality Unfairly prioritizes threads that are memory intensive (many outstanding memory accesses) 37

Solution: QoS-Aware Memory Request Scheduling How to schedule requests to provide  High system performance  High fairness to applications  Configurability to system software Memory controller needs to be aware of threads 38 Memory Controller Core Memory Resolves memory contention by scheduling requests

Stall-Time Fair Memory Scheduling Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors" 40th International Symposium on Microarchitecture (MICRO), "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors" 40th International Symposium on Microarchitecture pages , Chicago, IL, December Slides (ppt)Slides (ppt) STFM Micro 2007 Talk

The Problem: Unfairness 40 Unfair slowdown of different threads Low system performance Vulnerability to denial of service Priority inversion: unable to enforce priorities/SLAs Poor performance predictability (no performance isolation) Uncontrollable, unpredictable system

How Do We Solve the Problem? Stall-time fair memory scheduling [Mutlu+ MICRO’07] Goal: Threads sharing main memory should experience similar slowdowns compared to when they are run alone  fair scheduling  Also improves overall system performance by ensuring cores make “proportional” progress Idea: Memory controller estimates each thread’s slowdown due to interference and schedules requests in a way to balance the slowdowns Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO

42 Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it equalizes the slowdown of equal-priority threads relative to when each thread is run alone on the same system DRAM-related stall-time: The time a thread spends waiting for DRAM memory ST shared : DRAM-related stall-time when the thread runs with other threads ST alone : DRAM-related stall-time when the thread runs alone Memory-slowdown = ST shared /ST alone  Relative increase in stall-time Stall-Time Fair Memory scheduler (STFM) aims to equalize Memory-slowdown for interfering threads, without sacrificing performance  Considers inherent DRAM performance of each thread  Aims to allow proportional progress of threads

43 STFM Scheduling Algorithm [MICRO’07] For each thread, the DRAM controller  Tracks ST shared  Estimates ST alone Each cycle, the DRAM controller  Computes Slowdown = ST shared /ST alone for threads with legal requests  Computes unfairness = MAX Slowdown / MIN Slowdown If unfairness <   Use DRAM throughput oriented scheduling policy If unfairness ≥   Use fairness-oriented scheduling policy (1) requests from thread with MAX Slowdown first (2) row-hit first, (3) oldest-first

44 How Does STFM Prevent Unfairness? Row Buffer Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0 T1: Row 111 T0: Row 0 T1: Row 5 T0: Row 0 T0 Slowdown T1 Slowdown 1.00 Unfairness  Row 16Row 111

STFM Pros and Cons Upsides:  First algorithm for fair multi-core memory scheduling  Provides a mechanism to estimate memory slowdown of a thread  Good at providing fairness  Being fair can improve performance Downsides:  Does not handle all types of interference  (Somewhat) complex to implement  Slowdown estimations can be incorrect 45

Parallelism-Aware Batch Scheduling Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems” 35th International Symposium on Computer ArchitecturePerformance and Fairness of Shared DRAM Systems” 35th International Symposium on Computer Architecture (ISCA), pages 63-74, Beijing, China, June Slides (ppt)Slides (ppt)

Another Problem due to Interference Processors try to tolerate the latency of DRAM requests by generating multiple outstanding requests  Memory-Level Parallelism (MLP)  Out-of-order execution, non-blocking caches, runahead execution Effective only if the DRAM controller actually services the multiple requests in parallel in DRAM banks Multiple threads share the DRAM controller DRAM controllers are not aware of a thread’s MLP  Can service each thread’s outstanding requests serially, not in parallel 47

Bank Parallelism of a Thread 48 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank access latencies of the two requests overlapped Thread stalls for ~ONE bank access latency Thread A : Bank 0 Bank 1 Compute 2 DRAM Requests Bank 0 Stall Compute Bank 1 Single Thread:

Compute 2 DRAM Requests Bank Parallelism Interference in DRAM 49 Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 A : Compute 2 DRAM Requests Bank 0 Stall Bank 1 Baseline Scheduler: B: Compute Bank 0 Stall Bank 1 Stall Bank access latencies of each thread serialized Each thread stalls for ~TWO bank access latencies

2 DRAM Requests Parallelism-Aware Scheduler 50 Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread B: Bank 1, Row 99 Thread B: Bank 0, Row 99 Thread A: Bank 1, Row 1 A : 2 DRAM Requests Parallelism-aware Scheduler: B: Compute Bank 0 Stall Compute Bank 1 Stall 2 DRAM Requests A : Compute 2 DRAM Requests Bank 0 Stall Compute Bank 1 B: Compute Bank 0 Stall Compute Bank 1 Stall Baseline Scheduler: Compute Bank 0 Stall Compute Bank 1 Saved Cycles Average stall-time: ~1.5 bank access latencies

Parallelism-Aware Batch Scheduling (PAR-BS) Principle 1: Parallelism-awareness  Schedule requests from a thread (to different banks) back to back  Preserves each thread’s bank parallelism  But, this can cause starvation… Principle 2: Request Batching  Group a fixed number of oldest requests from each thread into a “batch”  Service the batch before all other requests  Form a new batch when the current one is done  Eliminates starvation, provides fairness  Allows parallelism-awareness within a batch 51 Bank 0Bank 1 T1 T0 T2 T3 T2 Batch T0 T1 Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.

PAR-BS Components Request batching Within-batch scheduling  Parallelism aware 52

Request Batching Each memory request has a bit (marked) associated with it Batch formation:  Mark up to Marking-Cap oldest requests per bank for each thread  Marked requests constitute the batch  Form a new batch when no marked requests are left Marked requests are prioritized over unmarked ones  No reordering of requests across batches: no starvation, high fairness How to prioritize requests within a batch? 53

Within-Batch Scheduling Can use any DRAM scheduling policy  FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality But, we also want to preserve intra-thread bank parallelism  Service each thread’s requests back to back Scheduler computes a ranking of threads when the batch is formed  Higher-ranked threads are prioritized over lower-ranked ones  Improves the likelihood that requests from a thread are serviced in parallel by different banks Different threads prioritized in the same order across ALL banks 54 HOW?

How to Rank Threads within a Batch Ranking scheme affects system throughput and fairness Maximize system throughput  Minimize average stall-time of threads within the batch Minimize unfairness (Equalize the slowdown of threads)  Service threads with inherently low stall-time early in the batch  Insight: delaying memory non-intensive threads results in high slowdown Shortest stall-time first (shortest job first) ranking  Provides optimal system throughput [Smith, 1956]*  Controller estimates each thread’s stall-time within the batch  Ranks threads with shorter stall-time higher 55 * W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.

Maximum number of marked requests to any bank (max-bank-load)  Rank thread with lower max-bank-load higher (~ low stall-time) Total number of marked requests (total-load)  Breaks ties: rank thread with lower total-load higher Shortest Stall-Time First Ranking 56 T2T3T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1T3 T2 T1T2 T1T0T2T0 T3T2T3 max-bank-loadtotal-load T013 T124 T226 T359 Ranking: T0 > T1 > T2 > T3

7 5 3 Example Within-Batch Scheduling Order 57 T2T3T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1T3 T2 T1T2 T1T0T2T0 T3T2T3 Baseline Scheduling Order (Arrival order) PAR-BS Scheduling Order T2 T3 T1 T0 Bank 0Bank 1Bank 2Bank 3 T3 T1 T3 T2 T1 T2 T1 T0 T2 T0 T3 T2 T3 T0T1T2T AVG: 5 bank access latenciesAVG: 3.5 bank access latencies Stall times T0T1T2T Stall times Time Ranking: T0 > T1 > T2 > T Time

Putting It Together: PAR-BS Scheduling Policy PAR-BS Scheduling Policy (1) Marked requests first (2) Row-hit requests first (3) Higher-rank thread first (shortest stall-time first) (4) Oldest first Three properties:  Exploits row-buffer locality and intra-thread bank parallelism  Work-conserving: does not waste bandwidth when it can be used Services unmarked requests to banks without marked requests  Marking-Cap is important Too small cap: destroys row-buffer locality Too large cap: penalizes memory non-intensive threads Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA Batching Parallelism-aware within-batch scheduling

Hardware Cost <1.5KB storage cost for  8-core system with 128-entry memory request buffer No complex operations (e.g., divisions) Not on the critical path  Scheduler makes a decision only every DRAM cycle 59

60 Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]

61 System Performance

PAR-BS Pros and Cons Upsides:  First scheduler to address bank parallelism destruction across multiple threads  Simple mechanism (vs. STFM)  Batching provides fairness  Ranking enables parallelism awareness Downsides:  Does not always prioritize the latency-sensitive applications 62

TCM: Thread Cluster Memory Scheduling Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread Cluster Memory Scheduling: "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior" 43rd International Symposium on MicroarchitectureExploiting Differences in Memory Access Behavior" 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December Slides (pptx) (pdf)Slides (pptx)(pdf) TCM Micro 2010 Talk

No previous memory scheduling algorithm provides both the best fairness and system throughput 64 System throughput bias Fairness bias Ideal Better system throughput Better fairness 24 cores, 4 memory controllers, 96 workloads Throughput vs. Fairness

Take turns accessing memory Throughput vs. Fairness 65 Fairness biased approach thread C thread B thread A less memory intensive higher priority Prioritize less memory-intensive threads Throughput biased approach Good for throughput starvation  unfairness thread C thread B thread A Does not starve not prioritized  reduced throughput Single policy for all threads is insufficient

Achieving the Best of Both Worlds 66 thread higher priority thread Prioritize memory-non-intensive threads For Throughput Unfairness caused by memory-intensive being prioritized over each other Shuffle thread ranking Memory-intensive threads have different vulnerability to interference Shuffle asymmetrically For Fairness thread

Thread Cluster Memory Scheduling [Kim+ MICRO’10] 1.Group threads into two clusters 2.Prioritize non-intensive cluster 3.Different policies for each cluster 67 thread Threads in the system thread Non-intensive cluster Intensive cluster thread Memory-non-intensive Memory-intensive Prioritized higher priority higher priority Throughput Fairness

Clustering Threads Step1 Sort threads by MPKI (misses per kiloinstruction) 68 thread higher MPKI T α < 10% ClusterThreshold Intensive cluster αTαT Non-intensive cluster T = Total memory bandwidth usage Step2 Memory bandwidth usage αT divides clusters

TCM: Quantum-Based Operation 69 Time Previous quantum (~1M cycles) During quantum: Monitor thread behavior 1.Memory intensity 2.Bank-level parallelism 3.Row-buffer locality Beginning of quantum: Perform clustering Compute niceness of intensive threads Current quantum (~1M cycles) Shuffle interval (~1K cycles)

TCM: Scheduling Algorithm 1.Highest-rank: Requests from higher ranked threads prioritized Non-Intensive cluster > Intensive cluster Non-Intensive cluster: l ower intensity  higher rank Intensive cluster: r ank shuffling 2.Row-hit: Row-buffer hit requests are prioritized 3.Oldest: Older requests are prioritized 70

TCM: Throughput and Fairness 71 Better system throughput Better fairness 24 cores, 4 memory controllers, 96 workloads TCM, a heterogeneous scheduling policy, provides best fairness and system throughput

TCM: Fairness-Throughput Tradeoff 72 When configuration parameter is varied… Adjusting ClusterThreshold TCM allows robust fairness-throughput tradeoff STFM PAR-BS ATLAS TCM Better system throughput Better fairness FRFCFS

TCM Pros and Cons Upsides:  Provides both high fairness and high performance  Caters to the needs for different types of threads (latency vs. bandwidth sensitive)  (Relatively) simple Downsides:  Scalability to large buffer sizes?  Robustness of clustering and shuffling algorithms? 73