Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Thomas Moscibroda Distributed Systems Research, Redmond Onur Mutlu

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Understanding a Problem in Multicore and How to Solve It

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

The Evicted-Address Filter

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ISPASS th April Santa Rosa, California

Resource Aware Scheduler – Initial Results

Rachata Ausavarungnirun, Kevin Chang

Computer Architecture Lecture 24: Memory Scheduling

Application Slowdown Model

Lecture: SMT, Cache Hierarchies

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Achieving High Performance and Fairness at Low Cost

Presented by Florian Ettinger

Presentation transcript:

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

2 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 7... Shared DRAM Memory System Multi-Core Chip unfairness

3 DRAM Bank Operation Row Buffer Access Address (Row 0, Column 0) Row decoder Column decoder Row address 0 Column address 0 Data Row 0Empty Access Address (Row 0, Column 1) Column address 1 Access Address (Row 0, Column 9) Column address 9 Access Address (Row 1, Column 0) HIT Row address 1 Row 1 Column address 0 CONFLICT ! Columns Rows

4 DRAM Controllers A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00] (1) Row-hit (column) first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first This scheduling policy aims to maximize DRAM throughput  But, it is unfair when multiple threads share the DRAM system

5 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

6 The Problem Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM throughput DRAM scheduling policies are thread-unaware and unfair  Row-hit first: unfairly prioritizes threads with high row buffer locality Streaming threads Threads that keep on accessing the same row  Oldest-first: unfairly prioritizes memory-intensive threads

7 The Problem Row Buffer Row decoder Column decoder Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0T1: Row 111 T0: Row 0 T1: Row 5 T0: Row 0 Request Buffer T0: streaming thread T1: non-streaming thread Row size: 8KB, cache block size: 64B 128 requests of T0 serviced before T1

8 DRAM is the only shared resource Consequences of Unfairness in DRAM Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07] System throughput loss Priority inversion at the system/OS level Poor performance predictability

9 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

10 Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent  Row-buffer locality  Bank parallelism Interference between threads can destroy either or both A fair DRAM scheduler should take into account all factors affecting each thread’s DRAM performance  Not solely bandwidth or solely request latency Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads

11 Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally  Compared to when each thread is run alone on the same system  Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01], SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04] T shared : DRAM-related stall-time when the thread is running with other threads T alone : DRAM-related stall-time when the thread is running alone Memory-slowdown = T shared /T alone The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance  Considers inherent DRAM performance of each thread

12 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

13 STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM controller  Tracks T shared  Estimates T alone At the beginning of a scheduling cycle, DRAM controller  Computes Slowdown = T shared /T alone for each thread with an outstanding legal request  Computes unfairness = MAX Slowdown / MIN Slowdown If unfairness <   Use DRAM throughput oriented baseline scheduling policy (1) row-hit first (2) oldest-first

14 STFM Scheduling Algorithm (2) If unfairness ≥   Use fairness-oriented scheduling policy (1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first Maximizes DRAM throughput if it cannot improve fairness Does NOT waste useful bandwidth to improve fairness  If a request does not interfere with any other, it is scheduled

15 How Does STFM Prevent Unfairness? Row Buffer Data Row 0 T0: Row 0 Row 0 T1: Row 16 T0: Row 0 T1: Row 111 T0: Row 0 T1: Row 5 T0: Row 0 T0 Slowdown T1 Slowdown 1.00 Unfairness  Row 16Row 111

16 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

17 Implementation Tracking T shared  Relatively easy  The processor increases a counter if the thread cannot commit instructions because the oldest instruction requires DRAM access Estimating T alone  More involved because thread is not running alone  Difficult to estimate directly  Observation: T alone = T shared - T interference  Estimate T interference : Extra stall-time due to interference

18 Estimating T interference (1) When a DRAM request from thread C is scheduled  Thread C can incur extra stall time: The request’s row buffer hit status might be affected by interference  Estimate the row that would have been in the row buffer if the thread were running alone  Estimate the extra bank access latency the request incurs Extra Bank Access Latency T interference (C) += # Banks Servicing C’s Requests Extra latency amortized across outstanding accesses of thread C (memory level parallelism)

19 Estimating T interference (2) When a DRAM request from thread C is scheduled  Any other thread C’ with outstanding requests incurs extra stall time  Interference in the DRAM data bus  Interference in the DRAM bank (see paper) Bus Transfer Latency of Scheduled Request T interference (C’) += Bank Access Latency of Scheduled Request T interference (C’) += # Banks Needed by C’ Requests * K

20 Hardware Cost <2KB storage cost for  8-core system with 128-entry memory request buffer Arithmetic operations approximated  Fixed point arithmetic  Divisions using lookup tables Not on the critical path  Scheduler makes a decision only every DRAM cycle More details in paper

21 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

22 Support for System Software Supporting system-level thread weights/priorities  Thread weights communicated to the memory controller  Larger-weight threads should be slowed down less  Each thread’s slowdown is scaled by its weight  Weighted slowdown used for scheduling Favors threads with larger weights  OS can choose thread weights to satisfy QoS requirements  : Maximum tolerable unfairness set by system software  Don’t need fairness? Set  large.  Need strict fairness? Set  close to 1.  Other values of  : trade-off fairness and throughput

23 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

24 Evaluation Methodology 2-, 4-, 8-, 16-core systems  x86 processor model based on Intel Pentium M  4 GHz processor, 128-entry instruction window  512 Kbyte per core private L2 caches Detailed DRAM model based on Micron DDR2-800  128-entry memory request buffer  8 banks, 2Kbyte row buffer  Row-hit round-trip latency: 35ns (140 cycles)  Row-conflict latency: 70ns (280 cycles) Benchmarks  SPEC CPU2006 and some Windows Desktop applications  256, 32, 3 benchmark combinations for 4-, 8-, 16-core experiments

25 Comparison with Related Work Baseline FR-FCFS [Rixner et al., ISCA’00]  Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS  Low DRAM throughput  Unfairly penalizes non-intensive threads FR-FCFS+Cap  Static cap on how many younger row-hits can bypass older accesses  Unfairly penalizes non-intensive threads Network Fair Queueing (NFQ) [Nesbit et al., Micro’06]  Per-thread virtual-time based scheduling A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread  Unfairly prioritizes threads with bursty access patterns (idleness problem)  Unfairly penalizes threads with unbalanced bank usage (in paper)

26 Idleness/Burstiness Problem in Fair Queueing Thread 1’s virtual time increases even though no other thread needs DRAMOnly Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’sOnly Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’sOnly Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it Serviced

27 Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown 1.27X1.81X 1.26X

28 System Performance 5.8%4.1%4.6%

29 Hmean-speedup (Throughput-Fairness Balance) 10.8%9.5%11.2%

30 Outline The Problem  Unfair DRAM Scheduling Stall-Time Fair Memory Scheduling  Fairness definition  Algorithm  Implementation  System software support Experimental Evaluation Conclusions

31 Conclusions A new definition of DRAM fairness: stall-time fairness  Equal-priority threads should experience equal memory-related slowdowns  Takes into account inherent memory performance of threads New DRAM scheduling algorithm enforces this definition  Flexible and configurable fairness substrate  Supports system-level thread priorities/weights  QoS policies Results across a wide range of workloads and systems show:  Improving DRAM fairness also improves system throughput  STFM provides better fairness and system performance than previously-proposed DRAM schedulers

Thank you. Questions?

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Backup

35 Structure of the STFM Controller

36 Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following target for quality of service:  A thread that is allocated 1/N th of the memory system bandwidth will run no slower than the same thread on a private memory system running at 1/N th of the frequency of the shared physical memory system  Baseline with memory bandwidth scaled down by N We compared different DRAM schedulers’ effectiveness using this metric  Number of violations of the above QoS target  Harmonic mean of IPC normalized to the above baseline

37 Violations of the NFQ QoS Target

38 Hmean Normalized IPC using NFQ Baseline 10.3% 9.1% 7.8% 7.3%5.9%5.1%

39 Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority threads  N equal-priority threads  a thread should do better than on a system with 1/N th of the memory bandwidth  This target is usually very easy to achieve Especially when N is large Unachievable target in some cases  Consider two threads always accessing the same bank in an interleaved fashion  too much interference Baseline performance very difficult to determine in a real system  Cannot scale memory frequency arbitrarily  Not knowing baseline performance makes it difficult to set thread priorities (how much bandwidth to assign to each thread)

40 A Case Study Unfairness: Memory Slowdown

41 Windows Desktop Workloads

42 Enforcing Thread Weights

43 Effect of 

44 Effect of Banks and Row Buffer Size