Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Improving Cache Performance by Exploiting Read-Write Disparity
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Memory Systems in the Many-Core Era: Some Challenges and Solution Directions Onur Mutlu June 5, 2011 ISMM/MSPC.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
The Evicted-Address Filter
Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Sunpyo Hong, Hyesoon Kim
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
18-740/640 Computer Architecture Lecture 15: Memory Resource Management II Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/2/2015.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.
Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015.
Reducing Memory Interference in Multicore Systems
Prof. Onur Mutlu Carnegie Mellon University 11/12/2012
A Staged Memory Resource Management Method for CMP Systems
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Prof. Onur Mutlu Carnegie Mellon University
Rachata Ausavarungnirun, Kevin Chang
Application Slowdown Model
Thesis Defense Lavanya Subramanian
Achieving High Performance and Fairness at Low Cost
Prof. Onur Mutlu Carnegie Mellon University 11/14/2012
Prof. Onur Mutlu Carnegie Mellon University 9/26/2012
Presented by Florian Ettinger
Presentation transcript:

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Motivation Memory is a shared resource Threads’ requests contend for memory – Degradation in single thread performance – Can even lead to starvation How to schedule memory requests to increase both system throughput and fairness? 2 Core Memory

Previous Scheduling Algorithms are Biased 3 System throughput bias Fairness bias No previous memory scheduling algorithm provides both the best fairness and system throughput Ideal Better system throughput Better fairness

Take turns accessing memory Why do Previous Algorithms Fail? 4 Fairness biased approach thread C thread B thread A less memory intensive higher priority Prioritize less memory-intensive threads Throughput biased approach Good for throughput starvation  unfairness thread C thread B thread A Does not starve not prioritized  reduced throughput Single policy for all threads is insufficient

Insight: Achieving Best of Both Worlds 5 thread higher priority thread Prioritize memory-non-intensive threads For Throughput Unfairness caused by memory-intensive being prioritized over each other Shuffle threads Memory-intensive threads have different vulnerability to interference Shuffle asymmetrically For Fairness thread

Outline  Motivation & Insights  Overview  Algorithm  Bringing it All Together  Evaluation  Conclusion 6

Overview: Thread Cluster Memory Scheduling 1.Group threads into two clusters 2.Prioritize non-intensive cluster 3.Different policies for each cluster 7 thread Threads in the system thread Non-intensive cluster Intensive cluster thread Memory-non-intensive Memory-intensive Prioritized higher priority higher priority Throughput Fairness

Outline  Motivation & Insights  Overview  Algorithm  Bringing it All Together  Evaluation  Conclusion 8

TCM Outline 9 1. Clustering

Clustering Threads Step1 Sort threads by MPKI (misses per kiloinstruction) 10 thread higher MPKI T α < 10% ClusterThreshold Intensive cluster αTαT Non-intensive cluster T = Total memory bandwidth usage Step2 Memory bandwidth usage αT divides clusters

TCM Outline Clustering 2. Between Clusters

Prioritize non-intensive cluster Increases system throughput – Non-intensive threads have greater potential for making progress Does not degrade fairness – Non-intensive threads are “light” – Rarely interfere with intensive threads Prioritization Between Clusters 12 > priority

TCM Outline Clustering 2. Between Clusters 3. Non-Intensive Cluster Throughput

Prioritize threads according to MPKI Increases system throughput – Least intensive thread has the greatest potential for making progress in the processor Non-Intensive Cluster 14 thread higher priority lowest MPKI highest MPKI

TCM Outline Clustering 2. Between Clusters 3. Non-Intensive Cluster 4. Intensive Cluster Throughput Fairness

Periodically shuffle the priority of threads Is treating all threads equally good enough? BUT: Equal turns ≠ Same slowdown Intensive Cluster 16 thread Increases fairness Most prioritized higher priority thread

Case Study: A Tale of Two Threads Case Study: Two intensive threads contending 1.random-access 2.streaming 17 Prioritize random-accessPrioritize streaming random-access thread is more easily slowed down 7x prioritized 1x 11x prioritized 1x Which is slowed down more easily?

Why are Threads Different? 18 random-accessstreaming req Bank 1Bank 2Bank 3Bank 4 Memory rows All requests parallel High bank-level parallelism All requests  Same row High row-buffer locality req activated row req stuck Vulnerable to interference

TCM Outline Clustering 2. Between Clusters 3. Non-Intensive Cluster 4. Intensive Cluster Fairness Throughput

Niceness How to quantify difference between threads? 20 Vulnerability to interference Bank-level parallelism Causes interference Row-buffer locality + Niceness - HighLow

Shuffling: Round-Robin vs. Niceness-Aware 1.Round-Robin shuffling 2.Niceness-Aware shuffling 21 Most prioritized ShuffleInterval Priority Time Nice thread Least nice thread GOOD: Each thread prioritized once  What can go wrong? A B C D DABCD

Shuffling: Round-Robin vs. Niceness-Aware 1.Round-Robin shuffling 2.Niceness-Aware shuffling 22 Most prioritized ShuffleInterval Priority Time Nice thread Least nice thread  What can go wrong? A B C D DABCD A B D C B C A D C D B A D A C B BAD: Nice threads receive lots of interference GOOD: Each thread prioritized once

Shuffling: Round-Robin vs. Niceness-Aware 1.Round-Robin shuffling 2.Niceness-Aware shuffling 23 Most prioritized ShuffleInterval Priority Time Nice thread Least nice thread GOOD: Each thread prioritized once A B C D DCBAD

Shuffling: Round-Robin vs. Niceness-Aware 1.Round-Robin shuffling 2.Niceness-Aware shuffling 24 Most prioritized ShuffleInterval Priority Time Nice thread Least nice thread A B C D DCBAD D A C B B A C D A D B C D A C B GOOD: Each thread prioritized once GOOD: Least nice thread stays mostly deprioritized

TCM Outline Clustering 2. Between Clusters 3. Non-Intensive Cluster 4. Intensive Cluster 1. Clustering 2. Between Clusters 3. Non-Intensive Cluster 4. Intensive Cluster Fairness Throughput

Outline  Motivation & Insights  Overview  Algorithm  Bringing it All Together  Evaluation  Conclusion 26

Quantum-Based Operation 27 Time Previous quantum (~1M cycles) During quantum: Monitor thread behavior 1.Memory intensity 2.Bank-level parallelism 3.Row-buffer locality Beginning of quantum: Perform clustering Compute niceness of intensive threads Current quantum (~1M cycles) Shuffle interval (~1K cycles)

TCM Scheduling Algorithm 1.Highest-rank: Requests from higher ranked threads prioritized Non-Intensive cluster > Intensive cluster Non-Intensive cluster: l ower intensity  higher rank Intensive cluster: r ank shuffling 2.Row-hit: Row-buffer hit requests are prioritized 3.Oldest: Older requests are prioritized 28

Implementation Costs Required storage at memory controller (24 cores) No computation is on the critical path 29 Thread memory behaviorStorage MPKI~0.2kb Bank-level parallelism~0.6kb Row-buffer locality~2.9kb Total< 4kbits

Outline  Motivation & Insights  Overview  Algorithm  Bringing it All Together  Evaluation  Conclusion 30 Fairness Throughput

Metrics & Methodology Metrics System throughput 31 Unfairness Methodology – Core model 4 GHz processor, 128-entry instruction window 512 KB/core L2 cache – Memory model: DDR2 – 96 multiprogrammed SPEC CPU2006 workloads

Previous Work FRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits – Thread-oblivious  Low throughput & Low fairness STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns – Non-intensive threads not prioritized  Low throughput PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism – Non-intensive threads not always prioritized  Low throughput ATLAS [Kim et al., HPCA10]: Prioritizes threads with less memory service – Most intensive thread starves  Low fairness 32

Results: Fairness vs. Throughput 33 Better system throughput Better fairness 5% 39% 8% 5% TCM provides best fairness and system throughput Averaged over 96 workloads

Results: Fairness-Throughput Tradeoff 34 When configuration parameter is varied… Adjusting ClusterThreshold TCM allows robust fairness-throughput tradeoff STFM PAR-BS ATLAS TCM Better system throughput Better fairness

Operating System Support ClusterThreshold is a tunable knob – OS can trade off between fairness and throughput Enforcing thread weights – OS assigns weights to threads – TCM enforces thread weights within each cluster 35

Outline  Motivation & Insights  Overview  Algorithm  Bringing it All Together  Evaluation  Conclusion 36 Fairness Throughput

Conclusion 37 No previous memory scheduling algorithm provides both high system throughput and fairness – Problem: They use a single policy for all threads TCM groups threads into two clusters 1.Prioritize non-intensive cluster  throughput 2.Shuffle priorities in intensive cluster  fairness 3.Shuffling should favor nice threads  fairness TCM provides the best system throughput and fairness

THANK YOU 38

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Weight Support Even if heaviest weighted thread happens to be the most intensive thread… – Not prioritized over the least intensive thread 40

Harmonic Speedup 41 Better system throughput Better fairness

Shuffling Algorithm Comparison Niceness-Aware shuffling – Average of maximum slowdown is lower – Variance of maximum slowdown is lower 42 Shuffling Algorithm Round-RobinNiceness-Aware E(Maximum Slowdown) VAR(Maximum Slowdown)

Sensitivity Results 43 ShuffleInterval (cycles) System Throughput Maximum Slowdown Number of Cores System Throughput (compared to ATLAS) 0%3%2%1% Maximum Slowdown (compared to ATLAS) -4%-30%-29%-30%-41%