CRUISE: Cache Replacement and Utility-Aware Scheduling

Slides:

Advertisements

Similar presentations

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

Advertisements

Advanced Piloting Cruise Plot.

Technische Universität München + Hewlett Packard Laboratories Dynamic Workload Management for Very Large Data Warehouses Juggling Feathers and Bowling.

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Chapter 1 The Study of Body Function Image PowerPoint

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

UNITED NATIONS Shipment Details Report – January 2006.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 5 second questions

Year 6 mental test 10 second questions

Chapter 4 Memory Management 4.1 Basic memory management 4.2 Swapping

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Solve Multi-step Equations

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Chapter 3.3 : OS Policies for Virtual Memory

Chapter 10: Virtual Memory

Virtual Memory II Chapter 8.

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.

IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

15. Oktober Oktober Oktober 2012.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

© 2012 National Heart Foundation of Australia. Slide 2.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Januar MDMDFSSMDMDFSSS

SE-292 High Performance Computing

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

High Performing Cache Hierarchies for Server Workloads

Prefetch-Aware Cache Management for High Performance Caching

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

International Symposium on Computer Architecture ( ISCA – 2010 )

Xiaodong Wang, Shuang Chen, Jeff Setter,

Prefetch-Aware Cache Management for High Performance Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Presentation transcript:

CRUISE: Cache Replacement and Utility-Aware Scheduling Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Aamer.Jaleel@intel.com Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012)

Motivation Core 0 L1 Core 1 LLC L2 Core 2 Core 3 Single Core ( SMT ) Dual Core ( ST/SMT ) Quad-Core Shared caches are common and becoming even more so with increasing number of cores. As the number of concurrent applications taking advantage of the number of cores increases, the contention for the shared cache increases. It is imperative that the shared cache be managed efficiently to achieve high performance. Shared last-level cache (LLC) common with increasing # of cores # concurrent applications   contention for shared cache 

Problems with LRU-Managed Shared Caches Conventional LRU policy allocates resources based on rate of demand Applications that have no cache benefit cause destructive cache interference Misses Per 1000 Instr (under LRU) soplex h264ref 3. May be better to have a punch line – Unlike other work, we do not reserve cache for each application, so we do not do cache partitioning” soplex 25 50 75 100 Cache Occupancy Under LRU Replacement (2MB Shared Cache) h264ref

Addressing Shared Cache Performance Conventional LRU policy allocates resources based on rate of demand Applications that have no cache benefit cause destructive cache interference State-of-Art Solutions: Improve Cache Replacement (HW) Modify Memory Allocation (SW) Intelligent Application Scheduling (SW) Misses Per 1000 Instr (under LRU) soplex h264ref 3. May be better to have a punch line – Unlike other work, we do not reserve cache for each application, so we do not do cache partitioning” soplex 25 50 75 100 Cache Occupancy Under LRU Replacement (2MB Shared Cache) h264ref

HW Techniques for Improving Shared Caches Modify cache replacement policy Goal: Allocate cache resources based on cache utility NOT demand LLC C0 C1 Intelligent LLC Replacement (e.g. UCP, DIP, TADIP, PIPP, SDBP, DRRIP, SHiP, GPA) C0 C1 LLC LRU

SW Techniques for Improving Shared Caches I Modify OS memory allocation policy Goal: Allocate pages to different cache sets to minimize interference Intelligent Memory Allocator (OS) C0 C1 C0 C1 LLC LLC LRU LRU

SW Techniques for Improving Shared Caches II Modify scheduling policy using Operating System (OS) or hypervisor Goal: Intelligently co-schedule applications to minimize contention If you have a system with more than one LLC (which could be a single socket or multi-socket) LLC0 C0 C1 LLC1 C2 C3 LLC0 C0 C1 LLC1 C2 C3 LRU-managed LLC LRU-managed LLC

SW Techniques for Improving Shared Caches B C D Three possible schedules: A, B | C, D A, C | B, D A, D | B, C Worst Schedule 4.9 5.5 6.3 LLC0 C0 C1 LLC1 C2 C3 ~30% Throughput Optimal Schedule Baseline System (4-core CMP, 3-level hierarchy, LRU-managed LLC) Optimal / Worst Schedule ~9% On Average

Interactions Between Co-Scheduling and Replacement Existing co-scheduling proposals evaluated on LRU-managed LLCs Question: Is intelligent co-scheduling necessary with improved cache replacement policies? Mention DRRIP requires less hardware than LRU but performs better than LRU DRRIP Cache Replacement [ Jaleel et al, ISCA’10 ]

Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Category I: No need for intelligent co-schedule under both LRU/DRRIP Category II: Require intelligent co-schedule only under LRU Category III: Require intelligent co-schedule only under DRRIP Category IV: Require intelligent co-schedule under both LRU/DRRIP

Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Category I: No need for intelligent co-schedule under both LRU/DRRIP Category II: Require intelligent co-schedule only under LRU Category III: Require intelligent co-schedule only under DRRIP Category IV: Require intelligent co-schedule under both LRU/DRRIP Observation: Need for Intelligent Co-Scheduling is Function of Replacement Policy

Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Category II: Require intelligent co-schedule only under LRU LLC0 C0 C1 LLC1 C2 C3 LRU-managed LLCs

Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Category II: Require intelligent co-schedule only under LRU LLC0 C0 C1 LLC1 C2 C3 LRU-managed LLCs

Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Category II: Require intelligent co-schedule only under LRU LLC0 C0 C1 LLC1 C2 C3 DRRIP-managed LLCs No Re-Scheduling Necessary for Category II Workloads in DRRIP-managed LLCs

Opportunity for Intelligent Application Co-Scheduling Prior Art: Evaluated using inefficient cache policies (i.e. LRU replacement) Proposal: Cache Replacement and Utility-aware Scheduling: Understand how apps access the LLC (in isolation) Schedule applications based on how they can impact each other ( Keep LLC replacement policy in mind ) Co-scheduling policies are based on application memory intensity Not quite optimal due to inadequate application characterization

Memory Diversity of Applications (In Isolation) LLC Core 0 L2 Core 1 LLCF LLC Fitting (e.g. sphinx3*) LLC Core 2 L2 Core 3 LLCT LLC Thrashing (e.g. bwaves*) LLC Core 0 L2 Core 1 CCF Core Cache Fitting (e.g. povray*) LLC Core 0 L2 Core 1 LLCFR LLC Friendly (e.g. bzip2*) *Assuming a 4MB shared LLC

Cache Replacement and Utility-aware Scheduling (CRUISE) Core Cache Fitting (CCF) Apps: Infrequently access the LLC Do not rely on LLC for performance Co-scheduling multiple CCF jobs on same LLC “wastes” that LLC Best to spread CCF applications across available LLCs CCF CCF Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC

Cache Replacement and Utility-aware Scheduling (CRUISE) LLC Thrashing (LLCT) Apps: Frequently access the LLC Do not benefit at all from the LLC Under LRU, LLCT apps degrade performance of other applications Co-schedule LLCT with LLCT apps LLCT LLCT Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC

Cache Replacement and Utility-aware Scheduling (CRUISE) LLC Thrashing (LLCT) Apps: Frequently access the LLC Do not benefit at all from the LLC Under DRRIP, LLCT apps do not degrade performance of co-scheduled apps Best to spread LLCT apps across available LLCs to efficiently utilize cache resources LLCT LLCT Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC

Cache Replacement and Utility-aware Scheduling (CRUISE) LLC Fitting (LLCF) Apps: Frequently access the LLC Require majority of LLC Behave like LLCT apps if they do not receive majority of LLC Best to co-schedule LLCF with CCF applications (if present) If no CCF app, schedule with LLCF/LLCT LLCF CCF LLCF Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC Co-scheduling LLCF apps with any other app can degrade system performance significantly.

Cache Replacement and Utility-aware Scheduling (CRUISE) LLC Friendly (LLCFR) Apps: Rely on LLC for performance Can share LLC with similar apps Co-scheduling multiple LLCFR jobs on same LLC will not result in suboptimal performance LLCFR LLCFR Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC

CRUISE for LRU-managed Caches (CRUISE-L) LLCT LLCT LLCF CCF Applications: Co-schedule apps as follows: Co-schedule LLCT apps with LLCT apps Spread CCF applications across LLCs Co-schedule LLCF apps with CCF Fill LLCFR apps onto free cores LLCT LLCT LLCF CCF Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC

CRUISE for DRRIP-managed Caches (CRUISE-D) LLCT LLCT LLCFR CCF Applications: Co-schedule apps as follows: Spread LLCT apps across LLCs Spread CCF apps across LLCs Co-schedule LLCF with CCF/LLCT apps Fill LLCFR apps onto free cores LLCFR LLCT LLCT CCF Core 0 L2 Core 1 L2 Core 2 L2 Core 3 L2 LLC LLC

Experimental Methodology System Model: 4-wide OoO processor (Core i7 type) 3-level memory hierarchy (Core i7 type) Application Scheduler Workloads Multi-programmed combinations of SPEC CPU2006 applications ~1400 4-core multi-programmed workloads (2 cores/LLC) ~6400 8-core multi-programmed workloads (2 cores/LLC, 4 cores/LLC)

Experimental Methodology B C D System Model: 4-wide OoO processor (Core i7 type) 3-level memory hierarchy (Core i7 type) Application Scheduler Workloads Multi-programmed combinations of SPEC CPU2006 applications ~1400 4-core multi-programmed workloads (2 cores/LLC) ~6400 8-core multi-programmed workloads (2 cores/LLC, 4 cores/LLC) LLC0 C0 C1 LLC1 C2 C3 Baseline System

CRUISE Performance on Shared Caches (4-core CMP, 3-level hierarchy, averaged across all 1365 multi-programmed workload mixes) (ASPLOS’10) Performance Relative to Worst Schedule O P T I M A L C R U I S E - L O P T I M A L C R U I S E - D CRUISE provides near-optimal performance Optimal co-scheduling decision is a function of LLC replacement policy

Classifying Application Cache Utility in Isolation How Do You Know Application Classification at Run Time? Profiling: Application provides memory intensity at run time HW Performance Counters: Assume isolated cache behavior same as shared cache behavior Periodically pause adjacent cores at runtime Proposal: Runtime Isolated Cache Estimator (RICE) Architecture support to estimate isolated cache behavior while still sharing the LLC x x x 

Runtime Isolated Cache Estimator (RICE) Assume a cache shared by 2 applications: APP0 APP1 Monitor isolated cache behavior. Only APP0 fills to these sets, all other apps bypass these sets < P0, P1, P2, P3 > APP0 + Access Miss Monitor isolated cache behavior. Only APP1 fills to these sets, all other apps bypass these sets APP1 + Access Miss Counters to compute isolated hit/miss rate (apki, mpki) Follower Sets 32 sets per APP 15-bit hit/miss cntrs Set-Level View of Cache High-Level View of Cache

Runtime Isolated Cache Estimator (RICE) Assume a cache shared by 2 applications: APP0 APP1 Monitor isolated cache behavior if only half the cache available. Only APP0 fills to half the ways in the sets. All other apps use these sets Needed to classify LLCF applications. < P0, P1, P2, P3 > APP0 + Access-F Miss-F APP0 + Access-H Miss-H APP1 + Access-F Miss-F APP1 + Access-H Miss-H Counters to compute isolated hit/miss rate (apki, mpki) Follower Sets 32 sets per APP 15-bit hit/miss cntrs Set-Level View of Cache High-Level View of Cache

Performance of CRUISE using RICE Classifier (ASPLOS’10) Performance Relative to Worst Schedule CRUISE using Dynamic RICE Classifier Within 1-2% of Optimal

Summary Optimal application co-scheduling is an important problem Useful for future multi-core processors and virtualization technologies Co-scheduling decisions are function of replacement policy Our Proposal: Cache Replacement and Utility-aware Scheduling (CRUISE) Architecture support for estimating isolated cache behavior (RICE) CRUISE is scalable and performs similar to optimal co-scheduling RICE requires negligible hardware overhead

Q&A