Massachusetts Institute of Technology

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
High Performing Cache Hierarchies for Server Workloads
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Memory System Characterization of Big Data Workloads
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
- 세부 1 - 이종 클라우드 플랫폼 데이터 관리 브로커 연구 및 개발 Network and Computing Lab.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Basic Concepts Maximum CPU utilization obtained with multiprogramming
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual memory.
Memory Management.
Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.
CSE 120 Principles of Operating
Replacement Policy Replacement policy:
18742 Parallel Computer Architecture Caching in Multi-core Systems
Cache Memory Presentation I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Chapter 6: CPU Scheduling
Adaptive Cache Replacement Policy
Application Slowdown Model
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
CPU Scheduling G.Anuradha
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
COT 4600 Operating Systems Spring 2011
3: CPU Scheduling Basic Concepts Scheduling Criteria
Module IV Memory Organization.
Chapter5: CPU Scheduling
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Outline Scheduling algorithms Multi-processor scheduling
Morgan Kaufmann Publishers
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CSE 451: Operating Systems Autumn 2005 Memory Management
Chapter 5: CPU Scheduling
CS 3410, Spring 2014 Computer Science Cornell University
CS510 - Portland State University
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Presented By: Darlene Banta
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Shortest-Job-First (SJR) Scheduling
Chapter 6: CPU Scheduling
Virtual Memory: Working Sets
Module 5: CPU Scheduling
Cache - Optimization.
Chapter 6: CPU Scheduling
Operating System Overview
CSE 542: Operating Systems
10/18: Lecture Topics Using spatial locality
Module 5: CPU Scheduling
Presentation transcript:

Massachusetts Institute of Technology A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology HPCA-8: 1

Problem Memory system performance is critical Everyone thinks about their own application Tuning replacement policies Software/hardware prefetching But modern computer systems execute multiple applications concurrently/simultaneously Time-shared systems Context switches cause cold misses Multiprocessors systems sharing memory hierarchy (SMP, SMT, CMP) Simultaneous applications compete for cache space HPCA-8: 2

Solutions: Cache Partitioning & Memory-Aware Scheduling Explicitly manage cache space allocation amongst concurrent/ simultaneous processes Each process gets different benefit from more cache space Similar to main memory partition (e.g.. Stone 1992) in the old days Memory-Aware Scheduling Choose a set of simultaneous processes to minimize memory/cache contention Schedule for SMT systems (Snavely 2000) Threads interact in various ways (RUU, functional units, caches, etc) Based on executing various schedules and profiling them Admission control for gang scheduling (Batat 2000) Based on the footprint of a job (total memory usage) HPCA-8: 3

BUT… Testing many possible schedules  not viable The number of possible schedules increase exponentially as the number of processes increase Need to decide a good schedule from individual process characteristics  complexity increases linearly Footprint-based scheduling  not enough information Footprint of a process is often larger than the cache Processes may not need the entire working set in the cache Can we find a good schedule for cache performance? What information do we need for each process? HPCA-8: 4

Information a Scheduler/Partitioner Needs Characterizing a process For scheduling and partitioning, need to know the effect of varying cache size Multiple performance numbers for different cache sizes Ignore other effects than cache size Miss-rate curves; m(c) Cache miss-rates as a function of cache size (cache blocks) Assume a process is isolated Assume the cache is FULLY-ASSOCIATIVE Provides essential information for scheduling and partitioning 100 50 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate HPCA-8: 5

Using Miss-Rate Curves for Partitioning What do miss-rate curves tell about cache allocation? 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate Process A Process B Cache misses  mA(cA)·refA+ mB(cB)·refB Cache Allocation cA cB A B HPCA-8: 6

Finding the best allocation Use marginal gain; g(c) = m(c) ·ref - m(c+1)·ref Gain in the number of misses by increasing the cache space Allocate cache blocks to each process in a greedy manner Guaranteed to result in the optimal partition if m(c) are convex Cache Allocation Initially no cache block is allocated Compare Marginal Gains 987 > 746 Compare Marginal Gains 987 > 1568 Compare Marginal Gains 987 < 2111 Compare Marginal Gains 409 < 746 A Allocate a block to Process A Allocate a block to Process B B B Allocate a block to Process B B Allocate a block to Process B HPCA-8: 7

Partitioning Results Partition the L2 cache amongst two simultaneous processes (spec2000 benchmarks: art and mcf ) HPCA-8: 8

Intuition for Memory-Aware Scheduling How to schedule 4 processes on 2 processor system using individual miss-rate curves? Schedule A and C, and B and D together Curves tend to have a knee  The amount of cache space where the marginal gain diminishes a lot 100 50 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate Process A Process B Working set size is larger than the cache for all processes All processes result in similar miss-rate if they have the entire cache Group processes based on the knees 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate Process C Process D HPCA-8: 9

Determining the Knee of the Curve Use partitioning technique Available cache resource should be doubled Cache Allocation Cache Allocation However, now we may need multiple time slices to schedule processes (2 time slices in our example) HPCA-8: 10

Scheduling Results Schedule 6 SPEC CPU benchmarks for 2 Processors HPCA-8: 11

Analytical Model (ICS`01) Miss-rate curves (or marginal gains) alone may not be enough for optimizing time-shared systems Partitioning amongst concurrent processes Scheduling considering the effects of context switches Use analytical model to predict cache-sharing effects 32-KB 8-way Set-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu) HPCA-8: 12

BUT… Processes to execute are only known at run-time Users decide what applications to run Scheduling/Partitioning decisions should be made at run-time The behavior of a process changes over time Applications have different phases Miss-rates curves (and marginal gains) may change over an execution Cache configurations are different for systems Miss-rate curves (and marginal gains) are different for systems Need an on-line estimation of miss-rate curves (and marginal gains) HPCA-8: 13

On-Line Estimation of Marginal Gains: Fully-Associative Caches Marginal gains can be directly counted based on the temporal ordering of cache blocks (LRU information) Use one counter per each cache block (or a group of cache blocks) and one for counting all accesses Hit on the ith MRU  Increment ith counter Example: a FA cache with 4 blocks Increment the 1st Counter Increment the 3th Counter Access Counter 350 2 LRU Order 1 351 2432 2433 2433 2434 912 913 2432 912 722 350 124 Marginal-Gain Counters Hit on the 3rd MRU Cache Block Hit on the MRU Cache Block 1 LRU Order LRU Order 2 LRU Order 3 LRU Order Cache Blocks HPCA-8: 14

BUT… Most caches are SET-ASSOCIATIVE Except main memory Usually up to 8-way associative Set-associative caches only maintain temporal ordering within a set No global temporal ordering Cannot use block-by-block temporal ordering to obtain marginal gains for fully-associative caches HPCA-8: 15

Way-Counters Way-Counters Use the existing LRU information within a set One counter per way (D-way cacehs  D counters) Hit on the ith MRU  Increment ith counter Each way-counter represents the gain of having S more blocks (S is the number of sets) Increment the 1st Counter Increment the 2nd Counter Hit on the MRU Cache Block Way Counters 4384 376 121 31 Access Counter 5012 1 3 2 377 5014 4385 5013 1 4-way Associative Cache 2 3 … S sets Hit on the 2nd MRU Cache Block HPCA-8: 16

Way+Set Counters Use more counters for more detailed information Maintain the LRU information of sets Hit on the ith MRU way and jth MRU set  Increment counter(i,j) Hit on the MRU way the 2nd MRU group Counters 2132 5248 377 1073 283 431 31 Access Counter … 1 2-way Associative Cache … 1 Group 0 Group 1 Group S’ 8 1074 5249 1 Increment the Counter (0,1) Temporal Ordering Of Set Groups HPCA-8: 17

Summary Caches should be managed more carefully considering the effect of space/time-sharing Cache Partitioning Memory-Aware Scheduling Miss-rate curves provide very relevant information for scheduling and partitioning Enables us to predict the effect of varying the cache space Useful for any tradeoff between performance and space (power) On-line counters can estimate miss-rate curves at run-time Use the temporal ordering of blocks to predict miss-rates for smaller caches Works for both fully-associative and set-associative caches HPCA-8: 18

Partitioning Mechanism Modify the LRU replacement policy to partition Count the number of cache blocks for each process (XA) Try to match XA to the allocated cache space (DA) Replacement Replace Process A’s LRU block if Replace Process B’s LRU block if Replace the standard LRU block if there is no over-allocated process HPCA-8: 19

Scheduling: L2 HPCA-8: 20

Way-Counter Implementation LRU V Tag Data Way LRU of Set1 Way LRU of Set 0 Which Way? Counter(1) Way LRU of Hit Block Counter(2) HIT? Ref HPCA-8: 21

Way+Set Counter Implementation Set LRU LRU V Tag Data Group 1 Way LRU of Set 0 Way LRU of Set1 Which Way? Counter(1) Way LRU of Hit Block Set LRU Ordering Counter(4) HIT? Ref HPCA-8: 22