Massachusetts Institute of Technology

Slides:

Advertisements

Similar presentations

Lecture 6: Multicore Systems

Advertisements

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

High Performing Cache Hierarchies for Server Workloads

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.

Memory System Characterization of Big Data Workloads

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

- 세부 1 - 이종 클라우드 플랫폼 데이터 관리 브로커 연구 및 개발 Network and Computing Lab.

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Basic Concepts Maximum CPU utilization obtained with multiprogramming

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Virtual memory.

Memory Management.

Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.

CSE 120 Principles of Operating

Replacement Policy Replacement policy:

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Chapter 6: CPU Scheduling

Adaptive Cache Replacement Policy

Application Slowdown Model

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

CPU Scheduling G.Anuradha

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

COT 4600 Operating Systems Spring 2011

3: CPU Scheduling Basic Concepts Scheduling Criteria

Module IV Memory Organization.

Chapter5: CPU Scheduling

CDA 5155 Caches.

Adapted from slides by Sally McKee Cornell University

Chapter 5: CPU Scheduling

Chapter 6: CPU Scheduling

Outline Scheduling algorithms Multi-processor scheduling

Morgan Kaufmann Publishers

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CSE 451: Operating Systems Autumn 2005 Memory Management

Chapter 5: CPU Scheduling

CS 3410, Spring 2014 Computer Science Cornell University

CS510 - Portland State University

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

Presented By: Darlene Banta

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

Shortest-Job-First (SJR) Scheduling

Chapter 6: CPU Scheduling

Virtual Memory: Working Sets

Module 5: CPU Scheduling

Cache - Optimization.

Chapter 6: CPU Scheduling

Operating System Overview

CSE 542: Operating Systems

10/18: Lecture Topics Using spatial locality

Module 5: CPU Scheduling

Presentation transcript:

Massachusetts Institute of Technology A New Cache Monitoring Scheme for Memory-Aware Scheduling and Partitioning G. Edward Suh Srinivas Devadas Larry Rudolph Massachusetts Institute of Technology HPCA-8: 1

Problem Memory system performance is critical Everyone thinks about their own application Tuning replacement policies Software/hardware prefetching But modern computer systems execute multiple applications concurrently/simultaneously Time-shared systems Context switches cause cold misses Multiprocessors systems sharing memory hierarchy (SMP, SMT, CMP) Simultaneous applications compete for cache space HPCA-8: 2

Solutions: Cache Partitioning & Memory-Aware Scheduling Explicitly manage cache space allocation amongst concurrent/ simultaneous processes Each process gets different benefit from more cache space Similar to main memory partition (e.g.. Stone 1992) in the old days Memory-Aware Scheduling Choose a set of simultaneous processes to minimize memory/cache contention Schedule for SMT systems (Snavely 2000) Threads interact in various ways (RUU, functional units, caches, etc) Based on executing various schedules and profiling them Admission control for gang scheduling (Batat 2000) Based on the footprint of a job (total memory usage) HPCA-8: 3

BUT… Testing many possible schedules  not viable The number of possible schedules increase exponentially as the number of processes increase Need to decide a good schedule from individual process characteristics  complexity increases linearly Footprint-based scheduling  not enough information Footprint of a process is often larger than the cache Processes may not need the entire working set in the cache Can we find a good schedule for cache performance? What information do we need for each process? HPCA-8: 4

Information a Scheduler/Partitioner Needs Characterizing a process For scheduling and partitioning, need to know the effect of varying cache size Multiple performance numbers for different cache sizes Ignore other effects than cache size Miss-rate curves; m(c) Cache miss-rates as a function of cache size (cache blocks) Assume a process is isolated Assume the cache is FULLY-ASSOCIATIVE Provides essential information for scheduling and partitioning 100 50 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate HPCA-8: 5

Using Miss-Rate Curves for Partitioning What do miss-rate curves tell about cache allocation? 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate Process A Process B Cache misses  mA(cA)·refA+ mB(cB)·refB Cache Allocation cA cB A B HPCA-8: 6

Finding the best allocation Use marginal gain; g(c) = m(c) ·ref - m(c+1)·ref Gain in the number of misses by increasing the cache space Allocate cache blocks to each process in a greedy manner Guaranteed to result in the optimal partition if m(c) are convex Cache Allocation Initially no cache block is allocated Compare Marginal Gains 987 > 746 Compare Marginal Gains 987 > 1568 Compare Marginal Gains 987 < 2111 Compare Marginal Gains 409 < 746 A Allocate a block to Process A Allocate a block to Process B B B Allocate a block to Process B B Allocate a block to Process B HPCA-8: 7

Partitioning Results Partition the L2 cache amongst two simultaneous processes (spec2000 benchmarks: art and mcf ) HPCA-8: 8

Intuition for Memory-Aware Scheduling How to schedule 4 processes on 2 processor system using individual miss-rate curves? Schedule A and C, and B and D together Curves tend to have a knee  The amount of cache space where the marginal gain diminishes a lot 100 50 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate Process A Process B Working set size is larger than the cache for all processes All processes result in similar miss-rate if they have the entire cache Group processes based on the knees 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate 50 100 0.2 0.4 0.6 0.8 1 Cache Space (%) Miss-rate Process C Process D HPCA-8: 9

Determining the Knee of the Curve Use partitioning technique Available cache resource should be doubled Cache Allocation Cache Allocation However, now we may need multiple time slices to schedule processes (2 time slices in our example) HPCA-8: 10

Scheduling Results Schedule 6 SPEC CPU benchmarks for 2 Processors HPCA-8: 11

Analytical Model (ICS`01) Miss-rate curves (or marginal gains) alone may not be enough for optimizing time-shared systems Partitioning amongst concurrent processes Scheduling considering the effects of context switches Use analytical model to predict cache-sharing effects 32-KB 8-way Set-Associative (bzip2+gcc+swim+mesa+vortex+vpr+twolf+iu) HPCA-8: 12

BUT… Processes to execute are only known at run-time Users decide what applications to run Scheduling/Partitioning decisions should be made at run-time The behavior of a process changes over time Applications have different phases Miss-rates curves (and marginal gains) may change over an execution Cache configurations are different for systems Miss-rate curves (and marginal gains) are different for systems Need an on-line estimation of miss-rate curves (and marginal gains) HPCA-8: 13

On-Line Estimation of Marginal Gains: Fully-Associative Caches Marginal gains can be directly counted based on the temporal ordering of cache blocks (LRU information) Use one counter per each cache block (or a group of cache blocks) and one for counting all accesses Hit on the ith MRU  Increment ith counter Example: a FA cache with 4 blocks Increment the 1st Counter Increment the 3th Counter Access Counter 350 2 LRU Order 1 351 2432 2433 2433 2434 912 913 2432 912 722 350 124 Marginal-Gain Counters Hit on the 3rd MRU Cache Block Hit on the MRU Cache Block 1 LRU Order LRU Order 2 LRU Order 3 LRU Order Cache Blocks HPCA-8: 14

BUT… Most caches are SET-ASSOCIATIVE Except main memory Usually up to 8-way associative Set-associative caches only maintain temporal ordering within a set No global temporal ordering Cannot use block-by-block temporal ordering to obtain marginal gains for fully-associative caches HPCA-8: 15

Way-Counters Way-Counters Use the existing LRU information within a set One counter per way (D-way cacehs  D counters) Hit on the ith MRU  Increment ith counter Each way-counter represents the gain of having S more blocks (S is the number of sets) Increment the 1st Counter Increment the 2nd Counter Hit on the MRU Cache Block Way Counters 4384 376 121 31 Access Counter 5012 1 3 2 377 5014 4385 5013 1 4-way Associative Cache 2 3 … S sets Hit on the 2nd MRU Cache Block HPCA-8: 16

Way+Set Counters Use more counters for more detailed information Maintain the LRU information of sets Hit on the ith MRU way and jth MRU set  Increment counter(i,j) Hit on the MRU way the 2nd MRU group Counters 2132 5248 377 1073 283 431 31 Access Counter … 1 2-way Associative Cache … 1 Group 0 Group 1 Group S’ 8 1074 5249 1 Increment the Counter (0,1) Temporal Ordering Of Set Groups HPCA-8: 17

Summary Caches should be managed more carefully considering the effect of space/time-sharing Cache Partitioning Memory-Aware Scheduling Miss-rate curves provide very relevant information for scheduling and partitioning Enables us to predict the effect of varying the cache space Useful for any tradeoff between performance and space (power) On-line counters can estimate miss-rate curves at run-time Use the temporal ordering of blocks to predict miss-rates for smaller caches Works for both fully-associative and set-associative caches HPCA-8: 18

Partitioning Mechanism Modify the LRU replacement policy to partition Count the number of cache blocks for each process (XA) Try to match XA to the allocated cache space (DA) Replacement Replace Process A’s LRU block if Replace Process B’s LRU block if Replace the standard LRU block if there is no over-allocated process HPCA-8: 19

Scheduling: L2 HPCA-8: 20

Way-Counter Implementation LRU V Tag Data Way LRU of Set1 Way LRU of Set 0 Which Way? Counter(1) Way LRU of Hit Block Counter(2) HIT? Ref HPCA-8: 21

Way+Set Counter Implementation Set LRU LRU V Tag Data Group 1 Way LRU of Set 0 Way LRU of Set1 Which Way? Counter(1) Way LRU of Hit Block Set LRU Ordering Counter(4) HIT? Ref HPCA-8: 22