1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Slides:



Advertisements
Similar presentations
1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.
Advertisements

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Chapter 4 Memory Management 4.1 Basic memory management 4.2 Swapping
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
SE-292 High Performance Computing
Outline Introduction Assumptions and notations
Virtual Memory II Chapter 8.
CRUISE: Cache Replacement and Utility-Aware Scheduling
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
Green Eggs and Ham.
VOORBLAD.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Constant, Linear and Non-Linear Constant, Linear and Non-Linear
Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Januar MDMDFSSMDMDFSSS
Analyzing Genes and Genomes
SE-292 High Performance Computing
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
The Pumping Lemma for CFL’s
1 © R. Guerraoui The Limitations of Registers R. Guerraoui Distributed Programming Laboratory.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Samira Khan University of Virginia April 14, 2016 COMPUTER ARCHITECTURE CS 6354 Caches The content and concept of this course are adapted from CMU ECE.
Samira Khan University of Virginia April 21, 2016
Samira Khan University of Virginia Nov 5, 2017
A Case for MLP-Aware Cache Replacement
Computer Architecture Lecture 4a: More Caches
Presentation transcript:

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu, Yale N. Patt

2 Memory Level Parallelism (MLP) Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew98] Several techniques to improve MLP (out-of-order, runahead etc.) MLP varies. Some misses are isolated and some parallel How does this affect cache replacement? time A B C isolated miss parallel miss

3 Problem with Traditional Cache Replacement Traditional replacement tries to reduce miss count Implicit assumption: Reducing miss count reduces memory-related stalls Misses with varying MLP breaks this assumption! Eliminating an isolated miss helps performance more than eliminating a parallel miss

4 An Example Misses to blocks P1, P2, P3, P4 can be parallel Misses to blocks S1, S2, and S3 are isolated Two replacement algorithms: 1.Minimizes miss count (Beladys OPT) 2.Reduces isolated miss (MLP-Aware) For a fully associative cache containing 4 blocks S1 P4 P3 P2 P1 P1 P2 P3 P4 S2S3

5 P3P2P1P4 H H MH H H MHit/Miss Misses=4 Stalls=4 S1 P4 P3 P2 P1 P1 P2 P3 P4 S2S3 Time stall Fewest Misses = Best Performance Beladys OPT replacement MM MLP-Aware replacement Hit/Miss P3P2S1P4P3P2P1P4P3P2S2P4P3P2S3P4S1S2S3P1P3P2S3P4S1S2S3P4 H H H S1S2S3P4 H M M M Time stall Misses=6 Stalls=2 Saved cycles Cache

6 Motivation MLP varies. Some misses more costly than others MLP-aware replacement can improve performance by reducing costly misses

7 Outline Introduction MLP-Aware Cache Replacement Model for Computing Cost Repeatability of Cost A Cost-Sensitive Replacement Policy Practical Hybrid Replacement Tournament Selection Dynamic Set Sampling Sampling Based Adaptive Replacement Summary

8 Computing MLP-Based Cost Cost of miss is number of cycles the miss stalls the processor Easy to compute for isolated miss Divide each stall cycle equally among all parallel misses A B C t0 t1 t4t5 time 1 ½ 1½ ½ t2 t3 ½ 1

9 Miss Status Holding Register (MSHR) tracks all in flight misses Add a field mlp-cost to each MSHR entry Every cycle for each demand entry in MSHR mlp-cost += (1/N) N = Number of demand misses in MSHR A First-Order Model

10 Machine Configuration Processor aggressive, out-of-order, 128-entry instruction window L2 Cache 1MB, 16-way, LRU replacement, 32 entry MSHR Memory 400 cycle bank access, 32 banks Bus Roundtrip delay of 11 bus cycles (44 processor cycles)

11 Distribution of MLP-Based Cost Cost varies. Does it repeat for a given cache block? MLP-Based Cost % of All L2 Misses

12 Repeatability of Cost An isolated miss can be parallel miss next time Can current cost be used to estimate future cost ? Let = difference in cost for successive miss to a block Small cost repeats Large cost varies significantly

13 In general is small repeatable cost When is large (e.g. parser, mgrid) performance loss Repeatability of Cost < 60 <

14 The Framework MSHR L2 CACHE MEMORY Quantization of Cost Computed mlp-based cost is quantized to a 3-bit value CCL CARECARE Cost-Aware Repl Engine Cost Calculation Logic PROCESSOR ICACHEDCACHE

15 A Linear (LIN) function that considers recency and cost Victim-LIN = min { Recency (i) + S*cost (i) } S = significance of cost. Recency(i) = position in LRU stack cost(i) = quantized cost Design of MLP-Aware Replacement policy LRU considers only recency and no cost Victim-LRU = min { Recency (i) } Decisions based only on cost and no recency hurt performance. Cache stores useless high cost blocks

16 Results for the LIN policy Performance loss for parser and mgrid due to large.

17 Effect of LIN policy on Cost Miss += 4% IPC += 4% Miss += 30% IPC -= 33% Miss -= 11% IPC += 22%

18 Outline Introduction MLP-Aware Cache Replacement Model for Computing Cost Repeatability of Cost A Cost-Sensitive Replacement Policy Practical Hybrid Replacement Tournament Selection Dynamic Set Sampling Sampling Based Adaptive Replacement Summary

19 Tournament Selection (TSEL) of Replacement Policies for a Single Set ATD-LINATD-LRU Saturating Counter (SCTR) HIT Unchanged MISS Unchanged HITMISS+= Cost of Miss in ATD-LRU MISSHIT-= Cost of Miss in ATD-LIN SET A + SCTR If MSB of SCTR is 1, MTD uses LIN else MTD use LRU ATD-LIN ATD-LRU SET A MTD

20 Extending TSEL to All Sets Implementing TSEL on a per-set basis is expensive Counter overhead can be reduced by using a global counter + SCTR Policy for All Sets In MTD Set A ATD-LIN Set B Set C Set D Set E Set F Set G Set H Set A ATD-LRU Set B Set C Set D Set E Set F Set G Set H

21 Dynamic Set Sampling + SCTR Policy for All Sets In MTD ATD-LIN Set B Set E Set G Set B Set E Set G ATD-LRU Set A Set C Set D Set F Set H Set C Set D Set F Set H Not all sets are required to decide the best policy Have the ATD entries only for few sets. Sets that have ATD entries (B, E, G) are called leader sets

22 Dynamic Set Sampling Bounds using analytical model and simulation (in paper) DSS with 32 leader sets performs similar to having all sets Last-level cache typically contains 1000s of sets, thus ATD entries are required for only 2%-3% of the sets How many sets are required to choose best performing policy? ATD overhead can further be reduced by using MTD to always simulate one of the policies (say LIN)

23 Decide policy only for follower sets + Sampling Based Adaptive Replacement (SBAR) The storage overhead of SBAR is less than 2KB (0.2% of the baseline 1MB cache) SCTR MTD Set B Set E Set G ATD-LRU Set A Set C Set D Set F Set H Set B Set E Leader sets Follower sets

24 Results for SBAR

25 SBAR adaptation to phases SBAR selects the best policy for each phase of ammp LIN is better LRU is better

26 Outline Introduction MLP-Aware Cache Replacement Model for Computing Cost Repeatability of Cost A Cost-Sensitive Replacement Policy Practical Hybrid Replacement Tournament Selection Dynamic Set Sampling Sampling Based Adaptive Replacement Summary

27 Summary MLP varies. Some misses are more costly than others MLP-aware cache replacement can reduce costly misses Proposed a runtime mechanism to compute MLP-Based cost and the LIN policy for MLP-aware cache replacement SBAR allows dynamic selection between LIN and LRU with low hardware overhead Dynamic set sampling used in SBAR also enables other cache related optimizations

28 Questions

29 Effect of number and selection of leader sets

30 Comparison with ACL ACL requires 33 times more overhead than SBAR

31 Analytical Model for DSS

32 Algorithm for computing cost

33 The Framework MSHR L2 CACHE MEMORY Computed Value (cycles) Stored value Quantization of Cost CCL CARECARE Cost-Aware Repl Engine Cost Calculation Logic PROCESSOR ICACHEDCACHE

34 Future Work Extensions for MLP-Aware Replacement Large instruction window processors (Runahead, CFP etc.) Interaction with prefetchers Extensions for SBAR Multiple replacement policies Separate replacement for demand and prefetched lines Extensions for Dynamic Set Sampling Runtime monitoring of cache behavior Tuning aggressiveness of prefetchers