I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander.

Slides:



Advertisements
Similar presentations
1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.
Advertisements

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Solve Multi-step Equations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
SE-292 High Performance Computing
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
EU market situation for eggs and poultry Management Committee 20 October 2011.
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,
Cache and Virtual Memory Replacement Algorithms
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer
CRUISE: Cache Replacement and Utility-Aware Scheduling
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
2 |SharePoint Saturday New York City
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
VOORBLAD.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Universität Kaiserslautern Institut für Technologie und Arbeit / Institute of Technology and Work 1 Q16) Willingness to participate in a follow-up case.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
Analyzing Genes and Genomes
SE-292 High Performance Computing
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Energy Generation in Mitochondria and Chlorplasts
Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^
High Performing Cache Hierarchies for Server Workloads
Improving Cache Performance by Exploiting Read-Write Disparity
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Improving Cache Performance using Victim Tag Stores
18742 Parallel Computer Architecture Caching in Multi-core Systems
Prefetch-Aware Cache Management for High Performance Caching
Improving Cache Management Policies Using Dynamic Reuse Distances
CARP: Compression-Aware Replacement Policies
Lecture 14: Large Cache Design II
Presentation transcript:

I MPROVING C ACHE M ANAGEMENT P OLICIES U SING D YNAMIC R EUSE D ISTANCES Nam Duong 1, Dali Zhao 1, Taesu Kim 1, Rosario Cammarota 1, Mateo Valero 2, Alexander V. Veidenbaum 1 1 University of California, Irvine 2 Universitat Politecnica de Catalunya and Barcelona Supercomputing Center

C ACHE M ANAGEMENT 2 Cache Management Single- core Replacement Shared- cache BypassPartitioning LRU NRU EELRU DIP RRIP … SPD … UCP PIPP TA-DIP TA-DRRIP Vantage … PDP Prefetch Have been a hot research topic

O VERVIEW Proposed new cache replacement and partitioning algorithms with a better balance between reuse and pollution Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior Models are used to dynamically compute the best PD Showed that PD-based cache management policies improve performance for both single- and multi-core systems 3

O UTLINE 1. The concept of Protecting Distance 2. The single-core PD-based replacement and bypass policy (PDP) 3. The multi-core PD-based management policies 4. Evaluation 4

D EFINITIONS The (line) reuse distance: The number of accesses to the same cache set between two accesses to the same line This metric is directly related to hit rate The reuse distance distribution (RDD) A distribution of observed reuse distances A program signature for a given cache configuration RDDs of representative benchmarks X-axis: the RD (<256) 5

F UTURE B EHAVIOR P REDICTION Cache management policies use past reference behavior to predict future accesses Prediction accuracy is critical Prediction in some of the prior policies LRU: predicts that lines are reused after K unique accesses, where K < W (W: cache associativity) Early eviction LRU (EELRU): Counts evictions in two non- LRU regions (early/late) to predict a line to evict RRIP: Predicts if a line will be reused in a near, long, or distant future 6

B ALANCING R EUSE AND C ACHE P OLLUTION Key to good performance (high hit rate) Cache lines must be reused as much as possible before eviction AND must be evicted soon after the last reuse to give space to new lines The former can be achieved by using the reuse distance and actively preventing eviction Protecting a line from eviction The latter can be achieved by evicting when not reused within this distance There is an optimal reuse distance balancing the two It is called a Protecting Distance (PD) 7

E XAMPLE : 436.C ACTUS ADM A majority of lines are reused at 64 or fewer accesses There are multiple peaks at different reuse distances Reuse maximized if lines are kept in the cache for 64 accesses Lines may not be reused if evicted before that Lines kept beyond that are likely to pollute cache Assume that no lines are kept longer than a given RD 8

T HE P ROTECTING D ISTANCE (PD) A distance at which a majority of lines are covered A single value for all sets Predicted based on the current RDD Questions to answer/solve Why does using the PD achieve the balance? How to dynamically find the PD for an application and a cache configuration? How to build the PD-based management policies? 9

O UTLINE 1. The concept of Protecting Distance 2. Single-core PD-based replacement and bypass policy (PDP) 3. The multi-core PD-based management policies 4. Evaluation 10

T HE S INGLE - CORE PDP A cache tag contains a lines remaining PD (RPD) A line can be evicted when its RPD=0 The RPD of an inserted or promoted line set to the predicted PD RPDs of other lines in a set are decremented Example: A 4-way cache, the predicted PD is 7 A line is promoted on a hit A set with RPDs before and after the hit access Reused lineInserted line (unused)

T HE S INGLE - CORE PDP (C ONT.) Selecting a victim on a miss A line with an RPD = 0 can be replaced Two cases when all RPDs > 0 (no unprotected lines) Caches without bypass (inclusive): Unused lines are less likely to be reused than reused lines Replace unused line with highest RPD first No unused line: Replace a line with highest RPD Caches with bypass (non-inclusive): Bypass the new line Reused lineInserted line (unused)

E VALUATION OF THE S TATIC PDP Static PDP: use the best static PD for each benchmark PD < 256 SPDP-NB: Static PDP with replacement only SPDP-B: Static PDP with replacement and bypass Performance: in general, DDRIP < SPDP-NB < SPDP-B 436.cactusADM: a 10% additional miss reduction Two static PDP policies have similar performance 483.xalancbmk: 3 different execution windows have different behavior for SPDP-B 13

436. CACTUS ADM: E XPLAINING THE PERFORMANCE DIFFERENCE How the evicted lines occupy the cache? DRRIP: Early evicted lines: 75% of accesses, but occupy only 4% Late evicted lines: 2% of accesses, but occupy 8% of the cache pollution SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4% SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache yielding cache space to useful lines 14 PDP has less pollution caused by long RD lines in the cache than RRIP

C ASE S TUDY : 483. XALANCBMK 15 The best PD is different in different windows And for different programs Need a dynamic policy that finds best PD Need a model to drive the search There is a close relationship between the hit rate, the PD and the RDD

A H IT R ATE M ODEL F OR N ON - INCLUSIVE C ACHE The model estimates the hit rate as a function of d p and the RDD {N i }, N t : The RDD d p : The protecting distance d e : Experimentally set to W (W: Cache associativity) 16 RDD E Hit rate Used to find the PD maximizing the hit rate

PDP C ACHE O RGANIZATION RD Sampler tracks access to several cache sets In L2 miss/WB stream, can reduce sampling rate Measures reuse distance of a new access RD Counter Array collects # of accesses at RD=i, N t To reduce overhead, each counter covers a range of RDs PD Compute Logic: finds PD that maximizes E Computed PD used in the next interval (.5M L3 accesses) Reasonable hardware overhead 2 or 3 bits per tag to store the RPD 17 LLC RD Sampler RD Counter Array PD Compute Logic Access address Higher level Main memory RD RDD PD

PDP VS. E XISTING P OLICIES Management policy Supported policy (*) BalanceDistance measurement Model ReplacementBypassReusePollution LRUYesNo YesStack-basedNo EELRU [1]YesNo YesStack-basedProbabilistic DIP [2]YesNoYesNoN/ANo RRIP [3]YesNoYesNoN/ANo SDP [4]NoYes NoN/ANo PDPYes Access-basedHit rate 18 [1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS99 [2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA07 [3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA10 [4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO10 (*) Originally proposed EELRU has the concept of late eviction point, which shares some similarities with the protecting distance However, lines are not always guaranteed to be protected

O UTLINE 1. The concept of Protecting Distance 2. The single-core PD-based replacement and bypass policy (PDP) 3. The multi-core PD-based management policies 4. Evaluation 19

PD- BASED S HARED C ACHE P ARTITIONING Each thread has its own PD (thread-aware) Counter array replicated per thread Sampler and compute logic shared A threads PD determines its cache partition Its lines occupy cache longer if its PD is large The cache is implicitly partitioned per needs of each thread using thread PDs The problem is to find a set of thread PDs that together maximize the hit rate 20

S HARED -C ACHE H IT R ATE M ODEL Extending the single-core approach Compute a vector (T= number of threads) Exhaustive search for is not practical A heuristic search algorithm finds a combination of threads RDD peaks that maximizes hit rate The single-core model generates top 3 peaks per thread The complexity is O(T 2 ) See the paper for more detail 21

O UTLINE 1. The concept of Protecting Distance 2. The single-core PD-based replacement and bypass policy (PDP) 3. The multi-core PD-based management policies 4. Evaluation 22

E VALUATION M ETHODOLOGY CMP$im simulator, LLC replacement Target cache: LLC 23 CacheParams DCache32KB, 8-way, 64B, 2 cycles ICache32KB, 4-way, 64B, 2 cycles L2Cache256KB, 8-way, 64B, 10 cycles L3Cache (LLC)2MB, 16-way, 64B, 30 cycles Memory200 cycles

E VALUATION M ETHODOLOGY (C ONT.) Benchmarks: SPEC CPU 2006 benchmarks Excluded those which did not stress the LLC Single-core: Compared to EELRU, SDP, DIP, DRRIP Multi-core 4- and 16-core configurations, 80 workloads each The workloads generated by randomly combining benchmarks Compared to UCP, PIPP, TA-DRRIP Our policy: PDP-x, where x is the number of bits per cache line 24

S INGLE - CORE PDP PDP-x, where x is the number of bits per cache line Each benchmark is executed for 1B instructions Best if can use 3 bits per line, but still better than prior work at 2 bits 25

5 benchmarks which demonstrate significant phase changes Each benchmark is run for 5B instructions Change of PD (X-axis: 1M LLC accesses) A DAPTATION TO P ROGRAM P HASES 26

A DAPTATION TO P ROGRAM P HASES (C ONT.) IPC improvement over DIP 27

PD- BASED C ACHE P ARTITIONING FOR 16 CORES Normalized to TA-DRRIP 28

H ARDWARE O VERHEAD PolicyPer-line bits Overhead (%) DIP40.8% RRIP20.4% SDP41.4% PDP-220.6% PDP-330.8% 29

O THER R ESULTS Exploration of PDP cache parameters Cache bypass fraction Prefetch-aware PDP PD-based cache management policy for 4-core 30

C ONCLUSIONS Proposed the concept of Protecting Distance (PD) Showed that it can be used to better balance reuse and cache pollution Developed a hit rate model as a function of the PD, program behavior, and cache configuration Proposed PD-based management policies for both single- and multi-core systems PD-based policies outperform existing policies 31

T HANK Y OU ! 32

B ACKUP S LIDES RDD, E and hit rate of all benchmarks 33

RDD S, M ODELED AND R EAL H IT R ATES OF SPEC CPU 2006 B ENCHMARKS 34

RDD S, M ODELED AND R EAL H IT R ATES OF SPEC CPU 2006 B ENCHMARKS (C ONT.) 35

RDD S, M ODELED AND R EAL H IT R ATES OF SPEC CPU 2006 B ENCHMARKS (C ONT.) 36