1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

Slides:

Advertisements

Similar presentations

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

Advertisements

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Factors, Primes & Composite Numbers

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Year 6 mental test 5 second questions

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

SE-292 High Performance Computing

HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.

Chapter 4 Memory Management Basic memory management Swapping

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

CS 241 Spring 2007 System Programming 1 Memory Replacement Policies Lecture 32 Klara Nahrstedt.

Page Replacement Algorithms

Online Algorithm Huaping Wang Apr.21

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Cache and Virtual Memory Replacement Algorithms

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Approximating the Optimal Replacement Algorithm Ben Juurlink.

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

Virtual Memory II Chapter 8.

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

PSSA Preparation.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

BALANCED CACHE Ayşe BAKIR, Zeynep ZENGİN. Ayse Bakır,CMPE 511,Bogazici University2 Outline  Introduction  Motivation  The B-Cache Organization  Experimental.

T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Cache Replacement Algorithms with Nonuniform Miss Costs Jeong, J. and Dubois, M. IEEE Transactions on Computers Volume: 55, Issue: 4 Pages: April.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Computer Architecture Lecture 26 Fasih ur Rehman.

Sampling Dead Block Prediction for Last-Level Caches

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Multilevel Memories (Improving performance using alittle “cash”)

Consider a Direct Mapped Cache with 4 word blocks

Using Dead Blocks as a Virtual Victim Cache

A Case for MLP-Aware Cache Replacement

Presentation transcript:

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in Cache Replacement Algorithms

Outline 2 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

Motivation 3 The processor-memory performance gap. L2 cache performance is very crucial. Traditionally, L2 cache replacement algorithms focus on improving the hit rate. But, cache misses have different costs. Better to take the cost of a miss into consideration. Processors ability to (partially) hide the L2 cache miss latency differs between misses. Depends on: dependency chain, miss bursts..etc.

Motivation 4 Issued Instructions per Miss Histogram.

Contributions 5 A novel, effective, but simple cost estimation method. Based on the number of instructions a processor manages to issue during the miss latency. A reflection of the processors ability to hide the miss latency. Number of issued instructions during the miss SmallLarge High cost miss/blockLow cost miss/block

Contributions 6 LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm. Integrates our novel cost estimation method with a locality algorithm (e.g. LRU). Attempts to reserve high cost blocks in the cache while their locality is still high. On a cache miss, a low-cost block is chosen for eviction. Excellent performance improvement at feasible cost. Performance improvement: 15% average and up to 85%. Effective in uniprocessors and CMPs. Effective for different cache configurations.

Outline 7 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

Related Work 8 Cache replacement algorithms traditionally attempt to reduce the cache miss rate. Beladys OPT algorithm [Belady 1966]. Dead block predictors [Kharbutli etc]. OPT emulators [Rajan 2007]. Cache misses are not uniform and have different costs [Srinivasan 1998, Puzak 2008]. A new class of replacement algorithms. Miss cost can be latency, power consumption, penalty..etc.

Related Work 9 Jeong and Dubois [1999, 2003, 2006]: In the context of CC-NUMA multiprocessors. Cost of miss mapping to remote memory higher than if mapping to local memory. LACS estimates cost based on processors ability to tolerate the miss latency not the miss latency value itself. Jeong et al. [2008]: In the context of uniprocessors. Next access predicted: Load (high cost); Store (low cost). All load misses treated equally. LACS does not treat load misses equally (different costs). A store miss may have a high cost.

Related Work 10 Srinivasan et al. [2001]: Critical blocks preserved in special critical cache. Criticality estimated from loads dependence chain. No significant improvement under realistic configurations. LACS does not track the dependence chain. Uses a simpler cost heuristic. LACS achieves considerable performance improvement under realistic configurations.

Related Work 11 Qureshi et al. [2006]: Based on Memory-level Parallelism (MLP). Cache misses occur in isolation (high cost) or concurrently (low cost). Suffers from pathological cases. Integrated with a tournament predictor to choose between it and LRU (SBAR). LACS does not slow down any of the 20 benchmarks in our study. LACS outperforms MLP-SBAR in our study.

Outline 12 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

LACS Storage Organization 13 P P L1 $ L2$ IIC (32 bits) MSHR IIRs (32 bits each) Prediction Table Each entry: 6-bit hashed tag, 5-bit cost, 1-bit confidence (8K sets x 4 ways x 1.5 bytes/entry = 48 KB) Total Storage Overhead 48 KB 9.4% of a 512KB Cache 4.7% of a 1MB Cache

Outline 14 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

LACS Implementation 15 On an L2 cache miss on block B in set S: (1) Copy IIC into IIR (2) Find a victim (3) When miss returns, update Bs info MSHR[B].IIR = IIC

LACS Implementation 16 On an L2 cache miss on block B in set S: (1) Copy IIC into IIR (2) Find a victim (3) When miss returns, update Bs info Identify all low cost blocks in set S. If there is at least one, choose a victim randomly from among them. Otherwise, the LRU block is the victim. Block X is a low cost block if: X.cost > threshold, and X.conf == 1

LACS Implementation 17 On an L2 cache miss on block B in set S: (1) Copy IIC into IIR (2) Find a victim (3) When miss returns, update Bs info When miss returns, calculate Bs new cost: newCost = IIC – MSHR[B].IIR Update Bs table info: if(newCost B.cost) B.conf=1, else B.conf=0 B.cost = newCost

Outline 18 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

Evaluation Environment 19 Evaluation using SESC: a detailed, cycle-accurate, execution- driven simulator. 20 of the 26 SPEC2000 benchmarks are used. Reference input sets. 2 billion instructions simulated after skipping the first 2 billion instructions. Benchmarks divided into two groups (GrpA, GrpB). GrpA: L2 cache performance-constrained - ammp, applu, art, equake, gcc, mcf, mgrid, swim, twolf, and vpr. L2 cache: 512 KB, 8-way, WB, LRU.

Outline 20 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

Evaluation 21 Performance Improvement: L2 Cache Miss Rates:

Evaluation 22 L2 Cache Miss Rates: Fraction of LRU blocks reserved by LACS that get re-used: ammpappluartequakegccmcfmgridswimtwolfvpr 94%22%51%15%89%1%33%11%21%22% Low-cost blocks in the cache: <20% OPT evicted blocks that were low-cost: 40% to 98% Strong correlation between blocks evicted by OPT and their cost.

Evaluation 23 Performance improvement in a CMP architecture:

Evaluation 24 Sensitivity to cache parameters: ConfigurationMinimumAverageMaximum 256 KB, 8-way0%3%9% 512 KB, 8-way0%15%85% 1 MB, 8-way-3%8%47% 2 MB, 8-way-3%19%195% 512 KB, 4-way0%12%69% 512 KB, 16-way-1%17%101%

Outline 25 Motivation and Contribution Related Work LACS Storage Organization LACS ImplementationEvaluation Environment Evaluation Conclusion

26 LACSs Exquisite Features: Novelty New metric for measuring cost-sensitivity. Combines Two Principles Locality and cost-sensitivity. Performance Improvements at Feasible Cost 15% average speedup in L2 cache performance-constrained benchmarks. Effective in uniprocessor and CMP architectures. Effective for different cache configurations.

27 Thank You ! Questions?