Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Slides:

Advertisements

Similar presentations

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

© 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Moshovos © 1 ReCast: Boosting L2 Tag Line Buffer Coverage “for Free” Won-Ho Park, Toronto Andreas Moshovos, Toronto Babak Falsafi, CMU

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Fetch Directed Prefetching - a Study

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason.

ASR: Adaptive Selective Replication for CMP Caches

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Temporal Streaming of Shared Memory

Chapter 8 Digital Design and Computer Architecture: ARM® Edition

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture 20: OOO, Memory Hierarchy

Presented by David Wolinsky

Temporal Memory Streaming

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi (CMU) and Babak Falsafi (EPFL)

2 Prediction The Way Forward CPU Predictors Prefetching Branch Target and Direction Cache Replacement Cache Hit Application footprints grow Predictors need to scale to remain effective Ideally, fast, accurate predictions Can’t have this with conventional technology Prediction has proven useful – Many forms – Which to choose?

3 The Problem with Conventional Predictors Predictor Virtualization  Approximate Large, Accurate, Fast Predictors Predictor Hardware Cost Accuracy Latency What we have Small Fast Not-so-accurate What we want Small Fast Accurate

4 Why Now? L2 Cache Physical Memory CPU MB I$ D$ I$ D$ I$ D$ I$ D$ Extra Resources: CMPs with Large Caches

L2 Cache 5 Predictor Virtualization (PV) Use the on-chip cache to store metadata Reduce cost of dedicated predictors Physical Memory CPU I$ D$ I$ D$ I$ D$ I$ D$

L2 Cache 6 Predictor Virtualization (PV) Use the on-chip cache to store metadata Implement otherwise impractical predictors Physical Memory CPU I$ D$ I$ D$ I$ D$ I$ D$

7 Research Overview PV breaks the conventional predictor design trade offs –Lowers cost of adoption –Facilitates implementation of otherwise impractical predictors Freeloads on existing resources –Adaptive demand Key Design Challenge: –How to compensate for the longer latency to metadata PV in action –Virtualized “Spatial Memory Streaming” –Virtualized Branch Target Buffers

8 PV Architecture PV in Action –Virtualizing “Spatial Memory Streaming” –Virtualizing Branch Target Buffers Conclusions Talk Roadmap

9 PV Architecture CPU I$ D$ L2 Cache Physical Memory Optimization Engine Optimization Engine Predictor Table requestprediction Virtualize

10 PV Architecture CPU I$ D$ L2 Cache Physical Memory Optimization Engine Optimization Engine PVCache requestprediction PVProxy PVTabl e Requires access to L2 Back-side of L1 Not as performance critical

11 PV Challenge: Prediction Latency CPU I$ D$ L2 Cache Physical Memory Optimization Engine Optimization Engine PVCache requestprediction PVProxy PVTable Common Case Infrequent latency: cycles Rare latency: 400 cycles Key: How to pack metadata in L2 cache blocks to amortize costs

12 To Virtualize or Not To Virtualize Predictors redesigned with PV in mind Overcoming the latency challenge –Metadata reuse Intrinsic: one entry used for multiple predictions Temporal: one entry reused in the near future Spatial: one miss overcome by several subsequent hits –Metadata access pattern predictability Predictor metadata prefetching –Looks similar to designing caches BUT: –Does not have to be correct all the time –Time limit on usefullnes

13 PV in Action Data prefetching –Virtualize “Spatial Memory Streaming” [ISCA06] Within 1% performance Hardware cost from 60KB down to < 1KB Branch prediction –Virtualize branch target buffers Increase the perceived BTB accuracy Up to 12.75% IPC improvement with 8% hardware overhead

14 Spatial Memory Streaming [ISCA06] Memory … … spatial patterns Pattern History Table [ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

15 Spatial Memory Streaming (SMS) Detector Predictor data access stream patterns trigger access pattern prefetches ~1KB ~60KB Virtualize

16 Virtualizing SMS Virtual Table Virtual Table pattern tag pattern tag pattern tag... unused 11 ways 1K sets PVCache 11 ways 8 sets L2 cache line Region-level prefetching is naturally tolerant of longer prediction latencies Simply pack predictor entries spatially

17 Experimental Methodology SimFlex: –full-system, cycle-accurate simulator Baseline processor configuration –4-core CMP - OoO –L1D/L1I 64KB 4-way set-associative –UL2 8MB 16-way set-associative Commercial Workloads –Web servers: Apache and Zeus –TPC-C: DB2 and Oracle –TPC-H: several queries –Developed by Impetus group at CMU Anastasia Ailamaki & Babak Falsafi PIs

SMS Performance Potential Percentage L1 Read Mises (%) Conventional Predictor Degrades with Limited Storage

19 Virtualized SMS Hardware Cost Original Prefetcher ~ 60KB Virtualized Prefetcher < 1KB Speedup better

Impact of Virtualization on L2 Requests Percentage Increase L2 Requests (%)

Impact of Virtualization on Off-Chip Bandwidth Off-Chip Bandwidth Increase

22 PV in Action Data prefetching –Virtualize “Spatial Memory Streaming” [ISCA06] Same performance Hardware cost from 60KB down to < 1KB Branch prediction –Virtualize branch target buffers Increase the perceived BTB capacity Up to 12.75% IPC improvement with 8% hardware overhead

23 The Need for Larger BTBs Branch MPKI better Commercial applications benefit from large BTBs BTB entries

24 L2 Cache Virtualizing BTBs: Phantom-BTB BTB PC Virtual Table Virtual Table Latency challenge Not latency tolerant to longer prediction latencies Solution: predictor metadata prefetching Virtual table decoupled from the BTB Virtual table entry: temporal group Small and Fast Large and Slow

Facilitating Metadata Prefetching Intuition: Programs follow mostly similar paths Detection path Subsequent path

26 Temporal Groups Past misses  Good indicator of future misses Dedicated Predictor acts as a filter

27 Fetch Trigger Preceding miss triggers temporal group fetch Not precise  region around miss

28 Temporal Group Prefetching

29 Temporal Group Prefetching

30 L2 Cache Temporal Group Generator Temporal Group Generator Phantom-BTB Architecture BTB PC Prefetch Engine Temporal Group Generator Generates and installs temporal groups in the L2 cache Prefetch Engine Prefetches temporal groups

31 Temporal Group Generator Temporal Group Generator Temporal Group Generation BTB PC Prefetch Engine L2 Cache miss BTB misses generate temporal groups BTB hits do not generate any PBTB activity Miss Hit

32 Prefetch Engine miss Branch Metadata Prefetching BTB PC Temporal Group Generator Temporal Group Generator Virtual Table Virtual Table L2 Cache Prefetch Buffer Prefetch Buffer BTB misses trigger metadata prefetches Parallel lookup in BTB and prefetch buffer Miss Hit Prefetch Buffer Hits

33 Phantom-BTB Advantages “Pay-as-you-go” approach –Practical design –Increases the perceived BTB capacity –Dynamic allocation of resources Branch metadata allocated on demand –On-the-fly adaptation to application demands Branch metadata generation and retrieval performed on BTB misses Only if the application sees misses Metadata survives in the L2 as long as there is sufficient capacity and demand

34 Experimental Methodology Flexus cycle-accurate, full-system simulator Uniprocessor - OoO –1K-entry conventional BTB –64KB 2-way ICache/DCache –4MB 16-way L2 Cache Phantom-BTB –64-entry prefetch buffer –6-entry temporal group –4K-entry virtual table Commercial Workloads

35 PBTB vs. Conventional BTBs Speedup better Performance within 1% of a 4K-entry BTB with 3.6x less storage

36 Phantom-BTB with Larger Dedicated BTBs Speedup better PBTB remains effective with larger dedicated BTBs

37 Increase in L2 MPKI L2 MPKI Marginal increase in L2 misses better

38 Increase in L2 Accesses L2 Accesses per KI PBTB follows application demand for BTB capacity better

39 Summary Predictor metadata stored in memory hierarchy –Benefits Reduces dedicated predictor resources Emulates large predictor tables for increased predictor accuracy –Why now? Large on-chip caches / CMPs / need for large predictors –Predictor virtualization advantages Predictor adaptation Metadata sharing Moving Forward –Virtualize other predictors –Expose predictor interface to software level