Prof. Onur Mutlu Carnegie Mellon University

Slides:



Advertisements
Similar presentations
1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.
Advertisements

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Improving Cache Performance by Exploiting Read-Write Disparity
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
The Evicted-Address Filter
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur Mutlu Carnegie Mellon University.
15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.
15-740/ Computer Architecture Lecture 22: Caching in Multi-Core Systems Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/7/2011.
Samira Khan University of Virginia April 26, 2016
Samira Khan University of Virginia April 21, 2016
Improving Cache Performance using Victim Tag Stores
COSC6385 Advanced Computer Architecture
Computer Architecture: Parallel Task Assignment
Lecture: Large Caches, Virtual Memory
18-447: Computer Architecture Lecture 23: Caches
ASR: Adaptive Selective Replication for CMP Caches
15-740/ Computer Architecture Lecture 20: Caching III
Computer Architecture Lecture 19: High-Performance Caches
15-740/ Computer Architecture Lecture 24: Prefetching
Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University
18742 Parallel Computer Architecture Caching in Multi-core Systems
Rachata Ausavarungnirun, Kevin Chang
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Carnegie Mellon University
Lecture 13: Large Cache Design I
Application Slowdown Model
Example Cache Coherence Problem
Lecture 12: Cache Innovations
CARP: Compression Aware Replacement Policies
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Achieving High Performance and Fairness at Low Cost
A Case for MLP-Aware Cache Replacement
Lecture: Cache Innovations, Virtual Memory
Tapestry: Reducing Interference on Manycore Processors for IaaS Clouds
CARP: Compression-Aware Replacement Policies
Massachusetts Institute of Technology
Lecture 14: Large Cache Design II
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
15-740/ Computer Architecture Lecture 27: Prefetching II
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
15-740/ Computer Architecture Lecture 14: Prefetching
Computer Architecture Lecture 15: Multi-Core Cache Management
15-740/ Computer Architecture Lecture 18: Caching Basics
Lecture: Cache Hierarchies
Computer Architecture Lecture 18b: Multi-Core Cache Management
Presentation transcript:

Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 17: Prefetching, Caching, Multi-core Prof. Onur Mutlu Carnegie Mellon University

Announcements Milestone meetings Meet with Evangelos, Lavanya, Vivek And, me… especially if you receive(d) my feedback and I asked to meet

Last Time Markov Prefetching Content Directed Prefetching Execution Based Prefetchers

Multi-Core Issues in Prefetching and Caching

Prefetching in Multi-Core (I) Prefetching shared data Coherence misses Prefetch efficiency a lot more important Bus bandwidth more precious Prefetchers on different cores can deny service to each other and to demand requests DRAM contention Bus contention Cache conflicts Need to coordinate the actions of independent prefetchers for best system performance Each prefetcher has different accuracy, coverage, timeliness

Shortcoming of Local Prefetcher Throttling Core 0 Core 1 Core 2 Core 3 Prefetcher Degree: Prefetcher Degree: 4 2 4 2 FDP Throttle Up FDP Throttle Up Shared Cache Set 0 Used_P Dem 2 Pref 0 Dem 2 Dem 2 Pref 0 Dem 2 Used_P Used_P Dem 3 Pref 1 Dem 3 Used_P Pref 1 Dem 3 Dem 3 Dem 2 Dem 2 Dem 3 Dem 3 Pref 0 Used_P Dem 3 Used_P Dem 3 Pref 0 Pref 1 Used_P Dem 2 Pref 1 Used_P Dem 2 Dem 3 Dem 3 Dem 3 Dem 3 Set 1 Set 2 Dem 2 Pref 0 Pref 0 Dem 2 Pref 0 Dem 2 Dem 2 Pref 0 Dem 3 Pref 1 Dem 3 Pref 1 Dem 3 Pref 1 Pref 1 Dem 3 Local-only prefetcher control techniques have no mechanism to detect inter-core interference … …

Prefetching in Multi-Core (II) Ideas for coordinating different prefetchers’ actions Utility-based prioritization Prioritize prefetchers that provide the best marginal utility on system performance Cost-benefit analysis Compute cost-benefit of each prefetcher to drive prioritization Heuristic based methods Global controller overrides local controller’s throttling decision based on interference and accuracy of prefetchers Ebrahimi et al., “Coordinated Management of Multiple Prefetchers in Multi-Core Systems,” MICRO 2009.

Hierarchical Prefetcher Throttling Local control’s goal: Maximize the prefetching performance of core i independently Global Control: accepts or overrides decisions made by local control to improve overall system performance Global control’s goal: Keep track of and control prefetcher-caused inter-core interference in shared memory system Memory Controller Bandwidth Feedback Final Throttling Decision Pref. i Global Control Throttling Decision Accuracy Local Control Local Throttling Decision Cache Pollution Feedback Shared Cache Core i

Hierarchical Prefetcher Throttling Example - High accuracy - High pollution - High bandwidth consumed while other cores need bandwidth Memory Controller High BW (i) BW (i) BWNO (i) High BWNO (i) Enforce Throttle Down Final Throttling Decision Pref. i Global Control High Acc (i) Acc (i) Local Control Local Throttle Up Local Throttling Decision Pol (i) High Pol (i) Shared Cache Core i Pol. Filter i

Multi-Core Issues in Caching More pressure on the memory/cache hierarchy  cache efficiency a lot more important Private versus shared caching Providing fairness/QoS in shared multi-core caches Migration of shared data in private caches How to organize/connect caches: Non-uniform cache access and cache interconnect design Placement/insertion Identifying what is most profitable to insert into cache Minimizing dead/useless blocks Replacement Cost-aware: which block is most profitable to keep?

Cache Coherence Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state? P1 P2 Interconnection Network 1000 x Main Memory

The Cache Coherence Problem ld r2, x 1000 Interconnection Network 1000 x Main Memory

The Cache Coherence Problem ld r2, x ld r2, x 1000 1000 Interconnection Network 1000 x Main Memory

The Cache Coherence Problem ld r2, x ld r2, x add r1, r2, r4 st x, r1 2000 1000 Interconnection Network 1000 x Main Memory

The Cache Coherence Problem ld r2, x ld r2, x add r1, r2, r4 st x, r1 2000 1000 Should NOT load 1000 ld r5, x Interconnection Network 1000 x Main Memory

Cache Coherence: Whose Responsibility? Software Can the programmer ensure coherence if caches are invisible to software? What if the ISA provided the following instruction? FLUSH-LOCAL A: Flushes/invalidates the cache block containing address A from a processor’s local cache When does the programmer need to FLUSH-LOCAL an address? FLUSH-GLOBAL A: Flushes/invalidates the cache block containing address A from all other processors’ caches When does the programmer need to FLUSH-GLOBAL an address? Hardware Simplifies software’s job One idea: Invalidate all other copies of block A when a processor writes to it

Snoopy Cache Coherence Caches “snoop” (observe) each other’s write/read operations A simple protocol: Write-through, no-write-allocate cache Actions: PrRd, PrWr, BusRd, BusWr PrWr / BusWr Valid BusWr Invalid PrRd / BusRd PrRd/--

Multi-core Issues in Caching How does the cache hierarchy change in a multi-core system? Private cache: Cache belongs to one core Shared cache: Cache is shared by multiple cores CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE DRAM MEMORY CONTROLLER CORE 0 CORE 1 CORE 2 CORE 3 DRAM MEMORY CONTROLLER L2 CACHE L2 CACHE

Shared Caches Between Cores Advantages: Dynamic partitioning of available cache space No fragmentation due to static partitioning Easier to maintain coherence Shared data and locks do not ping pong between caches Disadvantages Cores incur conflict misses due to other cores’ accesses Misses due to inter-core interference Some cores can destroy the hit rate of other cores What kind of access patterns could cause this? Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?) High bandwidth harder to obtain (N cores  N ports?)

Shared Caches: How to Share? Free-for-all sharing Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU) Not thread/application aware An incoming block evicts a block regardless of which threads the blocks belong to Problems A cache-unfriendly application can destroy the performance of a cache friendly application Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit Reduced performance, reduced fairness

Problem with Shared Caches Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ ……

Problem with Shared Caches Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ ……

Problem with Shared Caches Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing.

Controlled Cache Sharing Utility based cache partitioning Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. Fair cache partitioning Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. Shared/private mixed cache mechanisms Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA 2009.

Utility Based Shared Cache Partitioning Goal: Maximize system throughput Observation: Not all threads/applications benefit equally from caching  simple LRU replacement not good for system throughput Idea: Allocate more cache space to applications that obtain the most benefit from more space The high-level idea can be applied to other shared resources as well. Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.

Utility Based Cache Partitioning (I) Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2

Utility Based Cache Partitioning (II) equake vpr Misses per 1000 instructions (MPKI) UTIL LRU Idea: Give more cache to the application that benefits more from cache

Utility Based Cache Partitioning (III) UMON1 UMON2 PA I$ Shared L2 cache I$ Core1 Core2 D$ D$ Main Memory Three components: Utility Monitors (UMON) per core Partitioning Algorithm (PA) Replacement support to enforce partitions

Utility Monitors For each core, simulate LRU using auxiliary tag store (ATS) Hit counters in ATS to count hits per recency position LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 ATS Set B Set E Set G Set A Set C Set D Set F Set H + (MRU)H0 H1 H2…H15(LRU) MTS Set B Set E Set G Set A Set C Set D Set F Set H

Utility Monitors

Dynamic Set Sampling + Extra tags incur hardware and power overhead DSS reduces overhead [Qureshi+ ISCA’06] 32 sets sufficient (analytical bounds) Storage < 2kB/UMON + (MRU)H0 H1 H2…H15(LRU) MTD Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set B Set E Set G Set A Set C Set D Set F Set H ATD UMON (DSS)

Partitioning Algorithm Evaluate all possible partitions and select the best With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2) Partitioning done once every 5 million cycles

ways_occupied < ways_given Way Partitioning Way partitioning support: Each line has core-id bits On a miss, count ways_occupied in set by miss-causing app ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app

Performance Metrics Three metrics for performance: Weighted Speedup (default metric)  perf = IPC1/SingleIPC1 + IPC2/SingleIPC2  correlates with reduction in execution time Throughput  perf = IPC1 + IPC2  can be unfair to low-IPC application Hmean-fairness  perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)  balances fairness and performance

Utility Based Cache Partitioning Performance Four cores sharing a 2MB 32-way L2 LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll) Mix1 (gap-applu-apsi-gzp) Mix2 (swm-glg-mesa-prl) Mix3 (mcf-applu-art-vrtx) Mix4 (mcf-art-eqk-wupw)

Utility Based Cache Partitioning Advantages over LRU + Better utilizes the shared cache Disadvantages/Limitations - Scalability: Partitioning limited to ways. What if you have numWays < numApps? - Scalability: How is utility computed in a distributed cache? - What if past behavior is not a good predictor of utility?