18-742 Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur Mutlu Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

Nikos Hardavellas, Northwestern University

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Improving Cache Performance by Exploiting Read-Write Disparity

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

The Evicted-Address Filter

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

18-740/640 Computer Architecture Lecture 14: Memory Resource Management I Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 10/26/2015.

15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.

15-740/ Computer Architecture Lecture 22: Caching in Multi-Core Systems Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/7/2011.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Samira Khan University of Virginia April 21, 2016

Computer Architecture: Parallel Task Assignment

Reducing Memory Interference in Multicore Systems

Prof. Onur Mutlu Carnegie Mellon University

18-447: Computer Architecture Lecture 23: Caches

ASR: Adaptive Selective Replication for CMP Caches

15-740/ Computer Architecture Lecture 20: Caching III

Computer Architecture Lecture 19: High-Performance Caches

Prof. Onur Mutlu Carnegie Mellon University

Computer Structure Multi-Threading

18742 Parallel Computer Architecture Caching in Multi-core Systems

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Application Slowdown Model

Computer Architecture: Multithreading (I)

Lecture 12: Cache Innovations

CARP: Compression-Aware Replacement Policies

Massachusetts Institute of Technology

Lecture 14: Large Cache Design II

15-740/ Computer Architecture Lecture 14: Prefetching

Lecture: Cache Hierarchies

Presentation transcript:

Spring 2011 Parallel Computer Architecture Lecture 25: Shared Resource Management Prof. Onur Mutlu Carnegie Mellon University

Announcements Schedule for the rest of the semester April 20: Milestone II presentations  Same format as last time April 27: Oral Exam  30 minutes per person; in my office; closed book/notes  All content covered could be part of the exam May 6: Project poster session  HH 1112, 2-6pm May 10: Project report due 2

Reviews and Reading List Due Today (April 11), before class  Kung, “Why Systolic Architectures?” IEEE Computer Upcoming Topics (we will not cover all of them)  Shared Resource Management  Memory Consistency  Synchronization  Main Memory  Architectural Support for Debugging  Parallel Architecture Case Studies 3

Readings: Shared Cache Management Required  Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High- Performance, Runtime Mechanism to Partition Shared Caches,” MICRO  Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT  Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA  Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA Recommended  Kim et al., “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” ASPLOS  Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,” ISCA  Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA

Readings: Shared Main Memory Required  Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO  Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Memory Controllers,” ISCA  Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA  Kim et al., “Thread Cluster Memory Scheduling,” MICRO  Ebrahimi et al., “Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems,” ASPLOS Recommended  Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO  Rixner et al., “Memory Access Scheduling,” ISCA  Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO

Resource Sharing Concept Idea: Instead of dedicating a hardware resource to a hardware context, allow multiple contexts to use it  Example resources: functional units, pipeline, caches, buses, memory Why? + Resource sharing improves utilization/efficiency  throughput  As we saw with (simultaneous) multithreading  When a resource is left idle by one thread, another thread can use it; no need to replicate shared data + Reduces communication latency  For example, shared data kept in the same cache in SMT + Compatible with the shared memory model 6

Resource Sharing Disadvantages Resource sharing results in contention for resources  When the resource is not idle, another thread cannot use it  If space is occupied by one thread, another thread needs to re- occupy it - Sometimes reduces each or some thread’s performance - Thread performance can be worse than when it is run alone - Eliminates performance isolation  inconsistent performance across runs - Thread performance depends on co-executing threads - Uncontrolled (free-for-all) sharing degrades QoS - Causes unfairness, starvation 7

Need for QoS and Shared Resource Mgmt. Why is unpredictable performance (or lack of QoS) bad? Makes programmer’s life difficult  An optimized program can get low performance (and performance varies widely depending on co-runners) Causes discomfort to user  An important program can starve  Examples from shared software resources Makes system management difficult  How do we enforce a Service Level Agreement when hardware resources are sharing is uncontrollable? 8

Resource Sharing vs. Partitioning Sharing improves throughput  Better utilization of space Partitioning provides performance isolation (predictable performance)  Dedicated space Can we get the benefits of both? Idea: Design shared resources in a controllable/partitionable way 9

Shared Hardware Resources Memory subsystem (in both MT and CMP)  Non-private caches  Interconnects  Memory controllers, buses, banks I/O subsystem (in both MT and CMP)  I/O, DMA controllers  Ethernet controllers Processor (in MT)  Pipeline resources  L1 caches 10

Resource Sharing Issues and Related Metrics System performance Fairness Per-application performance (QoS) Power Energy System cost Lifetime Reliability, effect of faults Security, information leakage Partitioning: Isolation between apps/threads Sharing (free for all): No isolation 11

Main Memory in the System 12 CORE 1 L2 CACHE 0 SHARED L3 CACHE DRAM INTERFACE CORE 0 CORE 2 CORE 3 L2 CACHE 1 L2 CACHE 2L2 CACHE 3 DRAM BANKS DRAM MEMORY CONTROLLER

Modern Memory Systems (Multi-Core) 13

Memory System is the Major Shared Resource 14 threads’ requests interfere

Multi-core Issues in Caching How does the cache hierarchy change in a multi-core system? Private cache: Cache belongs to one core (a shared block can be in multiple caches) Shared cache: Cache is shared by multiple cores 15 CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER L2 CACHE CORE 0CORE 1CORE 2CORE 3 DRAM MEMORY CONTROLLER L2 CACHE

Shared Caches Between Cores Advantages:  High effective capacity  Dynamic partitioning of available cache space No fragmentation due to static partitioning  Easier to maintain coherence (a cache block is in a single location)  Shared data and locks do not ping pong between caches Disadvantages  Slower access  Cores incur conflict misses due to other cores’ accesses Misses due to inter-core interference Some cores can destroy the hit rate of other cores  Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?) 16

Shared Caches: How to Share? Free-for-all sharing  Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU)  Not thread/application aware  An incoming block evicts a block regardless of which threads the blocks belong to Problems  A cache-unfriendly application can destroy the performance of a cache friendly application  Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit  Reduced performance, reduced fairness 17

Controlled Cache Sharing Utility based cache partitioning  Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High- Performance, Runtime Mechanism to Partition Shared Caches,” MICRO  Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA Fair cache partitioning  Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT Shared/private mixed cache mechanisms  Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA  Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA

Utility Based Shared Cache Partitioning Goal: Maximize system throughput Observation: Not all threads/applications benefit equally from caching  simple LRU replacement not good for system throughput Idea: Allocate more cache space to applications that obtain the most benefit from more space The high-level idea can be applied to other shared resources as well. Qureshi and Patt, “Utility-Based Cache Partitioning: A Low- Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO Suh et al., “A New Memory Monitoring Scheme for Memory- Aware Scheduling and Partitioning,” HPCA

Marginal Utility of a Cache Way 20 Utility U a b = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16-way 1MB L2 Misses per 1000 instructions

Utility Based Shared Cache Partitioning Motivation 21 Num ways from 16-way 1MB L2 Misses per 1000 instructions (MPKI) equake vpr LRU UTIL Improve performance by giving more cache to the application that benefits more from cache

Utility Based Cache Partitioning (III) 22 Three components:  Utility Monitors (UMON) per core  Partitioning Algorithm (PA)  Replacement support to enforce partitions I$ D$ Core1 I$ D$ Core2 Shared L2 cache Main Memory UMON1 UMON2 PA

Utility Monitors  For each core, simulate LRU policy using ATD  Hit counters in ATD to count hits per recency position  LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 23 MTD Set B Set E Set G Set A Set C Set D Set F Set H ATD Set B Set E Set G Set A Set C Set D Set F Set H (MRU)H0 H1 H2…H15(LRU)

Utility Monitors 24

Dynamic Set Sampling  Extra tags incur hardware and power overhead  DSS reduces overhead [Qureshi+ ISCA’06]  32 sets sufficient (analytical bounds)analytical bounds  Storage < 2kB/UMON 25 MTD ATD Set B Set E Set G Set A Set C Set D Set F Set H (MRU)H0 H1 H2…H15(LRU) Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G UMON (DSS)

Partitioning Algorithm  Evaluate all possible partitions and select the best  With a ways to core1 and (16-a) ways to core2: Hits core1 = (H 0 + H 1 + … + H a-1 ) ---- from UMON1 Hits core2 = (H 0 + H 1 + … + H 16-a-1 ) ---- from UMON2  Select a that maximizes (Hits core1 + Hits core2 )  Partitioning done once every 5 million cycles 26

Way Partitioning 27 Way partitioning support: [ Suh+ HPCA’02, Iyer ICS’04 ] 1.Each line has core-id bits 2.On a miss, count ways_occupied in set by miss-causing app ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app

Performance Metrics Three metrics for performance: 1. Weighted Speedup (default metric)  perf = IPC 1 /SingleIPC 1 + IPC 2 /SingleIPC 2  correlates with reduction in execution time 2. Throughput  perf = IPC 1 + IPC 2  can be unfair to low-IPC application 3. Hmean-fairness  perf = hmean(IPC 1 /SingleIPC 1, IPC 2 /SingleIPC 2 )  balances fairness and performance 28

Weighted Speedup Results for UCP 29

IPC Results for UCP 30 UCP improves average throughput by 17%

Any Problems with UCP So Far? - Scalability - Non-convex curves? Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways) Possible partitions increase exponentially with cores For a 32-way cache, possible partitions:  4 cores  6545  8 cores  15.4 million Problem NP hard  need scalable partitioning algorithm 31

Greedy Algorithm [Stone+ ToC ’92] GA allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated Optimal partitioning when utility curves are convex Pathological behavior for non-convex curves 32

Problem with Greedy Algorithm Problem: GA considers benefit only from the immediate block. Hence, it fails to exploit large gains from looking ahead 33 In each iteration, the utility for 1 block: U(A) = 10 misses U(B) = 0 misses Blocks assigned Misses All blocks assigned to A, even if B has same miss reduction with fewer blocks

Lookahead Algorithm Marginal Utility (MU) = Utility per cache resource  MU a b = U a b /(b-a) GA considers MU for 1 block. LA considers MU for all possible allocations Select the app that has the max value for MU. Allocate it as many blocks required to get max MU Repeat till all blocks assigned 34

Lookahead Algorithm Example 35 Time complexity ≈ ways 2 /2 (512 ops for 32-ways) Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Result: A gets 5 blocks and B gets 3 blocks (Optimal) Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Blocks assigned Misses

UCP Results 36 Four cores sharing a 2MB 32-way L2 Mix2 (swm-glg-mesa-prl) Mix3 (mcf-applu-art-vrtx) Mix4 (mcf-art-eqk-wupw) Mix1 (gap-applu-apsi-gzp) LA performs similar to EvalAll, with low time-complexity LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll)

Utility Based Cache Partitioning Advantages over LRU + Improves system throughput + Better utilizes the shared cache Disadvantages - Fairness, QoS? Limitations - Scalability: Partitioning limited to ways. What if you have numWays < numApps? - Scalability: How is utility computed in a distributed cache? - What if past behavior is not a good predictor of utility? 37