Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.

Similar presentations


Presentation on theme: "Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de."— Presentation transcript:

1 Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de Catalunya ‡ University of Wisconsin-Madison * M. S. Obaidat and S. Misra (Editors) Chapter 13, Cooperative Networking (Wiley)

2 2 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

3 3 Motivation - Background Mem. Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs Critical for CMPs –Processor/memory gap –Limited pin-bandwidth Current designs –Shared cache sharing can lead to contention –Private caches isolation can waste resources PP PP $ Slow Narrow 4-core CMP PP PP PP PP Capacity contention Wasted capacity

4 4 Motivation – Challenges Key challenges –Growing on-chip wire delay –Expensive off-chip accesses –Destructive inter-thread interference –Diverse workload characteristics Three important demands for CMP caching –Capacity: reduce off-chip accesses –Latency: reduce remote on-chip references –Isolation: reduce inter-thread interference Need to combine the strength of both private and shared cache designs

5 5 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

6 6 CMP Cooperative Caching Form an aggregate global cache via cooperative private caches –Use private caches to attract data for fast reuse –Share capacity through cooperative policies –Throttle cooperation to find an optimal sharing point Inspired by cooperative file/web caches –Similar latency tradeoff –Similar algorithms P L2 L1IL1D P L2 L1IL1D P L2 L1IL1D P L2 L1IL1D Network

7 7 CCE L1 PP L2 Main Memory Bus Interconnect “Private” L2 caches to reduce access latency. Centralized directory with duplicated tags grants coherence on-chip. Spilling: Evicted blocks are forwarded to other caches for a more efficient use of cache space. (N-Chance forwarding mechanism) CMP Cooperative Caching

8 8 Distributed Cooperative Caching Objective: Keep the benefits of Cooperative caching while improving scalability and energy consumption. Distributed directory with different tag allocation mechanism. Main Memory Bus Interconnect L1 PP L2 DCE

9 9 Tag Structure Comparison

10 10 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

11 11 3. Applications of CMP Cooperative Caching Several techniques have appeared that take advantage of Cooperative Caching for CMPs. 1.For Latency Reduction –Cooperation Throtling 2.For Adaptive Repartitioning –Elastic Cooperative Caching 3.For Performance Isolation –Cooperative Cache Partitioning

12 12 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

13 13 Policies to Reduce Off-chip Accesses Cooperation policies for capacity sharing –(1) Cache-to-cache transfers of clean data –(2) Replication-aware replacement –(3) Global replacement of inactive data Implemented by two unified techniques –Policies enforced by cache replacement/placement –Information/data exchange supported by modifying the coherence protocol

14 14 Policy (1) - Make use of all on-chip data Don’t go off-chip if on-chip (clean) data exist Beneficial and practical for CMPs –Peer cache is much closer than next-level storage –Affordable implementations of “clean ownership” Important for all workloads –Multi-threaded: (mostly) read-only shared data –Single-threaded: spill into peer caches for later reuse

15 15 Policy (2) – Control replication Intuition – increase # of unique on-chip data Latency/capacity tradeoff –Evict singlets only when no replications exist –Modify the default cache replacement policy “Spill” an evicted singlet into a peer cache –Can further reduce on-chip replication –Randomly choose a recipient cache for spilling “singlets”

16 16 Policy (3) - Global cache management Approximate global-LRU replacement –Combine global spill/reuse history with local LRU Identify and replace globally inactive data –First become the LRU entry in the local cache –Set as MRU if spilled into a peer cache –Later become LRU entry again: evict globally 1-chance forwarding (1-Fwd) –Blocks can only be spilled once if not reused

17 17 Cooperation Throttling Why throttling? –Further tradeoff between capacity/latency Two probabilities to help make decisions –Cooperation probability: control replication –Spill probability: throttle spilling Shared Private Cooperative CachingCC 100% Policy (1) CC 0%

18 18 Performance Evaluation SPECOMP Single- threaded

19 19 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

20 20 CC for Adaptive Repartitioning Tradeoff between minimizing off-chip misses and avoiding inter-thread interference. Elastic Cooperative Caching adapts caches according to application requirements. –High cache requirements -> Big local private cache – Low data reuse -> Cache space reassigned to increase the size of the global shared cache for spilled blocks.

21 21 Local Shared/Private cache with independent repartitioning unit. Only local core can allocate Distributed Coherence Engines grant coherence Allocates evicted blocks from all private regions Distributes evicted blocks from private partition among nodes. Every N cycles repartitions cache based on LRU hits in S&P partitions. Elastic Cooperative Caching structure

22 22 Request Hit Repartitioning Unit Working Example Independent Partitioning, distributed structure. Request Hit Repartition Cycle 34 Counter > HT Private ++ Counter < LT Shared ++

23 23 Spilled Allocator Working Example Only data from the private region can be spilled No need of perfectly updated information, out of critical path. Broadcast Private Ways Shared Ways

24 24 Performance and energy-efficiency evaluation +12% Over ASR +24% Over ASR

25 25 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

26 26 Throughput-fairness dilemma –Cooperation: Taking turns to speed up –Multiple time-sharing partitions (MTP) QoS guarantee –Cooperatively shrink/expand across MTPs –Bound average slowdown over the long term P1 P2P3 P4 Time P1 P2P3 P4 IPC = 0.52 WS=2.42 QoS = -0.52 FS = 1.22 IPC = 0.52 WS = 2.42 QoS = 0 FS = 1.97 == << Fairness improvement and QoS guarantee reflected by higher FS and bounded QoS values Time Time-share Based Partitioning

27 27 MTP Benefits Better than single spatial partition (SSP) –MTP/long-termQoS almost the same as MTP/noQoS Offline analysis based on profile info, 210 workloads (job mixes) Percentage of workloads achieving various FS values Fair Speedup

28 28 Better than MTP MTP issues –Not needed if LRU performs better (LRU often near-optimal [Stone et al. IEEE TOC ’92]) –Partitioning is more complex than SSP Cooperative Cache Partitioning (CCP) –Integration with Cooperative Caching (CC) –Exploit CC’s latency and LRU-based sharing benefits –Simplify the partitioning algorithm –Total execution time = Epochs(CC) + Epochs(MTP) Weighted by # of threads benefiting from CC vs. MTP

29 29 Partitioning Heuristic When is MTP better than CC –QoS: ∑speedup > ∑slowdown (over N partitions) –Speedup should be large  CC already good at fine-grained tuning BaselineC_shrinkC_expand Throughput (normalized) Allocated cache ways (16-way total, 4-core) Speedup Slowdown thrashing _test Speedup > (N-1) x Slowdown thrashing _test Speedup > (N-1) x Slowdown

30 30 Partitioning Algorithm 1.S = All threads - supplier threads (e.g., gcc, swim) Allocate them with gPar (guaranteed partition, or min. capacity needed for QoS) [Yeh/Reinman CASES ’05] For threads in S, init their C_expand and C_shrink 2.Do thrashing_test iteratively for each thread in S If thread t fails, allocate t with gPar, remove t from S Update C_expand and C_shrink for other threads in S 3.Repeat until S is empty or all threads in S pass the test

31 31 Fair Speedup Results Two groups of workloads –PAR: MTP better than CC (partitioning helps) –LRU: CC better than MTP (partitioning hurts) PAR (67 out of 210 workloads) LRU (143 out of 210 workloads) Percentage of workloads

32 32 Performance and QoS evaluation

33 33 Outline 1.Motivation 2.Cooperative Caching for CMPs 3.Applications of CMP Cooperative Caching 1)Latency Reduction 2)Adaptive Repartitioning 3)Performance Isolation 4.Conclusions

34 34 4. Conclusions On-chip cache hierarchy plays an important role in CMPs and must provide fast and fair data accesses for multiple, competing processor cores. Cooperative Caching for CMPs is an effective framework to manage caches in such environment. Cooperative sharing mechanisms and the philosophy of using cooperation for conflict resolution can be applied to many other resource management problems.


Download ppt "Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de."

Similar presentations


Ads by Google