CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood

Outline Motivation Related Work (1) – Non-uniform Caches CMP-NuRAPID Related Work (2) – Replication Schemes ASR

Motivation Two options for L2 caches in CMPs  Shared: high latency because of wire delay  Private: more misses because of replications Need hybrid L2 caches Take in mind  On-chip communication is fast  On-chip capacity is limited

NUCA Non-Uniform Cache Architecture Place frequently-accessed data closest to the core to allow fast access Couple tag and data placement Can only place one or two ways in each set close to the processor

NuRAPID Non-uniform access with Replacement And Placement usIng Distance associativity Decouple the set-associative way number from data placement Divide the cache data array into d-groups Use forward and reverse pointers  Forward: from tag to data  Reverse: from data to tag  One to one?

CMP-NuRAPID - Overview Hybrid private tag Shared data organization Controlled Replication – CR In-Situ Communication – ISC Capacity Stealing – CS

CMP-NuRAPID – Structure Need carefully chosen d-group preference

CMP-NuRAPID – Data and Tag Array Tag arrays snoop on bus to maintain coherence The data array is accessed through a crossbar

CMP-NuRAPID – Controlled Replication For read-only sharing First use no copy, save capacity Second copy, reduce future access latency In total, avoid off-chip misses

CMP-NuRAPID – Time Issues Start to read before the invalidation and end after the invalidation  Mark the tag for the block being read from a farther d-group busy Start to read after the invalidation begins and end before the invalidation completes  Put an entry in the queue that holds the order of the bus transaction before sending a read request to a farther d-group

CMP-NuRAPID – In-situ Communication For read-write sharing Communication state Write-through for all C blocks in L1 cache

CMP-NuRAPID – Capacity Stealing Demote less-frequently-used data to unused frames in the d-groups closer to the cores with less capacity demands Placement and Promotion  Place all private blocks in the d-group closest to the initiating core  Promote the block directly to the closest d-group for the core

CMP-NuRAPID – Capacity Stealing Demotion and Replacement  Demote the block to the next-fastest d-group  Replace in the order of invalid, private, and shared Doesn’t this kind of demotion pollute another core’s fastest d-group?

CMP-NuRAPID - Methodology Simics 4-core CMP 8 MB, 8-way CMP-NuRAPID with 4 single- ported d-groups Both multithreaded and multiprogrammed workloads

CMP-NuRAPID – Multithreaded

CMP-NuRAPID – Multiprogrammed

Replication Schemes Cooperative Caching  Private L2 caches  Restrict replication under certain criteria Victim Replication  Share L2 cache  Allow replication under certain criteria Both have static replication policies How about dynamic?

ASR - Overview Adaptive Selective Replication Dynamic cache block replication Replicate blocks when the benefits exceed the costs  Benefits: lower L2 hit latency  Costs: More L2 misses

ASR – Sharing Types Shingle Requestor  Blocks are accessed by a single processor Shared Read-Only  Blocks are read, but not written, by multiple processors Shared Read-Write  Blocks are accessed by multiple processors, with at least one write Focus on replicating shared read-only blocks  High locality  Little Capacity  Large portion of requests

ASR - SPR Selective Probabilistic Replication Assume private L2 caches and selectively limits replication on L1 evictions Use probabilistic filtering to make local replication decisions

ASR – Balancing Replication

ASR – Replication Control Replication levels  C: Current  H: Higher  L: Lower Cycles  H: Hit cycles-per-instruction  M: Miss cycles-per-instruction

ASR – Replication Control

Wait until there are enough events to ensure a fair cost/benefit comparison Wait until four consecutive evaluation intervals predict the same change before change the replication level

ASR – Designs Supported by SPR SPR-VR  Add 1-bit per L2 cache block to identify replicas  Disallow replications when the local cache set is filled with owner blocks with identified sharers SPR-NR  Store a 1-bit counter per remote processor for each L2 block  Remove the shared bus overhead (How?) SPR-CC  Model the centralized tag structure using an idealized distributed tag structure

ASR - Methodology Two CMP configurations – Current and Future 8 processors Writeback, write-allocate cache Both commercial and scientific workloads Use throughput as metrics

ASR – Memory Cycles

ASR - Speedup

Conclusion Hybrid is better Dynamic is better Need tradeoff How does it scale…

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Similar presentations

Presentation on theme: "CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Similar presentations

Presentation on theme: "CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z."— Presentation transcript:

Similar presentations

About project

Feedback