Download presentation
Presentation is loading. Please wait.
Published byTobias Craig Modified over 6 years ago
1
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark Hill, and David Wood University of Wisconsin §Duke University Wisconsin Multifacet Project
2
Cache-Coherence: A Latency/Bandwidth Tradeoff
Average Miss Latency Bandwidth usage Broadcast Snooping Burns bandwidth for low latency (Cost) Broadcast snooping Evolved from bus to switched interconnect Direct data response (no indirection) Extra traffic (due to broadcast) Burns bandwidth for low latency
3
Cache-Coherence: A Latency/Bandwidth Tradeoff
Sacrifices latency for scalability Directory Protocol Average Miss Latency Bandwidth usage Broadcast Snooping (Cost) Directory protocols Send to directory which forwards as needed Avoids broadcast Adds latency for cache-to-cache misses Sacrifices latency for scalability/bandwidth improvement
4
Cache-Coherence: A Latency/Bandwidth Tradeoff
Directory Protocol Average Miss Latency Bandwidth usage Ideal Goal: move toward “ideal” design point Broadcast Snooping (Cost) DEFINE dest-set term here Right choice depends upon available bandwidth and other factors We present 4 predictors types, plus explore indexing modes Something for everyone If you like directory protocols, we can reduce the average miss latency at the cost of modest extra b/w If you like snooping protocols, we can significantly reduce the traffic as the cost of a modest loss average miss latency Approach: Destination-set Prediction Learn from past coherence behavior Predict destinations for each coherence request Correct prediction avoid indirection & broadcast
5
Destination-set predictor
System Model Processor/memory nodes Destination-set predictor Directory state Destination-set predictor Network Interface Caches Processor Memory Directory Controller Home $ P M ? Similar to current systems, only highlight differences Differences from a directory protocol: ordered interconnect Differences from a snooping system: directory at memory Differences from both: a simple bandwidth adaptive mechanism Predictor Interconnect
6
Quantifying potential benefit Predictors Conclusions
Outline Introduction Quantifying potential benefit Predictors Conclusions Forecast combine description/evaluation
7
Potential Benefit: Two Questions
Directory Protocol Average Miss Latency Bandwidth use Ideal #1 1. Significant performance difference? (y-axis) #2 2. Significant bandwidth difference? (x-axis) Broadcast Snooping
8
Potential Benefit: Question 1 of 2
9
Potential Benefit: Question 1 of 2
Directory Protocol Average Miss Latency Bandwidth use Ideal #1 Broadcast Snooping Larger cache don’t help; make percentage worse Yes! Significant performance difference? Frequent cache-to-cache L2 misses (35%-96%) Directory lookup on the critical path
10
Potential Benefit: Question 2 of 2
Directory Protocol Average Miss Latency Bandwidth use Ideal #2 Broadcast Snooping Significant bandwidth difference?
11
Potential Benefit: Question 2 of 2
Broadcast is overkill (~10% require > 1) (and the gap will grow with more processors)
12
Potential Benefit: Question 2 of 2
Directory Protocol Average Miss Latency Bandwidth use Ideal #2 Broadcast Snooping Yes! Significant bandwidth difference? Only ~10% requests contact > 1 other processor Broadcast is overkill The gap will grow with more processors
13
Outline Introduction Quantifying potential benefit Predictors Conclusions
14
Protocols for Destination-Set Prediction
Predict “minimal set” = Directory Protocol Average Miss Latency Bandwidth use Allow direct responses for “correct” predictions Ideal Predict “broadcast” = Broadcast Snooping Destination-Set Prediction Sharing patterns (describe the properties we want, not the mechanisms) Random would fall along a line connecting the two One protocol that unites directory protocols, snooping protocols, and predictive design-points in between
15
Predictor Design Space
Destination-Set Predictor Predictions Requests Training Information 1. Responses to own requests 2. Coherence requests from others (read & write) Basic organization One predictor at each processor’s L2 cache Accessed in parallel with L2 tags No modifications to processor core All simple cache-like (tagged) predictors Index with data block address Single-level predictor Prediction On tag miss, send to minimal set (directory & self) Otherwise, generate prediction (as described next)
16
Evaluation Methods Six multiprocessor workloads Simulation environment
Online transaction processing (OLTP) Java middleware (SPECjbb) Static and dynamic web serving (Apache & Slash) Scientific applications (Barnes & Ocean) Simulation environment Full-system simulation using Simics 16-processor SPARC MOSI multiprocessor Many parameters (see paper) Traces (for exploration) & timing simulation (for runtime results) Keep IEEE Computer title on slide, but don’t say it
17
Trace-based Predictor Evaluation
OLTP workload Directory Corresponds to latency Snooping The rate of indirection can be thought of as prediction accuracy Corresponds to bandwidth Quickly explore design space
18
Predictor #1: Broadcast-if-shared
Performance of snooping, fewer broadcasts Broadcast for “shared” data Minimal set for “private” data Each entry: valid bit, 2-bit counter Decrement on data from memory Increment on data from a processor Increment other processor’s request Prediction If counter > 1 then broadcast Otherwise, send only to minimal set
19
Predictor #1: Broadcast-if-shared
OLTP workload Directory Broadcast-if-shared Snooping NEED TO SAY UNBOUND PREDICTORS! Unbounded predictor, indexed with datablock (64B) Performance of snooping with less traffic
20
Predictor #2: Owner Traffic similar to directory, fewer indirections
Predict one extra processor (the “owner”) Pairwise sharing, write part of migratory sharing Each entry: valid bit, predicted owner ID Set “owner” on data from other processor Set “owner” on other’s request to write Unset “owner” on response from memory Prediction If “valid” then predict “owner” + minimal set Otherwise, send only to minimal set
21
Traffic of directory with higher performance
Predictor #2: Owner OLTP workload Directory Owner Broadcast-if-shared Snooping Traffic of directory with higher performance
22
Predictor #3: Group Try to achieve ideal bandwidth/latency
Detect groups of sharers Temporary groups or logical partitions (LPAR) Each entry: N 2-bit counters Response or request from another processor Increment corresponding counter Train down by occasionally decrement all counters (every 2N increments) Prediction Begin with minimal set For each processor, if the corresponding counter > 1, add it in the predicted set
23
A design point between directory and snooping protocols
Predictor #3: Group OLTP workload Directory Owner Group Broadcast-if-shared Snooping A design point between directory and snooping protocols
24
Predictor #4: Owner/Group
Group predictor Exculsive requests Owner Predictor Shared requests
25
Predictor #4: Owner/Group
OLTP workload Directory Owner Owner/Group Group Broadcast-if-shared Snooping A hybrid design between Owner and Group
26
Indexing Design Space Index by cache block (64B)
as shown Index by program counter (PC) Simple schemes not as effective with PCs Index by macroblock (256B or 1024B) Exploit spatial predictability of sharing misses Spatially-related blocks
27
Data block indexing yields better predictions
PC Indexing Data block indexing yields better predictions
28
Macroblock Indexing Group Macroblock indexing is an improvement
Group improves substantially (30% 10%)
29
Finite Size Predictors
8192 entries 32kB to 64kB predictor
30
Runtime Results What point in the design space to simulate?
As available bandwidth infinite snooping performs best (no indirections) As available bandwidth 0, directory performs best (bandwidth efficient) Bandwidth/latency cost/performance tradeoff Cost is difficult to quantify (cost of chip bandwidth) Other associated costs (snoop b/w, power use) Bandwidth under-design will reduce performance In the paper: measure runtime & traffic Simulate plentiful (but limited) bandwidth This is a bandwidth-rich configuration. With less available bandwidth, the snooping system would not always be the highest performing solution Like other architecture paper that estimate die area overheads, we provide an indirect measure of cost, and show performance
31
Runtime Results: OLTP Mostly data traffic
This is a bandwidth-rich configuration. With less available bandwidth, the snooping system would not always be the highest performing solution Remind the audience my there is such a large performance gap Group: 1/2 runtime of directory, 2/3 traffic of snooping
32
Conclusions Destination-Set Prediction provides better bandwidth/latency tradeoffs (Not just the extremes of snooping and directory) Significant benefit from macroblock indexing A 8192 entries predictor requires only 64KB space Result summary: 90% the performance of snooping, only 15% more bandwidth than directory Reference:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.