Download presentation
Presentation is loading. Please wait.
1
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin §, Mark Hill, and David Wood University of Wisconsin § Duke University Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/
2
Destination-Set Prediction – ISCA’03 - Milo Martin slide 2 Broadcast snooping Evolved from bus to switched interconnect +Direct data response (no indirection) –Extra traffic (due to broadcast) Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage (Cost) Broadcast Snooping Burns bandwidth for low latency
3
Destination-Set Prediction – ISCA’03 - Milo Martin slide 3 Directory protocols Send to directory which forwards as needed +Avoids broadcast –Adds latency for cache-to-cache misses Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage Sacrifices latency for scalability Directory Protocol Broadcast Snooping (Cost)
4
Destination-Set Prediction – ISCA’03 - Milo Martin slide 4 Approach: Destination-set Prediction –Learn from past coherence behavior –Predict destinations for each coherence request –Correct prediction avoid indirection & broadcast Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage Directory Protocol Broadcast Snooping Ideal Goal: move toward “ideal” design point (Cost)
5
Destination-Set Prediction – ISCA’03 - Milo Martin slide 5 Interconnect $ P M $ P M $ P M $ P M System Model Processor/memory nodes –Destination-set predictor –Directory state Destination-set predictor Network Interface Caches Processor Memory Directory Controller Predictor ? ? Home
6
Destination-Set Prediction – ISCA’03 - Milo Martin slide 6 Contributions Destination-set predictors are: –Simple: single-level, cache-like structures –Low-cost: 64kBs (size similar to L2 tags) –Effective: significantly closer to “ideal” Exploit spatial predictability –Aggregate spatially-related information –Capture “spatial predictability” better accuracy –Reduces predictor sizes Workload characterization for predictor design –Commercial and technical workloads –See paper
7
Destination-Set Prediction – ISCA’03 - Milo Martin slide 7 Outline Introduction Quantifying potential benefit Protocols Predictors Conclusions
8
Destination-Set Prediction – ISCA’03 - Milo Martin slide 8 Potential Benefit: Question 1 of 2 Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal #1 1.Significant performance difference? –Frequent cache-to-cache L2 misses (35%-96%) –Large difference in latency (2x or 100+ ns) –Median of ~20% runtime reduction (up to 50%) Yes!
9
Destination-Set Prediction – ISCA’03 - Milo Martin slide 9 Potential Benefit: Question 2 of 2 Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal #2 2.Significant bandwidth difference? –Only ~10% requests contact > 1 other processor –Broadcast is overkill (see paper for histogram) –The gap will grow with more processors Yes!
10
Destination-Set Prediction – ISCA’03 - Milo Martin slide 10 Outline Introduction Quantifying potential benefit Protocols Predictors Conclusions
11
Destination-Set Prediction – ISCA’03 - Milo Martin slide 11 –No worse than base protocols Protocols for Destination-Set Prediction Protocol goals –Allow direct responses for “correct” predictions Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal Predict “broadcast” = Predict “minimal set” = –A continuum: erase snooping/directory duality
12
Destination-Set Prediction – ISCA’03 - Milo Martin slide 12 Protocols for Destination-Set Prediction Many possible protocols for implementation –Multicast snooping [Bilir et al.] & [Sorin et al.] –Predictive directory protocols [Acacio et al.] –Token Coherence [Martin et al.] Requestor predicts recipients –Always include directory + self (“minimal set”) Directory at home memory audits predictions – Tracks sharers/owner (just like directory protocol) –“sufficient” acts as snooping (direct response) –“insufficient” acts as directory (forward request) Protocol not the main focus of this work
13
Destination-Set Prediction – ISCA’03 - Milo Martin slide 13 Outline Introduction Quantifying potential benefit Protocols Predictors Conclusions
14
Destination-Set Prediction – ISCA’03 - Milo Martin slide 14 RequestsPredictions Training Information 1. Responses to own requests 2. Coherence requests from others (read & write) Predictor Design Space Basic organization –One predictor at each processor’s L2 cache –Accessed in parallel with L2 tags –No modifications to processor core Destination-Set Predictor
15
Destination-Set Prediction – ISCA’03 - Milo Martin slide 15 Our Destination-Set Predictors All simple cache-like (tagged) predictors –Index with data block address –Single-level predictor Prediction –On tag miss, send to minimal set (directory & self) –Otherwise, generate prediction (as described next) Evaluation intermingled –Three predictors (more in paper) –Exploit spatial predictability –Limit predictor size –Runtime result
16
Destination-Set Prediction – ISCA’03 - Milo Martin slide 16 Evaluation Methods Six multiprocessor workloads Online transaction processing (OLTP) Java middleware (SPECjbb) Static and dynamic web serving (Apache & Slash) Scientific applications (Barnes & Ocean) Simulation environment Full-system simulation using Simics 16-processor SPARC MOSI multiprocessor Many parameters (see paper) Traces (for exploration) & timing simulation (for runtime results) See “Simulating a $2M Server on $2K PC” [IEEE Computer, Feb 2003]
17
Destination-Set Prediction – ISCA’03 - Milo Martin slide 17 Trace-based Predictor Evaluation Directory Snooping OLTP workload Corresponds to latency Corresponds to bandwidth Quickly explore design space
18
Destination-Set Prediction – ISCA’03 - Milo Martin slide 18 Predictor #1: Broadcast-if-shared Performance of snooping, fewer broadcasts –Broadcast for “shared” data –Minimal set for “private” data Each entry: valid bit, 2-bit counter –Decrement on data from memory –Increment on data from a processor –Increment other processor’s request Prediction –If counter > 1 then broadcast –Otherwise, send only to minimal set
19
Destination-Set Prediction – ISCA’03 - Milo Martin slide 19 Predictor #1: Broadcast-if-shared Unbounded predictor, indexed with datablock (64B) Performance of snooping with less traffic Directory Snooping Broadcast-if-shared OLTP workload
20
Destination-Set Prediction – ISCA’03 - Milo Martin slide 20 Predictor #2: Owner Traffic similar to directory, fewer indirections –Predict one extra processor (the “owner”) –Pairwise sharing, write part of migratory sharing Each entry: valid bit, predicted owner ID –Set “owner” on data from other processor –Set “owner” on other’s request to write –Unset “owner” on response from memory Prediction –If “valid” then predict “owner” + minimal set –Otherwise, send only to minimal set
21
Destination-Set Prediction – ISCA’03 - Milo Martin slide 21 Predictor #2: Owner Traffic of directory with higher performance Directory Snooping Broadcast-if-shared Owner OLTP workload
22
Destination-Set Prediction – ISCA’03 - Milo Martin slide 22 Predictor #3: Group Try to achieve ideal bandwidth/latency –Detect groups of sharers –Temporary groups or logical partitions (LPAR) Each entry: N 2-bit counters –Response or request from another processor Increment corresponding counter –Train down by occasionally decrement all counters (every 2N increments) Prediction –Begin with minimal set –For each processor, if the corresponding counter > 1, add it in the predicted set
23
Destination-Set Prediction – ISCA’03 - Milo Martin slide 23 Predictor #3: Group A design point between directory and snooping protocols Directory Snooping Broadcast-if-shared Owner Group OLTP workload
24
Destination-Set Prediction – ISCA’03 - Milo Martin slide 24 Indexing Design Space Index by cache block (64B) –Works well (as shown) Index by program counter (PC) –Simple schemes not as effective with PCs –See paper Index by macroblock (256B or 1024B) –Exploit spatial predictability of sharing misses –Aggregate information for spatially-related blocks –E.g., reading a shared buffer, process migration
25
Destination-Set Prediction – ISCA’03 - Milo Martin slide 25 Macroblock Indexing Broadcast-if-shared Owner Group Directory Snooping OLTP workload 64B blocks 256B blocks Legend 1024B blocks Macroblock indexing is an improvement Group improves substantially (30% 10%)
26
Destination-Set Prediction – ISCA’03 - Milo Martin slide 26 Finite Size Predictors 8192 entries 32kB to 64kB predictor 2-4% of L2 cache size (smaller than L2 tags) Broadcast-if-shared Owner Group Directory Snooping unbounded 8192 entries Legend OLTP workload, 1024B macroblock index
27
Destination-Set Prediction – ISCA’03 - Milo Martin slide 27 Runtime Results What point in the design space to simulate? –As available bandwidth infinite snooping performs best (no indirections) –As available bandwidth 0, directory performs best (bandwidth efficient) Bandwidth/latency cost/performance tradeoff –Cost is difficult to quantify (cost of chip bandwidth) –Other associated costs (snoop b/w, power use) –Bandwidth under-design will reduce performance Our evaluation: measure runtime & traffic –Simulate plentiful (but limited) bandwidth
28
Destination-Set Prediction – ISCA’03 - Milo Martin slide 28 Runtime Results: OLTP 1/2 runtime of directory, 2/3 traffic of snooping Broadcast-if-shared Owner Group Directory Snooping Mostly data traffic
29
Destination-Set Prediction – ISCA’03 - Milo Martin slide 29 More Runtime Results
30
Destination-Set Prediction – ISCA’03 - Milo Martin slide 30 Conclusions Destination-set prediction is effective –Provides better bandwidth/latency tradeoffs (Not just the extremes of snooping and directory) –Significant benefit from macroblock indexing –Result summary: 90% the performance of snooping, only 15% more bandwidth than directory Simple, low-cost predictors –Many further improvements possible One current disadvantage: protocol complexity –Solution: use Token Coherence [ISCA 2003]
31
Destination-Set Prediction – ISCA’03 - Milo Martin slide 31
32
Destination-Set Prediction – ISCA’03 - Milo Martin slide 32 Potential Benefit: Two Questions Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal #2 2. Significant bandwidth difference? (x-axis) #1 1. Significant performance difference? (y-axis)
33
Destination-Set Prediction – ISCA’03 - Milo Martin slide 33 Significant Performance Difference? Frequent cache-to-cache misses –35%-96% of L2 misses (average of 65%) –0.3 to 5.2 per 1000 instructions –Larger caches don’t help Large difference in latency (2x) –“Two-hop” vs. “Three-hop” miss –Directory lookup on the critical path –130ns difference (80ns directory + 50ns network) 260 cycles (at 2Ghz) every 200 instructions –Median of ~20% (up to 50%) runtime reduction Yes!
34
Destination-Set Prediction – ISCA’03 - Milo Martin slide 34 Significant Bandwidth Difference? Broadcast is overkill (~10% require > 1) (and the gap will grow with more processors) Yes!
35
Destination-Set Prediction – ISCA’03 - Milo Martin slide 35 Sharing Temporal Locality 45%-90% of misses in top 10,000 blocks SPECjbb Ocean Slashcode Apache OLTP Barnes-Hut (~45%) (~70%) (~90%)
36
Destination-Set Prediction – ISCA’03 - Milo Martin slide 36 Sharing Temporal Locality - Macroblocks 80% of misses to top 10,000 macrobocks (1024B)
37
Destination-Set Prediction – ISCA’03 - Milo Martin slide 37 Contributions 1.Predictor design-space exploration –Effective: Significantly closer to “ideal” –Simple: single-level, cache-like structures –Reasonable-cost: 64kBs (size similar to L2 tags) –Key result: Aggregate spatially-related information Captures spatial predictability Smaller predictor sizes 2.Our evaluation includes: –Commercial and scientific workloads –Characterize relevant aspects of these workloads –Detailed timing-simulation results
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.