(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Nikos Hardavellas, Northwestern University
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Cache Optimization Summary
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.
(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Multiprocessor Cache Coherency
Adaptive Cache Compression for High-Performance Processors Alaa Alameldeen and David Wood University of Wisconsin-Madison Wisconsin Multifacet Project.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Presented by: Nick Kirchem Feb 13, 2004
ASR: Adaptive Selective Replication for CMP Caches
The University of Adelaide, School of Computer Science
A New Coherence Method Using A Multicast Address Network
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Multiprocessor Cache Coherency
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Improving Multiple-CMP Systems with Token Coherence
E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Simulating a $2M Commercial Server on a $2K PC
Token Coherence: Decoupling Performance and Correctness
The University of Adelaide, School of Computer Science
Dynamic Verification of Sequential Consistency
The University of Adelaide, School of Computer Science
Presentation transcript:

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin §, Mark Hill, and David Wood University of Wisconsin § Duke University Wisconsin Multifacet Project

Destination-Set Prediction – ISCA’03 - Milo Martin slide 2 Broadcast snooping  Evolved from bus to switched interconnect +Direct data response (no indirection) –Extra traffic (due to broadcast) Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage (Cost) Broadcast Snooping Burns bandwidth for low latency

Destination-Set Prediction – ISCA’03 - Milo Martin slide 3 Directory protocols  Send to directory which forwards as needed +Avoids broadcast –Adds latency for cache-to-cache misses Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage Sacrifices latency for scalability Directory Protocol Broadcast Snooping (Cost)

Destination-Set Prediction – ISCA’03 - Milo Martin slide 4 Approach: Destination-set Prediction –Learn from past coherence behavior –Predict destinations for each coherence request –Correct prediction avoid indirection & broadcast Cache-Coherence: A Latency/Bandwidth Tradeoff Average Miss Latency Bandwidth usage Directory Protocol Broadcast Snooping Ideal Goal: move toward “ideal” design point (Cost)

Destination-Set Prediction – ISCA’03 - Milo Martin slide 5 Interconnect $ P M $ P M $ P M $ P M System Model Processor/memory nodes –Destination-set predictor –Directory state Destination-set predictor Network Interface Caches Processor Memory Directory Controller Predictor ? ? Home

Destination-Set Prediction – ISCA’03 - Milo Martin slide 6 Contributions Destination-set predictors are: –Simple: single-level, cache-like structures –Low-cost: 64kBs (size similar to L2 tags) –Effective: significantly closer to “ideal” Exploit spatial predictability –Aggregate spatially-related information –Capture “spatial predictability”  better accuracy –Reduces predictor sizes Workload characterization for predictor design –Commercial and technical workloads –See paper

Destination-Set Prediction – ISCA’03 - Milo Martin slide 7 Outline Introduction Quantifying potential benefit Protocols Predictors Conclusions

Destination-Set Prediction – ISCA’03 - Milo Martin slide 8 Potential Benefit: Question 1 of 2 Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal #1 1.Significant performance difference? –Frequent cache-to-cache L2 misses (35%-96%) –Large difference in latency (2x or 100+ ns) –Median of ~20% runtime reduction (up to 50%) Yes!

Destination-Set Prediction – ISCA’03 - Milo Martin slide 9 Potential Benefit: Question 2 of 2 Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal #2 2.Significant bandwidth difference? –Only ~10% requests contact > 1 other processor –Broadcast is overkill (see paper for histogram) –The gap will grow with more processors Yes!

Destination-Set Prediction – ISCA’03 - Milo Martin slide 10 Outline Introduction Quantifying potential benefit Protocols Predictors Conclusions

Destination-Set Prediction – ISCA’03 - Milo Martin slide 11 –No worse than base protocols Protocols for Destination-Set Prediction Protocol goals –Allow direct responses for “correct” predictions Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal Predict “broadcast” = Predict “minimal set” = –A continuum: erase snooping/directory duality

Destination-Set Prediction – ISCA’03 - Milo Martin slide 12 Protocols for Destination-Set Prediction Many possible protocols for implementation –Multicast snooping [Bilir et al.] & [Sorin et al.] –Predictive directory protocols [Acacio et al.] –Token Coherence [Martin et al.] Requestor predicts recipients –Always include directory + self (“minimal set”) Directory at home memory audits predictions – Tracks sharers/owner (just like directory protocol) –“sufficient”  acts as snooping (direct response) –“insufficient”  acts as directory (forward request) Protocol not the main focus of this work

Destination-Set Prediction – ISCA’03 - Milo Martin slide 13 Outline Introduction Quantifying potential benefit Protocols Predictors Conclusions

Destination-Set Prediction – ISCA’03 - Milo Martin slide 14 RequestsPredictions Training Information 1. Responses to own requests 2. Coherence requests from others (read & write) Predictor Design Space Basic organization –One predictor at each processor’s L2 cache –Accessed in parallel with L2 tags –No modifications to processor core Destination-Set Predictor

Destination-Set Prediction – ISCA’03 - Milo Martin slide 15 Our Destination-Set Predictors All simple cache-like (tagged) predictors –Index with data block address –Single-level predictor Prediction –On tag miss, send to minimal set (directory & self) –Otherwise, generate prediction (as described next) Evaluation intermingled –Three predictors (more in paper) –Exploit spatial predictability –Limit predictor size –Runtime result

Destination-Set Prediction – ISCA’03 - Milo Martin slide 16 Evaluation Methods Six multiprocessor workloads Online transaction processing (OLTP) Java middleware (SPECjbb) Static and dynamic web serving (Apache & Slash) Scientific applications (Barnes & Ocean) Simulation environment Full-system simulation using Simics 16-processor SPARC MOSI multiprocessor Many parameters (see paper) Traces (for exploration) & timing simulation (for runtime results) See “Simulating a $2M Server on $2K PC” [IEEE Computer, Feb 2003]

Destination-Set Prediction – ISCA’03 - Milo Martin slide 17 Trace-based Predictor Evaluation Directory Snooping OLTP workload Corresponds to latency Corresponds to bandwidth Quickly explore design space

Destination-Set Prediction – ISCA’03 - Milo Martin slide 18 Predictor #1: Broadcast-if-shared Performance of snooping, fewer broadcasts –Broadcast for “shared” data –Minimal set for “private” data Each entry: valid bit, 2-bit counter –Decrement on data from memory –Increment on data from a processor –Increment other processor’s request Prediction –If counter > 1 then broadcast –Otherwise, send only to minimal set

Destination-Set Prediction – ISCA’03 - Milo Martin slide 19 Predictor #1: Broadcast-if-shared Unbounded predictor, indexed with datablock (64B) Performance of snooping with less traffic Directory Snooping Broadcast-if-shared OLTP workload

Destination-Set Prediction – ISCA’03 - Milo Martin slide 20 Predictor #2: Owner Traffic similar to directory, fewer indirections –Predict one extra processor (the “owner”) –Pairwise sharing, write part of migratory sharing Each entry: valid bit, predicted owner ID –Set “owner” on data from other processor –Set “owner” on other’s request to write –Unset “owner” on response from memory Prediction –If “valid” then predict “owner” + minimal set –Otherwise, send only to minimal set

Destination-Set Prediction – ISCA’03 - Milo Martin slide 21 Predictor #2: Owner Traffic of directory with higher performance Directory Snooping Broadcast-if-shared Owner OLTP workload

Destination-Set Prediction – ISCA’03 - Milo Martin slide 22 Predictor #3: Group Try to achieve ideal bandwidth/latency –Detect groups of sharers –Temporary groups or logical partitions (LPAR) Each entry: N 2-bit counters –Response or request from another processor  Increment corresponding counter –Train down by occasionally decrement all counters (every 2N increments) Prediction –Begin with minimal set –For each processor, if the corresponding counter > 1, add it in the predicted set

Destination-Set Prediction – ISCA’03 - Milo Martin slide 23 Predictor #3: Group A design point between directory and snooping protocols Directory Snooping Broadcast-if-shared Owner Group OLTP workload

Destination-Set Prediction – ISCA’03 - Milo Martin slide 24 Indexing Design Space Index by cache block (64B) –Works well (as shown) Index by program counter (PC) –Simple schemes not as effective with PCs –See paper Index by macroblock (256B or 1024B) –Exploit spatial predictability of sharing misses –Aggregate information for spatially-related blocks –E.g., reading a shared buffer, process migration

Destination-Set Prediction – ISCA’03 - Milo Martin slide 25 Macroblock Indexing Broadcast-if-shared Owner Group Directory Snooping OLTP workload 64B blocks 256B blocks Legend 1024B blocks Macroblock indexing is an improvement Group improves substantially (30%  10%)

Destination-Set Prediction – ISCA’03 - Milo Martin slide 26 Finite Size Predictors 8192 entries  32kB to 64kB predictor 2-4% of L2 cache size (smaller than L2 tags) Broadcast-if-shared Owner Group Directory Snooping unbounded 8192 entries Legend OLTP workload, 1024B macroblock index

Destination-Set Prediction – ISCA’03 - Milo Martin slide 27 Runtime Results What point in the design space to simulate? –As available bandwidth  infinite snooping performs best (no indirections) –As available bandwidth  0, directory performs best (bandwidth efficient) Bandwidth/latency  cost/performance tradeoff –Cost is difficult to quantify (cost of chip bandwidth) –Other associated costs (snoop b/w, power use) –Bandwidth under-design will reduce performance Our evaluation: measure runtime & traffic –Simulate plentiful (but limited) bandwidth

Destination-Set Prediction – ISCA’03 - Milo Martin slide 28 Runtime Results: OLTP 1/2 runtime of directory, 2/3 traffic of snooping Broadcast-if-shared Owner Group Directory Snooping Mostly data traffic

Destination-Set Prediction – ISCA’03 - Milo Martin slide 29 More Runtime Results

Destination-Set Prediction – ISCA’03 - Milo Martin slide 30 Conclusions Destination-set prediction is effective –Provides better bandwidth/latency tradeoffs (Not just the extremes of snooping and directory) –Significant benefit from macroblock indexing –Result summary: 90% the performance of snooping, only 15% more bandwidth than directory Simple, low-cost predictors –Many further improvements possible One current disadvantage: protocol complexity –Solution: use Token Coherence [ISCA 2003]

Destination-Set Prediction – ISCA’03 - Milo Martin slide 31

Destination-Set Prediction – ISCA’03 - Milo Martin slide 32 Potential Benefit: Two Questions Average Miss Latency Bandwidth use Directory Protocol Broadcast Snooping Ideal #2 2. Significant bandwidth difference? (x-axis) #1 1. Significant performance difference? (y-axis)

Destination-Set Prediction – ISCA’03 - Milo Martin slide 33 Significant Performance Difference? Frequent cache-to-cache misses –35%-96% of L2 misses (average of 65%) –0.3 to 5.2 per 1000 instructions –Larger caches don’t help Large difference in latency (2x) –“Two-hop” vs. “Three-hop” miss –Directory lookup on the critical path –130ns difference (80ns directory + 50ns network)  260 cycles (at 2Ghz) every 200 instructions –Median of ~20% (up to 50%) runtime reduction Yes!

Destination-Set Prediction – ISCA’03 - Milo Martin slide 34 Significant Bandwidth Difference? Broadcast is overkill (~10% require > 1) (and the gap will grow with more processors) Yes!

Destination-Set Prediction – ISCA’03 - Milo Martin slide 35 Sharing Temporal Locality 45%-90% of misses in top 10,000 blocks SPECjbb Ocean Slashcode Apache OLTP Barnes-Hut (~45%) (~70%) (~90%)

Destination-Set Prediction – ISCA’03 - Milo Martin slide 36 Sharing Temporal Locality - Macroblocks 80% of misses to top 10,000 macrobocks (1024B)

Destination-Set Prediction – ISCA’03 - Milo Martin slide 37 Contributions 1.Predictor design-space exploration –Effective: Significantly closer to “ideal” –Simple: single-level, cache-like structures –Reasonable-cost: 64kBs (size similar to L2 tags) –Key result: Aggregate spatially-related information Captures spatial predictability Smaller predictor sizes 2.Our evaluation includes: –Commercial and scientific workloads –Characterize relevant aspects of these workloads –Detailed timing-simulation results