(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav,
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
The University of Adelaide, School of Computer Science
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Computer Architecture Lecture 29: Cache Coherence Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/10/2015.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Framework For Exploring Interconnect Level Cache Coherency
ASR: Adaptive Selective Replication for CMP Caches
Architecture and Design of AlphaServer GS320
A New Coherence Method Using A Multicast Address Network
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Improving Multiple-CMP Systems with Token Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Token Coherence: Decoupling Performance and Correctness
Dynamic Verification of Sequential Consistency
Lecture 10: Directory-Based Examples II
Presentation transcript:

(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project University of Wisconsin—Madison

Token Coherence – Milo Martin slide 2 We See Two Problems in Cache Coherence 1. Protocol ordering bottlenecks –Artifact of conservatively resolving racing requests –“Virtual bus” interconnect (snooping protocols) –Indirection (directory protocols) 2. Protocol enhancements compound complexity –Fragile, error prone & difficult to reason about –Why? A distributed & concurrent system –Often enhancements too complicated to implement (predictive/adaptive/hybrid protocols) Performance and correctness tightly intertwined

Token Coherence – Milo Martin slide 3 Rethinking Cache-Coherence Protocols Goal of invalidation-based coherence –Invariant: many readers -or- single writer –Enforced by globally coordinated actions Enforce this invariant directly using tokens –Fixed number of tokens per block –One token to read, all tokens to write Guarantees safety in all cases –Global invariant enforced with only local rules –Independent of races, request ordering, etc. Key innovation

Token Coherence – Milo Martin slide 4 Token Coherence: A New Framework for Cache Coherence Goal: Decouple performance and correctness –Fast in the common case –Correct in all cases To remove ordering bottlenecks –Ignore races (fast common case) –Tokens enforce safety (all cases) To reduce complexity –Performance enhancements (fast common case) –Without affecting correctness (all cases) –(without increased complexity) Focus of this talk Briefly described

Token Coherence – Milo Martin slide 5 Outline Overview Problem: ordering bottlenecks Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Token Coherence – Milo Martin slide 6 Technology Trends High-speed point-to-point links –No (multi-drop) busses Desire: low-latency interconnect –Avoid “virtual bus” ordering –Enabled by directory protocols Technology trends  unordered interconnects Increasing design integration –“Glueless” multiprocessors –Improve cost & latency

Token Coherence – Milo Martin slide 7 Workload Trends PPPM 1 2 PPPM Directory Protocol Workload trends  avoid indirection, broadcast ok Commercial workloads –Many cache-to-cache misses –Clusters of small multiprocessors Goals: –Direct cache-to-cache misses (2 hops, not 3 hops) –Moderate scalability

Token Coherence – Milo Martin slide 8 Basic Approach Low-latency protocol –Broadcast with direct responses –As in snooping protocols PPPM 1 2 Fast & works fine with no races… …but what happens in the case of a race? Low-latency interconnect –Use unordered interconnect –As in directory protocols

Token Coherence – Milo Martin slide 9 1 P 0 issues a request to write (delayed to P 2 ) Request to write Basic approach… but not yet correct P2P2 Read/Write P1P1 No Copy P0P0 Delayed in interconnect 3 P 1 issues a request to read Request to read 2 Ack

Token Coherence – Milo Martin slide 10 P2P2 Read/Write P1P1 No Copy P0P0 Basic approach… but not yet correct Read-only P 2 responds with data to P 1

Token Coherence – Milo Martin slide 11 Basic approach… but not yet correct P2P2 Read/Write P1P1 No Copy P0P Read-only P 0 ’s delayed request arrives at P 2

Token Coherence – Milo Martin slide 12 Basic approach… but not yet correct P2P2 Read/Write P1P1 No Copy P0P0 Read/Write Read-only No Copy P 2 responds to P 0

Token Coherence – Milo Martin slide 13 Basic approach… but not yet correct P2P2 Read/Write P1P1 No Copy P0P0 Read/Write Read-only No Copy Problem: P 0 and P 1 are in inconsistent states Locally “correct” operation, globally inconsistent

Token Coherence – Milo Martin slide 14 Contribution #1: Token Counting Tokens control reading & writing of data At all times, all blocks have T tokens E.g., one token per processor One or more to read All tokens to write Tokens: in caches, memory, or in transit Components exchange tokens & data Provides safety in all cases

Token Coherence – Milo Martin slide 15 Basic Approach (Revisited) As before: –Broadcast with direct responses (like snooping) –Use unordered interconnect (like directory) Track tokens for safety More refinement in a moment…

Token Coherence – Milo Martin slide 16 Token Coherence Example P2P2 T=16 (R/W) P1P1 T=0 P0P0 2 Delayed 1 P 0 issues a request to write (delayed to P 2 ) Request to write 3 P 1 issues a request to read Delayed Request to read Max Tokens

Token Coherence – Milo Martin slide 17 Token Coherence Example P2P2 T=16 (R/W) P1P1 T=0 P0P T=1(R)T=15(R) P 2 responds with data to P 1 T=1

Token Coherence – Milo Martin slide 18 Token Coherence Example P2P2 T=16 (R/W) P1P1 T=0 P0P T=1(R)T=15(R) P 0 ’s delayed request arrives at P 2

Token Coherence – Milo Martin slide 19 Token Coherence Example P2P2 T=16 (R/W) P1P1 T=0 P0P0 T=15(R) T=1(R)T=15(R) T=0 P 2 responds to P 0 T=15

Token Coherence – Milo Martin slide 20 Token Coherence Example P2P2 T=16 (R/W) P1P1 T=0 P0P0 T=15(R) T=1(R)T=15(R) T=0

Token Coherence – Milo Martin slide 21 Token Coherence Example P2P2 T=0 P1P1 T=1(R) P0P0 T=15(R) Now what? (P 0 wants all tokens)

Token Coherence – Milo Martin slide 22 Basic Approach (Re-Revisited) As before: –Broadcast with direct responses (like snooping) –Use unordered interconnect (like directory) –Track tokens for safety Reissue requests as needed –Needed due to racing requests (uncommon) –Timeout to detect failed completion Wait twice average miss latency Small hardware overhead –All races handled in this uniform fashion

Token Coherence – Milo Martin slide 23 Token Coherence Example P2P2 T=0 P1P1 T=1(R) P0P0 T=15(R) 8 P 0 reissues request P 1 responds with a token T=1 9 Timeout!

Token Coherence – Milo Martin slide 24 Token Coherence Example P2P2 T=0 P0P0 T=16 (R/W) P1P1 T=0 P 0 ’s request completed One final issue: What about starvation?

Token Coherence – Milo Martin slide 25 Contribution #2: Guaranteeing Starvation-Freedom Handle pathological cases –Infrequently invoked –Can be slow, inefficient, and simple When normal requests fail to succeed (4x) –Longer timeout and issue a persistent request –Request persists until satisfied –Table at each processor –“Deactivate” upon completion Implementation –Arbiter at memory orders persistent requests

Token Coherence – Milo Martin slide 26 Outline Overview Problem: ordering bottlenecks Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Token Coherence – Milo Martin slide 27 Evaluation Goal: Four Questions 1.Are reissued requests rare? Yes 2.Can Token Coherence outperform snooping? Yes: lower-latency unordered interconnect 3.Can Token Coherence outperform directory? Yes: direct cache-to-cache misses 4.Is broadcast overhead reasonable? Yes (for 16 processors) Quantitative evidence for qualitative behavior

Token Coherence – Milo Martin slide 28 Workloads and Simulation Methods Workloads –OLTP - On-line transaction processing –SPECjbb - Java middleware workload –Apache - Static web serving workload –All workloads use Solaris 8 for SPARC Simulation methods –16 processors –Simics full-system simulator –Out-of-order processor model –Detailed memory system model –Many assumptions and parameters (see paper)

Token Coherence – Milo Martin slide 29 Q1: Reissued Requests (percent of all L2 misses) OLTPSPECjbbApache

Token Coherence – Milo Martin slide 30 Q1: Reissued Requests (percent of all L2 misses) OutcomeOLTPSPECjbbApache Not Reissued98% 96% Reissued Once 2% 3% Reissued > 10.4%0.3%0.7% Persistent Requests (Reissued > 4) 0.2%0.1%0.3% Yes; reissued requests are rare (these workloads, 16p)

Token Coherence – Milo Martin slide 31 Q2: Runtime: Snooping vs. Token Coherence Hierarchical Switch Interconnect Similar performance on same interconnect “Tree” interconnect

Token Coherence – Milo Martin slide 32 Q2: Runtime: Snooping vs. Token Coherence Direct Interconnect Snooping not applicable “Torus” interconnect

Token Coherence – Milo Martin slide 33 Q2: Runtime: Snooping vs. Token Coherence Yes; Token Coherence can outperform snooping (15-28% faster) Why? Lower-latency interconnect

Token Coherence – Milo Martin slide 34 Q3: Runtime: Directory vs. Token Coherence Yes; Token Coherence can outperform directories (17-54% faster with slow directory) Why? Direct “2-hop” cache-to-cache misses

Token Coherence – Milo Martin slide 35 Q4: Traffic per Miss: Directory vs. Token Yes; broadcast overheads reasonable for 16 processors (directory uses 21-25% less bandwidth)

Token Coherence – Milo Martin slide 36 Q4: Traffic per Miss: Directory vs. Token Yes; broadcast overheads reasonable for 16 processors (directory uses 21-25% less bandwidth) Why? Requests are smaller than data (8B v. 64B) Requests & forwards Responses

Token Coherence – Milo Martin slide 37 Outline Overview Problem: ordering bottlenecks Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Token Coherence – Milo Martin slide 38 Contribution #3: Decoupled Coherence Cache Coherence Protocol Correctness Substrate (all cases) Performance Protocol (common cases) Safety (token counting) Starvation Freedom (persistent requests) Many Implementation choices

Token Coherence – Milo Martin slide 39 Example Opportunities of Decoupling Predict a destination-set [ISCA ‘03] –Based on past history –Need not be correct (rely on persistent requests) –Enables larger or more cost-effective systems Example#2: predictive push P2P2 P2P2 P2P2 P2P2 PnPn P1P1 T=16 P0P0 Example#1: Broadcast is not required Requires no changes to correctness substrate Predictive push

Token Coherence – Milo Martin slide 40 Conclusions Token Coherence (broadcast version) –Low cache-to-cache miss latency (no indirection) –Avoids “virtual bus” interconnects –Faster and/or cheaper Token Coherence (in general) –Correctness substrate Tokens for safety Persistent requests for starvation freedom –Performance protocol for performance –Decouple correctness from performance Enables further protocol innovation

Token Coherence – Milo Martin slide 41

Token Coherence – Milo Martin slide 42 Cache-Coherence Protocols Goal: provide a consistent view of memory Permissions in each cache per block –One read/write -or- –Many readers Cache coherence protocols –Distributed & complex –Correctness critical –Performance critical Races: the main source of complexity –Requests for the same block at the same time

Token Coherence – Milo Martin slide 43 Evaluation Parameters Processors –SPARC ISA –2 GHz, 11 pipe stages –4-wide fetch/execute –Dynamically scheduled –128 entry ROB –64 entry scheduler Memory system –64 byte cache lines –128KB L1 Instruction and Data, 4-way SA, 2 ns (4 cycles) –4MB L2, 4-way SA, 6 ns (12 cycles) –2GB main memory, 80 ns (160 cycles) Interconnect –15ns link latency –Switched tree (4 link latencies) cycles 2-hop round trip –2D torus (2 link latencies on average) cycles 2-hop round trip –Link bandwidth: 3.2 Gbyte/second Coherence Protocols –Aggressive snooping –Alpha like directory –72 byte data messages –8 byte request messages

Token Coherence – Milo Martin slide 44 Q3: Runtime: Directory vs. Token Coherence Yes; Token Coherence can outperform directories (17-54% faster with slow directory) Why? Direct “2-hop” cache-to-cache misses

Token Coherence – Milo Martin slide 45 More Information in Paper Traffic optimization –Transfer tokens without data –Add an “owner” token Note: no silent read-only replacements –Worst case: 10% interconnect traffic overhead Comparison to AMD’s Hammer protocol

Token Coherence – Milo Martin slide 46 Verifiability & Complexity Divide and conquer complexity –Formal verification is work in progress –Difficult to quantify, but promising –All races handled uniformly (reissuing) Local invariants –Safety is response-centric; independent of requests –Locally enforced with tokens –No reliance on global ordering properties Explicit starvation avoidance –Simple mechanism Further innovation  no correctness worries

Token Coherence – Milo Martin slide 47 Traditional v. Token Coherence Traditional protocols –Sensitive to request ordering Interconnect or directory –Monolithic Complicated Intertwine correctness and performance Token Coherence –Track tokens (safety) –Persistent requests (starvation avoidance) –Request are only “hints” Separate Correctness and Performance

Token Coherence – Milo Martin slide 48 Conceptual Interface Performance Protocol Correctness Substrate Data, Tokens, & Persistent Requests “Hint” requests MP0P0 Controller MP1P1

Token Coherence – Milo Martin slide 49 Conceptual Interface Performance Protocol Correctness Substrate I’m looking for block B Please send Block B to P1 Here is block B & one token (Or, no response) MP0P0 Controller MP1P1

Token Coherence – Milo Martin slide 50 Snooping multiprocessors –Uses broadcast –“Virtual bus” interconnect +Directly locate data (2 hops) Directory-based multiprocessors  Directory tracks writer or readers +Avoids broadcast +Avoids “virtual bus” interconnect –Indirection for cache-to-cache (3 hops) Examine workload and technology trends Snooping v. Directories: Which is Better? PPPM 1 2 PPPM 2 1 3

Token Coherence – Milo Martin slide 51 Workload Trends Commercial workloads –Many cache-to-cache misses or sharing misses –Cluster small- or moderate-scale multiprocessors Goals: –Low cache-to-cache miss latency (2 hops) –Moderate scalability Workload trends  snooping protocols

Token Coherence – Milo Martin slide 52 Technology Trends High-speed point-to-point links –No (multi-drop) busses Desire: unordered interconnect –No “virtual bus” ordering –Decouple interconnect & protocol Technology trends  directory protocols Increasing design integration –“Glueless” multiprocessors –Improve cost & latency

Token Coherence – Milo Martin slide 53 Multiprocessor Taxonomy Directories Interconnect Virtual Bus Unordered Snooping Cache-to-cache latency HighLow Technology Trends Workload Trends Our Goal

Token Coherence – Milo Martin slide 54 Overview Two approaches to cache-coherence The Problem –Workload trends  snooping protocols –Technology trends  directory protocols –Want a protocol that fits both trends A Solution –Unordered broadcast & direct response –Track tokens, reissue requests, prevent starvation Generalization: a new coherence framework –Decouple correctness from “performance” policy –Enables further opportunities