Token Coherence: Decoupling Performance and Correctness

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Cache Coherence Mechanisms (Research project) CSCI-5593

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Cache Optimization Summary

(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project

Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav,

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Multiprocessor Cache Coherency

(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus

Cache Coherence: Directory Protocol

Framework For Exploring Interconnect Level Cache Coherency

ASR: Adaptive Selective Replication for CMP Caches

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 9: Directory-Based Examples II

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 2: Snooping-Based Coherence

Lecture 8: Directory-Based Cache Coherence

Improving Multiple-CMP Systems with Token Coherence

Lecture 7: Directory-Based Cache Coherence

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

/ Computer Architecture and Design

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Prof. Onur Mutlu ETH Zürich Fall November 2017

Dynamic Verification of Sequential Consistency

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Chapter 13: I/O Systems.

Lecture 10: Directory-Based Examples II

University of Wisconsin-Madison Presented by: Nick Kirchem

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin—Madison Something to ponder and keep in mind during the presentation: Why wasn’t this done before? Technology and workload trends Past focus on scalability The mindset that each request must succeed the first time Didn’t think of this decoupling Try to avoid oral pauses: “Ok?”, and “Right?” Need to separate goal setting from goal achieving Other things Previous systems have not done this: unordered broadcast protocol Not about logical time Give the context Replacements are latency tolerant Tokens in memory in ECC How do we select the timeout interval Decouple interconnect from protocol Alpha and AMD towards directories IBM and Sun towards snooping Remove the snooping/directory duality Implementation costs Bits in caches & memory Starvation-prevention mechanism Token counting vs token collection/finding

We See Two Problems in Cache Coherence 1. Protocol ordering bottlenecks Artifact of conservatively resolving racing requests “Virtual bus” interconnect (snooping protocols) Indirection (directory protocols) 2. Protocol enhancements compound complexity Fragile, error prone & difficult to reason about Why? A distributed & concurrent system Often enhancements too complicated to implement (predictive/adaptive/hybrid protocols) Performance and correctness tightly intertwined Hardware shared-memory successful Especially for cost-effective, small-scale systems Further enhancements  more complexity Low-latency/predictive/adaptive/hybrid protocols Direct communication, highly-integrated nodes Avoid ordered or synchronous interconnects, global snoop responses, logical timestamping

Rethinking Cache-Coherence Protocols Goal of invalidation-based coherence Invariant: many readers -or- single writer Enforced by globally coordinated actions Enforce this invariant directly using tokens Fixed number of tokens per block One token to read, all tokens to write Guarantees safety in all cases Global invariant enforced with only local rules Independent of races, request ordering, etc. Key innovation

Token Coherence: A New Framework for Cache Coherence Goal: Decouple performance and correctness Fast in the common case Correct in all cases To remove ordering bottlenecks Ignore races (fast common case) Tokens enforce safety (all cases) To reduce complexity Performance enhancements (fast common case) Without affecting correctness (all cases) (without increased complexity) Focus of this talk Not just new enhancements, but many previous performance enhancements Ones that were previously too complicated to implement Decouple the cache-coherence problem Safety (do no harm) Forward progress (do some good) Performance (policy and optimizations) We propose: a new framework for coherence Decouple correctness and performance Remove bottlenecks & enable performance enhancements without concern for correctness Create a “performance design space” in which designers can do what ever they like, it is separate from the substrate.. Not interwoven.. As designers add performance enhancements, they do not add complexity to the performance substrate. Enables designers to actually implement these enhancements in the performance design space Briefly described

Outline Overview Problem: ordering bottlenecks Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Technology trends  unordered interconnects High-speed point-to-point links No (multi-drop) busses Increasing design integration “Glueless” multiprocessors Improve cost & latency Desire: low-latency interconnect Avoid “virtual bus” ordering Enabled by directory protocols Technology trends  unordered interconnects Green squares are processor/memory nodes Need to point out the problem with the second interconnect Shared bus is like Ethernet: does not scale, a central bottleneck

Workload trends  avoid indirection, broadcast ok P M 2 1 3 Directory Protocol Commercial workloads Many cache-to-cache misses Clusters of small multiprocessors Goals: Direct cache-to-cache misses (2 hops, not 3 hops) Moderate scalability P M 1 2 Turn clustering from a competitor into a “feature” Workload trends  avoid indirection, broadcast ok

1 2 Basic Approach Low-latency protocol Low-latency interconnect M 1 2 Low-latency protocol Broadcast with direct responses As in snooping protocols Low-latency interconnect Use unordered interconnect As in directory protocols Fast & works fine with no races… …but what happens in the case of a race?

Basic approach… but not yet correct Delayed in interconnect 1 P0 issues a request to write (delayed to P2) Request to write P2 Read/Write P1 No Copy P0 2 Ack 3 P1 issues a request to read Request to read

Basic approach… but not yet correct Read-only Read-only 1 P2 Read/Write P1 No Copy P0 2 4 3 P2 responds with data to P1

Basic approach… but not yet correct Read-only Read-only 1 P2 Read/Write P1 No Copy P0 5 2 4 “Last writer responds with data” - side-step the ownership issues 3 P0’s delayed request arrives at P2

Basic approach… but not yet correct 6 No Copy Read-only Read-only 1 Read/Write P1 No Copy Read/Write 5 P0 2 P2 7 4 3 P2 responds to P0

Basic approach… but not yet correct 6 No Copy Read-only Read-only 1 Read/Write P1 No Copy Read/Write 5 P0 2 P2 7 4 Allows for inconsistent access of data leading to program errors 3 Problem: P0 and P1 are in inconsistent states Locally “correct” operation, globally inconsistent

Contribution #1: Token Counting Tokens control reading & writing of data At all times, all blocks have T tokens E.g., one token per processor One or more to read All tokens to write Tokens: in caches, memory, or in transit Components exchange tokens & data Provides safety in all cases This is the “inspiration” part of the talk. We’ve been building MPs for decades: we’re the first we know of to take this approach We’re maintaining a different kind of state *** Need to point out ~one token per processor

Basic Approach (Revisited) As before: Broadcast with direct responses (like snooping) Use unordered interconnect (like directory) Track tokens for safety More refinement in a moment… Safety: avoid reading or writing data incoherently

Token Coherence Example Max Tokens 1 P0 issues a request to write (delayed to P2) Request to write 2 Delayed P2 T=16 (R/W) P1 T=0 P0 3 P1 issues a request to read Delayed Request to read

Token Coherence Example T=1(R) T=15(R) 1 P2 T=16 (R/W) P1 T=0 P0 2 4 T=1 3 P2 responds with data to P1

Token Coherence Example T=1(R) T=15(R) 1 P2 T=16 (R/W) P1 T=0 P0 5 2 4 3 P0’s delayed request arrives at P2

Token Coherence Example 6 T=15 T=0 T=1(R) T=15(R) 1 T=15(R) P1 T=0 T=16 (R/W) 5 P0 2 P2 7 4 3 P2 responds to P0

Token Coherence Example 6 T=0 T=1(R) T=15(R) 1 T=15(R) P1 T=0 T=16 (R/W) 5 P0 2 P2 7 4 3

Token Coherence Example T=1(R) P0 T=15(R) Now what? (P0 wants all tokens)

Basic Approach (Re-Revisited) As before: Broadcast with direct responses (like snooping) Use unordered interconnect (like directory) Track tokens for safety Reissue requests as needed Needed due to racing requests (uncommon) Timeout to detect failed completion Wait twice average miss latency Small hardware overhead All races handled in this uniform fashion Safety: avoid reading or writing data incoherently

Token Coherence Example 8 P0 reissues request P2 T=0 P1 T=1(R) P0 T=15(R) Timeout! P1 responds with a token T=1 9

Token Coherence Example T=16 (R/W) P1 T=0 P2 T=0 P0’s request completed One final issue: What about starvation?

Contribution #2: Guaranteeing Starvation-Freedom Handle pathological cases Infrequently invoked Can be slow, inefficient, and simple When normal requests fail to succeed (4x) Longer timeout and issue a persistent request Request persists until satisfied Table at each processor “Deactivate” upon completion Implementation Arbiter at memory orders persistent requests More about why they persist Emphasize the persistent nature not the implemenation

Outline Overview Problem: ordering bottlenecks Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Evaluation Goal: Four Questions Are reissued requests rare? Yes 2. Can Token Coherence outperform snooping? Yes: lower-latency unordered interconnect Can Token Coherence outperform directory? Yes: direct cache-to-cache misses 4. Is broadcast overhead reasonable? Yes (for 16 processors) Quantitative evidence for qualitative behavior Point out that exact speedup numbers is the job of industry for their particular designs

Workloads and Simulation Methods OLTP - On-line transaction processing SPECjbb - Java middleware workload Apache - Static web serving workload All workloads use Solaris 8 for SPARC Simulation methods 16 processors Simics full-system simulator Out-of-order processor model Detailed memory system model Many assumptions and parameters (see paper)

Q1: Reissued Requests (percent of all L2 misses) OLTP SPECjbb Apache

Q1: Reissued Requests (percent of all L2 misses) Outcome OLTP SPECjbb Apache Not Reissued 98% 96% Reissued Once 2% 3% Reissued > 1 0.4% 0.3% 0.7% Persistent Requests (Reissued > 4) 0.2% 0.1% Yes; reissued requests are rare (these workloads, 16p)

Similar performance on same interconnect Q2: Runtime: Snooping vs. Token Coherence Hierarchical Switch Interconnect Similar performance on same interconnect “Tree” interconnect

Q2: Runtime: Snooping vs. Token Coherence Direct Interconnect Snooping not applicable “Torus” interconnect

Q2: Runtime: Snooping vs. Token Coherence Yes; Token Coherence can outperform snooping (15-28% faster) Why? Lower-latency interconnect

Q3: Runtime: Directory vs. Token Coherence Yes; Token Coherence can outperform directories (17-54% faster with slow directory) Why? Direct “2-hop” cache-to-cache misses

Q4: Traffic per Miss: Directory vs. Token Yes; broadcast overheads reasonable for 16 processors (directory uses 21-25% less bandwidth) Why is the amount of data sent is different: Slight difference in handling O->M upgrades between the protocols

Q4: Traffic per Miss: Directory vs. Token Requests & forwards Responses Yes; broadcast overheads reasonable for 16 processors (directory uses 21-25% less bandwidth) Why? Requests are smaller than data (8B v. 64B) Why is the amount of data sent is different: Slight difference in handling O->M upgrades between the protocols

Outline Overview Problem: ordering bottlenecks Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Contribution #3: Decoupled Coherence Cache Coherence Protocol Correctness Substrate (all cases) Performance Protocol (common cases) Many Implementation choices Safety (token counting) Starvation Freedom (persistent requests)

Example Opportunities of Decoupling Example#1: Broadcast is not required P2 Pn P1 T=16 P0 Predictive push T=16 Predict a destination-set [ISCA ‘03] Based on past history Need not be correct (rely on persistent requests) Enables larger or more cost-effective systems Example#2: predictive push Requires no changes to correctness substrate

Conclusions Token Coherence (broadcast version) Low cache-to-cache miss latency (no indirection) Avoids “virtual bus” interconnects Faster and/or cheaper Token Coherence (in general) Correctness substrate Tokens for safety Persistent requests for starvation freedom Performance protocol for performance Decouple correctness from performance Enables further protocol innovation

Cache-Coherence Protocols Goal: provide a consistent view of memory Permissions in each cache per block One read/write -or- Many readers Cache coherence protocols Distributed & complex Correctness critical Performance critical Races: the main source of complexity Requests for the same block at the same time Two or more requests for the same block and the same time

Evaluation Parameters Processors SPARC ISA 2 GHz, 11 pipe stages 4-wide fetch/execute Dynamically scheduled 128 entry ROB 64 entry scheduler Memory system 64 byte cache lines 128KB L1 Instruction and Data, 4-way SA, 2 ns (4 cycles) 4MB L2, 4-way SA, 6 ns (12 cycles) 2GB main memory, 80 ns (160 cycles) Interconnect 15ns link latency Switched tree (4 link latencies) - 240 cycles 2-hop round trip 2D torus (2 link latencies on average) - 120 cycles 2-hop round trip Link bandwidth: 3.2 Gbyte/second Coherence Protocols Aggressive snooping Alpha 21364-like directory 72 byte data messages 8 byte request messages

Q3: Runtime: Directory vs. Token Coherence Yes; Token Coherence can outperform directories (17-54% faster with slow directory) Why? Direct “2-hop” cache-to-cache misses Simulates the un-buildable design point of a super fast and perfect directory cache

More Information in Paper Traffic optimization Transfer tokens without data Add an “owner” token Note: no silent read-only replacements Worst case: 10% interconnect traffic overhead Comparison to AMD’s Hammer protocol

Verifiability & Complexity Divide and conquer complexity Formal verification is work in progress Difficult to quantify, but promising All races handled uniformly (reissuing) Local invariants Safety is response-centric; independent of requests Locally enforced with tokens No reliance on global ordering properties Explicit starvation avoidance Simple mechanism Further innovation  no correctness worries

Traditional v. Token Coherence Traditional protocols Sensitive to request ordering Interconnect or directory Monolithic Complicated Intertwine correctness and performance Token Coherence Track tokens (safety) Persistent requests (starvation avoidance) Request are only “hints” Separate Correctness and Performance

Conceptual Interface M P0 P1 “Hint” requests Performance Protocol Controller Performance Protocol Correctness Substrate M P0 Controller P1 Data, Tokens, & Persistent Requests

Here is block B & one token Conceptual Interface I’m looking for block B Controller Performance Protocol Please send Block B to P1 Correctness Substrate M P0 Controller P1 Here is block B & one token (Or, no response)

Snooping v. Directories: Which is Better? M 1 2 Snooping multiprocessors Uses broadcast “Virtual bus” interconnect Directly locate data (2 hops) Directory-based multiprocessors Directory tracks writer or readers Avoids broadcast Avoids “virtual bus” interconnect Indirection for cache-to-cache (3 hops) Examine workload and technology trends P M 2 1 3 This question has been debated for 20+ years… If there was one that was better, we would only have one approach Sun, Intel, IBM : Snooping Alpha, SGI : Directory Spend some time on this slide talking about ordering in snooping Talk about directory indirections Need to define cache-to-cache misses Our workloads have many cache-to-cache misses

Workload trends  snooping protocols Commercial workloads Many cache-to-cache misses or sharing misses Cluster small- or moderate-scale multiprocessors Goals: Low cache-to-cache miss latency (2 hops) Moderate scalability Workload trends  snooping protocols Turn clustering from a competitor into a “feature”

Technology trends  directory protocols High-speed point-to-point links No (multi-drop) busses Increasing design integration “Glueless” multiprocessors Improve cost & latency Desire: unordered interconnect No “virtual bus” ordering Decouple interconnect & protocol Technology trends  directory protocols Green squares are processor/memory nodes Need to point out the problem with the second interconnect Shared bus is like Ethernet: does not scale, a central bottleneck

Multiprocessor Taxonomy Workload Trends Snooping Cache-to-cache latency High Low Technology Trends Directories Interconnect Virtual Bus Unordered Our Goal Best of both worlds Goal: replace both protocols with one approach that is best in most all cases This is a “goal setting” slide. Part of the separation from goal setting and goal achieving Note: Mark asked me to make this an “up and to the right” type graph. However, after trying it, I decided that I like this setup better, because it matches the eye-flow… left-to-right and top-to-bottom.

Overview Two approaches to cache-coherence The Problem A Solution Workload trends  snooping protocols Technology trends  directory protocols Want a protocol that fits both trends A Solution Unordered broadcast & direct response Track tokens, reissue requests, prevent starvation Generalization: a new coherence framework Decouple correctness from “performance” policy Enables further opportunities