(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Multiprocessor Cache Coherency

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク.

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

ASR: Adaptive Selective Replication for CMP Caches

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Multiprocessor Cache Coherency

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 9: Directory-Based Examples II

Lecture 8: Directory-Based Cache Coherence

Improving Multiple-CMP Systems with Token Coherence

Lecture 7: Directory-Based Cache Coherence

Lecture 25: Multiprocessors

Lecture 9: Directory-Based Examples

High Performance Computing

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

Token Coherence: Decoupling Performance and Correctness

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 10: Directory-Based Examples II

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo Martin 3, and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania February 17 th, 2005

Improving Multiple-CMP Systems using Token Coherence Slide 2 Summary Microprocessor  Chip Multiprocessor (CMP) Symmetric Multiprocessor (SMP)  Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Directory Complex & Slow New Solution: Apply Token Coherence –Developed for glueless multiprocessor [2003] –Keep: Flat for Correctness –Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory

Improving Multiple-CMP Systems using Token Coherence Slide 3 Outline Motivation and Background –Coherence in Multiple-CMP Systems –Example: DirectoryCMP Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation

Improving Multiple-CMP Systems using Token Coherence Slide 4 Coherence in Multiple-CMP Systems CMP 3CMP 4 CMP 2 CMP 1 interconnect I D I D I D I D P P P P L2 Chip Multiprocessors (CMPs) emerging Larger systems will be built with Multiple CMPs interconnect

Improving Multiple-CMP Systems using Token Coherence Slide 5 Problem: Hierarchical Coherence Inter-CMP Coherence Intra-CMP Coherence Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity –explodes state space CMP 3CMP 4 CMP 2 CMP 1 interconnect

Improving Multiple-CMP Systems using Token Coherence Slide 6 Improving Multiple CMP Systems with Token Coherence Token Coherence allows Multiple-CMP systems to be... –Flat for correctness, but –Hierarchical for performance Correctness Substrate Performance Protocol Low Complexity Fast interconnect CMP 3CMP 4 CMP 2 CMP 1

Improving Multiple-CMP Systems using Token Coherence Slide 7 Memory/Directory Example: DirectoryCMP CMP 0 P0 Store B CMP 1 L1 I&D Shared L2 / directory P1 L1 I&D P2 L1 I&D P3 L1 I&D P4 L1 I&D P5 L1 I&D P6 L1 I&D P7 L1 I&D getx fwd inv Shared L2 / directory ack data/ ack data/ ack data/ ack S OSSS 2-level MOESI Directory getx WB getx WB RACE CONDITIONS! Store B Memory/Directory B: [S O]B: [M I]

Improving Multiple-CMP Systems using Token Coherence Slide 8 Token Coherence Summary Token Coherence separates performance from correctness Correctness Substrate: Enforces coherence invariant and prevents starvation 1.Safety with Token Counting 2.Starvation Avoidance with Persistent Requests Performance Policy: Makes the common case fast –Transient requests to seek tokens Unordered, untracked, unacknowledged –Possible prediction, multicast, filters, etc

Improving Multiple-CMP Systems using Token Coherence Slide 9 Outline Motivation and Background Token Coherence: Flat for Correctness –Safety –Starvation Avoidance Token Coherence: Hierarchical for Performance Evaluation

Improving Multiple-CMP Systems using Token Coherence Slide 10 Store BLoad B Example: Token Coherence [ISCA 2003] Load B Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block P0 L1 I&D L2 P1 L1 I&D L2 P2 L1 I&D L2 P3 L1 I&D L2 interconnect Store B mem 0mem 3

Improving Multiple-CMP Systems using Token Coherence Slide 11 Extending to Multiple-CMP System P0 L1 I&D L2 P1 L1 I&D L2 P2 L1 I&D L2 P3 L1 I&D L2 interconnect mem 0mem 1 CMP 0 interconnect Shared L2 CMP 1 interconnect Shared L2

Improving Multiple-CMP Systems using Token Coherence Slide 12 mem 0 Extending to Multiple-CMP System CMP 0 interconnect P0 interconnect P1 mem 1 CMP 1 interconnect P2P3 Token counting remains flat Tokens to caches –Handles shared caches and other complex hierarchies Shared L2 L1 I&D Store B

Improving Multiple-CMP Systems using Token Coherence Slide 13 Safety Recap Safety: Maintain coherence invariant –Only one writer, or multiple readers Tokens for Safety –T Tokens associated with each memory block –# tokens encoded in 1+log 2 T –Processor acquires all tokens to write, a single token to read Tokens passed to nodes in glueless multiprocessor scheme –But CMPs have private and shared caches Tokens passed to caches in Multiple-CMP system –Arbitrary cache hierarchy easily handled –Flat for correctness

Improving Multiple-CMP Systems using Token Coherence Slide 14 Some Token Counting Implications Memory must store tokens –Separate RAM –Use extra ECC bits –Token cache T sized to # caches to allow read-only copies in all caches Replacements cannot be silent –Tokens must not be lost or dropped Targeted for invalidate-based protocols –Not a solution for write-through or update protocols Tokens must be identified by block address –Address must be in all token-carrying messages

Improving Multiple-CMP Systems using Token Coherence Slide 15 Starvation Avoidance Request messages can miss tokens –In-flight tokens Transient Requests are not tracked throughout system –Incorrect filtering, multicast, destination-set prediction, etc Possible Solution: Retries –Retry w/ optional randomized backoff is effective for races Guaranteed Solution: Persistent Requests –Heavyweight request guaranteed to succeed –Should be rare (uses more bandwidth) –Locates all tokens in the system –Orders competing requests

Improving Multiple-CMP Systems using Token Coherence Slide 16 mem 0 Starvation Avoidance CMP 0 interconnect P0 Store B interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 Tokens move freely in the system –Transient requests can miss in-flight tokens –Incorrect speculation, filters, prediction, etc Shared L2 Store B GETX L1 I&D

Improving Multiple-CMP Systems using Token Coherence Slide 17 mem 0 Starvation Avoidance CMP 0 interconnect P0 interconnect P1 mem 1 CMP 1 interconnect P2P3 Shared L2 L1 I&D Solution: issue Persistent Request –Heavyweight request guaranteed to succeed –Methods: Centralized [2003] and Distributed (New) Store B

Improving Multiple-CMP Systems using Token Coherence Slide 18 mem 0 Old Scheme: Central Arbiter [2003] CMP 0 interconnect P0 Store B interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 –Processors issue persistent requests Shared L2 Store B L1 I&D arbiter 0 B: P0 B: P2 B: P1 timeout

Improving Multiple-CMP Systems using Token Coherence Slide 19 mem 0 Old Scheme: Central Arbiter [2003] CMP 0 interconnect P0 Store B interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 –Processors issue persistent requests –Arbiter orders and broadcasts activate Shared L2 Store B L1 I&D arbiter 0 B: P0 B: P2 B: P1 B: P0 Store B

Improving Multiple-CMP Systems using Token Coherence Slide 20 mem 0 Old Scheme: Central Arbiter [2003] CMP 0 interconnect P0 interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 –Processor sends deactivate to arbiter –Arbiter broadcasts deactivate (and next activate) –Bottom Line: handoff is 3 message latencies Shared L2 Store B L1 I&D arbiter 0 B: P2 B: P1 B: P0 B: P2 Store B B: P0 12 3

Improving Multiple-CMP Systems using Token Coherence Slide 21 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 Store B interconnect P1: B P2: B P0: B P1: B P2: B P0: B P1 P1: B P2: B P0: B mem 1 CMP 1 interconnect P2 Store B P1: B P2: B P0: B P1: B P2: B P0: B P3 P1: B P2: B P0: B P1: B P2: B P0: B –Processors broadcast persistent requests Shared L2 Store B L1 I&D

Improving Multiple-CMP Systems using Token Coherence Slide 22 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 Store B interconnect P1: B P2: B P0: B P1: B P2: B P0: B P1 P1: B P2: B P0: B mem 1 CMP 1 interconnect P2 Store B P1: B P2: B P0: B P1: B P2: B P0: B P3 P1: B P2: B P0: B P1: B P2: B P0: B –Processors broadcast persistent requests –Fixed priority (processor number) Store B P0: B Shared L2 L1 I&D

Improving Multiple-CMP Systems using Token Coherence Slide 23 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 interconnect P1: B P2: B P0: B P1: B P2: B P0: B P1 P1: B P2: B P0: B mem 1 CMP 1 interconnect P2 Store B P1: B P2: B P0: B P1: B P2: B P0: B P3 P1: B P2: B P0: B P1: B P2: B P0: B Shared L2 Store B –Processors broadcast persistent requests –Fixed priority (processor number) –Processors broadcast deactivate P1: B L1 I&D 1

Improving Multiple-CMP Systems using Token Coherence Slide 24 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 interconnect P1: B P2: B P1: B P2: B P1 P1: B P2: B mem 1 CMP 1 interconnect P2 P1: B P2: B P1: B P2: B P3 P1: B P2: B P1: B P2: B Shared L2 –Bottom line: Handoff is a single message latency Subtle point: P0 and P1 must wait until next “wave” P1: B L1 I&D

Improving Multiple-CMP Systems using Token Coherence Slide 25 Implementing Distributed Persistent Requests Table at each cache –Sized to N entries for each processor (we use N=1) –Indexed by processor ID –Content-addressable by Address Each incoming message must access table –Not on the critical path– can be slow CAM Activate/deactivate reordering cannot be allowed –Persistent request virtual channel must be point-to-point ordered –Or, other solution such as sequence numbers or acks

Improving Multiple-CMP Systems using Token Coherence Slide 26 Implementing Distributed Persistent Requests Should reads be distinguished from writes? –Not necessary, but –Persistent Read request is helpful Implications of flat distributed arbitration –Simple  flat for correctness –Global broadcast when used Fortunately they are rare in typical workloads (0.3%) Bad workload (very high contention) would burn bandwidth –Maximum # processors must be architected What about a hierarchical persistent request scheme? –Possible, but correctness is no longer flat –Make the common case fast

Improving Multiple-CMP Systems using Token Coherence Slide 27 Reducing Unnecessary Traffic Problem: Which token-holding cache responds with data? Solution: Distinguish one token as the owner token –The owner includes data with token response –Clean vs. dirty owner distinction also useful for writebacks

Improving Multiple-CMP Systems using Token Coherence Slide 28 Outline Motivation and Background Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance –TokenCMP –Another look at performance policies Evaluation

Improving Multiple-CMP Systems using Token Coherence Slide 29 Hierarchical for Performance: TokenCMP Target System: –2-8 CMPs –Private L1s, shared L2 per CMP –Any interconnect, but high-bandwidth Performance Policy Goals: –Aggressively acquire tokens –Exploit on-chip locality and bandwidth –Respect cache hierarchy –Detecting and handling missed tokens

Improving Multiple-CMP Systems using Token Coherence Slide 30 Hierarchical for Performance: TokenCMP Approach: –On L1 miss, broadcast within own CMP Local cache responds if possible –On L2 miss, broadcast to other CMPs –Appropriate L2 bank responds or broadcasts within its CMP Optionally filter –Responses between CMPs carry extra tokens for future locality Handling missed tokens: –Timeout after average memory latency –Invoke persistent request (no retries) Larger systems can use filters, multicast, soft-state directories

Improving Multiple-CMP Systems using Token Coherence Slide 31 Other Optimizations in TokenCMP Implementing E-state –Memory responds with all tokens on read request –Use clean/dirty owner distinction to eliminate writing back unwritten data Implementing Migratory Sharing –What is it? A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block –In TokenCMP, simply return all tokens Non-speculative delay –Hold block for some # cycles so permission isn’t stolen prematurely

Improving Multiple-CMP Systems using Token Coherence Slide 32 Another Look at Performance Policies How to find tokens? –Broadcast –Broadcast w/ filters –Multicast (destination-set prediction) –Directories (soft or hard) Who responds with data? –Owner token TokenCMP uses Owner token for Inter-CMP responses –Other heuristics For TokenCMP intra-CMP responses, cache responds if it has extra tokens

Improving Multiple-CMP Systems using Token Coherence Slide 33 Transient Requests May Reduce Complexity Processor holds the only required state about request L2 controller in TokenCMP very simple: –Re-broadcasts L1 request message on a miss –Re-broadcasts or filters external request messages –Possible states: no tokens (I) all tokens (M) some tokens (S) –Bounce unexpected tokens to memory DirectoryCMP’s L2 controller is complex –Allocates MSHR on miss and forward –Issues invalidates and receives acks –Orders all intra-CMP requests and writebacks –57 states in our L2 implementation!

Improving Multiple-CMP Systems using Token Coherence Slide 34 Writebacks DirectoryCMP uses “3-phase writebacks” –L1 issues writeback request –L2 enters transient state or blocks request –L2 responds with writeback ack –L1 sends data TokenCMP uses “fire-and-forget” writebacks –Immediately send tokens and data –Heuristic: Only send data if # tokens > 1

Improving Multiple-CMP Systems using Token Coherence Slide 35 Outline Motivation and Background Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation –Model checking –Performance w/ commercial workloads –Robustness

Improving Multiple-CMP Systems using Token Coherence Slide 36 TokenCMP Evaluation Simple? –Some anecdotal examples and comparisons –Model checking Fast? –Full-system simulation w/ commercial workloads Robust? –Micro-benchmarks to simulate high contention

Improving Multiple-CMP Systems using Token Coherence Slide 37 Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia Methods: –TLA+ and TLC –DirectoryCMP omits all intra-CMP details –TokenCMP’s correctness substrate modeled Result: –Complexity similar between TokenCMP and non-hierarchical DirectoryCMP –Correctness Substrate verified to be correct and deadlock-free –All possible performance protocols correct

Improving Multiple-CMP Systems using Token Coherence Slide 38 Performance Evaluation Target System: –4 CMPs, 4 procs/cmp –2GHz OoO SPARC, 8MB shared L2 per chip –Directly connected interconnect Methods: Multifacet GEMS simulator –Simics augmented with timing models –Released soon: Benchmarks: –Performance: Apache, Spec, OLTP –Robustness: Locking uBenchmark

Improving Multiple-CMP Systems using Token Coherence Slide 39 Full-system Simulation: Runtime –TokenCMP performs 9-50% faster than DirectoryCMP

Improving Multiple-CMP Systems using Token Coherence Slide 40 Full-system Simulation: Runtime –TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2

Improving Multiple-CMP Systems using Token Coherence Slide 41 Full-system Simulation: Inter-CMP Traffic –TokenCMP traffic is reasonable (or better) DirectoryCMP control overhead greater than broadcast for small system

Improving Multiple-CMP Systems using Token Coherence Slide 42 Full-system Simulation: Intra-CMP Traffic

Improving Multiple-CMP Systems using Token Coherence Slide 43 Performance Robustness Locking micro-benchmark less contention more contention (correctness substrate only)

Improving Multiple-CMP Systems using Token Coherence Slide 44 Performance Robustness Locking micro-benchmark less contention more contention (correctness substrate only)

Improving Multiple-CMP Systems using Token Coherence Slide 45 Performance Robustness Locking micro-benchmark less contention more contention

Improving Multiple-CMP Systems using Token Coherence Slide 46 Summary Microprocessor  Chip Multiprocessor (CMP) Symmetric Multiprocessor (SMP)  Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Directory Complex & Slow New Solution: Apply Token Coherence –Developed for glueless multiprocessor [2003] –Keep: Flat for Correctness –Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory