(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo Martin 3, and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania February 17 th, 2005
Improving Multiple-CMP Systems using Token Coherence Slide 2 Summary Microprocessor Chip Multiprocessor (CMP) Symmetric Multiprocessor (SMP) Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Directory Complex & Slow New Solution: Apply Token Coherence –Developed for glueless multiprocessor [2003] –Keep: Flat for Correctness –Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory
Improving Multiple-CMP Systems using Token Coherence Slide 3 Outline Motivation and Background –Coherence in Multiple-CMP Systems –Example: DirectoryCMP Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation
Improving Multiple-CMP Systems using Token Coherence Slide 4 Coherence in Multiple-CMP Systems CMP 3CMP 4 CMP 2 CMP 1 interconnect I D I D I D I D P P P P L2 Chip Multiprocessors (CMPs) emerging Larger systems will be built with Multiple CMPs interconnect
Improving Multiple-CMP Systems using Token Coherence Slide 5 Problem: Hierarchical Coherence Inter-CMP Coherence Intra-CMP Coherence Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity –explodes state space CMP 3CMP 4 CMP 2 CMP 1 interconnect
Improving Multiple-CMP Systems using Token Coherence Slide 6 Improving Multiple CMP Systems with Token Coherence Token Coherence allows Multiple-CMP systems to be... –Flat for correctness, but –Hierarchical for performance Correctness Substrate Performance Protocol Low Complexity Fast interconnect CMP 3CMP 4 CMP 2 CMP 1
Improving Multiple-CMP Systems using Token Coherence Slide 7 Memory/Directory Example: DirectoryCMP CMP 0 P0 Store B CMP 1 L1 I&D Shared L2 / directory P1 L1 I&D P2 L1 I&D P3 L1 I&D P4 L1 I&D P5 L1 I&D P6 L1 I&D P7 L1 I&D getx fwd inv Shared L2 / directory ack data/ ack data/ ack data/ ack S OSSS 2-level MOESI Directory getx WB getx WB RACE CONDITIONS! Store B Memory/Directory B: [S O]B: [M I]
Improving Multiple-CMP Systems using Token Coherence Slide 8 Token Coherence Summary Token Coherence separates performance from correctness Correctness Substrate: Enforces coherence invariant and prevents starvation 1.Safety with Token Counting 2.Starvation Avoidance with Persistent Requests Performance Policy: Makes the common case fast –Transient requests to seek tokens Unordered, untracked, unacknowledged –Possible prediction, multicast, filters, etc
Improving Multiple-CMP Systems using Token Coherence Slide 9 Outline Motivation and Background Token Coherence: Flat for Correctness –Safety –Starvation Avoidance Token Coherence: Hierarchical for Performance Evaluation
Improving Multiple-CMP Systems using Token Coherence Slide 10 Store BLoad B Example: Token Coherence [ISCA 2003] Load B Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block P0 L1 I&D L2 P1 L1 I&D L2 P2 L1 I&D L2 P3 L1 I&D L2 interconnect Store B mem 0mem 3
Improving Multiple-CMP Systems using Token Coherence Slide 11 Extending to Multiple-CMP System P0 L1 I&D L2 P1 L1 I&D L2 P2 L1 I&D L2 P3 L1 I&D L2 interconnect mem 0mem 1 CMP 0 interconnect Shared L2 CMP 1 interconnect Shared L2
Improving Multiple-CMP Systems using Token Coherence Slide 12 mem 0 Extending to Multiple-CMP System CMP 0 interconnect P0 interconnect P1 mem 1 CMP 1 interconnect P2P3 Token counting remains flat Tokens to caches –Handles shared caches and other complex hierarchies Shared L2 L1 I&D Store B
Improving Multiple-CMP Systems using Token Coherence Slide 13 Safety Recap Safety: Maintain coherence invariant –Only one writer, or multiple readers Tokens for Safety –T Tokens associated with each memory block –# tokens encoded in 1+log 2 T –Processor acquires all tokens to write, a single token to read Tokens passed to nodes in glueless multiprocessor scheme –But CMPs have private and shared caches Tokens passed to caches in Multiple-CMP system –Arbitrary cache hierarchy easily handled –Flat for correctness
Improving Multiple-CMP Systems using Token Coherence Slide 14 Some Token Counting Implications Memory must store tokens –Separate RAM –Use extra ECC bits –Token cache T sized to # caches to allow read-only copies in all caches Replacements cannot be silent –Tokens must not be lost or dropped Targeted for invalidate-based protocols –Not a solution for write-through or update protocols Tokens must be identified by block address –Address must be in all token-carrying messages
Improving Multiple-CMP Systems using Token Coherence Slide 15 Starvation Avoidance Request messages can miss tokens –In-flight tokens Transient Requests are not tracked throughout system –Incorrect filtering, multicast, destination-set prediction, etc Possible Solution: Retries –Retry w/ optional randomized backoff is effective for races Guaranteed Solution: Persistent Requests –Heavyweight request guaranteed to succeed –Should be rare (uses more bandwidth) –Locates all tokens in the system –Orders competing requests
Improving Multiple-CMP Systems using Token Coherence Slide 16 mem 0 Starvation Avoidance CMP 0 interconnect P0 Store B interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 Tokens move freely in the system –Transient requests can miss in-flight tokens –Incorrect speculation, filters, prediction, etc Shared L2 Store B GETX L1 I&D
Improving Multiple-CMP Systems using Token Coherence Slide 17 mem 0 Starvation Avoidance CMP 0 interconnect P0 interconnect P1 mem 1 CMP 1 interconnect P2P3 Shared L2 L1 I&D Solution: issue Persistent Request –Heavyweight request guaranteed to succeed –Methods: Centralized [2003] and Distributed (New) Store B
Improving Multiple-CMP Systems using Token Coherence Slide 18 mem 0 Old Scheme: Central Arbiter [2003] CMP 0 interconnect P0 Store B interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 –Processors issue persistent requests Shared L2 Store B L1 I&D arbiter 0 B: P0 B: P2 B: P1 timeout
Improving Multiple-CMP Systems using Token Coherence Slide 19 mem 0 Old Scheme: Central Arbiter [2003] CMP 0 interconnect P0 Store B interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 –Processors issue persistent requests –Arbiter orders and broadcasts activate Shared L2 Store B L1 I&D arbiter 0 B: P0 B: P2 B: P1 B: P0 Store B
Improving Multiple-CMP Systems using Token Coherence Slide 20 mem 0 Old Scheme: Central Arbiter [2003] CMP 0 interconnect P0 interconnect P1 mem 1 CMP 1 interconnect P2 Store B P3 –Processor sends deactivate to arbiter –Arbiter broadcasts deactivate (and next activate) –Bottom Line: handoff is 3 message latencies Shared L2 Store B L1 I&D arbiter 0 B: P2 B: P1 B: P0 B: P2 Store B B: P0 12 3
Improving Multiple-CMP Systems using Token Coherence Slide 21 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 Store B interconnect P1: B P2: B P0: B P1: B P2: B P0: B P1 P1: B P2: B P0: B mem 1 CMP 1 interconnect P2 Store B P1: B P2: B P0: B P1: B P2: B P0: B P3 P1: B P2: B P0: B P1: B P2: B P0: B –Processors broadcast persistent requests Shared L2 Store B L1 I&D
Improving Multiple-CMP Systems using Token Coherence Slide 22 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 Store B interconnect P1: B P2: B P0: B P1: B P2: B P0: B P1 P1: B P2: B P0: B mem 1 CMP 1 interconnect P2 Store B P1: B P2: B P0: B P1: B P2: B P0: B P3 P1: B P2: B P0: B P1: B P2: B P0: B –Processors broadcast persistent requests –Fixed priority (processor number) Store B P0: B Shared L2 L1 I&D
Improving Multiple-CMP Systems using Token Coherence Slide 23 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 interconnect P1: B P2: B P0: B P1: B P2: B P0: B P1 P1: B P2: B P0: B mem 1 CMP 1 interconnect P2 Store B P1: B P2: B P0: B P1: B P2: B P0: B P3 P1: B P2: B P0: B P1: B P2: B P0: B Shared L2 Store B –Processors broadcast persistent requests –Fixed priority (processor number) –Processors broadcast deactivate P1: B L1 I&D 1
Improving Multiple-CMP Systems using Token Coherence Slide 24 mem 0 Improved Scheme: Distributed Arbitration [NEW] CMP 0 interconnect P0 interconnect P1: B P2: B P1: B P2: B P1 P1: B P2: B mem 1 CMP 1 interconnect P2 P1: B P2: B P1: B P2: B P3 P1: B P2: B P1: B P2: B Shared L2 –Bottom line: Handoff is a single message latency Subtle point: P0 and P1 must wait until next “wave” P1: B L1 I&D
Improving Multiple-CMP Systems using Token Coherence Slide 25 Implementing Distributed Persistent Requests Table at each cache –Sized to N entries for each processor (we use N=1) –Indexed by processor ID –Content-addressable by Address Each incoming message must access table –Not on the critical path– can be slow CAM Activate/deactivate reordering cannot be allowed –Persistent request virtual channel must be point-to-point ordered –Or, other solution such as sequence numbers or acks
Improving Multiple-CMP Systems using Token Coherence Slide 26 Implementing Distributed Persistent Requests Should reads be distinguished from writes? –Not necessary, but –Persistent Read request is helpful Implications of flat distributed arbitration –Simple flat for correctness –Global broadcast when used Fortunately they are rare in typical workloads (0.3%) Bad workload (very high contention) would burn bandwidth –Maximum # processors must be architected What about a hierarchical persistent request scheme? –Possible, but correctness is no longer flat –Make the common case fast
Improving Multiple-CMP Systems using Token Coherence Slide 27 Reducing Unnecessary Traffic Problem: Which token-holding cache responds with data? Solution: Distinguish one token as the owner token –The owner includes data with token response –Clean vs. dirty owner distinction also useful for writebacks
Improving Multiple-CMP Systems using Token Coherence Slide 28 Outline Motivation and Background Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance –TokenCMP –Another look at performance policies Evaluation
Improving Multiple-CMP Systems using Token Coherence Slide 29 Hierarchical for Performance: TokenCMP Target System: –2-8 CMPs –Private L1s, shared L2 per CMP –Any interconnect, but high-bandwidth Performance Policy Goals: –Aggressively acquire tokens –Exploit on-chip locality and bandwidth –Respect cache hierarchy –Detecting and handling missed tokens
Improving Multiple-CMP Systems using Token Coherence Slide 30 Hierarchical for Performance: TokenCMP Approach: –On L1 miss, broadcast within own CMP Local cache responds if possible –On L2 miss, broadcast to other CMPs –Appropriate L2 bank responds or broadcasts within its CMP Optionally filter –Responses between CMPs carry extra tokens for future locality Handling missed tokens: –Timeout after average memory latency –Invoke persistent request (no retries) Larger systems can use filters, multicast, soft-state directories
Improving Multiple-CMP Systems using Token Coherence Slide 31 Other Optimizations in TokenCMP Implementing E-state –Memory responds with all tokens on read request –Use clean/dirty owner distinction to eliminate writing back unwritten data Implementing Migratory Sharing –What is it? A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block –In TokenCMP, simply return all tokens Non-speculative delay –Hold block for some # cycles so permission isn’t stolen prematurely
Improving Multiple-CMP Systems using Token Coherence Slide 32 Another Look at Performance Policies How to find tokens? –Broadcast –Broadcast w/ filters –Multicast (destination-set prediction) –Directories (soft or hard) Who responds with data? –Owner token TokenCMP uses Owner token for Inter-CMP responses –Other heuristics For TokenCMP intra-CMP responses, cache responds if it has extra tokens
Improving Multiple-CMP Systems using Token Coherence Slide 33 Transient Requests May Reduce Complexity Processor holds the only required state about request L2 controller in TokenCMP very simple: –Re-broadcasts L1 request message on a miss –Re-broadcasts or filters external request messages –Possible states: no tokens (I) all tokens (M) some tokens (S) –Bounce unexpected tokens to memory DirectoryCMP’s L2 controller is complex –Allocates MSHR on miss and forward –Issues invalidates and receives acks –Orders all intra-CMP requests and writebacks –57 states in our L2 implementation!
Improving Multiple-CMP Systems using Token Coherence Slide 34 Writebacks DirectoryCMP uses “3-phase writebacks” –L1 issues writeback request –L2 enters transient state or blocks request –L2 responds with writeback ack –L1 sends data TokenCMP uses “fire-and-forget” writebacks –Immediately send tokens and data –Heuristic: Only send data if # tokens > 1
Improving Multiple-CMP Systems using Token Coherence Slide 35 Outline Motivation and Background Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation –Model checking –Performance w/ commercial workloads –Robustness
Improving Multiple-CMP Systems using Token Coherence Slide 36 TokenCMP Evaluation Simple? –Some anecdotal examples and comparisons –Model checking Fast? –Full-system simulation w/ commercial workloads Robust? –Micro-benchmarks to simulate high contention
Improving Multiple-CMP Systems using Token Coherence Slide 37 Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia Methods: –TLA+ and TLC –DirectoryCMP omits all intra-CMP details –TokenCMP’s correctness substrate modeled Result: –Complexity similar between TokenCMP and non-hierarchical DirectoryCMP –Correctness Substrate verified to be correct and deadlock-free –All possible performance protocols correct
Improving Multiple-CMP Systems using Token Coherence Slide 38 Performance Evaluation Target System: –4 CMPs, 4 procs/cmp –2GHz OoO SPARC, 8MB shared L2 per chip –Directly connected interconnect Methods: Multifacet GEMS simulator –Simics augmented with timing models –Released soon: Benchmarks: –Performance: Apache, Spec, OLTP –Robustness: Locking uBenchmark
Improving Multiple-CMP Systems using Token Coherence Slide 39 Full-system Simulation: Runtime –TokenCMP performs 9-50% faster than DirectoryCMP
Improving Multiple-CMP Systems using Token Coherence Slide 40 Full-system Simulation: Runtime –TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2
Improving Multiple-CMP Systems using Token Coherence Slide 41 Full-system Simulation: Inter-CMP Traffic –TokenCMP traffic is reasonable (or better) DirectoryCMP control overhead greater than broadcast for small system
Improving Multiple-CMP Systems using Token Coherence Slide 42 Full-system Simulation: Intra-CMP Traffic
Improving Multiple-CMP Systems using Token Coherence Slide 43 Performance Robustness Locking micro-benchmark less contention more contention (correctness substrate only)
Improving Multiple-CMP Systems using Token Coherence Slide 44 Performance Robustness Locking micro-benchmark less contention more contention (correctness substrate only)
Improving Multiple-CMP Systems using Token Coherence Slide 45 Performance Robustness Locking micro-benchmark less contention more contention
Improving Multiple-CMP Systems using Token Coherence Slide 46 Summary Microprocessor Chip Multiprocessor (CMP) Symmetric Multiprocessor (SMP) Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Directory Complex & Slow New Solution: Apply Token Coherence –Developed for glueless multiprocessor [2003] –Keep: Flat for Correctness –Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory