Improving Multiple-CMP Systems with Token Coherence

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
The University of Adelaide, School of Computer Science
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
To Include or Not to Include? Natalie Enright Dana Vantrease.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
High Performing Cache Hierarchies for Server Workloads
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Thoughts on Shared Caches Jeff Odom University of Maryland.
University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.
(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav,
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Presented by: Nick Kirchem Feb 13, 2004
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
ASR: Adaptive Selective Replication for CMP Caches
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs
A New Coherence Method Using A Multicast Address Network
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
Lecture 1: Parallel Architecture Intro
Introduction to Multiprocessors
Impact of Interconnection Network resources on CMP performance
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 6290 Many-core & Interconnect
Token Coherence: Decoupling Performance and Correctness
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Dynamic Verification of Sequential Consistency
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Improving Multiple-CMP Systems with Token Coherence Mike Marty1, Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun

Summary Microprocessor  Chip Multiprocessor (CMP) Symmetric Multiprocessor (SMP)  Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Protocol Complex & Slow New Solution: Apply Token Coherence Developed for glueless multiprocessor [ISCA 2003] Keep: Flat for Correctness Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory

Outline Motivation and Background Coherence in Multiple-CMP Systems Example: DirectoryCMP Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation

Coherence in Multiple-CMP Systems Chip Multiprocessors (CMPs) emerging Larger systems will be built with Multiple CMPs interconnect I D P L2 CMP 2 CMP 1 interconnect CMP 3 CMP 4

Problem: Hierarchical Coherence Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4

Improving Multiple CMP Systems with Token Coherence Token Coherence allows Multiple-CMP systems to be... Flat for correctness, but Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4

Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack getx data/ ack inv ack inv ack WB getx fwd inv ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [S O] B: [M I] getx Memory/Directory Memory/Directory

Outline Motivation and Background Token Coherence: Flat for Correctness Safety Starvation Avoidance Token Coherence: Hierarchical for Performance Evaluation

Example: Token Coherence [ISCA 2003] Load B Load B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3 Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block

Extending to Multiple-CMP System L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1

Extending to Multiple-CMP System Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect Token counting remains flat Tokens to caches Handles shared caches and other complex hierarchies

Tokens move freely in the system Starvation Avoidance CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 GETX GETX GETX L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect Tokens move freely in the system Transient requests can miss in-flight tokens Incorrect speculation, filters, prediction, etc

Starvation Avoidance P0 P1 P2 P3 interconnect CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect Solution: issue Persistent Request Heavyweight request guaranteed to succeed Methods: Centralized [2003] and Distributed (New)

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B timeout Store B timeout Store B timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1 Processors issue persistent requests

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B Store B Store B Store B P0 P1 P2 P3 B: P0 L1 I&D L1 I&D B: P0 B: P0 L1 I&D L1 I&D B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1 Processors issue persistent requests Arbiter orders and broadcasts activate

Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 B: P0 B: P2 L1 I&D L1 I&D B: P0 B: P2 B: P2 B: P0 L1 I&D L1 I&D B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P0 B: P2 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1 Processor sends deactivate to arbiter Arbiter broadcasts deactivate (and next activate) Bottom Line: handoff is 3 message latencies

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B Processors broadcast persistent requests

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B Processors broadcast persistent requests Fixed priority (processor number)

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B Processors broadcast persistent requests Fixed priority (processor number) Processors broadcast deactivate

Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B Bottom line: Handoff is a single message latency Subtle point: P0 and P1 must wait until next “wave”

Outline Motivation and Background Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation

Hierarchical for Performance: TokenCMP Target System: 2-8 CMPs Private L1s, shared L2 per CMP Any interconnect, but high-bandwidth Performance Policy Goals: Aggressively acquire tokens Exploit on-chip locality and bandwidth Respect cache hierarchy Detecting and handling missed tokens

Hierarchical for Performance: TokenCMP Approach: On L1 miss, broadcast within own CMP Local cache responds if possible On L2 miss, broadcast to other CMPs Appropriate L2 bank responds or broadcasts within its CMP Optionally filter Responses between CMPs carry extra tokens for future locality Handling missed tokens: Timeout after average memory latency Invoke persistent request (no retries) Larger systems can use filters, multicast, soft-state directories

Outline Motivation and Background Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation Model checking Performance w/ commercial workloads Robustness

TokenCMP Evaluation Simple? Fast? Robust? Model checking Full-system simulation w/ commercial workloads Robust? Micro-benchmarks to simulate high contention

Complexity Evaluation with Model Checking Methods: TLA+ and TLC DirectoryCMP omits all intra-CMP details TokenCMP’s correctness substrate modeled Result: Complexity similar between TokenCMP and non-hierarchical DirectoryCMP Correctness Substrate verified to be correct and deadlock-free Small configuration, varied parameters All possible performance protocols correct

Performance Evaluation Target System: 4 CMPs, 4 procs/cmp 2GHz OoO SPARC, 8MB shared L2 per chip Directly connected interconnect Methods: Multifacet GEMS simulator Simics augmented with timing models Released soon: http://www.cs.wisc.edu/gems ISCA 2005 Tutorial! Benchmarks: Performance: Apache, Spec, OLTP Robustness: Locking uBenchmark

Full-system Simulation: Runtime TokenCMP performs 9-50% faster than DirectoryCMP

Full-system Simulation: Runtime TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2

Full-system Simulation: Traffic TokenCMP traffic is reasonable (or better) DirectoryCMP control overhead greater than broadcast for small system

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention less contention

Performance Robustness Locking micro-benchmark (correctness substrate only) more contention less contention

Performance Robustness Locking micro-benchmark more contention less contention

Summary Microprocessor  Chip Multiprocessor (CMP) Symmetric Multiprocessor (SMP)  Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Protocol Complex & Slow New Solution: Apply Token Coherence Developed for glueless multiprocessor [2003] Keep: Flat for Correctness Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory

Full-system Simulation: Traffic

Full-system Simulation: Intra-CMP Traffic