Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav,

Slides:



Advertisements
Similar presentations
HardBound: Architectural Support for Spatial Safety of the C Programming Language Joe Devietti *, Colin Blundell, Milo Martin, Steve Zdancewic * University.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Cache Optimization Summary
12. Common Errors, a few Puzzles. © O. Nierstrasz P2 — Common Errors, a few Puzzles 12.2 Common Errors, a few Puzzles Sources  Cay Horstmann, Computing.
(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project
12. Summary, Trends, Research. © O. Nierstrasz PS — Summary, Trends, Research Roadmap  Summary: —Trends in programming paradigms  Research:...
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
ESE Einführung in Software Engineering N. XXX Prof. O. Nierstrasz Fall Semester 2009.
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.
ESE Einführung in Software Engineering X. CHAPTER Prof. O. Nierstrasz Wintersemester 2005 / 2006.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
Metamodeling Seminar X. CHAPTER Prof. O. Nierstrasz Spring Semester 2008.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
ESE Einführung in Software Engineering X. CHAPTER Prof. O. Nierstrasz Wintersemester 2005 / 2006.
N. XXX Prof. O. Nierstrasz Thanks to Jens Palsberg and Tony Hosking for their kind permission to reuse and adapt the CS132 and CS502 lecture notes.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
OORPT Object-Oriented Reengineering Patterns and Techniques X. CHAPTER Prof. O. Nierstrasz.
CP — Concurrent Programming X. CHAPTER Prof. O. Nierstrasz Wintersemester 2005 / 2006.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
12. eToys. © O. Nierstrasz PS — eToys 12.2 Denotational Semantics Overview:  … References:  …
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.
Computational Sprinting on a Hardware/Software Testbed Arun Raghavan *, Laurel Emurian *, Lei Shao #, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas.
Multicore Programming
Floodless in SEATTLE : A Scalable Ethernet ArchiTecTure for Large Enterprises. Changhoon Kim, Matthew Caesar and Jenifer Rexford. Princeton University.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
State of the Ward in 2007 Version 1.0 A Fifth Sunday Lesson Given in the Sterling Park Ward, Ashburn, VA Stake by D. Calvin Andrus, Bishop
Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
Images of pesticides By: Leslie London, University of Cape Town This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Lecture 8: Snooping and Directory Protocols
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
ASR: Adaptive Selective Replication for CMP Caches
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Ivy Eva Wu.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
CS5102 High Performance Computer Systems Distributed Shared Memory
FOTW Worksheet Slides Christopher Penn, Financial Aid Podcast Student Loan Network.
Improving Multiple-CMP Systems with Token Coherence
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
Token Coherence: Decoupling Performance and Correctness
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 10: Directory-Based Examples II
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav, blundell,

This work licensed under the Creative Commons Attribution-Share Alike 3.0 United States License You are free: to Share — to copy, distribute, display, and perform the work to Remix — to make derivative works Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to: Any of the above conditions can be waived if you get permission from the copyright holder. Apart from the remix rights granted under this license, nothing in this license impairs or restricts the author's moral rights. [ 2 ] PATCH - Arun Raghavan - MICRO 2008

Why Yet Another Coherence Protocol? Fast sharing Avoids broadcast Scalable interconnect ✔✗ Snoopy Directory Track sharers Token Coherence Token counting ✗ ✗ ✔ ✔ ✔ ✔ ✗ ✔✔ ✔ Our goal This work: combining directory and token counting ? 3 PATCH - Arun Raghavan - MICRO 2008

Overview Begin with a standard directory protocol Fast sharing misses? Direct requests Ensure safety? Token counting Broadcast-free forward progress? Token Tenure Directory selects one requestor to retain tokens Requestors give up tokens after a timeout interval PATCH: Predictive, Adaptive Token Counting Hybrid Send request “hints” directly to predicted sharers Retain scalability? Lowest-priority, best-effort delivery Fast sharing misses, scales as directory [ 4 ] PATCH - Arun Raghavan - MICRO 2008

Directory Operation [ 5 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Directory

Directory Operation [ 6 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P0 MI I Directory

Directory Operation [ 7 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P0 MI I Directory Store miss GetM Fwd(P1) acks =1 I Data, acks=1 M Unblock 1 2 3

Directory Operation [ 8 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P0 MI I Directory Store miss GetM IM Unblock P1 Data, acks=1 Fwd(P1) acks =1

Directory with Direct Requests? [ 9 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 IM I Load miss GetS Data OS GetS

Directory with Direct Requests? [ 10 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 IM I Load miss Data OS GetS Store miss GetM Fwd(P0) acks=1 GetS

Directory with Direct Requests? [ 11 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 IM I Load miss Data OS GetS Store miss GetM Fwd(P0) acks=1 Data acks=1 I GetS

Directory with Direct Requests? [ 12 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 IM I Load miss Data OS GetS Store miss GetM Fwd(P0) acks=1 Data acks=1 M Incoherence!! GetS I Why? Direct requests break key directory assumption

Restoring Coherence Coherence invariant: one writer or many readers Directory: enforces implicitly by distributed algorithm Assumes complete state information at the directory Alternative: encode permission with token count Fixed number of tokens per cache block Need all tokens to write One or more tokens to read Explicitly enforces coherence invariant Without regard to races, protocol details [ 13 ] PATCH - Arun Raghavan - MICRO 2008 Token Coherence [ISCA ’03]

Directory with Direct Requests: Tokens [ 14 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss GetS

[ 15 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss GetS Data Directory with Direct Requests: Tokens

[ 16 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss GetS Data Store miss GetM Fwd(P0) Directory with Direct Requests: Tokens

[ 17 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss GetS Data Store miss GetM Data Directory with Direct Requests: Tokens Fwd(P0)

[ 18 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss GetS Data Store miss GetM Data Directory with Direct Requests: Tokens Fwd(P0)

[ 19 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss GetS Data Store miss GetM Data Directory with Direct Requests: Tokens Fwd(P0)

[ 20 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss Store miss P0 Starves Directory with Direct Requests: Tokens GetS

[ 21 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss Store miss Token Coherence Solution: Persistent Requests GetS Persistent Request Broadcast Table at each processor N 2 state

[ 22 ] PATCH - Arun Raghavan - MICRO 2008 P0 P1 P2 Sharers: Owner: P1 Load miss Store miss P0’s request reached directory first Directory declares P0 winner Our Solution GetS Winner P2 non-winner Inferred after timeout Timeout Directory forwards to P0

Token Tenure [ 23 ] PATCH - Arun Raghavan - MICRO 2008

Token Tenure Tokens can be tenured or untenured Tokens by default untenured Untenured tokens must be sent to the directory… … unless tenured within timeout window Active (winner) requestors tenure tokens Directory activates one request at a time Directory explicitly informs active requestor Multiple processors can hold tenured tokens Why does this ensure forward progress? [ 24 ] PATCH - Arun Raghavan - MICRO 2008

Flow of Tokens to Active Requestor [ 25 ] PATCH - Arun Raghavan - MICRO 2008 Timeout Bounce Forwarded request Racing Request Active Directory Untenured Tenured Direct request Restore directory’s ability to resolve races Implementation: add timeout

Token Tenure: Implementation No common-case performance impact Activation off critical path of miss Token count still determines permissions No additional traffic Activation piggybacked on forwarded messages Set timeout to twice average roundtrip latency Avoid early timeout…. …but minimize slowing down winner in races [ 26 ] PATCH - Arun Raghavan - MICRO 2008

Using Direct Requests Direct requests to no, some or all processors [ 27 ] PATCH - Arun Raghavan - MICRO %7%28%8%18% Direct requests improve performance But at what cost? jbb oltp apache barnes ocean 64 processors, 16B/cycle normalized runtime Average 14% Dest. Set Prediction [ISCA ’03]

Direct Requests: Runtime and Traffic Direct requests to no, some or all processors If successful, two hop miss Else, directory forwards anyway [ 28 ] PATCH - Arun Raghavan - MICRO 2008 Predictors see benefit using fewer direct requests normalized runtime normalized traffic jbb oltp apache barnes ocean PATCH-NoDirect and Directory have identical traffic PATCH-Broadcast has >100% overhead Runtime Traffic

Best-Effort Direct Requests Direct requests in PATCH 1.Strictly in addition to directory requests 2.Don’t need explicit acks  direct requests can be dropped arbitrarily Best-effort delivery Lowest priority, deliver strictly on “do-no-harm-basis” If queued up too long in switches, controller: drop  lower-bound: PATCH-NoDirect performance Adequate bandwidth? drop no requests Scarce bandwidth? drop all requests Never worse than directory [ 29 ] PATCH - Arun Raghavan - MICRO 2008

Best-Effort Direct Requests normalized runtime number of processors 29% 20% Broadcast performance with plentiful bandwidth Converges with directory performance at 512 Adapt dynamically; one-size-fits-all Better than both [ 30 ] PATCH - Arun Raghavan - MICRO 2008

Enhancing Directory Scalability [ 31 ] PATCH - Arun Raghavan - MICRO 2008

Enhancing Directory Scalability Req IISS Directory Forward Ack 0 1 Directory [ 32 ] PATCH - Arun Raghavan - MICRO 2008

Enhancing Directory Scalability Coarse directories: 1-bit for k sharers Fan-out delivery of forwards: worst case O(N) traffic Requires acks from non-sharers too Multiple unicast messages (no ack combining) Worst case O(N√N) on 2D torus interconnect Req IISS Directory Forward Ack Directory-coarse [ 33 ] PATCH - Arun Raghavan - MICRO 2008

Enhancing Directory Scalability With PATCH only token holders need respond Avoid “unnecessary acknowledgements” When # of sharers small, prevents ack from dominating Even more scalable than directory Req IISS Directory Forward Ack Req Directory Forward Directory-coarse PATCH-coarse 1 [ 34 ] PATCH - Arun Raghavan - MICRO 2008

PATCH has high tolerance to inexactness Traffic comparison normalized traffic coarseness (sharers/bit) 319% 32% DirectoryPATCH Runtime comparison, 2B/cycle normalized runtime 142% 3.6% 256 processors Coarse Directory: Runtime and Traffic [ 35 ] PATCH - Arun Raghavan - MICRO 2008

Related Work Token counting Token Coherence [Martin+, ISCA ‘03] Priority Requests [Cuesta+, PDP ‘07] Virtual Hierarchies [Marty+, ISCA ’07] Ring Order [Marty+, MICRO ‘06] Predictive direct requests Multicast snooping [Bilir+, ISCA ‘99] Owner Prediction [Acacio+, SC ‘02] Producer-Consumer sharing [Cheng+, HPCA ‘07] Virtual Circuit Tree Multicast [Jerger+, ISCA ‘08] Bandwidth Adaptive Snooping [Martin+, HPCA ‘02] Embedded ring snooping Uncorq [Strauss+, MICRO ‘07] [ 36 ] PATCH - Arun Raghavan - MICRO 2008

Conclusion PATCH Directory protocol foundation Fast sharing? Direct requests Safety? Token counting Forward progress? Token tenure Broadcast-free Retain scaling of directory? Best-effort delivery Resulting properties One-size-fits-all Opportunistically uses bandwidth for performance Yet scales no worse than directory [ 37 ] PATCH - Arun Raghavan - MICRO 2008