A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-Shiuan Peh (MIT)

Slides:



Advertisements
Similar presentations
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
To Include or Not to Include? Natalie Enright Dana Vantrease.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Cache Optimization Summary
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
The University of Adelaide, School of Computer Science
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
IntroductionSnoopingDirectoryConclusion IntroductionSnoopingDirectoryConclusion Memory 1A 2B 3C 4D 5E Cache 1 1A 2B 3C Cache 2 3C 4D 5E Cache 4 1A 2B.
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
ASR: Adaptive Selective Replication for CMP Caches
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs
A Study on Snoop-Based Cache Coherence Protocols
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
CMSC 611: Advanced Computer Architecture
Transactional Memory Coherence and Consistency
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Lecture 8: Directory-Based Cache Coherence
Improving Multiple-CMP Systems with Token Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 9: Directory Protocol Implementations
Lecture 9: Directory-Based Examples
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-Shiuan Peh (MIT)

Motivation CMP era is here… Enabled by aggressive transistor scaling shrinking transistor dimensions  unreliable silicon (10K-100K FITs, frequency of errors : months) NIC P$ S$ P P CC … CC R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R [1,2] [1] R. Bauman (TI), IEEE Design Test of Computers, vol. 22 (3), 2005 [2] J. Graham (MoSys), EE Times, 2002

Motivation CMP era is here… Enabled by aggressive transistor scaling shrinking transistor dimensions  unreliable silicon (10K-100K FITs, frequency of errors : months) Goal: resilient cache coherence protocol NIC P$ S$ P P CC … CC loss of a single coherence message : deadlock R R R data request R R R R R R S R R

Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions

S1 S2 S S R S SM dir I I M request (M) unblock ack S{ } B M M{ } request (M) R S1S2R 1. initiator sends request to the directory 2. directory forwards request to the sharers 3. sharers invalidate their copy and acknowledge 4. request completes and initiator sends unblock to the dir 5. dir updates sharing vector and may now process succeeding requests Walkthrough Example: transaction resilient transaction

S1 S2 R S dir request (M) SM request (M) 1. initiator sends request to the directory 2. request is lost 3. initiator resends request after a timeout 4. directory forwards request to the sharers (…transaction continues identically as before) Walkthrough Example: transaction resilient transaction

S2 S1 R request (M) ack S{R,S1,S2 } B M request (M) S SM dir ack S{ } R S1S2 1. initiator resends its request Walkthrough Example: transaction resilient transaction

S2 S1 R request (M) ack S{R,S1,S2 } B M request (M) S SM S request (M) request (S) B S unblock S M B M request (M) ? request (M) dir tolerate a duplicate request: (1) transit to same state (2) generate the same messages S{ } R S1S2 1. initiator resends its request Walkthrough Example: transaction resilient transaction B M (M) request unblock

S2 S1 R request (M) ack request (M) S SM ack dir S{R,S1,S2 } B M S{ } R S1S2 1. initiator resends its request 2. directory forwards the request to sharers (again) Walkthrough Example: transaction resilient transaction

S2 S1 request (M) ack S I request (M) ack request (M) ack Walkthrough Example: transaction resilient transaction tolerate a duplicate request: (1) transit to same state (2) generate the same messages

S2 S1 R request (M) ack request (M) S SM ack dir ack M 1. initiator resends its request 2. directory forwards the request to sharers (again) 3. sharers acknowledge (again) (…transaction completes identically as before) Walkthrough Example: transaction resilient transaction

Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions

Defining the Resilience Properties request R … … … R response - same state transition - same outgoing messages - same state transition - same outgoing messages response message loss => transaction suspended the requestor regenerates its request after timeout

Defining the Resilience Properties request X A msgA … Y … msgB msgA msgB transient … stable request stable message last R … … … Property 1 initiator remains transient throughout the transaction Property 2 replicate msgs roll-back to same earlier state Property 3 retain information to regenerate msgs R response

Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions

Enforcing Property 1 the initiator remains transient throughout a transaction to be able to resend lost messages transient … stable request stable message last Property 1

Enforcing Property 1 the initiator remains transient throughout a transaction to be able to resend lost messages transient … request stable message last Property 1 transient stable request stable dir … response unblock done initiator cannot resend unblock counter-example: Enforcement: transient - detect every outgoing message that transits the initiator to stable state - replace the stable with a transient state, and wait for done stable

Enforcing Property 2 Property 2 A msgA … replicate messages roll-back to the earlier state the original message transitioned to

T1T1 S msgA … T2T2 … … … … TMTM … T M2 T1T1 S msgA … T2T2 … … … … T M1 TMTM disassociate branches after merging point msgA T 1 or T 2 ? Enforcing Property 2 replicate messages roll-back to the earlier state the original message transitioned to Property 2 A msgA …

unique data I M R request (M) dir ( ) unique data request (M) dir ( ) Enforcing Property 3 retain info to regenerate every outgoing message, in case a replicate request is received Property 3 msgA … msgB msgA msgB Sharer

TMTM … unique data M R request (M) dir ( ) I TITI invalidate permission invalidate ack … Enforcing Property 3 retain info to regenerate every outgoing message, in case a replicate request is received Property 3 msgA … msgB msgA msgB Sharer unique data retains

Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions

Evaluation: Overhead  directory-based protocol (static directory node, MESI) base statesresilient states stable ModifiedMd (M, waiting done) Ed (E, waiting done) Exclusive SharedSd (S, waiting done) InvalidId (I, waiting done) transient IM (I  M)Sp (S, waiting permission) IS (I  S)Ip (I, waiting permission) SM (S  M)Ma (M, waiting ack) ISI (IS  I)Sa (S, waiting ack) MI (M  I) base statesresilient states stable ransient ModifiedMd (M, waiting done) OwnedEd (E, waiting done) ExclusiveSd (S, waiting done) SharedId (I, waiting done) InvalidMId (MI, waiting done) transient IM (I  M)Sp (S, waiting permission) IS (I  S)Ip (I, waiting permission) SM (S  M)Ma (M, waiting ack) SE (S  E)Ea (E, waiting ack) SS (S  S)Sa (S, waiting ack) OM (O  M) WB req  broadcast-based protocol (AMD Hammer, MOESI) 9 to 17 states (4 to 5 bits) 12 to 22 cache states (4 to 5 bits) 12 to 22 states (4 to 5 bits) stable transient stable transient No state was introduced into the critical path of serving a request

PCaddressrequestorflagsstate Miss Status Holding Register (MSHR) entries 4-32 timer 0 to 2 13 state 1bit 13bits response bitvector 64bits trans ID 6bits 11 bytes total storage overhead : < 0.5 KB / core (worst-case: 2KB / core) (*)(*) assuming a 64-node CMP with in-order cores (*)(*) Evaluation: Overhead

Network-on-Chip Topology8x8 mesh Channels64-bit VNets5 RoutingXY System Configuration Processorsin-order SPARC cores L1 Caches64KB/node, 3 cycles4-way 64Byte blk L2 Caches1MB/node, 6 cycles Memory4 controllers * 1GB, 160 cycles Simulator: Wisconsin Multifacet GEMS Evaluation: Performance

fft fmm lu radix water water blacks canneal fluidan swaptions x264 AVERAGE nsq sp choles imate SPLASHPARSEC 7.4% 11% 1.4% 1.8% 1.1% 3.5%  lower is better directory protocol Evaluation: Performance metric: runtime overhead vs. non-resilient baseline

fft fmm lu radix water water blacks canneal fluidan swaptions x264 AVERAGE nsq sp choles imate SPLASHPARSEC 2.4% 5.1% 0.5% 20.4% 51% 56% broadcast protocol Evaluation: Performance metric: runtime overhead vs. non-resilient baseline

Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions

We have presented a generic methodology: coherence protocol -> resilient coherence protocol …by enforcing 3 properties minimal hardware overhead (<2KB / node) small performance overhead – directory-based protocol: 1.4% (1 fault / msec) – broadcast-based protocol:2.4% (1 fault / msec) Conclusions

Thank You! Questions?

BACKUP SLIDES

Why performance overhead? transactions last longer => a request may have to wait for outstanding conflicting requests to complete data remain in caches for longer (3-way hs) => cache replacement duration more messages are injected in the NoC => network traffic => average NoC latency

Transaction Duration B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) +12% +18%

Transaction Duration 11% 24% B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) large working sets, shared data => high number of requests (high traffic) (!) retransmissions saturate network)

Network Traffic most congested link average over all links

Enforcing the Resilience Properties  A single message type transits to a unique state in every FSM branch P2 … … T1T1 T2T2 msgA … Case 2: identical messages in same branch X Y msgA T count =1 T count =2 ack SM + acks =1 ack SM + acks =2 R request (M) SM + acks =0 … M

Enforcing the Resilience Properties  A single message type transits to a unique state in every FSM branch P2 … … msgA … Case 2: identical messages in same branch X Y msgA T count =1 T count =2 … … X msgA T [XYZ=100] msgA … Y T [XYZ=110]

Enforcing the Resilience Properties  A single message type transits to a unique state in every FSM branch P2 … … msgA … Case 2: identical messages in same branch X Y msgA T count =1 T count =2 … … X msgA T [XYZ=100] msgA … X T [XYZ=100] (duplicate)