Download presentation
Presentation is loading. Please wait.
Published byMoris Williams Modified over 8 years ago
1
A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-Shiuan Peh (MIT)
2
Motivation CMP era is here… Enabled by aggressive transistor scaling shrinking transistor dimensions unreliable silicon (10K-100K FITs, frequency of errors : months) NIC P$ S$ P P CC … CC R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R [1,2] [1] R. Bauman (TI), IEEE Design Test of Computers, vol. 22 (3), 2005 [2] J. Graham (MoSys), EE Times, 2002
3
Motivation CMP era is here… Enabled by aggressive transistor scaling shrinking transistor dimensions unreliable silicon (10K-100K FITs, frequency of errors : months) Goal: resilient cache coherence protocol NIC P$ S$ P P CC … CC loss of a single coherence message : deadlock R R R data request R R R R R R S R R
4
Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions
5
S1 S2 S S R S SM dir I I M request (M) unblock ack S{ } B M M{ } request (M) R S1S2R 1. initiator sends request to the directory 2. directory forwards request to the sharers 3. sharers invalidate their copy and acknowledge 4. request completes and initiator sends unblock to the dir 5. dir updates sharing vector and may now process succeeding requests Walkthrough Example: transaction resilient transaction
6
S1 S2 R S dir request (M) SM request (M) 1. initiator sends request to the directory 2. request is lost 3. initiator resends request after a timeout 4. directory forwards request to the sharers (…transaction continues identically as before) Walkthrough Example: transaction resilient transaction
7
S2 S1 R request (M) ack S{R,S1,S2 } B M request (M) S SM dir ack S{ } R S1S2 1. initiator resends its request Walkthrough Example: transaction resilient transaction
8
S2 S1 R request (M) ack S{R,S1,S2 } B M request (M) S SM S request (M) request (S) B S unblock S M B M request (M) ? request (M) dir tolerate a duplicate request: (1) transit to same state (2) generate the same messages S{ } R S1S2 1. initiator resends its request Walkthrough Example: transaction resilient transaction B M (M) request unblock
9
S2 S1 R request (M) ack request (M) S SM ack dir S{R,S1,S2 } B M S{ } R S1S2 1. initiator resends its request 2. directory forwards the request to sharers (again) Walkthrough Example: transaction resilient transaction
10
S2 S1 request (M) ack S I request (M) ack request (M) ack Walkthrough Example: transaction resilient transaction tolerate a duplicate request: (1) transit to same state (2) generate the same messages
11
S2 S1 R request (M) ack request (M) S SM ack dir ack M 1. initiator resends its request 2. directory forwards the request to sharers (again) 3. sharers acknowledge (again) (…transaction completes identically as before) Walkthrough Example: transaction resilient transaction
12
Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions
13
Defining the Resilience Properties request R … … … R response - same state transition - same outgoing messages - same state transition - same outgoing messages response message loss => transaction suspended the requestor regenerates its request after timeout
14
Defining the Resilience Properties request X A msgA … Y … msgB msgA msgB transient … stable request stable message last R … … … Property 1 initiator remains transient throughout the transaction Property 2 replicate msgs roll-back to same earlier state Property 3 retain information to regenerate msgs R response
15
Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions
16
Enforcing Property 1 the initiator remains transient throughout a transaction to be able to resend lost messages transient … stable request stable message last Property 1
17
Enforcing Property 1 the initiator remains transient throughout a transaction to be able to resend lost messages transient … request stable message last Property 1 transient stable request stable dir … response unblock done initiator cannot resend unblock counter-example: Enforcement: transient - detect every outgoing message that transits the initiator to stable state - replace the stable with a transient state, and wait for done stable
18
Enforcing Property 2 Property 2 A msgA … replicate messages roll-back to the earlier state the original message transitioned to
19
T1T1 S msgA … T2T2 … … … … TMTM … T M2 T1T1 S msgA … T2T2 … … … … T M1 TMTM disassociate branches after merging point msgA T 1 or T 2 ? Enforcing Property 2 replicate messages roll-back to the earlier state the original message transitioned to Property 2 A msgA …
20
unique data I M R request (M) dir ( ) unique data request (M) dir ( ) Enforcing Property 3 retain info to regenerate every outgoing message, in case a replicate request is received Property 3 msgA … msgB msgA msgB Sharer
21
TMTM … unique data M R request (M) dir ( ) I TITI invalidate permission invalidate ack … Enforcing Property 3 retain info to regenerate every outgoing message, in case a replicate request is received Property 3 msgA … msgB msgA msgB Sharer unique data retains
22
Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions
23
Evaluation: Overhead directory-based protocol (static directory node, MESI) base statesresilient states stable ModifiedMd (M, waiting done) Ed (E, waiting done) Exclusive SharedSd (S, waiting done) InvalidId (I, waiting done) transient IM (I M)Sp (S, waiting permission) IS (I S)Ip (I, waiting permission) SM (S M)Ma (M, waiting ack) ISI (IS I)Sa (S, waiting ack) MI (M I) base statesresilient states stable ransient ModifiedMd (M, waiting done) OwnedEd (E, waiting done) ExclusiveSd (S, waiting done) SharedId (I, waiting done) InvalidMId (MI, waiting done) transient IM (I M)Sp (S, waiting permission) IS (I S)Ip (I, waiting permission) SM (S M)Ma (M, waiting ack) SE (S E)Ea (E, waiting ack) SS (S S)Sa (S, waiting ack) OM (O M) WB req broadcast-based protocol (AMD Hammer, MOESI) 9 to 17 states (4 to 5 bits) 12 to 22 cache states (4 to 5 bits) 12 to 22 states (4 to 5 bits) stable transient stable transient No state was introduced into the critical path of serving a request
24
PCaddressrequestorflagsstate Miss Status Holding Register (MSHR) entries 4-32 timer 0 to 2 13 state 1bit 13bits response bitvector 64bits trans ID 6bits 11 bytes total storage overhead : < 0.5 KB / core (worst-case: 2KB / core) (*)(*) assuming a 64-node CMP with in-order cores (*)(*) Evaluation: Overhead
25
Network-on-Chip Topology8x8 mesh Channels64-bit VNets5 RoutingXY System Configuration Processorsin-order SPARC cores L1 Caches64KB/node, 3 cycles4-way 64Byte blk L2 Caches1MB/node, 6 cycles Memory4 controllers * 1GB, 160 cycles Simulator: Wisconsin Multifacet GEMS Evaluation: Performance
26
fft fmm lu radix water water blacks canneal fluidan swaptions x264 AVERAGE nsq sp choles imate SPLASHPARSEC 7.4% 11% 1.4% 1.8% 1.1% 3.5% lower is better directory protocol Evaluation: Performance metric: runtime overhead vs. non-resilient baseline
27
fft fmm lu radix water water blacks canneal fluidan swaptions x264 AVERAGE nsq sp choles imate SPLASHPARSEC 2.4% 5.1% 0.5% 20.4% 51% 56% broadcast protocol Evaluation: Performance metric: runtime overhead vs. non-resilient baseline
28
Outline Motivation Methodology – Walkthrough: a resilient transaction – Defining resilience properties – Enforcing resilience properties Evaluation – Overhead – Performance Conclusions
29
We have presented a generic methodology: coherence protocol -> resilient coherence protocol …by enforcing 3 properties minimal hardware overhead (<2KB / node) small performance overhead – directory-based protocol: 1.4% (1 fault / msec) – broadcast-based protocol:2.4% (1 fault / msec) Conclusions
30
Thank You! Questions?
31
BACKUP SLIDES
32
Why performance overhead? transactions last longer => a request may have to wait for outstanding conflicting requests to complete data remain in caches for longer (3-way hs) => cache replacement duration more messages are injected in the NoC => network traffic => average NoC latency
33
Transaction Duration B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) +12% +18%
34
Transaction Duration 11% 24% B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) large working sets, shared data => high number of requests (high traffic) (!) retransmissions saturate network)
35
Network Traffic most congested link average over all links
36
Enforcing the Resilience Properties A single message type transits to a unique state in every FSM branch P2 … … T1T1 T2T2 msgA … Case 2: identical messages in same branch X Y msgA T count =1 T count =2 ack SM + acks =1 ack SM + acks =2 R request (M) SM + acks =0 … M
37
Enforcing the Resilience Properties A single message type transits to a unique state in every FSM branch P2 … … msgA … Case 2: identical messages in same branch X Y msgA T count =1 T count =2 … … X msgA T [XYZ=100] msgA … Y T [XYZ=110]
38
Enforcing the Resilience Properties A single message type transits to a unique state in every FSM branch P2 … … msgA … Case 2: identical messages in same branch X Y msgA T count =1 T count =2 … … X msgA T [XYZ=100] msgA … X T [XYZ=100] (duplicate)
39
01234567 89101112131415 161719212223 24252728293031 3233343536373839 4041424344454647 4849505152535455 5657585960616263 20 18 26
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.