Download presentation
Presentation is loading. Please wait.
Published byTony Whittum Modified over 9 years ago
1
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison
2
Overview Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering ▫Bus-based snooping coherence not sufficient Solutions: ▫O RDERING -P OINT : establish an ordering point ▫G REEDY -O RDER : greedily order requests ▫R ING -O RDER : complete requests in ring order R ING -O RDER offers and performance
3
Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion
4
Future CMPs Bus? Crossbar? Packet-Switched?Ring?
5
The “Cell” Processor
6
Ring Interconnect Why? Short, fast point-to-point links Fewer (data) ports Less complex than packet-switched Simple, distributed arbitration Exploitable ordering for coherence
8
Cache Coherence for a Ring
9
Ring is broadcast and offers ordering Apply existing bus-based snooping protocols? NO! Order properties of ring are different
10
Ring Order != Bus Order P9P3 P6 P12 A B {A, B} {B, A}
11
Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion
12
Snooping Protocols for Rings Assumptions: ▫Unidirectional ring Multiple rings per-address OK ▫Write-back, write-invalidate caches ▫Eager request forwarding e.g., forward message then snoop [Strauss et al. ISCA 2006] Can total bus order be recreated? YES
13
O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O S ordering point Store P9 getM (inactive)
14
O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O I S ordering point Store P9 getM own request ordered
15
O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O I S I ordering point Store P9 getM own request ordered
16
O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O I S I ordering point Store Data to P9 own request ordered P9 ACK
17
O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O I S I ordering point Store Data to P9 own request ordered P9 ACK Store P6 getM
18
O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 ordering point Data to P6 Store P6 getM Store Complete
19
Bottom line: O RDERING- P OINT Requests totally ordered + Stable, predictable performance Slow – Requests not active immediately Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message Can requests be active immediately? YES (e.g., IBM Power4/5)
20
G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O P9 getM S Store P12 response: I Store
21
G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK I will send data
22
G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK I will send data Store P6 getM response:
23
G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 I will send data Store P6 getM response: acked Data to P9
24
G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store P6 getM response: acked Data to P9 M RETRY
25
Bottom line: G REEDY -O RDER Average case is fast + Request active immediately Requires combined snoop response ▫Synchronous timing of snoops for efficiency Resorts to unbounded # of retries in conflict ▫Will conditions eventually allow request completion? ▫Probabilistic system (e.g. Ethernet)
26
Recap Existing Solutions: 1.O RDERING- P OINT Establishes total order Extra latency and control message overhead 2.G REEDY -O RDER Fast in common case Unbounded retries Ideal Solution ▫Fast for average case ▫Stable for worse-case (no retries)
27
New Approach: R ING -O RDER + Requests complete in order of ring position ▫Fully exploits ring ordering + Initial requests always succeeds ▫No retries, No ordering point ▫Fast, stable, predictable performance Key: Use token counting ▫All tokens to write, one token to read
28
R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token
29
R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token FurthestDest = P9
30
R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store FurthestDest = P9 P6 getM
31
R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store Complete FurthestDest = P9 Store Complete
32
R ING -O RDER Recap Key: Exploit Order of Ring with token counting ▫Requests never race with tokens Furthest Destination field ▫Carried in responses, tracked in MSHRs ▫Determines if tokens need to keep moving Priority token ensures liveness Data satisfies all requestors during traversal
33
R ING -O RDER vs. Token Coherence Token CoherenceR ING -O RDER Safetytoken counting Liveness retries + persistent requests priority token + ring order DRAM state (bits per block) Log 2 (# tokens)1
34
Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion
35
Applying to Baseline CMP
36
Interfacing with Memory Controllers Problem: When should memory respond? Solution: 1-bit per block of memory ▫Owner bit for O RDERING -P OINT and G REEDY -O RDER ▫Token-count bit for R ING -O RDER All or none tokens Cache the bits in a Memory Interface Cache ▫Eliminates costly DRAM accesses ▫Enable G REEDY -O RDER to meet snoop timing
37
Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results ▫Metholodogy ▫Runtime ▫Traffic ▫Performance Stability Conclusion
38
Methodology Full-system Simulation ▫Virtutech Simics ▫Wisconsin GEMS GPL software http://www.cs.wisc.edu/gems Workloads: ▫Commercial: OLTP, Apache, SpecJBB, Zeus ▫Scientific: OMPart, OMPfma3d, OMPmgrid Protocols: ▫O RDERING -P OINT ▫G REEDY -O RDER (called –I DEAL in paper) ▫R ING -O RDER
39
Simulation Parameters 1/2 SPARC 4GHz 8MB, 16-way 25-cycle bank access 1MB, 4-way 15-cycle data access 64KB I&D, 4-way 2-cycle access
40
Simulation Parameters 2/2 Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle 275-cycle DRAM access
41
Normalized Runtime R ING - O RDER is up to 52% faster than O RDERING- P OINT
42
Ring Bandwidth R ING - O RDER uses up to 34% less bandwidth
43
G REEDY -O RDER Starvation RETRY #1402 time Processor 3Processor 4Processor 6Processor 7 631597033 issue getM......045 RETRY #2......059 RETRY #10......081 Complete......083 RETRY #1......087 ack p7, send data......111 issue getM......116 RETRY #11......127 Complete......140 RETRY #2......148 ack p3, send data......161 RETRY #1......180 issue getM......197 RETRY #3......198 Complete......205 ack p7, send data......218 RETRY #2......237 issue getM......254 RETRY #4......255 Complete......262 ack p3, send data issue getM +70,000 cycles
44
Retries MAX Retries/Request G REEDY -O RDER R ING -O RDER Apache 100 OLTP 80 SpecJBB 110 Zeus 140 OMPmgrid timed out0 OMPart 290 OMPfma3d 100 R ING - O RDER offers stable, bounded performance
45
Conclusion Rings a viable interconnect for CMPs Ring != Bus for ordering R ING -O RDER protocol offers best of: ▫O RDERING -P OINT (stable) and, ▫G REEDY -O RDER (fast) P.S. R ING -O RDER requires NO system-wide snoop response ▫Useful for hierarchy of rings
46
BACKUP SLIDES
47
Flexible Snooping [Strauss et al. ISCA 2006] Eager vs. Lazy forwarding Key Differences: ▫Targets coherence between bus-based CMPs ▫Logical ring on message-passing interconnect ▫Protocol similar to G REEDY -O RDER Uses a separate combined snoop response message R ING -O RDER also works with logical ring ▫Possible to extend protocol to send data off the ring Lazy vs. Eager Forwarding applies to R ING -O RDER ▫Synergistic fit to reduce snoop power
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.