Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison

Overview Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering ▫Bus-based snooping coherence not sufficient Solutions: ▫O RDERING -P OINT : establish an ordering point ▫G REEDY -O RDER : greedily order requests ▫R ING -O RDER : complete requests in ring order R ING -O RDER offers and performance

Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

Future CMPs Bus? Crossbar? Packet-Switched?Ring?

The “Cell” Processor

Ring Interconnect Why?  Short, fast point-to-point links  Fewer (data) ports  Less complex than packet-switched  Simple, distributed arbitration  Exploitable ordering for coherence

Cache Coherence for a Ring

Ring is broadcast and offers ordering Apply existing bus-based snooping protocols? NO! Order properties of ring are different

Ring Order != Bus Order P9P3 P6 P12 A B {A, B} {B, A}

Snooping Protocols for Rings Assumptions: ▫Unidirectional ring Multiple rings per-address OK ▫Write-back, write-invalidate caches ▫Eager request forwarding e.g., forward message then snoop [Strauss et al. ISCA 2006] Can total bus order be recreated? YES

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O S ordering point Store P9 getM (inactive)

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S ordering point Store P9 getM own request ordered

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store P9 getM own request ordered

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store Data to P9 own request ordered P9 ACK

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store Data to P9 own request ordered P9 ACK Store P6 getM

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 ordering point Data to P6 Store P6 getM Store Complete

Bottom line: O RDERING- P OINT Requests totally ordered + Stable, predictable performance Slow – Requests not active immediately Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message Can requests be active immediately? YES (e.g., IBM Power4/5)

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O P9 getM S Store P12 response:  I Store

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK  I will send data

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK  I will send data Store P6 getM response:

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12  I will send data Store P6 getM response: acked Data to P9

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store P6 getM response: acked Data to P9 M RETRY

Bottom line: G REEDY -O RDER Average case is fast + Request active immediately Requires combined snoop response ▫Synchronous timing of snoops for efficiency Resorts to unbounded # of retries in conflict ▫Will conditions eventually allow request completion? ▫Probabilistic system (e.g. Ethernet)

Recap Existing Solutions: 1.O RDERING- P OINT Establishes total order Extra latency and control message overhead 2.G REEDY -O RDER Fast in common case Unbounded retries Ideal Solution ▫Fast for average case ▫Stable for worse-case (no retries)

New Approach: R ING -O RDER + Requests complete in order of ring position ▫Fully exploits ring ordering + Initial requests always succeeds ▫No retries, No ordering point ▫Fast, stable, predictable performance Key: Use token counting ▫All tokens to write, one token to read

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token FurthestDest = P9

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store FurthestDest = P9 P6 getM

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store Complete FurthestDest = P9 Store Complete

R ING -O RDER Recap Key: Exploit Order of Ring with token counting ▫Requests never race with tokens Furthest Destination field ▫Carried in responses, tracked in MSHRs ▫Determines if tokens need to keep moving Priority token ensures liveness Data satisfies all requestors during traversal

R ING -O RDER vs. Token Coherence Token CoherenceR ING -O RDER Safetytoken counting Liveness retries + persistent requests priority token + ring order DRAM state (bits per block) Log 2 (# tokens)1

Applying to Baseline CMP

Interfacing with Memory Controllers Problem: When should memory respond? Solution: 1-bit per block of memory ▫Owner bit for O RDERING -P OINT and G REEDY -O RDER ▫Token-count bit for R ING -O RDER All or none tokens Cache the bits in a Memory Interface Cache ▫Eliminates costly DRAM accesses ▫Enable G REEDY -O RDER to meet snoop timing

Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results ▫Metholodogy ▫Runtime ▫Traffic ▫Performance Stability Conclusion

Methodology Full-system Simulation ▫Virtutech Simics ▫Wisconsin GEMS GPL software http://www.cs.wisc.edu/gems Workloads: ▫Commercial: OLTP, Apache, SpecJBB, Zeus ▫Scientific: OMPart, OMPfma3d, OMPmgrid Protocols: ▫O RDERING -P OINT ▫G REEDY -O RDER (called –I DEAL in paper) ▫R ING -O RDER

Simulation Parameters 1/2 SPARC 4GHz 8MB, 16-way 25-cycle bank access 1MB, 4-way 15-cycle data access 64KB I&D, 4-way 2-cycle access

Simulation Parameters 2/2 Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle 275-cycle DRAM access

Normalized Runtime R ING - O RDER is up to 52% faster than O RDERING- P OINT

Ring Bandwidth R ING - O RDER uses up to 34% less bandwidth

G REEDY -O RDER Starvation RETRY #1402 time Processor 3Processor 4Processor 6Processor 7 631597033 issue getM......045 RETRY #2......059 RETRY #10......081 Complete......083 RETRY #1......087 ack p7, send data......111 issue getM......116 RETRY #11......127 Complete......140 RETRY #2......148 ack p3, send data......161 RETRY #1......180 issue getM......197 RETRY #3......198 Complete......205 ack p7, send data......218 RETRY #2......237 issue getM......254 RETRY #4......255 Complete......262 ack p3, send data issue getM +70,000 cycles

Retries MAX Retries/Request G REEDY -O RDER R ING -O RDER Apache 100 OLTP 80 SpecJBB 110 Zeus 140 OMPmgrid timed out0 OMPart 290 OMPfma3d 100 R ING - O RDER offers stable, bounded performance

Conclusion Rings a viable interconnect for CMPs Ring != Bus for ordering R ING -O RDER protocol offers best of: ▫O RDERING -P OINT (stable) and, ▫G REEDY -O RDER (fast) P.S. R ING -O RDER requires NO system-wide snoop response ▫Useful for hierarchy of rings

BACKUP SLIDES

Flexible Snooping [Strauss et al. ISCA 2006] Eager vs. Lazy forwarding Key Differences: ▫Targets coherence between bus-based CMPs ▫Logical ring on message-passing interconnect ▫Protocol similar to G REEDY -O RDER Uses a separate combined snoop response message R ING -O RDER also works with logical ring ▫Possible to extend protocol to send data off the ring Lazy vs. Eager Forwarding applies to R ING -O RDER ▫Synergistic fit to reduce snoop power

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback