Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

Token Tenure: PATCHing Token Counting Using Directory-Based Cache Coherence Arun Raghavan, Colin Blundell, Milo Martin University of Pennsylvania {arraghav,

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Lecture 21: Coherence and Interconnection Networks Papers: Flexible Snooping: Adaptive Filtering and Forwarding in Embedded Ring Multiprocessors, UIUC,

Multiprocessor Cache Coherency

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007.

A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture.

Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

The University of Adelaide, School of Computer Science

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Presented by: Nick Kirchem Feb 13, 2004

ASR: Adaptive Selective Replication for CMP Caches

Architecture and Design of AlphaServer GS320

A New Coherence Method Using A Multicast Address Network

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Multiprocessor Cache Coherency

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Energy-Efficient Address Translation

CS5102 High Performance Computer Systems Distributed Shared Memory

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Impact of Interconnection Network resources on CMP performance

Improving Multiple-CMP Systems with Token Coherence

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

CS 6290 Many-core & Interconnect

Token Coherence: Decoupling Performance and Correctness

Presentation transcript:

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison

Overview Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering ▫Bus-based snooping coherence not sufficient Solutions: ▫O RDERING -P OINT : establish an ordering point ▫G REEDY -O RDER : greedily order requests ▫R ING -O RDER : complete requests in ring order R ING -O RDER offers and performance

Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

Future CMPs Bus? Crossbar? Packet-Switched?Ring?

The “Cell” Processor

Ring Interconnect Why?  Short, fast point-to-point links  Fewer (data) ports  Less complex than packet-switched  Simple, distributed arbitration  Exploitable ordering for coherence

Cache Coherence for a Ring

Ring is broadcast and offers ordering Apply existing bus-based snooping protocols? NO! Order properties of ring are different

Ring Order != Bus Order P9P3 P6 P12 A B {A, B} {B, A}

Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

Snooping Protocols for Rings Assumptions: ▫Unidirectional ring Multiple rings per-address OK ▫Write-back, write-invalidate caches ▫Eager request forwarding e.g., forward message then snoop [Strauss et al. ISCA 2006] Can total bus order be recreated? YES

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O S ordering point Store P9 getM (inactive)

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S ordering point Store P9 getM own request ordered

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store P9 getM own request ordered

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store Data to P9 own request ordered P9 ACK

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store Data to P9 own request ordered P9 ACK Store P6 getM

O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 ordering point Data to P6 Store P6 getM Store Complete

Bottom line: O RDERING- P OINT Requests totally ordered + Stable, predictable performance Slow – Requests not active immediately Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message Can requests be active immediately? YES (e.g., IBM Power4/5)

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O P9 getM S Store P12 response:  I Store

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK  I will send data

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK  I will send data Store P6 getM response:

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12  I will send data Store P6 getM response: acked Data to P9

G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store P6 getM response: acked Data to P9 M RETRY

Bottom line: G REEDY -O RDER Average case is fast + Request active immediately Requires combined snoop response ▫Synchronous timing of snoops for efficiency Resorts to unbounded # of retries in conflict ▫Will conditions eventually allow request completion? ▫Probabilistic system (e.g. Ethernet)

Recap Existing Solutions: 1.O RDERING- P OINT Establishes total order Extra latency and control message overhead 2.G REEDY -O RDER Fast in common case Unbounded retries Ideal Solution ▫Fast for average case ▫Stable for worse-case (no retries)

New Approach: R ING -O RDER + Requests complete in order of ring position ▫Fully exploits ring ordering + Initial requests always succeeds ▫No retries, No ordering point ▫Fast, stable, predictable performance Key: Use token counting ▫All tokens to write, one token to read

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token FurthestDest = P9

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store FurthestDest = P9 P6 getM

R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store Complete FurthestDest = P9 Store Complete

R ING -O RDER Recap Key: Exploit Order of Ring with token counting ▫Requests never race with tokens Furthest Destination field ▫Carried in responses, tracked in MSHRs ▫Determines if tokens need to keep moving Priority token ensures liveness Data satisfies all requestors during traversal

R ING -O RDER vs. Token Coherence Token CoherenceR ING -O RDER Safetytoken counting Liveness retries + persistent requests priority token + ring order DRAM state (bits per block) Log 2 (# tokens)1

Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

Applying to Baseline CMP

Interfacing with Memory Controllers Problem: When should memory respond? Solution: 1-bit per block of memory ▫Owner bit for O RDERING -P OINT and G REEDY -O RDER ▫Token-count bit for R ING -O RDER All or none tokens Cache the bits in a Memory Interface Cache ▫Eliminates costly DRAM accesses ▫Enable G REEDY -O RDER to meet snoop timing

Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results ▫Metholodogy ▫Runtime ▫Traffic ▫Performance Stability Conclusion

Methodology Full-system Simulation ▫Virtutech Simics ▫Wisconsin GEMS GPL software Workloads: ▫Commercial: OLTP, Apache, SpecJBB, Zeus ▫Scientific: OMPart, OMPfma3d, OMPmgrid Protocols: ▫O RDERING -P OINT ▫G REEDY -O RDER (called –I DEAL in paper) ▫R ING -O RDER

Simulation Parameters 1/2 SPARC 4GHz 8MB, 16-way 25-cycle bank access 1MB, 4-way 15-cycle data access 64KB I&D, 4-way 2-cycle access

Simulation Parameters 2/2 Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle 275-cycle DRAM access

Normalized Runtime R ING - O RDER is up to 52% faster than O RDERING- P OINT

Ring Bandwidth R ING - O RDER uses up to 34% less bandwidth

G REEDY -O RDER Starvation RETRY #1402 time Processor 3Processor 4Processor 6Processor issue getM RETRY # RETRY # Complete RETRY # ack p7, send data issue getM RETRY # Complete RETRY # ack p3, send data RETRY # issue getM RETRY # Complete ack p7, send data RETRY # issue getM RETRY # Complete ack p3, send data issue getM +70,000 cycles

Retries MAX Retries/Request G REEDY -O RDER R ING -O RDER Apache 100 OLTP 80 SpecJBB 110 Zeus 140 OMPmgrid timed out0 OMPart 290 OMPfma3d 100 R ING - O RDER offers stable, bounded performance

Conclusion Rings a viable interconnect for CMPs Ring != Bus for ordering R ING -O RDER protocol offers best of: ▫O RDERING -P OINT (stable) and, ▫G REEDY -O RDER (fast) P.S. R ING -O RDER requires NO system-wide snoop response ▫Useful for hierarchy of rings

BACKUP SLIDES

Flexible Snooping [Strauss et al. ISCA 2006] Eager vs. Lazy forwarding Key Differences: ▫Targets coherence between bus-based CMPs ▫Logical ring on message-passing interconnect ▫Protocol similar to G REEDY -O RDER Uses a separate combined snoop response message R ING -O RDER also works with logical ring ▫Possible to extend protocol to send data off the ring Lazy vs. Eager Forwarding applies to R ING -O RDER ▫Synergistic fit to reduce snoop power