(C) 2002 Milo MartinHPCA, Feb. 2002 Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Cache Coherence Mechanisms (Research project) CSCI-5593

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.

(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Evaluating a $2M Commercial Server on a $2K PC and Related Challenges Mark D. Hill Multifacet.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.

Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.

“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.

Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Presented by: Nick Kirchem Feb 13, 2004

ASR: Adaptive Selective Replication for CMP Caches

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Improving Multiple-CMP Systems with Token Coherence

E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,

11 – Snooping Cache and Directory Based Multiprocessors

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Simulating a $2M Commercial Server on a $2K PC

Token Coherence: Decoupling Performance and Correctness

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Dynamic Verification of Sequential Consistency

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin—Madison

BASH – Milo Martin slide 2 Two classes of multiprocessors Snooping (SMP) multiprocessors –Broadcast-based  use more interconnect bandwidth +Directly locate owner  low latency cache-to-cache transfers (36% - 91% of misses are cache-to-cache transfers in our commercial workloads) Directory-based multiprocessors +Indirection  bandwidth-efficient & scalable –Indirection  higher latency cache-to-cache transfers Problem: higher performing approach varies with: –Configuration (e.g., number of processors) –Workload (e.g., cache miss rate)

BASH – Milo Martin slide 3 Which approach is best? Micro-benchmark 64 processors

BASH – Milo Martin slide 4 Bandwidth Adaptive Snooping Hybrid (BASH) Goals –Best performance aspects of both approaches High performance for many configurations & workloads Future workload properties unknown at design time –Single design Coherence logic integrated with processors One part for many systems Hybrid protocol –Snooping-like broadcast requests –Directory-like “unicast” requests Bandwidth adaptive –Estimate available bandwidth –Adjust rate of broadcast based on estimate

BASH – Milo Martin slide 5 Best of both protocols Micro-benchmark 64 processors

BASH – Milo Martin slide 6 Outline Overview Bandwidth adaptive mechanism Hybrid protocol Evaluation Conclusions

BASH – Milo Martin slide 7 Ordered Interconnect $ P M $ P M $ P M $ P M System model Ordered interconnect Processor/Memory nodes –Directory state –Adaptive mechanism Bandwidth Adaptive Mechanism Network Interface Caches Processor Memory Directory Controller

BASH – Milo Martin slide 8 Bandwidth adaptive mechanism Choose broadcast or unicast for each miss Goal: minimize latency - avoid extreme queuing delay Approach: limit average interconnect utilization –Contention dominates miss latency at high utilizations –Interconnect utilization goal (e.g., 75%) –Adjust rate of broadcast –Feedback control system

BASH – Milo Martin slide 9 Implementation Two counters at each processor –Utilization counter (Above or below utilization threshold?) –Policy counter (Probability of broadcast?) At each processor –Each cycle: Monitor local link & adjust utilization counter –Each sampling interval: Adjust policy counter based on utilization counter –Each miss: Compare policy counter with a random number Why random? –Steady state of mixed broadcasts and unicasts –Enables us to avoid oscillation

BASH – Milo Martin slide 10 Outline Overview Bandwidth adaptive mechanism Hybrid protocol –Snooping-like operation –Directory-like operation –Complexity & Scalability Evaluation Conclusions

BASH – Milo Martin slide 11 Ordered broadcast Marker places request in total order marker request  Snooping-like operation P2P2 Owner P1P1 Shared P3P3 Invalid P0P0 Requestor M0M0 Home Data  Low latency cache-to-cache, but requires broadcast Owner: P 1

BASH – Milo Martin slide 12 marker request  Add indirection Uses order to avoid acks Similar to Alpha GS320 marker  re-request Directory-like operation P2P2 Owner P1P1 Shared P3P3 Invalid P0P0 Requestor M0M0 Home Data  Avoids broadcast, but frequently adds indirection Owner: P 1, Sharers: {P 2 }

BASH – Milo Martin slide 13 Protocol races Choose broadcast or unicast for each miss Protocol simultaneously allows –Broadcast requests –Unicast requests –Forwarded requests –Writebacks Like all protocols, BASH has protocol races

BASH – Milo Martin slide 14 Protocol race example P2P2 Owner P1P1 Shared P3P3 Requestor P0P0 M0M0 Home Owner: P 1, Sharers: {P 2 } Broadcast Unicast re-request Data

BASH – Milo Martin slide 15 Protocol race example P2P2 Invalid P1P1 P3P3 Modified P0P0 Requestor M0M0 Home Owner: P 3, Sharers: Ø Unicast re-request

BASH – Milo Martin slide 16 Protocol race example P2P2 Invalid P1P1 P3P3 Modified P0P0 Requestor M0M0 Home Owner: P 3, Sharers: Ø Unicast re-request Data 2nd re-request

BASH – Milo Martin slide 17 Protocol races Race detection: directory audits all requests –Observes all requests –Compares request destination set with current sharers –Occasionally needs to re-issue a request Requests are processed uniformly –Processors - respond with data or invalidate –Directory - audit request, may forward data or request See paper for more information

BASH – Milo Martin slide 18 Complexity One “cost” of implementing BASH Quantifying complexity is difficult… –Protocol controllers are finite state machines –Similar number of states –BASH has twice as many events and transitions Moderate complexity –Additive, not multiplicative Similar to Multicast Snooping –Original proposal [Bilir et al., ISCA 1999] –Enhanced, specified & verified [Sorin et al., TPDS 2002]

BASH – Milo Martin slide 19 Scalability Limited by ordered interconnect –BASH eliminates broadcast-only nature of snooping Recent systems with an ordered interconnect –Compaq AlphaServer GS320 (32 processor) - directory –Sun UE15000 (106 processors) - snooping –Fujitsu PrimePower 2000 (128 processors) - snooping Potential alternative –Timestamp Snooping network [Martin et al., ASPLOS 2000]

BASH – Milo Martin slide 20 Outline Overview Bandwidth adaptive mechanism Hybrid protocol Evaluation Conclusions

BASH – Milo Martin slide 21 Workloads & methods Workloads [ CAECW ‘02 ] –OLTP: IBM’s DB2 & TPCC-like (1GB database) –Static web: Apache –Dynamic web: SlashCode –Java middleware: SpecJBB –Scientific workload: Barnes-Hut Setup and tuned for 16 processors Full system simulation –Virtutech’s Simics –Solaris 8 on SPARC V9 –Blocking processor model Memory system simulator –Captures timing, races, and all transient states

BASH – Milo Martin slide 22 Three Questions 1)Is our adaptive mechanism effective? 2)Does BASH adapt to multiple workloads? 3)Does BASH adapt to multiple configurations?

BASH – Milo Martin slide 23 (1) SpecJBB on 16 processors

BASH – Milo Martin slide 24 (1) SpecJBB on 16 processors, 4x broadcast cost

BASH – Milo Martin slide 25 (1) SpecJBB on 16 processors, 4x broadcast cost

BASH – Milo Martin slide 26 (2) Can BASH adapt to multiple workloads? 1600 MB/s links SimilarSnooping Directory

BASH – Milo Martin slide 27 (2) Can BASH adapt to multiple workloads? 1600 MB/s links

BASH – Milo Martin slide 28 (3) Can BASH adapt to multiple configurations? Micro-benchmark 1600 MB/s links

BASH – Milo Martin slide 29 (3) Can BASH adapt to multiple configurations? Micro-benchmark 1600 MB/s links

BASH – Milo Martin slide 30 Results Summary 1)Is our adaptive mechanism effective? Yes 2)Does BASH adapt to multiple workloads? Yes 3)Does BASH adapt to multiple configurations? Yes

BASH – Milo Martin slide 31 Conclusions Bandwidth Adaptive Snooping Hybrid (BASH) –Hybrid of snooping and directories –Simple bandwidth adaptive mechanism Adapts to various workloads & system configurations –Robust performance –Outperforms base protocols in some cases Future directions –Focus bandwidth on likely cache-to-cache transfers –Explore multicasts –Power-adaptive coherence

BASH – Milo Martin slide 32

BASH – Milo Martin slide 33 Queuing model motivation Knee A multiprocessor as a simple queuing model –Exponential service & think time distributions “interconnect” “ processors ” requestsresponses