Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

To Include or Not to Include? Natalie Enright Dana Vantrease.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors ： Thomas J. Ashby, Pedro D´ıaz, Marcelo.

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Revisiting Load Value Speculation:

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

Power and Frequency Analysis for Data and Control Independence in Embedded Processors Farzad Samie Amirali Baniasadi Sharif University of Technology University.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Lecture 13: Multiprocessors Kai Bu

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Perceptron-based Coherence Predictors Naveen R. Iyer Publication: Perceptron-based Coherence Predictors. D. Ghosh, J.B. Carter, and H. Duame. In the Proceedings.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Prophet/Critic Hybrid Branch Prediction B B B

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Lecture 13: Multiprocessors Kai Bu

Speculative Lock Elision

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

The Stanford FLASH Multiprocessor

Address-Value Delta (AVD) Prediction

High Performance Computing

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

University of Wisconsin-Madison Presented by: Nick Kirchem

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004

Motivation Multi-threading and Multi-processing have become common When a cache line is marked as invalid very often not all data in the line is incorrect If the data in invalid lines can be used speculatively there is a great potential for performance improvement

Background Cache Coherence Protocol Used in shared-memory multiprocessors for managing correct data sharing Vital to the design of multiprocessors since it contributes the most to inter-processor communication latency

Proposed Idea Separate the traditional cache coherence protocol into two parts –Speculative cache lookup (SCL) – uses a speculative value from an invalid cache line thus allowing the processor to work continuously –Safe coherence protocol – obtains the correct value which is then compared with the value provided by SCL

Coherence Decoupling

Related Work Customized Coherence Protocols Speculative Coherence Operations Dynamic self-invalidation, coherence message predictor, token coherence etc. Speculation on outcome of events in multi-processor execution

Coherence Decoupling Architecture Must support the following: 1.Split - means to split a memory op into speculative load and a coherence operation 2.Compute - mechanisms to support execution with speculative values 3.Recover – means to recover and rollback upon misprediction

SCL Protocols for Coherence Decoupling Use a simple safe coherence protocol and rely on an aggressive SCL protocol to increase performance Two components of an SCL protocol –Read component – obtains the speculative value –Update component – updates an invalid cache line so subsequent speculative reads can use it (can be left out in some SCL protocols)

Read vs Update components SCL protocol with only a read component can be used if the word in an invalid block has: –Not changed remotely (false sharing) –Changed remotely to a same value (silent stores) –Changed remotely to a different value and then back to the original value (temporally silent stores) For truly-shared data an update component needs to be added –Speculatively sends data around the system by writing it into invalid cache lines

SCL protocol Read component CD - Use the locally cached incoherent value for every L2 miss Simple but since it is triggered on every load operation it could produce many mis- speculations CD-F - Add a PC-indexed confidence predictor to filter speculations Reduces the number of (mis)speculative reads thus improving the average accuracy

SCL protocol Update component CD-IA Use invalidation piggyback to update all invalid blocks CD-C Use invalidation piggyback if the value is compressed

SCL protocol Update component (Ctd.) CD-N - Update all sharers after N writes to a block Increases the number of messages (bandwidth) CD-W - Update on every write if any sharers exist CD assumed wherever Write update is being used

Methodology Simulator MP-Sauce & SimpleScalar 16-node SMP systems simulated Coherence protocol used – simple invalidation snooping-bus protocol 3 commercial applications and 5 scientific shared memory SPLASH2 suite benchmarks simulated

Results - Microbenchmarks Simple-fs – loads falsely shared data and then executes (in)dependent instructions Critical-fs – forces data dependence between two loads by placing consecutive false sharing misses in critical path

L2 Miss Profiling Results

Coherence Decoupling Accuracy Results CD, CD-F, CD-IA, CD-C, CD-N, CD-W

Timing Results

Bandwidth Requirements

Latency Tolerance Profiles Executed instructions during coherence decoupling The number of control dependent instructions will grow in future processors

Conclusions Coherence Misses – significant fraction of L2 misses ranging from 10% to 80% Coherence Decoupling has the potential to hide the miss latency for 40% to 90% of coherence misses Mis-speculation occurs 20% of the time