Using Prediction to Accelerate Coherence Protocols

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Continuously Recording Program Execution for Deterministic Replay Debugging.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Speculative Lock Elision
COMP 740: Computer Architecture and Implementation
Architecture and Design of AlphaServer GS320
Multiscalar Processors
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
5.2 Eleven Advanced Optimizations of Cache Performance
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Parallel and Multiprocessor Architectures – Shared Memory
Module 3: Branch Prediction
Death Match ’92: NUMA v. COMA
Ka-Ming Keung Swamy D Ponpandi
Lecture 17: Transactional Memories I
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
Advanced Computer Architecture
Lecture 10: Branch Prediction and Instruction Delivery
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Lecture 10: Consistency Models
High Performance Computing
Speculative execution and storage
Lecture 25: Multiprocessors
Lecture 26: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Dynamic Hardware Prediction
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
rePLay: A Hardware Framework for Dynamic Optimization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Ka-Ming Keung Swamy D Ponpandi
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 11: Consistency Models
Presentation transcript:

Using Prediction to Accelerate Coherence Protocols Shubhendu S Mukherjee and Mark D Hill University of Wisconsin Madison

The topic once again Using Prediction to Accelerate Coherence Protocols Discuss the concept of using prediction in a coherence protocol See how it can be used to accelerate the protocol

Organization Introduction Background Cosmos Directory Protocol Two-level Branch Predictor Cosmos Basic Structure Obtaining Predictions Implementation Issues Integration with a Coherence Protocol How and When to act on the predictions Handling Mis-predictions Performance Evaluation Benchmarks Results Summary and Conclusions

Introduction Large shared memory multi processors suffer from long latencies for misses to remotely cached blocks Proposals to lessen these latencies Multithreading Non-blocking caches Application specific coherence protocols Predict future sharing patterns, overlap execution with coherence work Drawbacks More complex program model Require sophisticated compilers Existing predictors are directed at specific sharing patterns known a priori Need for a general predictor, hence this paper!

Introduction If general predictor is not in the army then what is it? A general predictor would sit beside standard directory or cache module, monitor coherence activity and take appropriate actions See the design of Cosmos coherence message predictor Evaluate Cosmos on some scientific applications Alls well that ends well? Summarize and conclude

Background: 6810 strikes back! Structure of a Directory Protocol Distributed memory multiprocessor Hardware based cache coherence Directory and memory distributed among processors Physical address gives the location of memory Nodes connected to each other via a scalable interconnect Messages routed from sender to receiver Directory keeps track of sharing states, which are?

Interconnection network Directory Structure Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network

Example: Coherence Protocol Actions Processor 1 & Caches Wr A Processor 2 & Caches A ? Memory I/O Memory I/O 3 1 5 Directory 4 Directory 2 Interconnection network

Example: Coherence Protocol Actions Processor 1 & Caches Wr A Processor 2 & Caches A P1 Wr request to Dir 1 Dir 1 Inval request Dir 2 Dir 2 Inval Cache copy P2 Dir2 Inval response Dir 1 Dir 1 Wr response P1 Memory I/O Memory I/O 3 1 5 Directory 4 Directory 2 Interconnection network

Example: Coherence Protocol Actions Processor 1 & Caches Wr A Processor 2 & Caches A P1 Wr request to Dir 1 Dir 1 Inval request Dir 2 Dir 2 Inval Cache copy P2 Dir2 Inval response Dir 1 Dir 1 Wr response P1 Memory I/O Memory I/O 3 1 5 Directory 4 Directory 2 Interconnection network Point to ponder: Multiple long-latency operations (sequential)

Background: 6810 strikes back! Branch predictor Need: Execute probable instructions without waiting, thus improve performance Two Level Basically a Local predictor Use PC of branch to index into Branch History Table(Local) Use this BHT entry to index into per branch Pattern History Table to obtain a branch prediction

index into branch history table Two Level Predictor Branch PC Table of 16K entries of 2-bit saturating counters Use 6 bits of branch PC to index into branch history table 10110111011001 14-bit history indexes into next level Table of 64 entries of 14-bit histories for a single branch Pattern History Table

What in Universe is COSMOS? Cosmos is a Coherence Message Predictor Predicts the sender and type of next incoming message for a particular block. Structure : Similar to a two level branch predictor

Structure of Cosmos Pattern History Tables Message History Table (MHT) Message History Register (MHR) Pattern History Tables (Per block address) Message History Table (MHT)

Structure of Cosmos Message History Table (MHT) Message History Register (MHR) <sender, type> … Number of tuples per MHR constitutes its depth Message History Table (MHT)

Structure of Cosmos The first level table is called the Message History Table (MHT) An MHT consists of a series of Message History Registers (MHR) (one per cache block address) An MHR contains a sequence of <sender,type> tuples (depth) The second level table is called the Pattern History Table(PHT) There is one PHT for each MHR PHT is indexed by the entry in MHR Each PHT contains prediction tuples corresponding to MHR entries

An Example: Producer - Consumer repeat … if(producer) private_counter++ shared_counter = private_counter barrier else if(consumer) private_counter = shared_counter else endif until done

An Example: Producer - Consumer Processor 1 & Caches Memory I/O Processor 2 Interconnection network Directory Consumer Producer

An Example: Producer - Consumer Producer Cache Memory I/O ? ? 1 2 Directory Messages seen by the Producer Cache (from directory)

An Example: Producer - Consumer Producer Cache Memory I/O 1. Get Wr Response 2. Invalidate Wr request 1 2 Directory Messages seen by the Producer Cache

An Example: Producer - Consumer Consumer Cache Memory I/O ? ? 1 2 Directory Messages seen by the Consumer Cache(from directory)

An Example: Producer - Consumer Consumer Cache Memory I/O 1. Get Rd Response 2. Invalidate Rd request 1 2 Directory Messages seen by the Consumer Cache

An Example: Producer - Consumer ? ? ? ? Messages seen by the Directory

An Example: Producer - Consumer 1. Get Wr Request from producer 2. Invalidate Rd Response from consumer 4. Invalidate Wr Response from producer 3. Get Rd Request from consumer Messages seen by the Directory

An Example: Producer - Consumer Sharing Pattern Signature Predictable message patterns Producer send Get Wr request to directory receive Get Wr response from directory receive Invalidate Wr request from directory send Invalidate Wr response to directory Consumer send Get Rd request to directory receive Get Rd response from directory receive Invalidate Rd request from directory send Invalidate Rd response to directory

Directory receives get Rd request from the consumer Back to Cosmos Pattern History Table for shared_counter Directory receives get Rd request from the consumer ? <P2, get Rd request> Message History Table <P2, get Rd request> P1: Producer P2: Consumer Global Address of shared_counter

Directory receives get Rd request from the consumer Back to Cosmos Directory receives get Rd request from the consumer Pattern History Table for shared_counter <P2, get Rd request> <P1, Inval Wr response> Message History Table <P2, get Rd request> P1: Producer P2: Consumer Global Address of shared_counter

Back to Cosmos Obtaining Predictions Updating Cosmos Index into MHR table with the address of the cache block Use the MHR entry to index into the corresponding PHT Return the prediction (if one exists) from the PHT. This prediction is of the form < Sender , Message – type >. Updating Cosmos Write new <Sender, Message – type> tuple as prediction for index corresponding to the MHR entry Insert the <Sender, Message – type> tuple into the MHR for the cache block

How Cosmos adapts to complex signatures Consider one Producer and two Consumers P1 and P2 Two get Rd requests arrive out of order. PHT will then be as shown below Index Prediction <P1, get Rd request> <P2, get Rd request> <P2, get Rd request> <P1, get Rd request>

How Cosmos adapts to complex signatures MHR with depth greater than 1 Index Prediction <P1, get Rd request> <P3, get Rd request> <P2, get Rd request> <P2, get Rd request> <P1, get Rd request> <P3, get Rd request> <P3, get Rd request> <P2, get Rd request> <P1, get Rd request>

Implementation issues Storage Issues Possible to merge the first level table with cache block state at cache and the directory? Second level table will need more memory to catch pattern histories for each cache block If number of pattern histories for each cache block is found to be low, per allocate memory for the pattern histories If more pattern histories needed, allocate them from a common pool of dynamically allocated memory Higher prediction accuracies require higher MHR depths : may result in large amounts of memory

Integration with a Coherence protocol Predictors sit beside cache and directory module and accelerate coherence activity in two steps: Step 1: Monitor message activity and make a prediction Step 2: Invoke an action based on the prediction Key challenges: Knowing how and when to act on the predictions Handling Mis – predictions Performance

How to act on predictions Some Examples Prediction Location Static / Dynamic Action Protocol Ld/St from Processor Cache Static Pre fetch block in shared or exclusive state Stanford DASH protocol Read – modify - write Directory Directory responds with block in exclusive state for read miss for idle block SGI Origin Protocol Cache requests exclusive copy on read miss Dir1 SW, Dir1 SW+ Store from different processor Replace block and return to directory Dynamic Invalidate and replace block to directory if exclusive Dynamic self invalidation Block migrates between different processors On read miss return block to requesting processor in exclusive state Migratory Protocols

Detecting and Handling Mis-predictions Usual problem with predictions Mis-predictions may leave processor state / protocol state in an inconsistent state Actions taken after predictions can be classified into three categories Actions that move the protocol between two legal states Actions that move the protocol to a future state, but do not expose this state to the processor Actions that allow both processor and the protocol to move to future states

Handling Mis-Predictions Actions that move the protocol between two legal states Example : Replacement of a cache block that moves the block from “exclusive” to “invalid” state No explicit recovery in this case P1 Cache Directory P2 Cache Time Get Wr request Inval Wr response Get Wr response

Handling Mis-Predictions Actions that move the protocol to a future state, but do not expose this state to the processor If mis-prediction, simply discard the future state If prediction is correct, commit the future state and expose it to the processor P1 Cache Directory P2 Cache Time Predicts, updates protocol state, generates message Get Wr request Inval Wr request Sends message Inval Wr response Get Wr response

Handling Mis-Predictions P1 Cache Directory P2 Cache Time Predicts, updates protocol state, generates message Mis-Predict Send correct response

Handling Mis-Predictions Actions that allow both processor and the protocol to move to future states Need greater support for recovery Before speculation, both processor and protocol can checkpoint their states On detecting Mis-predictions , they rollback to the check pointed states On correct prediction, the current protocol and processor states must be committed

Performance How prediction affects runtime A simplistic execution model is as follows. Let : p be the prediction accuracy for each message, f be the fraction of delay incurred on messages predicted correctly (e.g .f = 0 means that the time of a message predicted correctly is completely overlapped with other delays), and r be the penalty due to a mis-predicted message (e.g., r = O.5 implies a mis-predicted message takes 1.5 times the delay of a message without prediction).

Performance How prediction affects runtime p be the prediction accuracy for each message, f be the fraction of delay incurred on messages predicted correctly r be the penalty due to a mis-predicted message If performance is completely determined by the number of messages in the critical path of a parallel program, then speedup due to prediction is: time(w/o prediction) 1 ----------------------------- = ----------------------------- time (with prediction) p * f + (1-p) * (1+r)

Performance E.g.: For a prediction accuracy of 80% (p=0.8), speedup = 56% with a mis-prediction penalty of 100%(r=1) and a prediction success benefit of 30% (f=0.3)

Evaluation Benchmarks Cosmos’ prediction accuracy is evaluated using traces of coherence messages obtained from the Wisconsin Stache protocol running five parallel scientific applications Wisconsin Stache protocol Stache is a software, full-map,and write-invalidate directory protocol that uses part of local memory as a cache for remote data. Benchmarks Five parallel scientific applications: appbt, barnes, dsmc, moldyn, unstructured

Benchmarks Appbt Barnes Dsmc Appbt is a parallel three-dimensional computational fluid dynamics application. Barnes Barnes simulates the interaction of a system of bodies in three dimensions using the Barnes-Hut hierarchical N-body method. Dsmc Dsmc studies the properties of a gas by simulating the movement and collision of a large number of particles in a three-dimensional domain with discrete simulation Monte Carlo method.

Benchmarks Moldyn Unstructured Moldyn is a molecular dynamics application. Unstructured Unstructured is a computational fluid dynamics application that uses an unstructured mesh to model a physical structure,such as an airplane wing or body.

Results C D O C: cache prediction rate D: Directory prediction rate Depth of MHR 1 2 3 4 appbt barnes dsmc moldyn unstructured C D O 91 90 89 77 79 80 84 85 81 78 42 56 57 62 69 68 94 95 73 92 86 93 96 65 88 74 C: cache prediction rate D: Directory prediction rate O: Overall prediction rate

Results C D O C: cache prediction rate D: Directory prediction rate Depth of MHR 1 2 3 4 appbt barnes dsmc moldyn unstructured C D O 91 90 89 77 79 80 84 85 81 78 42 56 57 62 69 68 94 95 73 92 86 93 96 65 88 74 C: cache prediction rate D: Directory prediction rate O: Overall prediction rate

Results: Observations Overall prediction accuracy :62 ~ 86% Higher accuracy for cache compared to directory: Why ? Prediction accuracy increases with an increase in MHR depth However, not much increase beyond MHR depth of 3 Appbt: High prediction accuracy Producer-consumer sharing pattern Producer reads, writes and consumer reads Barnes : Lower accuracy than other applications Nodes of octree are assigned different shared memory addresses in different iterations

Results: Observations Dsmc: Highest accuracy among all applications Producer-consumer sharing patterns Producer writes and consumer reads Why higher than Appbt? Moldyn: High accuracy Migratory and producer-consumer sharing patterns Unstructured: Different dominant signatures for same data structures in different phases of the application

Effects of noise-filters Remember them? Cosmos noise filter: Saturating counter : 0 to MAXCOUNT, here till 2 MHR depth >2, filters do not help much – Why? Predictors with MHR>1 can adapt to noise, greater accuracy for repeating noise Depth of MHR 1 2 appbt barnes dsmc moldyn unstructured 1 2 84 85 86 62 69 66 71 88 74 78 89 0 ,1, 2: MAXCOUNT

Summary and Conclusions Comparison with directed optimizations Worse: Less cost effective as more hardware required Better: Including the composition of predictors of several directed optimizations in a single protocol will be more complex than a single Cosmos Can discover application-specific sharing patterns not known a priori

Summary and Conclusions We explored using Prediction to Accelerate coherence protocol Protocol executes faster if future actions can be predicted and executed speculatively. We came across Cosmos MHT, MHR, PHT Two-level predictor Use <sender, message-type> tuple We evaluated Cosmos using scientific applications High prediction accuracy because of predictable coherence message patterns Cosmos is more general than directed optimizations Can be costly because of large resource usage Can be easily integrated with a protocol Can discover and track application specific patterns not known a priori Finally, more work is needed to determine if the high prediction rates can be used to significantly reduce execution time with a coherence protocol.

Questions ?