Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Chapter 12 Pipelining Strategies Performance Hazards.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

Multiscalar processors

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Pipelining and Parallelism Mark Staveley

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Lecture 8: Snooping and Directory Protocols

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

Lecture 2: Snooping-Based Coherence

Using Prediction to Accelerate Coherence Protocols

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Lecture 9: Directory-Based Examples

High Performance Computing

Lecture 8: Directory-Based Examples

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Dynamic Hardware Prediction

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University

Organization Introduction Directory based cache coherence Pattern Based Message Predictors Memory Sharing Predictors Vector Memory Sharing predictors Speculative Coherent operations Performance Analysis Results Summary & conclusions

Introduction Distributed Shared Memory Multiprocessors: –Provide a logical shared address space over physically distributed memory –Programming easier compared to SMPs. –Non-Uniform Memory Access(Bottleneck): Remote access far slower compared to local access. DSM

Efforts to eliminate this difference : –Custom designed motherboards– cannot get benefit of excellent cost- performance of off-shelf motherboards –Reduce remote access frequency –Reduce coherence protocol overhead—will need complex adaptive coherence protocols. –Existing predictors—directed to specific sharing patterns known a priori. –Pattern based predictors: Dynamically adapt to an application’s sharing pattern at runtime Does not modify the base coherence protocol –Memory Sharing Predictors & Vector Memory Sharing Predictors : Topic of this paper Improvement on general pattern based predictors proposed by Mukherjee & Hill

Directory based cache coherence Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Processor & Caches MemoryI/O Interconnection network Directory

Directory based cache coherence Directory Based cache Coherence Protocols –Each node maintains sharing information of all memory blocks –Based on a Finite state machine in which states : directory state & actions: messages –This paper uses half migratory protocol –Speculative Coherent DSM must accurately predict remote access and timely perform actions. Directory protocol transitions A remote read request

Pattern Based Message Predictors Predicts the sender and type of next incoming message for a particular block. Structure : Similar to a two level branch predictor History table: captures most recent sequence of incoming messages for every memory block Pattern table records all observed sequences of coherence messages for every memory block –(An Entry : Sequence of messages : prediction message) A two level Message predictor

Pattern Based Message Predictors(contd.) Depth of History Table Register = number of past messages, it keeps track of. Deeper history depth=> more accurate prediction, no race conditions. Deeper history depth => Large Pattern history table=> high cost. Message History Table (MHT) Message History Register (MHR) …

Memory Sharing Predictors Shortcomings of General Message Predictor: -Invalidation messages may arrive in any order, thus may interfere with prediction of more necessary request messages - It increases the number of pattern table entries (almost doubles) -It increases the number of bits needed to encode the messages (three requests & two acks). Observations: –To eliminate the coherence overhead on remote access, only necessary to predict memory request messages (read,write, upgrade). –Coherence acknowledgement message prediction extra overhead as they are always expected to arrive in response to a coherence action

Memory Sharing Predictors MSP addresses these issues: –predicting only the memory request messages –Since the acknowledgements are eliminated, all the effects of possible reordering of acknowledgements are eliminated. –Only 2 bits required to encode messages compared to 3 for general predictor

VMSP: A Vector MSP Observations: –Full map protocol allows multiple processors to simultaneously cache read only copy of a memory block. –A predictor must identify the sharers and not maintain the order in which they are read. Optimizations to MSP to get VMSP: –Rather than record and predict read requests as individual pattern table entries, encode a sequence of read requests as a bit vector just like the directory maintains the list of sharers.

Vector Memory Sharing Predictor(contd.) Benefits: –reduces the number of pattern table entries –eliminates the effect of re-ordering of reads on size –Effect on history depth : number of sharers –Good when the number of readers are large(>(2+n)/2+log(n)).

Triggering Request Speculation Important considerations: –Predict what remote memory requests arrive –Predict when remote accesses arrive –Execute necessary coherence actions A speculative coherent DSM node and coherence hardware

Triggering Request Speculation A) What remote memory request arrives : somewhat simple from pattern history table (which stores what memory accesses take place) B) When : somewhat tough here –early speculation may take away block from its readers –Late speculation may incur additional delay and may limit DSM’s ability to hide coherence overhead –was not a problem in COSMOS as all the coherence messages were being predicted but not sent. They were sent only after the previous message arrived. Since there are no coherence acknowledgement messages in the history table so timing is a problem now.

Triggering Request Speculation Two ways to overcome: 1)Speculative Write Invalidation: Based on common memory access patterns– most producer consumer scenario: Producer writes to a memory block and then no longer accesses until it has been read by consumers. Common in parallel commercial data base servers. MSP predicts that a processor is done writing when the processor writes to some other memory location Maintain a early write-invalidate table – stores last address written by a processor. If address in EWI table changes, trigger speculative write invalidate and subsequent reads.

Comparison with general Message predictor Time Read Writeback Send block P1 reader P3 DirectoryP2 Writer Time Write A P1 reader P3 DirectoryP2 writer Write B Invalidate Writeback Prefetching starts Send block Read hit invalidate

Question? What happens if while speculatively read data has been sent by P3 to P1, P1 has already made the request for data?

Question? What happens if while speculatively read data has been sent by P3 to P1, P1 has already made the request for data? -- The DSM node on receiving that speculated message drops this message to avoid modifying the protocol.

Question? What happens if P1 makes read request before P2 does the second write?

Question? What happens if P1 makes read request before P2 does the second write? –First Read Protocol 2) First Read: –If SWI fails, then on the first read request made, all subsequent reads are triggered.

Speculative Coherence Operations Final Action: –execute a coherence action speculatively –verify the accuracy of the predictor Requirements: –Co-exist with the base coherence protocol without any protocol modifications MSP simply advices the protocol to execute coherence operations. Any misspeculation results in additional coherence operations but no interference with protocol functionality –eg. A premature write invalidation results in additional read /write request by producer. MSP will advice the protocol to send read-only block copies to requesters.

Verification of accuracy Reference bit in remote cache of every block placed speculatively On actual reference, remote cache clears the bit, verifying that the access occurred. On invalidation of this block, reference bit is sent alongwith the invalidation message The MSP at home node examines this bit and removes mispredicted messages.

Performance Analysis Performance depends on –Speculation accuracy –Reduction in latency on successful speculation –Misspeculation penalty –Speculation opportunity– A computationally intensive application will benefit little from speculation. Assumptions: –When speculative memory request is successfully executed, entire remote latency is hidden –Misspeculation only slows the remote access, does not increase the request frequency

Performance Performance Model: –c : Application’s communication ratio – f : fraction of speculatively executed instructions over all received requests –p : request prediction accuracy –l access : local access latency –r access : remote access latency –rtl : r access /l access –n: misspeculation penalty factor –N: number of remote requests on the critical path

Performance Communication speedup is given by: (Comm time w/o speculation)/(comm time w/ speculation) Nr access = (1-f)Nr access + fN(pl access + (1-p)nr access ) 1 = (1-f) +f (p/rtl + n(1-p)) Total speedup is given by : (total execution time w/o speculation)/(total execution time w/ speculation) 1 = (1-c) + c/(comm_speedup)

Speedup vs various parameters Potential Speedup in a speculative coherent DSM

Speedups Prediction accuracy plays prominent role in speedup –A low prediction accuracy of 10-50% results in slowdown due to high speculation overhead while a high prediction accuracy (90%) increases speedup even for moderate communication ratios. –At high prediction rates, slowdown due to increasing misspeculation penalty is not significant –f: fraction of speculated instructions, is a measure of number of request messages it takes to learn and predict. For rapidly changing patterns, even at high prediction accuracy, performance improvement will not be significant. –Speculative coherent Protocol impacts clusters most because of high rtl ratio.

Simulation & results Wisconsin wind tunnel II to simulate CC-Numa with 16 nodes interconnected through hardware DSM boards to a low latency switched network. Full map write invalidate protocol with 32 byte coherence blocks. Benchmarks: appbt, barnes, em3d, moldyn, ocean, tomcatv, unstructures.

Results Base predictor accuracy comparison(history depth 1)

Results Em3d, Moldyn exhibit producer/consumer sharing with small read sharing => low impact of read ordering => high performance with MSP. Unstructured exhibits wide read-sharing in producer/consumer phase, hence MSP can get a prediction accuracy of less that 65% while VMSP can get almost 85%.

Results Prediction accuracy with varying history depths

Results Messages predicted(correctly predicted) for a history depth of 1

Results Predictor storage overhead

Results All predictors use 4 bits to encode processor id Cosmos uses 3 bits to encode message type => 7 bits for history table entry and 14 bit per pte => (7+14) bits per block MSP and VMSP use 2 bits to encode a message type MSP 12 bits per pte =>(6+12) bits per block VMSP uses 18 bits per history table, but (18+6) bits per pte => (18+24) bits per block (in VMSP a read vector is always followed by a write/upgrade and vice versa). A pte will contain at most one entry. MSP and VMSP require less storage compared to cosmos.

Summary and Conclusion Proposed the Memory Sharing Predictor tom predict and execute coherence protocols speculatively. MSP eliminates acknowledgement messages in pattern tables and increases prediction accuracy from 81% to 86%. VMSP further improves accuracy upto 93% using compact vector representations and eliminating perturbations due to read request reorderings. VMSP also reduces implementation storage. High accuracy predictors are key to high performance SC DSM.

Discussions