Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

Slides:

Advertisements

Similar presentations

Chapter 5 Part I: Shared Memory Multiprocessors

Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,

Shared Memory Consistency

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Cache Coherence in Scalable Machines (IV) Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

Memory consistency models Presented by: Gabriel Tanase.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Lecture 13: Consistency Models

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Memory Consistency Models

1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Evaluation of Memory Consistency Models in Titanium.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

“Shared Memory Consistency Models: A Tutorial” By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

CS 295 – Memory Models Harry Xu Oct 1, Multi-core Architecture Core-local L1 cache L2 cache shared by cores in a processor All processors share.

Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.

Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

CS533 Concepts of Operating Systems Jonathan Walpole.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 20: Consistency Models, TM

COSC6385 Advanced Computer Architecture

Distributed Shared Memory

Memory Consistency Models

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Memory Consistency Models

Example Cache Coherence Problem

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Shared Memory Consistency Models: A Tutorial

Lecture 12: TM, Consistency Models

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 22: Consistency Models, TM

Shared Memory Consistency Models: A Tutorial

Lecture 25: Multiprocessors

Lecture 10: Consistency Models

Memory Consistency Models

The University of Adelaide, School of Computer Science

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Consistency Models, TM

Lecture 11: Relaxed Consistency Models

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Presentation transcript:

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe

Overview What is a Memory Consistency Model ? Uniprocessor memory consistency Multiprocessors Shared memory multiprocessor memory consistency 1.Sequential Consistency (SC) model 2.Relaxed Models

Memory Consistency Model A memory model provides a formal specification of the effect of read and write operations on the memory system and describes how memory appears to the programmer Bridges the gap between the behavior expected by the programmer and the actual behavior of the program. Memory model affects: -- Programmability (easy-of-programming) -- Performance (optimizations that it allows) -- Portability (moving software across different systems)

Uniprocessor memory model In a non-parallel program, all memory accesses are done via a single-thread of control executing on a single processor A uniprocessor presents a simple and intuitive view of memory to programmers based on sequential semantics Memory operations are assumed to execute one at a time in the order specified by the program’s code

Uniprocessor memory model Memory operations are assumed to execute one at a time, ie. an operation executes atomically w.r.t other operations in the order specified by the program’s code So there is an ordering on the memory operations. A read is assumed to return the value of the last write to the same location Last is precisely defined by program order

Uniprocessor memory model A processor’s speed is orders of magnitude faster than memory access speeds Compilers and h/w perform various optimizations to hide memory latency Can result in overlapping, reordering or elimination of memory operations OK in a single-threaded program as long as program order is preserved between memory operations to the same location, thereby preserving control and data dependences

Uniprocessor Optimizations Re-ordering optimizations Compiler optimizations – Register allocation, code motion etc. H/W optimizations occuring at various levels – Processor issues operations out of order – Use of write buffers causes reordering of W->R to different locations – Non-blocking caches can cause reordering Reorderings that preserve control and data dependence are OK, since memory is being viewed only by a single processor/thread

Multiprocessors Differentiated based on communication mechanism between nodes Message passing : each processor has own memory. Communication via messages Shared memory: single address spaces. Communication thru read/write operations to shared memory

Shared Memory Multiprocessors In a typical scalable shared-memory multiprocessor system The memory is distributed among the nodes; hence local VS remote memory accesses Nodes are connected using a general network, the paths thru which take varying amounts of time Processor environment within a node is similar to that of a uniprocessor, ie. Write buffers, cache etc.

Shared Memory Multiprocessors Optimizations to hide memory latency assume greater importance in multiprocessors Memory Latency is greater because: Operation may involve a remote node Larger cache miss rate due to communication among processors

Shared Memory Model Multiple processors concurrently operate on shared memory All processors need to have a common view of the shared memory This is complicated by the compiler and hardware optimizations required to efficiently support a single address space. These can cause processors to observe distinct views of shared memory Need a conceptual model for the semantics of memory operations to allow programmers to use shared memory correctly

Sequential Consistency model Intuitively, the execution of a multi-threaded program on a multiprocessor should behave the same as the interleaved execution of the threads on a uniprocessor Consider the multiprocessor as a collection of sequential uniprocessors accessing a common memory. Only a single processor accesses memory at a time MEMORY P1P3P2Pn

Sequential Consistency model Definition: [A multiprocessor system is sequentially consistent if] the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Sequential consistency requires appearance of maintenance of program order among operations from individual processors a single sequential order among operations from all processors i.e. they ececute one at a time, i.e. an operation executes atomically w.r.t other operations

Sequential Consistency model Initially: Flag1 = Flag2 = 0 P1P2 Flag1 = 1Flag2 = 1 if (Flag2 == 0)if (Flag1 == 0) critical section critical section Illustrates importance of maintaining program order among operations from a single processor. Notice that the Read and Write of each processor is to different memory locations. Sequential consistency is violated if P1 or P2 reorder their Write and Read, allowing both to read a value 0 enter the critical section

Sequential Consistency model Initially A = B = 0 P1P2P3 A = 1 if (A ==1) B = 1 if (B==1) reg1 = A Illustrates importance of atomic execution of memory operations. Sequantial consistency is violated if P1’s Write(A) is seen by P2 but not by P3 and P2’s Write(B) is seen by P3, allowing reg1 to get value 0

Implementing Sequential Consistency Architecture without caches Architecture with caches

SC architectures without caches SC violation due to write buffers with bypassing capability Initially: Flag1 = Flag2 = 0 P1 P2 Flag1 = 1 Flag2 = 1 if (Flag2 == 0) if (Flag1 == 0) CS CS

SC architectures without caches SC violation due to write buffers with bypassing capability Each processor buffers its write and allows subsequent read to different address to bypass the write So both reads of the flags return the value 0 allowing simultaneous entry into the CS - Safe on uniprocessor system, since a read address that matches a buffered write will get value from write buffer

SC architectures without caches SC violation due to overlapping Write Operations

SC architectures without caches SC violation due to overlapping Write Operations A general interconnection network alleviates the serialization bottleneck of a bus-based design multiple memory modules provide the ability to service multiple operations simultaneously Problem: write operations issued by the same processor to locations in different memory modules may complete out of order. P1’s Write (Head) completes before Write (Data) P2 sees new Head, but old Data OK on uniproceessor since memory accesses are sequential. Solution: delay injecting the next Write into the network a until processor receives an ack that its previous write has reached its target

SC architectures without caches Non-Blocking Read Operations

SC architectures without caches SC violation due to Non-Blocking Read Operations If P2 issues its reads in an overlapped fashion, it is possible for P2’s Read (Data) to arrive at memory before Write (Data) from P1, while Read (Head) reaches memory after Write(Head) from P1 This leads to a non-sequentially-consistent outcome

SC architectures with caches The replication of shared data introduces three additional issues The presence of multiple copies requires a mechanism, referred to as the cache coherence protocol, to propagate a newly written value to all cached copies of the modified location. Detecting when a write is complete (to preserve program order between a write and its following operations) requires more transactions in the presence of replication. Propagating changes to multiple copies is inherently a non-atomic operation making it more challenging to preserve the illusion of atomicity for writes with respect to other operations.

SC architectures with caches Cache coherence model Basic requirements commonly associated with a cache coherence model are: a write is eventually made visible to all processors writes to the same location appear to be seen in the same order by all processors (referred to as serialization of writes to the same location) Not strong enough for Sequential Consistency which requires all writes to be serializable and program order among operations from individual processors

SC architectures with caches Cache coherence protocol A cache coherence protocol is the mechanism that propagates a newly written value to the cached copies of the modified location. Typically achieved by either invalidating the copy or updating the copy to the newly written value. A memory consistency model places an early and late bound on when a new value can be propagated to any given processor.

SC architectures with caches Detecting the Completion of Write Operations Asssume each processor has a write thru cache. P2 has Data in its cache. P1 proceeds with Write(Head) after its Write(Data) reaches memory, but before the update/invalidation reaches P2 Possible for P2 to see new value in Head but old cached value for Data

SC architectures with caches Detecting the Completion of Write Operations (cont..) Soln: P1 waits for P2’s cached copy of Data to be invalidated or updated. Target caches ack the reciept of an invalidate/update msg When acks from all target caches are collected, the processor that did the Write is notified

SC architectures with caches Maintaining atomicity of writes: Condition 1 Seq consistency is violated if P3 and P4 see the writes to A in a different sequence and hence read different values for A Soln: – Writes to same location must be serialized – All update/invalidate msgs for a given location originate from a single point and the ordering of these msgs between a given source and destination is preserved by the network

SC architectures with caches Maintaining atomicity of writes: Condition 2 A and B are cached by all processors Initially: A = B = 0 P1P2P3 A = 1 if (A ==1) B = 1 if (B==1) reg1 = A Sequantial consistency is violated if Update for P1’s Write(A) reaches P2 but not P3 Update for P2’s Write(B) reaches P3 before update for P1’s Write (A) P3 returns old value for A from its cache

SC architectures with caches Maintaining atomicity of writes: Condition 2 (cont …) Cause of SC violation: P2 is allowed to read new value of A before update message reaches P3 Solution: Prohibit a read from returning a newly written value until all cached copies have acknowledged the receipt of the invalidation or update messages generated by the write.

Relaxed Memory models Allow performance enhancing optimizations Differentiated and compared based on: – How do the models relax program order – How do the models relax write atomicity Provide mechanisms to override program order relaxations Relaxations: First 3 deal with Program Order for operations to different locations, last 2 with Atomicity

Relaxed Memory models Different model implementations

Relaxing W  R order Models: IBM 370, SPARC Total Store Order (TSO) and PC Differ in how they relax atomicity: IBM enforces strict atomicity. TSO relaxes for when read is for a buffered write from the same processor. PC enforces nothing P1 P2Initially: A = Flag1 = Flag2 = 0 Flag1 = 1 Flag2 = 1 A = 1 A = 2 r1 = A r3 = A r2 = Flag2 r4 = Flag1 Result: r1 = 1, r3 = 2, r2 = r4 = 0 This result is possible with TSO and PC, but not with IBM 370

Relaxing W  R order (cont…) Initially: A = B = 0 P1P2 P3 A = 1 if (A == 1) B = 1 if (B == 1) register1 = A Result: B = 1, register1 = 0 This result is possible with PC, but not with TSO and IBM 370

Relaxing W  R order (cont…) Safety Nets: IBM: Inserting a serialization instruction (a memory synchronization instr like “compare&swap” or a non-memory instr such as branch) between a W and a R will force them to serialize TSO and PC: Replacing the W or R by a read-modify-writes enforces serialization

Relaxing W  W SPARC Partial Store Model (PSO) Safety Net: Insert STBAR instruction in write buffer between WRITESs to different locations The WRITEs in the buffer that are ahead of the STBAR are completed before attempting the WRITES behind the STBAR

Relaxing all program orders Example: Weak Ordering Safety net: Inserting a synchronization operation between regions of data operations forces the order between the 2 regions to be preserved. Data operations within a region may be reordered Issue a sync operation only after all previous data operations have completed. Issue a data operation only after a previous sync operation is completed.