Lecture 4. Memory Consistency Models

Slides:



Advertisements
Similar presentations
1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
Advertisements

Shared Memory Consistency
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Memory Consistency Models
1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Evaluation of Memory Consistency Models in Titanium.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Shared Memory Consistency Models. Quiz (1)  Let’s define shared memory.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
CS533 Concepts of Operating Systems Jonathan Walpole.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
CS5102 High Performance Computer Systems Memory Consistency
Distributed Shared Memory
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
5.2 Eleven Advanced Optimizations of Cache Performance
Threads and Memory Models Hal Perkins Autumn 2011
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Shared Memory Consistency Models: A Tutorial
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Introduction to High Performance Computing Lecture 20
Threads and Memory Models Hal Perkins Autumn 2009
Lecture 22: Consistency Models, TM
Background for Debate on Memory Consistency Models
Shared Memory Consistency Models: A Tutorial
Lecture 10: Consistency Models
Programming with Shared Memory Specifying parallelism
Memory Consistency Models
Relaxed Consistency Part 2
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Lecture: Consistency Models, TM
Lecture 11: Relaxed Consistency Models
Lecture 11: Consistency Models
Presentation transcript:

Lecture 4. Memory Consistency Models COM503 Parallel Computer Architecture & Programming Lecture 4. Memory Consistency Models Prof. Taeweon Suh Computer Science Education Korea University

Memory Consistency What do you expect from the following code? Processor 1 Processor 2 A = 1 flag = 1 while (flag == 0) print A Program orders in P1 and P2’s accesses to different locations are not implied nor enforced by coherence Coherence requires that the new value for A eventually become visible to process P2 (not necessarily before the new value of flag is observed) Note that x86 CPU is a superscalar with OOO (Out-Of-Order) execution What would you do if you want “print A” to print “1”?

Demo #include <stdio.h> #include <omp.h> int main() { int a, b; int a_tmp, b_tmp; a = 0; b = 0; #pragma omp parallel num_threads(2) shared(a, b) //printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); #pragma omp single nowait a = 1; b = 2; //while(1); } a_tmp = a; b_tmp = b; printf("A = %d, B = %d\n", a_tmp, b_tmp); return 0;

Memory Consistency Use barrier Processor 1 Processor 2 A = 1 Barrier (b1) print A A barrier is often built using reads and writes to ordinary shared variables (e.g., b1 above) rather than a special barrier operation Coherence does not say anything at all about the order among these accesses It would be interesting to see how OpenMP (or Pthreads) implements barrier in low level But, CPU typically provides barrier instructions (such as sfence, lfence, mfence in x86)

Memory Consistency So, clearly we need something more than coherence to give a shared address space a clear semantics That is, an ordering model that programmers can use to reason about the possible results and hence the correctness of their programs Memory consistency model for a shared address space specifies constraints on the order in which memory operations must appear to be performed (i.e., to become visible to the processors) with respect to one another It includes operations to the same locations or to different locations and by the same process or different processes, so in this sense memory consistency subsumes coherence Processor 1 Processor 2 A = 1 Barrier (b1) print A

Programmer’s Abstraction of Memory Subsystem Partial order Partial order Partial order Processors are issuing memory references as per program order ….. P1 P2 Pn ● ● ● ● ● The “switch” is randomly set after each memory reference Memory Interleaving the partial (program) orders for different processes may yield a large number of possible total orders

Sequential Consistency Sequential consistency (SC) Formalized by Lamport in 1979 A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program Implementing SC requires that the system (s/w and h/w) follow 2 constraints Program order requirement: memory operations of a process must appear to become visible (to itself and others) in program order Write atomicity: all writes (to any location) should appear to all processors to have occurred in the same order

Sequential Consistency Processor 1 Processor 2 /* Assume initial values of A and B are 0 */ (1a) A = 1 (1b) B = 2 (2a) print B (2b) print A What values of A and B do you expect to be printed on P2? (A B) = (0, 0)? (A B) = (1, 2)? (A B) = (1, 0)? (A B) = (0, 2)? Under SC, the result (0, 2) for (A, B) would not be allowed since it would then appear that the writes of A and B by P1 executed out of program order Execution order of 1b, 2a, 2b, and 1a is not sequentially consistent

How to Impose Constraint? In practice, to constrain the compiler optimizations, multithreaded and parallel programs annotate variables or memory references that are used to preserve orders A particularly stringent example is the use of the volatile qualifier in a variable declaration It prevents the variable from being register allocated or any memory operation on the variable from being reordered with respect to operations before or after it in program order

Reordering Impact Example How would reordering the memory operations affect semantics in a parallel program running on a multiprocessor and in a threaded program in which the two processes are interleaved on the same processor? Processor 1 Processor 2 A = 1 flag = 1 while (flag == 0) print A The compiler may reorder the writes to A and flag with no impact on a sequential program It violates our intuition for both parallel programs and multithreaded uniprocessor programs For many compilers, these reorderings can be avoided by declaring the variable flag to be of type volatile integer (instead of integer)

Problems with SC The SC model provides an intuitive semantics to the programmer The program order and a consistent interleaving across processes can be quite easily implemented However, its drawback is that it restricts many of the performance optimizations that modern uniprocessor compilers and microprocessors employ With the high cost of memory access latency, computer systems achieve higher performance by reordering or overlapping the multiple memory or communication operations from a processor Preserving the sufficient conditions for SC does not allow for much reordering or overlap in hardware With SC, the compiler can not reorder memory accesses even if they are to different locations, disallowing critical performance optimizations such as code motion, common-subexpression elimination, software pipelining, and even register allocation

Reality Check Unfortunately, many of the optimizations that are commonly employed in both compilers and processors violate the SC property Explicitly parallel programs use uniprocessor compilers, which are concerned only about preserving dependences to the same location So, compliers routinely reorder accesses to different locations within a process, so a processor may in fact issue accesses out of the program order seen by the programmer Advanced compiler optimizations can change the order in which different memory locations are accessed or can even eliminate memory operations Common subexpression elimination, constant propagation, register allocation, and loop transformations such as loop splitting, loop reversal, and blocking

Example: Register Allocation How can the register allocation lead to a violation of SC even if the hardware satisfies SC P 1 P2 P 1 P2 r1 = 0 A = 1 u = r1 B = r1 r2 = 0 B = 1 v = r2 A = r2 B = 0 A = 1 u = B A = 0 B = 1 v = A The result (u, v) = (0, 0) is disallowed under SC A uniprocessor compiler might easily perform these optimizations in each process They are valid for sequential programs since the reordered accesses are to different locations

Problems with SC Providing SC at the programmer’s interface implies supporting SC at lower-level interfaces If the sufficient conditions for SC are met, a processor waits for an access to complete before issuing the next one So, most of the latency suffered by memory references is directly seen by processors as stall time Although a processor may continue executing non-memory instructions while a single outstanding memory reference is being serviced, the expected benefit from such overlap is tiny, since even without ILP (Instruction-Level Parallelism) every third instruction on average is a memory reference So, we go to do something about this performance problem Programmer’s interface: We focus mainly on the consistency model as seen by the programmer. That is, at the interface between the programmer and the rest of the system composed of the compiler, operating system, hardware. For example, a processor may preserve all program orders presented to it among memory operations, but if the compiler has already reordered operations, then programmers can no longer reason with the simple model exported by the hardware

Solutions? One approach is to preserve SC at the programmer’s interface, but find ways to hide the long stalls from the processor 1st technique Compiler does not reorder memory operations, but latency tolerance techniques such as data prefetching or multithreading are used to overlap data transfer with one another or with computation But, the actual read and write operations are not issued before previous ones complete in program order

Solutions? Or 2nd technique Compiler reorders operations as long as it can guarantee that SC will not be violated in the results Compiler algorithms have been developed for this (Shasha and Snir 1988, Kris and Yelick 1994, 1995) At the hardware level, Memory operations are issued and executed out of program order, but are guaranteed to become visible to other processors in program order This approach is well suited to dynamically scheduled processors that use instruction lookahead buffer to find independent instructions to issue Instructions are inserted in the lookahead buffer in program order They are guaranteed to retire from the lookahead buffer in program order Speculative execution such as Branch prediction Speculative reads Values returned by reads are used even before they are known to be correct Later, roll back if they are incorrect Or Change the memory consistency model itself!

Relaxed Consistency Models A completely different way to overcome the performance limitations imposed by SC is to change the memory consistency model itself That is, not to guarantee such strong ordering constraints to the programmer, but still retain semantics that are intuitive enough to be useful The intuition behind the relaxed models is that SC is usually too conservative Many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations By relaxing the ordering constraints, these relaxed consistency models allow the compiler to reorder accesses before presenting them to the hardware, at least to some extent A the hardware level, they allow multiple memory accesses from the same process not only to be outstanding at a time, but even to complete or become visible out of order, thus allowing much of the latency to be overlapped and hidden from the processor

Example Ordering Under SC Ordering necessary for correct program semantics P 1 P2 P 1 P2 A = 1 B = 1 flag = 1 While (flag ==0) u = A v = B A = 1 B = 1 flag = 1 While (flag ==0) u = A v = B Writes to variables A and B by P1 can be reordered without affecting the results All we must ensure is that both of them complete before the variable flag is set to 1 Reads to variables A and B can be reordered at P2 once flag has been observed to change to value 1 Even with these reorderings, the results look just like those of an SC execution

Reality Check It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics and allow the others to be violated for higher performance (Shasha and Snir 1998) However, the problem is intractable (in fact, undecidable) for general programs, and inexact solutions are often too conservative to be very useful

Relaxed Consistency Model A relaxed consistency model requires 2 things What program orders among memory operations are guaranteed to be preserved by the system, including that write atomicity will be maintained If not all program orders are guaranteed to be preserved by default, then what mechanisms the system provides for a programmer to enforce order explicitly when desired As should be clear by now, the compiler and the hardware have their own system specifications, but we focus on the specification that the two together (or the system as a whole) presents to the programmer For a processor architecture, the specification it exports governs the reorderings that it allows and it also provides the order-preserving primitives It is often called the processor’s memory model

Relaxed Consistency Model A programmer may use the consistency model to reason about correctness and insert the appropriate order-preserving mechanisms However, this is a very low-level interface for a programmer Parallel programming is challenging enough without having to think about reorderings and write atomicity What programmer wants is a methodology for writing “safe” programs So, this is a contract: if the program follows certain high-level rules or provides enough program annotations (such as synchronization), then any system on which program runs will always guarantee a sequentially consistent execution, regardless of the default orderings permitted by the system specifications

Relaxed Consistency Model The programmer’s responsibility is to use the rules and annotations, which hopefully does not involve reasoning at the level of potential orderings The system’s responsibility is to use the rules and annotations as constraints to maintain the illusion of sequential consistency

Ordering Specifications TSO (Total Store Ordering) Sindhu, Frailong, and Cekleov 1991, Sun Microsystems PC (Processor Consistency) Goodman 1989 and Gharachorloo 1990, Intel Pentium PSO (Partial Store Ordering) WO (Weak Ordering) Dubois, Scheurich, and Briggs 1986 RC (Release Consistency) Gharachorloo 1990 RMO (Relaxed Memory Ordering) Weaver and Germond 1994, Sun Sparc V8 and V9 Digital Alpha (Sites 1992) and IBM/Motorola PowerPC (May et al. 1994) models

1. Relaxing the Write-to-Read Program Order The main motivation is to allow the hardware to hide the latency of write operations While the write miss is still in the write buffer and not yet visible to other processors, the processor can issue and complete reads that hit its cache The models (TSO and PC) in this class preserve the programmer’s intuition quite well, for the most part, even without any special operations TSO and PC allow a read to bypass an earlier incomplete write in program order TSO and PC preserve the ordering of writes in program order But, PC does not guarantee write atomicity

Write Atomicity Write atomicity ensures that nothing a processor does after it has seen the new value produced by a write (e.g. another write that it issues) becomes visible to other processes before they too have seen the new value for that write All writes (to any location) should appear to all processors to have occurred in the same order Write serialization says that writes to the same location should appear to all processors to have occurred in the same order Page 288 in the textbook

Write Atomicity Example This example illustrates the importance of write atomicity for sequential consistency Processor 1 Processor 2 Processor 3 A = 1; while (A == 0); B = 1; while (B == 0); print A; What happens if P2 writes B before it is guaranteed that P3 has seen the new value of A?

Example Code Sequences SC is guaranteed in TSO and PC? P1 P2 A = 1; flag = 1; while (flag == 0); print A; (a) P1 P2 A = 1; B = 1; print B; print A; (b) P1 P2 A = 1; while (A == 0); B = 1; P3 while (B == 0); print A; (c) P1 P2 A = 1; print B; B =1; print A; (d) A popular software-only mutual exclusion algorithm called Dekker’s algorithm (which is used in the absence of hardware support for atomic read-modify-write operations) relies on the property that both A and B will not be read as 0 in (d)

How to Ensure SC Semantics? To ensure SC semantics when desired (e.g., to port a program written under SC assumptions to a TSO or PC system), we need mechanisms to enforce 2 types of extra orderings A read does not complete before an earlier write in program order (applies to both TSO and PC) Sun’s Sparc V9 provides memory barrier (MEMBAR) or fence instructions of different flavors that can ensure any desired ordering MEMBAR prevents any read that follows it in program order from issuing before all writes that precede it have completed On architectures that do not provide memory barrier instructions, it is possible to achieve this effect by substituting an atomic read-modify-write operation or sequence for the original read A read-modify-write is treated as being both a read and a write, so it cannot be reordered with respect to previous writes in these models Write atomicity for a read operation (applied to PC) Replacing a read with a read-modify-write also guarantees write atomicity at that read on machines supporting the PC model Refer to Adve et al, 1993 referenced in the textbook

2. Relaxing the W-R and W-W Program Orders It allows writes and reads to bypass earlier writes (to different locations) It enables multiple write misses to be fully overlapped and to become visible out of program order Sun’s Sparc’s Partial Store Ordering (PSO) model belongs to this category The only additional instruction we need over TSO is one that enforces w-w ordering in a process’s program order In Sun’s Sparc V9, it can be achieved by using a MEMBAR instruction Sun’s Sparc V8 provides a special instruction called store barrier (STBAR) to achieve this

3. Relaxing All Program Orders No program orders are guaranteed by default These models are particularly well matched to superscalar processors whose implementation allows for proceeding past read misses to other memory locations Prominent models in this category Weak ordering (WO): WO is the seminal model Release consistency (RC) Sparc V9 relaxed memory ordering (RMO) Digital Alpha model IBM PowerPC model

Weak Ordering (WO) The motivation of WO is quite simple Most parallel programs use synchronization operations to coordinate accesses to data when necessary Between synchronization operations, they do not rely on the order of accesses being preserved P1, P2, … Pn ... Lock (TaskQ) newTask→next = Head; if (Head != NULL) Head→prev = newTask; Head = newTask; UnLock(TaskQ)

Illustration of WO Read/Write … Block 1 Sync (Acquire) Read, write and read-modify-write operations in blocks 1, 2, and 3 can be arbitrarily reordered within its block Read/Write … Block 2 Sync (Release) Read/Write … Block 3

Weak Ordering (WO) The intuitive semantics are not violated by any program reorderings as long as synchronization operations are not reordered with respect to data accesses Sufficient conditions to ensure a WO system Before a synchronization operation is issued, the processor waits for all previous operations in program order to have completed Similarly, memory accesses that follow the synchronization operation are not issued until the synchronization operation completes When synchronization operations are infrequent, as in many parallel programs, WO typically provides considerable reordering freedom to the hardware and compiler

Release Consistency (RC) Improvement from WO Acquire can be reordered with memory accesses in block 1 The purpose of an acquire is to delay memory accesses in block 2 until the acquire completes No reason to wait for block 1 to complete before the acquire can be issued Release can be reordered with memory accesses in block 3 The purpose of a release is to grant access to the new data that are modified before the release in program order No reason to delay processing block 3 until the release has completed Read/Write … Sync (Acquire) Sync (Release) 1 2 3

Memory Barriers of Commercial Processors Processors provide specific instructions called memory barriers or fences that can be used to enforce orderings Synchronization operations (or acquires or releases) cause the compiler to insert the appropriate special instructions or the programmer can insert these instructions directly Alpha supports 2 kinds of fence instructions: the memory barrier (MB) and the write memory barrier (WMB) The MB fence is like a synchronization operation in WO It waits for all previously issued memory accesses to complete before issuing any new accesses The WMB fence imposes program order only between writes Thus, a read issued after a WMB can still bypass a write access issued before the WMB

Memory Barriers of Commercial Processors The Sparc V9 RMO provides a fence or MEMBAR instruction with 4 flavor bits associated with it Each bit indicates a particular type of ordering to be enforced between previous and following load-store operations The 4 possibilities are R-R, R-W, W-R and W-W Any combinations of these bits can be set, offering a variety of ordering choices The IBM PowerPC mode provides only a single fence instruction called SYNC, that is equivalent to Alpha’s MB fence

Characteristics of Various Systems

Characteristics of Various Systems

Programmer’s Interface A program running “correctly” on a system with TSO (with enough memory barriers) will not necessarily work “correctly” on a system with WO Programmer Programmers ensure that all synchronization operations are explicitly labeled or identified For example, LOCK, UNLOCK and BARRIER System (compiler and hardware) The compiler or run-time library translates these synchronization operations into the appropriate order-preserving operations (memory barrier or fences) Then, the system (compiler plus hardware) guarantees sequentially consistent executions even though it may reorder operations between synchronization operations

Backup Slides

A Typical Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology higher level lower level Secondary Storage (Disk) On-Chip Components Main Memory (DRAM) CPU Core L1I (Instr Cache) L2 (Second Level) Cache ITLB Reg File L1D (Data Cache) DTLB Note that the cache coherence hardware updates or invalidates only the memory and the caches (not the registers of CPU)

The Memory Hierarchy: Why Does It Work? Temporal Locality (locality in time) If a memory location is referenced, then it will tend to be referenced again soon  Keep most recently accessed data items closer to the processor Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon  Move blocks consisting of contiguous words closer to the processor

Slide from Prof Sean Lee in Georgia Tech Example of Locality int A[100], B[100], C[100], D; for (i=0; i<100; i++) { C[i] = A[i] * B[i] + D; } A[96] A[97] A[98] A[99] B[1] B[2] B[3] B[0] . . . . . . . . . . . . . . B[5] B[6] B[7] B[4] B[9] B[10] B[11] B[8] C[0] C[1] C[2] C[3] C[5] C[6] C[7] C[4] C[96] C[97] C[98] C[99] D A[0] A[1] A[2] A[3] A[5] A[6] A[7] A[4] A Cache Line (block) Slide from Prof Sean Lee in Georgia Tech

Volatile When would you use a variable declaration with volatile , for example, in C? For lock variables, I/O registers

True Sharing & False Sharing

Impact of Cache Line Size

Synchronization