Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors 2005-2008 Yan Solihin Copyright.

Slides:

Advertisements

Similar presentations

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Chapter 6 Process Synchronization Bernard Chen Spring 2007.

Chapter 6: Process Synchronization

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Mutual Exclusion.

CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任指導教授 : 楊武 Abstract Multicore.

6: Process Synchronization 1 1 PROCESS SYNCHRONIZATION I This is about getting processes to coordinate with each other. How do processes work with resources.

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

Chapter 6: Synchronization. 6.2 Silberschatz, Galvin and Gagne ©2005 Operating System Principles Module 6: Synchronization 6.1 Background 6.2 The Critical-Section.

Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Concurrency, Mutual Exclusion and Synchronization.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Java Thread and Memory Model

Fundamentals of Parallel Computer Architecture - Chapter 51 Chapter 5 Parallel Programming for Linked Data Structures Yan Solihin.

Chapter 2: Parallel Programming Models Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

CSC/ECE 506: Architecture of Parallel Computers The Cache-Coherence Problem Lecture 11 (Chapter 7) Lecture 11 (Chapter 7) 1.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

CSC/ECE 506: Architecture of Parallel Computers The Cache-Coherence Problem Chapter 7 1.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)

The Cache-Coherence Problem Intro to Chapter 5. Lecture 7 ECE/CSC Spring E. F. Gehringer, based on slides by Yan Solihin2 Shared Memory vs.

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Software Coherence Management on Non-Coherent-Cache Multicores

G.Anuradha Reference: William Stallings

Memory Consistency Models

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Memory Consistency Models

Lecture 18: Coherence and Synchronization

Chapter 5: Process Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

The Cache-Coherence Problem

The Cache-Coherence Problem

The Cache-Coherence Problem

Designing Parallel Algorithms (Synchronization)

Concurrency: Mutual Exclusion and Process Synchronization

Chapter 6: Process Synchronization

Lecture 25: Multiprocessors

Lecture 10: Consistency Models

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture 24: Multiprocessors

Chapter 6: Synchronization Tools

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Process/Thread Synchronization (Part 2)

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.

Fundamentals of Parallel Computer Architecture - Chapter 72 Pros/Cons of Providing Shared Memory  Shared memory = an abstraction in which any address is the main memory is accessible using load/store instruction from any processor  Advantages of Shared Memory Machine: Naturally support shared memory programs Single OS image Can fit larger problems: total mem = mem n = n * mem 1 Allow fine grain sharing  Disadvantages of Shared Memory Machine: Cost of providing shared memory abstraction  Cheap on small-scale multiprocessors (2-32 procs), multicore  Expensive on large-scale multiprocessors (> 64 procs)

Fundamentals of Parallel Computer Architecture - Chapter 73 Objective of Chapter  Shared memory abstraction requires hardware support  Find out what hardware support is needed to construct a shared memory multiprocessor  What is required to provide a shared memory abstraction? Cache coherence (in systems with caches) Memory consistency models Synchronization support

Fundamentals of Parallel Computer Architecture - Chapter 74 Cache Coherence Problem Illustration  Illustration 1:  “You (A) met a friend B yesterday and made an appointment to meet her at a place exactly one month from now. Later you found out that you have an emergency and cannot possibly meet on the agreed date. You want to change the appointment to three months from now.”  Constraints The only mode of communication is mail In your mail, you can only write the new appointment date and nothing else  Question 1: How can you ensure that you will meet B?  Principle 1: When a processor performs a write operation on a memory address, the write must be propagated to all other processors The write propagation principle

Fundamentals of Parallel Computer Architecture - Chapter 75 Cache Coherence Problem  Illustration 2:  “You have placed a mail on the mailbox telling B to change the meeting month to 3 months from now. Unfortunately, you find out again that you cannot meet 3 months from now and need to change it to 2 months from now.”  Question 2: How can you ensure that you will meet B two months from now?  Principle 2: When there are two writes to the same memory address, the order in which the writes are seen by all processors must be the same. The write serialization principle

Fundamentals of Parallel Computer Architecture - Chapter 76 Will This Parallel Code Work Correctly? sum = 0; begin parallel for (i=0; i<2; i++) { lock(id, myLock); sum = sum + a[i]; unlock(id, myLock); end parallel Print sum; Suppose a[0] = 3 and a[1] = 7 … Will it print sum = 10?

Fundamentals of Parallel Computer Architecture - Chapter 77 Example 1: Cache Coherence Problem sum = 0; begin parallel for (i=0; i<2; i++) { lock(id, myLock); sum = sum + a[i]; unlock(id, myLock); end parallel Print sum; Suppose a[0] = 3 and a[1] = 7 P1P1 P2P2 PnPn … Will it print sum = 10?

Fundamentals of Parallel Computer Architecture - Chapter 78 Cache Coherence Problem Illustration P1P3 P2 Cache Main Memory Bus sum=0 Mem Ctrl

Fundamentals of Parallel Computer Architecture - Chapter 79 Cache Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum =0 V rd &sum

Fundamentals of Parallel Computer Architecture - Chapter 710 Cache Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=0V wr sum (sum = 3) 3D

Fundamentals of Parallel Computer Architecture - Chapter 711 Cache Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=3Dsum =0V rd &sum

Fundamentals of Parallel Computer Architecture - Chapter 712 Cache Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=3Dsum=0V wr &sum (sum = 7) 7D

Fundamentals of Parallel Computer Architecture - Chapter 713 Cache Coherence Problem Illustration P1P3 P2 sum=0 Mem Ctrl sum=3Dsum=7D rd &sum print sum=3

Fundamentals of Parallel Computer Architecture - Chapter 714 Questions  Do P1 and P2 see the same value of sum at the end?  How is the problem affected by cache organization? For example: What happens if we use write-through caches instead?  What if we do not have caches, or sum is uncachable. Will it work?

Fundamentals of Parallel Computer Architecture - Chapter 715 Main Observations  Write propagation: values written in one cache must be propagated to other caches  Write serialization: values written successively on the same data must be seen in the same order by all caches  Need a protocol to ensure both properties Called cache coherence protocol  Cache coherence protocol can be implemented in hardware or software Overheads of software implementations are often very high

Fundamentals of Parallel Computer Architecture - Chapter 716 Memory Consistency Problem  Cache coherence deals with propagation and ordering of memory accesses to a single datum  Are there any needs for ordering memory accesses to multiple data items?

Fundamentals of Parallel Computer Architecture - Chapter 717 Memory Consistency Problem Illustration P1: S3: while (!datumIsRead) {}; S4: print datum P0: S1: datum = 5; S2: datumIsRead = 1;  Recall that point-to-point synchronization is common in DOACROSS and DOPIPE parallelism  Correctness of execution is conditional upon P0 the ordering of S1 and S2, and the ordering of S3 and S4 What happens if S2 is executed before S1? What happens if S4 is executed before S3?  In multiprocessors, even if P0 executes S1 before S2, P1 does not necessarily see the same order!

Fundamentals of Parallel Computer Architecture - Chapter 718 Global Ordering of Memory Accesses  Full ordering: all memory accesses issued by a processor are executed in program order, and they are seen by all processors in that order  However, full ordering is expensive  Current solution: programmers’ manual of what types of ordering are enforced by the processor, and what types are not Referred to as memory consistency model  Programmers’ responsibility to write programs that conform to that model Incorrect understanding is a source of non-trivial program errors Contributes to non-portability of code

Fundamentals of Parallel Computer Architecture - Chapter 719 How to Ensure Ordering  Need cooperation of compiler and hardware  Compiler may reorder S1 and S2 No data dependence between S1 and S2 So, it must be explicitly told not to reorder them  e.g. declaring datum and datumIsReady as volatile This prevents compiler from using aggressive optimizations used in sequential programs  The architecture may reorder S1 and S2 Write buffers, memory controller Network delay for statement A Many performance enhancing features for exploiting Instruction Level Parallelism affect ordering of instructions Need to ensure ordering while allowing ILP to be exploited

Fundamentals of Parallel Computer Architecture - Chapter 720 How to support critical section?  How to guarantee mutual exclusion in a critical section?  Let us see how a lock can be implemented in software void lock (int *lockvar) { while (*lockvar == 1) {} ; // wait until released *lockvar = 1; // acquire lock } void unlock (int *lockvar) { *lockvar = 0; } In machine language, it looks like this: lock: ld R1, &lockvar // R1 = MEM[&lockvar] bnz R1, lock // jump to lock if R1 != 0 sti &lockvar, #1 // MEM[&lockvar] = 1 ret // return to caller unlock: sti &lockvar, #0 // MEM[&lockvar] = 0 ret // return to caller

Fundamentals of Parallel Computer Architecture - Chapter 721 Problem with the Solution Just Shown  Unfortunately, the solution is incorrect  The sequence of ld, bnz, and sti are not atomic Several threads may be executing it at the same time  It allows several threads to enter the critical section simultaneously

Fundamentals of Parallel Computer Architecture - Chapter 722 Peterson’s algorithm  Exit from lock() happens only if: interested[other] == FALSE: either the other process has not competed for the lock, or it has just called unlock() turn != process: the other process is competing, has set the turn to itself, and will be blocked in the while() loop int turn; int interested[n]; // initialized to 0 void lock (int process, int lvar) { // process is 0 or 1 int other = 1 – process; interested[process] = TRUE; turn = process; while (turn == process && interested[other] == TRUE) {} ; } // Post: turn != process or interested[other] == FALSE void unlock (int process, int lvar) { interested[process] = FALSE; }

Fundamentals of Parallel Computer Architecture - Chapter 723 No Race // Proc 0 interested[0] = TRUE; turn = 0; while (turn==0 && interested[1]==TRUE) {}; // since interested[1] == FALSE, // Proc 0 enters critical section // Proc 1 interested[1] = TRUE; turn = 1; while (turn==1 && interested[0]==TRUE) {}; // since turn==1 && interested[0]==TRUE // Proc 1 waits in the loop until Proc 0 // releases the lock // unlock Interested[0] = FALSE; // now can exit the loop and acquire the // lock

Fundamentals of Parallel Computer Architecture - Chapter 724 There is race // Proc 0 interested[0] = TRUE; turn = 0; while (turn==0 && interested[1]==TRUE) {}; // since turn == 1, // Proc 0 enters critical section // Proc 1 interested[1] = TRUE; turn = 1; while (turn==1 && interested[0]==TRUE) {}; // since turn==1 && interested[0]==TRUE // Proc 1 waits in the loop until Proc 0 // releases the lock // unlock Interested[0] = FALSE; // now can exit the loop and acquire the // lock

Fundamentals of Parallel Computer Architecture - Chapter 725 Will Peterson’s Algorithm Work?  Yes, but correctness depends on the global order of  So, it’s related to the memory consistency problem A: interested[process] = TRUE; B: turn = process; // Proc 0 interested[0] = TRUE; turn = 0; while (turn==0 && interested[1]==TRUE) {}; // since interested[1] == FALSE, // Proc 0 enters critical section // Proc 1 turn = 1; interested[1] = TRUE; while (turn==1 && interested[0]==TRUE) {}; // since turn==0, // Proc 1 enters critical section reordered

Fundamentals of Parallel Computer Architecture - Chapter 726 Performance Problem  Assuming full program ordering is guaranteed: Peterson’s algorithm can achieve mutual exclusion  However, it is very inefficient For n threads, it needs an interested[n] array Before a thread can enter the critical section, it must ensure that all elements of interested[.] array are FALSE O(n) number of loads and branches  If non-atomicity is the problem, hardware can provide atomic sequence of load, modify, and store  Typically in the form of an atomic instruction

Fundamentals of Parallel Computer Architecture - Chapter 727 Conclusion  To support shared memory abstraction in multiprocessors, we need to provide Cache coherence Memory Consistency Model Support for synchronization