Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Process Synchronization A set of concurrent/parallel processes/tasks can be disjoint or cooperating (or competing) With cooperating and competing processes.

PROCESS SYNCHRONIZATION READINGS: CHAPTER 5. ISSUES IN COOPERING PROCESSES AND THREADS – DATA SHARING Shared Memory Two or more processes share a part.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

1 Mutual Exclusion: Primitives and Implementation Considerations.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

CS 162 Discussion Section Week 3. Who am I? Mosharaf Chowdhury Office 651 Soda 4-5PM.

CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Computer Science 162 Discussion Section Week 3. Agenda Project 1 released! Locks, Semaphores, and condition variables Producer-consumer – Example (locks,

Synchronization Andy Wang Operating Systems COP 4610 / CGS 5765.

Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.

9/8/2015cse synchronization-p1 © Perkins DW Johnson and University of Washington1 Synchronization Part 1 CSE 410, Spring 2008 Computer Systems.

1 Thread Synchronization: Too Much Milk. 2 Implementing Critical Sections in Software Hard The following example will demonstrate the difficulty of providing.

Synchronization (Barriers) Parallel Processing (CS453)

Concurrency, Mutual Exclusion and Synchronization.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

CSCI-375 Operating Systems Lecture Note: Many slides and/or pictures in the following are adapted from: slides ©2005 Silberschatz, Galvin, and Gagne Some.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

CPS110: Thread cooperation Landon Cox. Constraining concurrency  Synchronization  Controlling thread interleavings  Some events are independent  No.

CS162 Operating Systems and Systems Programming Lecture 4 Synchronization, Atomic operations, Locks September 10, 2012 Ion Stoica

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency: Locks and synchronization Slides by Prof. Cox.

Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.

Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.

6/27/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,

CS162 Operating Systems and Systems Programming Lecture 4 Synchronization, Atomic operations, Locks, Semaphores September 12, 2011 Anthony D. Joseph and.

CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick

CS703 – Advanced Operating Systems

Background on the need for Synchronization

Prof John D. Kubiatowicz

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

Atomic Operations in Hardware

Atomic Operations in Hardware

CSCI 511 Operating Systems Chapter 5 (Part A) Process Synchronization

CSCI 511 Operating Systems Chapter 5 (Part A) Process Synchronization

Designing Parallel Algorithms (Synchronization)

Andy Wang Operating Systems COP 4610 / CGS 5765

Shared Memory Systems Miodrag Bolic.

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Background and Motivation

Implementing Mutual Exclusion

Hardware-Software Trade-offs in Synchronization and Data Layout

Implementing Mutual Exclusion

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CS333 Intro to Operating Systems

Lecture: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

CPS110: Thread cooperation

Don Porter Portions courtesy Emmett Witchel

Presentation transcript:

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory Programming Synchronization primitives Ing. Andrea Marongiu

Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variables PnP1 P0 s s =... y =..s... Shared memory i: 2i: 5 Private memory i: 8 Shared Memory Programming

Thread 1 for i = 0, n/2-1 s = s + sqr(A[i]) Thread 2 for i = n/2, n-1 s = s + sqr(A[i]) static int s = 0; Problem is a race condition on variable s in the program A race condition or data race occurs when: -two processors (or two threads) access the same variable, and at least one does a write. -The accesses are concurrent (not synchronized) so they could happen simultaneously Shared Memory code for computing a sum

Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or 25 The atomic operations are reads and writes Never see ½ of one number, but no += operation is not atomic All computations happen in (private) registers Af = square Shared Memory code for computing a sum

Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Right? Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s Shared Memory code for computing a sum ATOMIC

Atomic Operations To understand a concurrent program, we need to know what the underlying indivisible operations are! Atomic Operation: an operation that always runs to completion or not at all It is indivisible: it cannot be stopped in the middle and state cannot be modified by someone else in the middle Fundamental building block – if no atomic operations, then have no way for threads to work together On most machines, memory references and assignments (i.e. loads and stores) of words are atomic

Role of Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization Mutual Exclusion Event synchronization point-to-point group global (barriers) How much hardware support? Most used forms of synchronization in shared memory parallel programming

Motivation: “Too much milk” Example: People need to coordinate: Arrive home, put milk away3:30 Buy milk3:25 Arrive at storeArrive home, put milk away3:20 Leave for storeBuy milk3:15 Leave for store3:05 Look in Fridge. Out of milk3:00 Look in Fridge. Out of milkArrive at store3:10 Person BPerson ATime

Definitions Synchronization: using atomic operations to ensure cooperation between threads For now, only loads and stores are atomic hard to build anything useful with only reads and writes Mutual Exclusion: ensuring that only one thread does a particular thing at a time One thread excludes the other while doing its task Critical Section: piece of code that only one thread can execute at once Critical section and mutual exclusion are two ways of describing the same thing Critical section defines sharing granularity

More Definitions Lock: prevents someone from doing something Lock before entering critical section and before accessing shared data Unlock when leaving, after accessing shared data Wait if locked Important idea: all synchronization involves waiting Example: fix the milk problem by putting a lock on refrigerator Lock it and take key if you are going to go buy milk Fixes too much (coarse granularity): roommate angry if only wants orange juice

Need to be careful about correctness of concurrent programs, since non-deterministic Always write down desired behavior first think first, then code What are the correctness properties for the “Too much milk” problem? Never more than one person buys Someone buys if needed Restrict ourselves to use only atomic load and store operations as building blocks Too Much Milk: Correctness properties

Too Much Milk: Solution #1 Use a note to avoid buying too much milk: Leave a note before buying (kind of “lock”) Remove note after buying (kind of “unlock”) Don’t buy if note (wait) Suppose a computer tries this (remember, only memory read/write are atomic): if (noMilk) { if (noNote) { leave Note; buy milk; remove note; } } Result?

Too Much Milk: Solution #1 Thread A Thread B if (noMilk) if (noNote) { if (noMilk) if (noNote) { leave Note; buy milk; remove note; } leave Note; buy milk; remove note; } Need to atomically update lock variable

How to Implement Lock? Lock: prevents someone from accessing something Lock before entering critical section (e.g., before accessing shared data) Unlock when leaving, after accessing shared data Wait if locked Important idea: all synchronization involves waiting Should sleep if waiting for long time Hardware atomic instructions?

Examples of hardware atomic instructions test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result; } swap (&address, register) { /* x86 */ temp = M[address]; M[address] = register; register = temp; } compare&swap (&address, reg1, reg2) { /* */ if (reg1 == M[address]) { M[address] = reg2; return success; } else { return failure; } } Atomic operations!

Implementing Locks with test&set Simple solution: int value = 0; // Free Acquire() { while (test&set(value)); // while busy } Release() { value = 0; } Simple explanation: If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns 0 so while exits If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so while loop continues When we set value = 0, someone else can get lock test&set (&address) { result = M[address]; M[address] = 1; return result; }

Too Much Milk: Solution #2 Lock.Acquire() – wait until lock is free, then grab Lock.Release() – unlock, waking up anyone waiting atomic operations – if two threads are waiting for the lock, only one succeeds to grab the lock Then, our milk problem is easy: milklock.Acquire(); if (nomilk) buy milk; milklock.Release(); Once again, section of code between Acquire() and Release() called a “Critical Section”

Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Right? Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s static lock lk; lock(lk); unlock(lk); lock(lk); unlock(lk); Shared Memory code for computing a sum

Performance Criteria for Synch. Ops Latency (time per op) How long does it take if you always win Especially when light contention Bandwidth (ops per sec) Especially under high contention How long does it take (averaged over threads) when many others are trying for it Traffic How many events on shared resources (bus, crossbar,…) Storage How much memory is required? Fairness Can any one threads be “starved” and never get the lock?

Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors even harder with multiple processes per processor Difficult to dynamically change number and identity of participants e.g. latter due to process migration Not common today on bus-based machines

struct bar_type { int counter; struct lock_type lock int flag = 0; } bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ } A Simple Centralized Barrier Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs Problem?

A Working Centralized Barrier Consecutively entering the same barrier doesn’t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else {UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } }

Centralized Barrier Performance Latency Centralized has critical path length at least proportional to p Traffic About 3p bus transactions Storage Cost Very low: centralized counter and flag Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node

Improved Barrier Algorithm Separate gather and release trees Advantage: use of ordinary reads/writes instead of locks (array of flags) 2x(p-1) messages exchanged over the network Valuable in distributed network: communicate along different paths Master-Slave barrier Master core gathers slaves on the barrier and releases them Use separate, per-core polling flags for different wait stages Centralized Contention Master-Slave

Improved Barrier Algorithm Not all messages have same latency Need for locality-aware implementation What if implemented on top of NUMA (cluster-based) shared memory system? e.g., p2012 Master-Slave MEM PROC XBAR NI MEM PROC XBAR NI MEM PROC XBAR MEM PROC XBAR NI

Improved Barrier Algorithm Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage: use of ordinary reads/writes instead of locks Software combining tree Only k processors access the same location, where k is degree of tree Little contention Centralized Contention Tree

Improved Barrier Algorithm Centralized Contention Tree Software combining tree Only k processors access the same location, where k is degree of tree Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage: use of ordinary reads/writes instead of locks

Improved Barrier Algorithm Hierarchical synchronization locality-aware implementation What if implemented on top of NUMA (cluster-based) shared memory system? e.g., p2012 Tree MEM PROC XBAR NI MEM PROC XBAR NI MEM PROC XBAR MEM PROC XBAR NI

Barrier performance

Programming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? How is are dependencies (orderings) enforced? Data Can data be shared or is it all private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? Parallel programming models

In this and the upcoming lectures we will see different programming models and the features that each provide with respect to Control Data Synchronization Pthreads OpenMP OpenCL Parallel programming models