Synchronization without Contention

Slides:



Advertisements
Similar presentations
EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Operating Systems Part III: Process Management (Process Synchronization)
The University of Adelaide, School of Computer Science
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:
John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Scalable Reader-Writer Synchronization for Shared- Memory Multiprocessors Mellor-Crummey and Scott Presented by Robert T. Bauer.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.
Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.
Local-Spin Algorithms
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
10/20/2006ELEG652-06F1 Topic 5 Synchronization and Costs for Shared Memory “.... You will be assimilated. Resistance is futile.“ Star Trek.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.
Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.
Interprocessor arbitration
Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.
Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”
Synchronization (Barriers) Parallel Processing (CS453)
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.
Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
1 Chapter 11 Global Properties (Distributed Termination)
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Synchronization Feb 2017 Topics Locks Barriers
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Semaphores and Condition Variables
Reactive Synchronization Algorithms for Multiprocessors
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
Chapter 5: Process Synchronization
CS510 Concurrent Systems Jonathan Walpole.
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
CS533 Concepts of Operating Systems
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Synchronization without Contention John Mellor-Crummey and Michael Scott Presented by Shoaib Kamil

Overview Review of some lock types MCS lock algorithm Barriers Empirical Performance Discussion

Review of Lock Types test&set test-and-test&set using a test&set instruction, poll a single memory location acquire lock by changing flag from false to true release by changing back test-and-test&set reduce memory/interconnection contention but only while lock is held! exponential backoff helps

Review of Lock Types ticket lock Anderson next ticket and currently-serving counters acquire lock by fetch&increment on next ticket; own the lock if currently-serving equals our ticket fair (FIFO order) effective backoff Anderson fetch&increment to obtain new location; spin on that location previous owner of lock frees it by writing to next loc reduces contention; polling on unique locations but requires coherence & O(P*locks) static space

MCS Lock Maintains a queue of requesters Each waiter has a local record that points to the next waiter Release gives the next waiter the lock

MCS Lock Pseudocode type qnode = record next : ^qnode// ptr to successor in queue locked : Boolean // busy-waiting necessary type lock = ^qnode// ptr to tail of queue // I points to a queue link record allocated // (in an enclosing scope) in shared memory // locally-accessible to the invoking processor procedure acquire_lock(L : ^lock; I : ^qnode) varpred : ^qnode I->next := nil // initially, no successor pred := fetch_and_store(L, I) // queue for lock if pred != nil // lock was not free I->locked := true // prepare to spin pred->next := I // link behind predecessor repeat while I->locked // busy-wait for lock procedure release_lock(L * ^lock; I : ^qnode) if I->next = nil // no known successor if compare-and-swap(L, I, nil) return // no successor, lock free repeat while I->next = nil // wait for succ. I->next->locked := false // pass lock Necessary because of the time between fetch&store and pred->next assignment

MCS Lock (con’t) Alternate release procedure doesn’t use compare&swap but doesn’t guarantee FIFO order All spinning occurs on local data item no unnecessary bus traffic while spinning

Barriers Previous work central counter is incremented by each processor, then wait until count equals P large amounts of contention software combining uses groups of k organized into a k-ary tree, travel up tree to root then down. last one in each leaf is the one that travels up. less contention, but still spinning on non-local location tournament barriers use statically determined node to travel up (not last one to arrive) not local-only spinning on DSM machines

MCS Barrier spins only on local-accessible flag variables requires O(P) space for P processors performs theoretical minimum number of network transactions performs O(log P) transactions in critical path uses two trees with different structures one for arrival, one for wakeup

MCS Barrier type treenode = record wsense : Boolean parentpointer : ^Boolean childpointers : array [0. .1] of "Boolean havechild : array [0. .3] of Boolean cnotready : array [0. .3] of Boolean dummy : Boolean // pseudo-data processor private vpid : integer // a unique "virtual processor" index processor private sense : Boolean shared nodes : array [O..P-11 of treenode // nodes[vpid] is allocated in shared memory // locally-accessible to processor vpid // for each processor i , sense is initially true // in nodes [i] : // havechild[j] = (4*i+j< P) // parentpointer = // hodes[floor((i-l)/4)] .cnotready [(i-1) mod 41 // or &dummy if i = 0 // childpointers [O] = (modes [2*i+l] .wsense, // or &dummy if 2*i+l>= P // childpointers [I] = (modes [2*i+2] .wsense, // or &dummy if 2*i+2 >= P // initially, // cnotready = havechild and wsense = false procedure tree-barrier with nodes[vpid] do repeat until cnotready = [false, false, false, false] cnotready := havechild // init for next ti parentpointer^ := false // signal parent if vpid != 0 // not root, wait until parent wakes me repeat until wsense = sense // signal children in wakeup tree childpointers[0]^ := sense childpointers[1]^ := sense sense := not sense

Results MCS scales best on Butterfly backoffs are effective peak is due to fact some parts of lock acquire/release occur in parallel note time to release MCS lock depends on whether there is a processor waiting

Results On Symmetry, MCS and Anderson are best Symmetry is more representative of actual lock costs?

Barrier Results

Shared Local Memory is good for performance helps because it lets processes spin on local items without going to main memory

Conclusions