CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.
Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Scalable and Lock-Free Concurrent Dictionaries
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
A Lock-Free Multiprocessor OS Kernel1 Henry Massalin and Calton Pu Columbia University June 1991 Presented by: Kenny Graunke.
Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Multi-Object Synchronization. Main Points Problems with synchronizing multiple objects Definition of deadlock – Circular waiting for resources Conditions.
CS444/CS544 Operating Systems Synchronization 2/21/2006 Prof. Searleman
CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
CS533 - Concepts of Operating Systems 1 Class Discussion.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
CS510 Concurrent Systems Introduction to Concurrency.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
Kernel Locking Techniques by Robert Love presented by Scott Price.
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.
Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Multiprocessors – Locks
Lecture 20: Consistency Models, TM
By Michael Greenwald and David Cheriton Presented by Jonathan Walpole
Background on the need for Synchronization
Håkan Sundell Philippas Tsigas
Threads Threads.
Atomic Operations in Hardware
Atomic Operations in Hardware
Lecture 21: Synchronization and Consistency
Lecture 22: Consistency Models, TM
Coordination Lecture 5.
Implementing Mutual Exclusion
Software Transactional Memory Should Not be Obstruction-Free
Kernel Synchronization II
CS333 Intro to Operating Systems
CSE 451 Section 1/27/2000.
CSE 542: Operating Systems
CSE 542: Operating Systems
Presentation transcript:

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel

CS510 - Concurrent Systems 2 The Synthesis kernel  A research project at Columbia University  Synthesis V.0 o Uniprocessor (Motorola 68020) ‏ o No virtual memory  Synthesis V.1 o Dual 68030s o virtual memory, threads, etc o Lock-free kernel

Locking  Why do kernels normally use locks?  Locks support a concurrent programming style based on mutual exclusion o Acquire lock on entry to critical sections o Release lock on exit o Block or spin if lock is held o Only one thread at a time executes the critical section  Locks prevent concurrent access and enable sequential reasoning about critical section code CS510 - Concurrent Systems 3

So why not use locking?  Granularity decisions o Simplicity vs performance o Increasingly poor performance (superscalar CPUs)  Complicates composition o Need to know the locks I’m holding before calling a function o Need to know if its safe to call while holding those locks?  Risk of deadlock  Propagates thread failures to other threads o What if I crash while holding a lock? CS510 - Concurrent Systems 4

Is there an alternative?  Use lock-free, “optimistic” synchronization o Execute the critical section unconstrained, and check at the end to see if you were the only one o If so, continue. If not roll back and retry  Synthesis uses no locks at all!  Goal: Show that Lock-Free synchronization is... o Sufficient for all OS synchronization needs o Practical o High performance CS510 - Concurrent Systems 5

Locking is pessimistic  Murphy's law: “If it can go wrong, it will...”  In concurrent programming: o “If we can have a race condition, we will...” o “If another thread could mess us up, it will...”  Solution: o Hide the resources behind locked doors o Make everyone wait until we're done o That is...if there was anyone at all o We pay the same cost either way CS510 - Concurrent Systems 6

Optimistic synchronization  The common case is often little or no contention o Or at least it should be! o Do we really need to shut out the whole world? o Why not proceed optimistically and only incur cost if we encounter contention?  If there's little contention, there's no starvation o So we don’t need to be “wait-free” which guarantees no starvation o Lock-free is easier and cheaper than wait-free  Small critical sections really help performance CS510 - Concurrent Systems 7

How does it work?  Copy o Write down any state we need in order to retry  Do the work o Perform the computation  Atomically “test and commit” or retry o Compare saved assumptions with the actual state of the world o If different, undo work, and start over with new state o If preconditions still hold, commit the results and continue o This is where the work becomes visible to the world (ideally) CS510 - Concurrent Systems 8

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 9

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 10 loop

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 11 Locals - won’t change! Locals - won’t change! Global - may change any time! “Atomic” read-modify-write instruction “Atomic” read-modify-write instruction

CAS  CAS – single word Compare and Swap o An atomic read-modify-write instruction o Semantics of the single atomic instruction are: CAS(copy, update, mem_addr) { if (*mem_addr == copy) { *mem_addr = update; return SUCCESS; } else return FAIL; } CS510 - Concurrent Systems 12

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 13

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 14 Do Work

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 15

Example – stack pop Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL) goto retry; return elem; } CS510 - Concurrent Systems 16 Do Work

What made it work?  It works because we can atomically commit the new stack pointer value and compare the old stack pointer with the one at commit time  This allows us to verify no other thread has accessed the stack concurrently with our operation o i.e. since we took the copy o Well, at least we know the address in the stack pointer is the same as it was when we started Does this guarantee there was no concurrent activity? Does it matter? We have to be careful ! CS510 - Concurrent Systems 17

Stack push Push(elem) { retry: old_SP = SP; new_SP = old_SP – 1; old_val = *new_SP; if(CAS2(old_SP, old_val, new_SP, elem, &SP, new_SP) == FAIL)‏ goto retry; } CS510 - Concurrent Systems 18

Stack push Push(elem) { retry: old_SP = SP; new_SP = old_SP – 1; old_val = *new_SP; if(CAS2(old_SP, old_val, new_SP, elem, &SP, new_SP) == FAIL)‏ goto retry; } CS510 - Concurrent Systems 19 Copy

Stack push Push(elem) { retry: old_SP = SP; new_SP = old_SP – 1; old_val = *new_SP; if(CAS2(old_SP, old_val, new_SP, elem, &SP, new_SP) == FAIL)‏ goto retry; } CS510 - Concurrent Systems 20 Do Work

Stack push Push(elem) { retry: old_SP = SP; new_SP = old_SP – 1; old_val = *new_SP; if(CAS2(old_SP, old_val, new_SP, elem, &SP, new_SP) == FAIL)‏ goto retry; } CS510 - Concurrent Systems 21

Stack push Push(elem) { retry: old_SP = SP; new_SP = old_SP – 1; old_val = *new_SP; if(CAS2(old_SP, old_val, new_SP, elem, &SP, new_SP) == FAIL)‏ goto retry; } CS510 - Concurrent Systems 22 Note: this is a double compare and swap! Its needed to atomically update both the new item and the new stack pointer Unnecessary Compare

CAS2  CAS2 = double compare and swap o Sometimes referred to as DCAS CAS2(copy1, copy2, update1, update2, addr1, addr2) { if(addr1 == copy1 && addr2 == copy2) { *addr1 = update1; *addr2 = update2; return SUCCEED; } else return FAIL; } CS510 - Concurrent Systems 23

Stack push Push(elem) { retry: old_SP = SP; new_SP = old_SP – 1; old_val = *new_SP; if(CAS2(old_SP, old_val, new_SP, elem, &SP, new_SP) == FAIL)‏ goto retry; } CS510 - Concurrent Systems 24 Do Work

Optimistic synchronization in Synthesis  Saved state is only one or two words  Commit is done via o Compare-and-Swap (CAS), or o Double-Compare-and-Swap (CAS2 or DCAS)  Can we really do everything in only two words? o Every synchronization problem in the Synthesis kernel is reduced to only needing to atomically touch two words at a time! o Requires some very clever kernel architecture CS510 - Concurrent Systems 25

Approach  Build data structures that work concurrently o Stacks o Queues (array-based to avoid allocations) o Linked lists  Then build the OS around these data structures  Concurrency is a first-class concern CS510 - Concurrent Systems 26

Why is this trickier than it seems?  List operations show insert and delete at the head o This is the easy case o What about insert and delete of interior nodes? o Next pointers of deletable nodes are not safe to traverse, even the first time! o Need reference counts and DCAS to atomically compare and update the count and pointer values o This is expensive, so we may choose to defer deletes instead (more on this later in the course)  Specialized list and queue implementations can reduce the overheads CS510 - Concurrent Systems 27

The fall-back position  If you can’t reduce the work such that it requires atomic updates to two or less words: o Create a single server thread and do the work sequentially on a single CPU o Why is this faster than letting multiple CPUs try to do it concurrently?  Callers pack the requested operation into a message o Send it to the server (using lock-free queues!) o Wait for a response/callback/... o The queue effectively serializes the operations CS510 - Concurrent Systems 28

Lock vs lock-free critical sections CS510 - Concurrent Systems 29 Lock_based_Pop() { spin_lock(&lock); elem = *SP; SP = SP + 1; spin_unlock(&lock); return elem; } Lock_free_Pop() { retry: old_SP = SP; new_SP = old_SP + 1; elem = *old_SP; if (CAS(old_SP, new_SP, &SP) == FAIL)‏ goto retry; return elem; }

CS510 - Concurrent Systems 30 Conclusions  This is really intriguing!  Its possible to build an entire OS without locks!  But do you really want to? o Does it add or remove complexity? o What if hardware only gives you CAS and no DCAS? o What if critical sections are large or long lived? o What if contention is high? o What if we can’t undo the work? o … ?