CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support.

Slides:

Advertisements

Similar presentations

CMPT 431 Dr. Alexandra Fedorova Lecture III: OS Support.

Advertisements

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture III: OS Support.

CMPT 401 Dr. Alexandra Fedorova Lecture III: OS Support.

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Chapter 6: Process Synchronization

Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.

Multiple Processor Systems

CS444/CS544 Operating Systems Synchronization 2/21/2006 Prof. Searleman

Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.

Big Picture Lab 4 Operating Systems Csaba Andras Moritz.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.

Concurrency: Mutual Exclusion, Synchronization, Deadlock, and Starvation in Representative Operating Systems.

3.5 Interprocess Communication

CPS110: Implementing threads/locks on a uni-processor Landon Cox.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

1 I/O Management in Representative Operating Systems.

Synchronization CSCI 444/544 Operating Systems Fall 2008.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Input / Output CS 537 – Introduction to Operating Systems.

More on Locks: Case Studies

Flash An efficient and portable Web server. Today’s paper, FLASH Quite old (1999) Reading old papers gives us lessons We can see which solution among.

Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Spin Locks and Contention

Games Development 2 Concurrent Programming CO3301 Week 9.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

Processes and Virtual Memory

The Mach System Silberschatz et al Presented By Anjana Venkat.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Processes. Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems.

 Process Concept  Process Scheduling  Operations on Processes  Cooperating Processes  Interprocess Communication  Communication in Client-Server.

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.

Big Picture Lab 4 Operating Systems C Andras Moritz

CS703 – Advanced Operating Systems

Process Management Presented By Aditya Gupta Assistant Professor

Threads and Cooperation

Lecture 2: Processes Part 1

Thread Implementation Issues

Prof. Leonardo Mostarda University of Camerino

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Chapter 6: Synchronization Tools

Presentation transcript:

CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support

2 CMPT 401 Summer 2007 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Scalable synchronization

3 CMPT 401 Summer 2007 © A. Fedorova Process/Thread Support: Good Enough? Many computer scientists observed limited scalability of MT and MP architectures Performance of a threaded web server M. Welsh, SOSP ‘01

4 CMPT 401 Summer 2007 © A. Fedorova Alternative Web Services Architectures Alternative architectures for web services that rely less heavily on threads/processes: –Single-Process Event-Driven (SPED) –Asymmetric Multiprocess Event-Driven (AMPED) –Stage Event-Driven Architecture (SEDA)

5 CMPT 401 Summer 2007 © A. Fedorova Web Services Architecture. Case Study: A Web server Sequence of actions at the web server Each step can block: –Socket read/accept can block on network I/O –File find/read can block for disk I/O –Send can block on TCP buffer queue How do servers overlap blocking and computation? V. Pai, USENIX ‘99

6 CMPT 401 Summer 2007 © A. Fedorova Multiprocess (MP) or Multithreaded (MT) Architecture: A Review V. Pai, USENIX ‘99 One process performs all steps for a request I/O and computation overlap naturally OS switches to a new process when a process blocks MP V. Pai, USENIX ‘99 One thread performs all steps for a request I/O and computation overlap is possible only if support for kernel threads is available Modern OSs provide such support MT

7 CMPT 401 Summer 2007 © A. Fedorova Single Process Event-Driven Architecture A single process executes processing steps for all requests Uses non-blocking network and disk I/O system calls Uses select system call to check on the status of those operations Problem #1: many OSs do not provide non-blocking system calls for disk I/O Problem #2: those that do, do not integrate them with select – cannot check for completion of network and disk I/O simultaneously V. Pai, USENIX ‘99

8 CMPT 401 Summer 2007 © A. Fedorova Asymmetric Multiprocess Event Driven Architecture (AMPED) AMPED = MP + SPED Use SPED architecture for I/O operations with non-blocking interface: socket read/write, accept Use MP architecture for I/O operations without the non-blocking interface: file read/write: mmap the file Use mincore to check if the file is in memory If not, spawn a helper process to bring the file into memory Communicate with the helper process via IPC V. Pai, USENIX ‘99 Flash – a web server implemented using AMPED (V. Pai, et al., USENIX ‘99) Matches or exceeds performance of existing web servers by up to 50%

9 CMPT 401 Summer 2007 © A. Fedorova Staged Event-Driven Architecture Observation: AMPED is good, but it is not easy to control application resources. E.g., which event to process first? SEDA: Create a stage for each logical step of processing; Manage each stage separately There is a queue of events for each stage, so you can tell how each stage is loaded Each stage can be processed by several (a small number of) threads Adaptive load shedding – manage queues to control load –E.g., if the stage that involves disk I/O is the bottleneck, drop the queued up requests or reject new requests Dynamic control – adjust the number of threads per stage based on demand M. Welsh, SOSP ‘01

10 CMPT 401 Summer 2007 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Support for scalable synchronization Distributed operating systems

11 CMPT 401 Summer 2007 © A. Fedorova OS Support for Inter-Process Communication (IPC) Cooperating processes or threads need to communicate Threads share address space, so they communicate via shared memory What about processes? They do not share an address space. They communicate via: –Unix pipes –Memory-mapped files –Inter-process shared memory

12 CMPT 401 Summer 2007 © A. Fedorova Unix Pipes Pipe is a communication channel among two processes Using pipe in a shell: prompt% cat log_file | grep “May 16” cat grep write read Pipes can also be created using pipe() system call

13 CMPT 401 Summer 2007 © A. Fedorova Implementation of Pipes In Solaris: a data structure containing two vnodes, a lock and a buffer lock fnode buffer vnode To the user, each end of the pipe is represented by a file descriptor The user reads/writes the pipe by reading/writing the file descriptor The OS blocks the process reading from an empty pipe The OS blocks the process writing into the full pipe (when the buffer is full)

14 CMPT 401 Summer 2007 © A. Fedorova Memory-mapped Files Address space of process A File Mapped file Address space of process B Mapped File

15 CMPT 401 Summer 2007 © A. Fedorova Inter-process Shared Memory Inter-process shared memory: a piece of physical memory set up to be shared among processes Allocate inter-process shared memory using shmget Get permission to use (attach to it) via shmat Disadvantages: shared memory is not cleaned up automatically when processes exit; it needs to be cleaned up explicitly

16 CMPT 401 Summer 2007 © A. Fedorova Performance of IPC IPC involves inter-process context switching The expensive kind of context switch, because it involves switching address spaces The cost of a context switch determines the cost of IPC – largely depends on the hardware

17 CMPT 401 Summer 2007 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Support for scalable synchronization

18 CMPT 401 Summer 2007 © A. Fedorova Synchronization if(account_balance >= amount) { account_balance -= amount; } Thread 1: perform a withdrawal if(account_balance >= service_fee) { account_balance -= service_fee; } Thread 2: subtract service fee Unsynchronized Access Account balanced has changed between steps 2 and 4!!! Synchronized Access lock_aquire(account_balance_lock); if(account_balance >= amount) { account_balance -= amount; } lock_release(account_balance); lock_aquire(account_balance_lock); if(account_balance >= service_fee) { account_balance -= service_fee; } lock_release(account_balance);

19 CMPT 401 Summer 2007 © A. Fedorova Synchronization Primitives (SP) Synchronization primitives provide atomic access to a critical section Types of synchronization primitives –mutex –semaphore –lock –condition variable –etc. Synchronization primitives are provided by the OS Can also be implemented by a library (e.g., pthreads) or by the application Hardware provide special atomic instructions for implementation of synchronization primitives (test-and-set, compare-and-swap, etc.)

20 CMPT 401 Summer 2007 © A. Fedorova Implementation of SP Performance of applications that use SP is determined by an implementation of the SP A SP must be scalable – must continue to perform well as the number of contending threads increases We will look at several implementations of locks to understand how to create a scalable implementation

21 CMPT 401 Summer 2007 © A. Fedorova What should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor Systems usually use a combination: –Spin for a while, then give up the processor We will focus on multiprocessors, so we’ll look at spinlock implementations © Herlihy-Shavit 2007

22 CMPT 401 Summer 2007 © A. Fedorova A Shared Memory Multiprocessor Bus cache memory cache © Herlihy-Shavit 2007

23 CMPT 401 Summer 2007 © A. Fedorova Basic Spinlock CS Resets lock upon exit spin lock critical section... …lock suffers from contention Sequential Bottleneck  no parallelism © Herlihy-Shavit 2007

24 CMPT 401 Summer 2007 © A. Fedorova Review: Test-and-Set We have a boolean value in memory Test-and-set (TAS) –Swap true with prior value –Return value tells if prior value was true or false Can reset just by writing false © Herlihy-Shavit 2007

25 CMPT 401 Summer 2007 © A. Fedorova TAS Provided by the hardware Example SPARC: an assembly instruction load-store unsigned byte ldstub public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically. © Herlihy-Shavit 2007 TAS can be implemented in a high-level language. Example in Java:

26 CMPT 401 Summer 2007 © A. Fedorova TAS Locks Value of TAS’ed memory shows lock state: –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS: –If result is false, you win –If result is true, you lose Release lock by writing false

27 CMPT 401 Summer 2007 © A. Fedorova TAS Lock in SPARC Assembly spin_lock: busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch ! retl nop ! delay slot for branch loads old value into reg. o1. Writes “1” into memory at address in %o0. Test if %o1 equals to zero. If %o1 is not zero (old value is true), spin.

28 CMPT 401 Summer 2007 © A. Fedorova TAS Lock in Java class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); } Initialize lock state to false (unlocked) While lock is taken (true) spin. Release the lock – set state to false © Herlihy-Shavit 2007

29 CMPT 401 Summer 2007 © A. Fedorova Performance of TAS Lock Experiment –N threads on a multiprocessor –Increment shared counter 1 million times (total) –The thread acquires a lock before incrementing the counter –Each thread does 1,000,000/N increments N does not exceed the number of processors  no thread switching overhead How long should it take? How long does it take?

30 CMPT 401 Summer 2007 © A. Fedorova Expected performance ideal Total time Number of threads no speedup because there is no parallelism © Herlihy-Shavit 2007 lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release Thread 1 Thread 2 same as sequential execution

31 CMPT 401 Summer 2007 © A. Fedorova Actual Performance TAS lock Ideal Much worse than ideal Total time Number of threads © Herlihy-Shavit 2007

32 CMPT 401 Summer 2007 © A. Fedorova Reasons for Bad TAS Lock Performance Has to do with cache behaviour on the multiprocessor system TAS causes a lot of invalidation misses –This hurts performance To understand what this means, let’s review how caches work

33 CMPT 401 Summer 2007 © A. Fedorova Processor Issues Load Request Bus cache memory cache data © Herlihy-Shavit 2007

34 CMPT 401 Summer 2007 © A. Fedorova Another Processor Issues Load Request Bus cache memory cache data Bus I got data data Bus I want data © Herlihy-Shavit 2007

35 CMPT 401 Summer 2007 © A. Fedorova memory Bus Processor Modifies Data cache data Now other copies are invalid data © Herlihy-Shavit 2007

36 CMPT 401 Summer 2007 © A. Fedorova Send Invalidation Message to Others memory Bus cache data Invalidate ! Bus Other caches lose read permission No need to change now: other caches can provide valid data © Herlihy-Shavit 2007

37 CMPT 401 Summer 2007 © A. Fedorova Processor Asks for Data memory Bus cache data Bus I want data data © Herlihy-Shavit 2007

38 CMPT 401 Summer 2007 © A. Fedorova Multiprocessor Caches: Summary Simultaneous reads and writes of shared data: –Make data invalid Invalidation is bad for performance On next data request: –Data must be fetched from another cache This slows down performance

39 CMPT 401 Summer 2007 © A. Fedorova What This Has to Do with TAS Locks Recall that TAS lock had bad performance Invalidations were the cause TAS lock Ideal Total time Number of threads Here is why: All spinners do load/store in a loop They all read/write the same location Cause lots of invalidations

40 CMPT 401 Summer 2007 © A. Fedorova A Solution: Test-And-Test-And-Set Lock Wait until lock “looks” free –Spin on local cache –No bus use while lock busy class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations. Now try to acquire it. © Herlihy-Shavit 2007

41 CMPT 401 Summer 2007 © A. Fedorova TTAS Lock Performance TAS lock TTAS lock Ideal Better, but still far from ideal Total time Number of threads © Herlihy-Shavit 2007

42 CMPT 401 Summer 2007 © A. Fedorova The Problem with TTAS Lock When the lock is released: –Everyone tries to acquire is –Everyone does TAS –There is a storm of invalidations Only one processor can use the bus at a time So all processors queue up, waiting for the bus, so they can perform the TAS

43 CMPT 401 Summer 2007 © A. Fedorova A Solution: TTAS Lock with Backoff Intuition: If I fail to get the lock there must be contention So I should back off before trying again Introduce a random “sleep” delay before trying to acquire the lock again © Herlihy-Shavit 2007

44 CMPT 401 Summer 2007 © A. Fedorova TTAS Lock with Backoff: Performance TAS lock TTAS lock Backoff lock Ideal Total time Number of threads © Herlihy-Shavit 2007

45 CMPT 401 Summer 2007 © A. Fedorova Backoff Locks Better performance than TAS and TTAS Caveats: –Performance is sensitive to the choice of delay parameter –The delay parameter depends on the number of processors and their speed –Easy to tune for one platform –Difficult to write an implementation that will work well across multiple platforms © Herlihy-Shavit 2007

46 CMPT 401 Summer 2007 © A. Fedorova An Idea Avoid useless invalidations –By keeping a queue of threads Each thread –Notifies next in line –Without bothering the others © Herlihy-Shavit 2007

47 CMPT 401 Summer 2007 © A. Fedorova Anderson Queue Lock flags next TFFFFFFF idle locations on which thread spin, one per thread Points to the next unused “spin” location acquiring getAndIncrement: atomically get value of “next”, and increment “next” pointer If “next” was TRUE, lock is acquired © Herlihy-Shavit 2007

48 CMPT 401 Summer 2007 © A. Fedorova Acquiring a Held Lock flags next TFFFFFFF acquired acquiring getAndIncrement T released acquired © Herlihy-Shavit 2007

49 CMPT 401 Summer 2007 © A. Fedorova Anderson Lock: Performance TAS lock TTAS lock Ideal Anderson lock Total time Number of threads Almost ideal. We avoid all unnecessary invalidations. Portable – no tunable parameters. © Herlihy-Shavit 2007

50 CMPT 401 Summer 2007 © A. Fedorova Scalable Synchronization: Summary Making synchronization primitives scalable is tricky Performance tied to the hardware architecture We looked at these spinlocks: –TAS – poor performance due to invalidations –TTAS – avoids constant invalidations, but a storm of invalidations on lock release –TTAS with backoff – eliminates the storm of invalidations on release –Anderson Queue Lock – completely eliminates all useless invalidations One could think of other optimizations… For more information, look at the references in the syllabus

51 CMPT 401 Summer 2007 © A. Fedorova OS Support For Distributed Systems: Summary (I) Networking –Access to network devices –Implementation of network protocols: TPC, UDP, IP Processes and Threads (because many DS components use MP/MT architectures). Must ensure: –Good load balance –Good response time –Minimize context switches –We looked at how Solaris time-sharing scheduler does this

52 CMPT 401 Summer 2007 © A. Fedorova OS Support For Distributed Systems: Summary (II) Inter-process communication –Pipes –Memory-mapped files –Inter-process shared memory Scalable Synchronization