Download presentation
Presentation is loading. Please wait.
1
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture IV: OS Support
2
2 CMPT 401 Summer 2007 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Scalable synchronization
3
3 CMPT 401 Summer 2007 © A. Fedorova Process/Thread Support: Good Enough? Many computer scientists observed limited scalability of MT and MP architectures Performance of a threaded web server M. Welsh, SOSP ‘01
4
4 CMPT 401 Summer 2007 © A. Fedorova Alternative Web Services Architectures Alternative architectures for web services that rely less heavily on threads/processes: –Single-Process Event-Driven (SPED) –Asymmetric Multiprocess Event-Driven (AMPED) –Stage Event-Driven Architecture (SEDA)
5
5 CMPT 401 Summer 2007 © A. Fedorova Web Services Architecture. Case Study: A Web server Sequence of actions at the web server Each step can block: –Socket read/accept can block on network I/O –File find/read can block for disk I/O –Send can block on TCP buffer queue How do servers overlap blocking and computation? V. Pai, USENIX ‘99
6
6 CMPT 401 Summer 2007 © A. Fedorova Multiprocess (MP) or Multithreaded (MT) Architecture: A Review V. Pai, USENIX ‘99 One process performs all steps for a request I/O and computation overlap naturally OS switches to a new process when a process blocks MP V. Pai, USENIX ‘99 One thread performs all steps for a request I/O and computation overlap is possible only if support for kernel threads is available Modern OSs provide such support MT
7
7 CMPT 401 Summer 2007 © A. Fedorova Single Process Event-Driven Architecture A single process executes processing steps for all requests Uses non-blocking network and disk I/O system calls Uses select system call to check on the status of those operations Problem #1: many OSs do not provide non-blocking system calls for disk I/O Problem #2: those that do, do not integrate them with select – cannot check for completion of network and disk I/O simultaneously V. Pai, USENIX ‘99
8
8 CMPT 401 Summer 2007 © A. Fedorova Asymmetric Multiprocess Event Driven Architecture (AMPED) AMPED = MP + SPED Use SPED architecture for I/O operations with non-blocking interface: socket read/write, accept Use MP architecture for I/O operations without the non-blocking interface: file read/write: mmap the file Use mincore to check if the file is in memory If not, spawn a helper process to bring the file into memory Communicate with the helper process via IPC V. Pai, USENIX ‘99 Flash – a web server implemented using AMPED (V. Pai, et al., USENIX ‘99) Matches or exceeds performance of existing web servers by up to 50%
9
9 CMPT 401 Summer 2007 © A. Fedorova Staged Event-Driven Architecture Observation: AMPED is good, but it is not easy to control application resources. E.g., which event to process first? SEDA: Create a stage for each logical step of processing; Manage each stage separately There is a queue of events for each stage, so you can tell how each stage is loaded Each stage can be processed by several (a small number of) threads Adaptive load shedding – manage queues to control load –E.g., if the stage that involves disk I/O is the bottleneck, drop the queued up requests or reject new requests Dynamic control – adjust the number of threads per stage based on demand M. Welsh, SOSP ‘01
10
10 CMPT 401 Summer 2007 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Support for scalable synchronization Distributed operating systems
11
11 CMPT 401 Summer 2007 © A. Fedorova OS Support for Inter-Process Communication (IPC) Cooperating processes or threads need to communicate Threads share address space, so they communicate via shared memory What about processes? They do not share an address space. They communicate via: –Unix pipes –Memory-mapped files –Inter-process shared memory
12
12 CMPT 401 Summer 2007 © A. Fedorova Unix Pipes Pipe is a communication channel among two processes Using pipe in a shell: prompt% cat log_file | grep “May 16” cat grep write read Pipes can also be created using pipe() system call
13
13 CMPT 401 Summer 2007 © A. Fedorova Implementation of Pipes In Solaris: a data structure containing two vnodes, a lock and a buffer lock fnode buffer vnode To the user, each end of the pipe is represented by a file descriptor The user reads/writes the pipe by reading/writing the file descriptor The OS blocks the process reading from an empty pipe The OS blocks the process writing into the full pipe (when the buffer is full)
14
14 CMPT 401 Summer 2007 © A. Fedorova Memory-mapped Files Address space of process A File Mapped file Address space of process B Mapped File
15
15 CMPT 401 Summer 2007 © A. Fedorova Inter-process Shared Memory Inter-process shared memory: a piece of physical memory set up to be shared among processes Allocate inter-process shared memory using shmget Get permission to use (attach to it) via shmat Disadvantages: shared memory is not cleaned up automatically when processes exit; it needs to be cleaned up explicitly
16
16 CMPT 401 Summer 2007 © A. Fedorova Performance of IPC IPC involves inter-process context switching The expensive kind of context switch, because it involves switching address spaces The cost of a context switch determines the cost of IPC – largely depends on the hardware
17
17 CMPT 401 Summer 2007 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Support for scalable synchronization
18
18 CMPT 401 Summer 2007 © A. Fedorova Synchronization if(account_balance >= amount) { account_balance -= amount; } Thread 1: perform a withdrawal if(account_balance >= service_fee) { account_balance -= service_fee; } Thread 2: subtract service fee 12 3 4 Unsynchronized Access Account balanced has changed between steps 2 and 4!!! Synchronized Access lock_aquire(account_balance_lock); if(account_balance >= amount) { account_balance -= amount; } lock_release(account_balance); lock_aquire(account_balance_lock); if(account_balance >= service_fee) { account_balance -= service_fee; } lock_release(account_balance);
19
19 CMPT 401 Summer 2007 © A. Fedorova Synchronization Primitives (SP) Synchronization primitives provide atomic access to a critical section Types of synchronization primitives –mutex –semaphore –lock –condition variable –etc. Synchronization primitives are provided by the OS Can also be implemented by a library (e.g., pthreads) or by the application Hardware provide special atomic instructions for implementation of synchronization primitives (test-and-set, compare-and-swap, etc.)
20
20 CMPT 401 Summer 2007 © A. Fedorova Implementation of SP Performance of applications that use SP is determined by an implementation of the SP A SP must be scalable – must continue to perform well as the number of contending threads increases We will look at several implementations of locks to understand how to create a scalable implementation
21
21 CMPT 401 Summer 2007 © A. Fedorova What should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor Systems usually use a combination: –Spin for a while, then give up the processor We will focus on multiprocessors, so we’ll look at spinlock implementations © Herlihy-Shavit 2007
22
22 CMPT 401 Summer 2007 © A. Fedorova A Shared Memory Multiprocessor Bus cache memory cache © Herlihy-Shavit 2007
23
23 CMPT 401 Summer 2007 © A. Fedorova Basic Spinlock CS Resets lock upon exit spin lock critical section... …lock suffers from contention Sequential Bottleneck no parallelism © Herlihy-Shavit 2007
24
24 CMPT 401 Summer 2007 © A. Fedorova Review: Test-and-Set We have a boolean value in memory Test-and-set (TAS) –Swap true with prior value –Return value tells if prior value was true or false Can reset just by writing false © Herlihy-Shavit 2007
25
25 CMPT 401 Summer 2007 © A. Fedorova TAS Provided by the hardware Example SPARC: an assembly instruction load-store unsigned byte ldstub public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically. © Herlihy-Shavit 2007 TAS can be implemented in a high-level language. Example in Java:
26
26 CMPT 401 Summer 2007 © A. Fedorova TAS Locks Value of TAS’ed memory shows lock state: –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS: –If result is false, you win –If result is true, you lose Release lock by writing false
27
27 CMPT 401 Summer 2007 © A. Fedorova TAS Lock in SPARC Assembly spin_lock: busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch ! retl nop ! delay slot for branch loads old value into reg. o1. Writes “1” into memory at address in %o0. Test if %o1 equals to zero. If %o1 is not zero (old value is true), spin.
28
28 CMPT 401 Summer 2007 © A. Fedorova TAS Lock in Java class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); } Initialize lock state to false (unlocked) While lock is taken (true) spin. Release the lock – set state to false © Herlihy-Shavit 2007
29
29 CMPT 401 Summer 2007 © A. Fedorova Performance of TAS Lock Experiment –N threads on a multiprocessor –Increment shared counter 1 million times (total) –The thread acquires a lock before incrementing the counter –Each thread does 1,000,000/N increments N does not exceed the number of processors no thread switching overhead How long should it take? How long does it take?
30
30 CMPT 401 Summer 2007 © A. Fedorova Expected performance ideal Total time Number of threads no speedup because there is no parallelism © Herlihy-Shavit 2007 lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release Thread 1 Thread 2 same as sequential execution
31
31 CMPT 401 Summer 2007 © A. Fedorova Actual Performance TAS lock Ideal Much worse than ideal Total time Number of threads © Herlihy-Shavit 2007
32
32 CMPT 401 Summer 2007 © A. Fedorova Reasons for Bad TAS Lock Performance Has to do with cache behaviour on the multiprocessor system TAS causes a lot of invalidation misses –This hurts performance To understand what this means, let’s review how caches work
33
33 CMPT 401 Summer 2007 © A. Fedorova Processor Issues Load Request Bus cache memory cache data © Herlihy-Shavit 2007
34
34 CMPT 401 Summer 2007 © A. Fedorova Another Processor Issues Load Request Bus cache memory cache data Bus I got data data Bus I want data © Herlihy-Shavit 2007
35
35 CMPT 401 Summer 2007 © A. Fedorova memory Bus Processor Modifies Data cache data Now other copies are invalid data © Herlihy-Shavit 2007
36
36 CMPT 401 Summer 2007 © A. Fedorova Send Invalidation Message to Others memory Bus cache data Invalidate ! Bus Other caches lose read permission No need to change now: other caches can provide valid data © Herlihy-Shavit 2007
37
37 CMPT 401 Summer 2007 © A. Fedorova Processor Asks for Data memory Bus cache data Bus I want data data © Herlihy-Shavit 2007
38
38 CMPT 401 Summer 2007 © A. Fedorova Multiprocessor Caches: Summary Simultaneous reads and writes of shared data: –Make data invalid Invalidation is bad for performance On next data request: –Data must be fetched from another cache This slows down performance
39
39 CMPT 401 Summer 2007 © A. Fedorova What This Has to Do with TAS Locks Recall that TAS lock had bad performance Invalidations were the cause TAS lock Ideal Total time Number of threads Here is why: All spinners do load/store in a loop They all read/write the same location Cause lots of invalidations
40
40 CMPT 401 Summer 2007 © A. Fedorova A Solution: Test-And-Test-And-Set Lock Wait until lock “looks” free –Spin on local cache –No bus use while lock busy class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations. Now try to acquire it. © Herlihy-Shavit 2007
41
41 CMPT 401 Summer 2007 © A. Fedorova TTAS Lock Performance TAS lock TTAS lock Ideal Better, but still far from ideal Total time Number of threads © Herlihy-Shavit 2007
42
42 CMPT 401 Summer 2007 © A. Fedorova The Problem with TTAS Lock When the lock is released: –Everyone tries to acquire is –Everyone does TAS –There is a storm of invalidations Only one processor can use the bus at a time So all processors queue up, waiting for the bus, so they can perform the TAS
43
43 CMPT 401 Summer 2007 © A. Fedorova A Solution: TTAS Lock with Backoff Intuition: If I fail to get the lock there must be contention So I should back off before trying again Introduce a random “sleep” delay before trying to acquire the lock again © Herlihy-Shavit 2007
44
44 CMPT 401 Summer 2007 © A. Fedorova TTAS Lock with Backoff: Performance TAS lock TTAS lock Backoff lock Ideal Total time Number of threads © Herlihy-Shavit 2007
45
45 CMPT 401 Summer 2007 © A. Fedorova Backoff Locks Better performance than TAS and TTAS Caveats: –Performance is sensitive to the choice of delay parameter –The delay parameter depends on the number of processors and their speed –Easy to tune for one platform –Difficult to write an implementation that will work well across multiple platforms © Herlihy-Shavit 2007
46
46 CMPT 401 Summer 2007 © A. Fedorova An Idea Avoid useless invalidations –By keeping a queue of threads Each thread –Notifies next in line –Without bothering the others © Herlihy-Shavit 2007
47
47 CMPT 401 Summer 2007 © A. Fedorova Anderson Queue Lock flags next TFFFFFFF idle locations on which thread spin, one per thread Points to the next unused “spin” location acquiring getAndIncrement: atomically get value of “next”, and increment “next” pointer If “next” was TRUE, lock is acquired © Herlihy-Shavit 2007
48
48 CMPT 401 Summer 2007 © A. Fedorova Acquiring a Held Lock flags next TFFFFFFF acquired acquiring getAndIncrement T released acquired © Herlihy-Shavit 2007
49
49 CMPT 401 Summer 2007 © A. Fedorova Anderson Lock: Performance TAS lock TTAS lock Ideal Anderson lock Total time Number of threads Almost ideal. We avoid all unnecessary invalidations. Portable – no tunable parameters. © Herlihy-Shavit 2007
50
50 CMPT 401 Summer 2007 © A. Fedorova Scalable Synchronization: Summary Making synchronization primitives scalable is tricky Performance tied to the hardware architecture We looked at these spinlocks: –TAS – poor performance due to invalidations –TTAS – avoids constant invalidations, but a storm of invalidations on lock release –TTAS with backoff – eliminates the storm of invalidations on release –Anderson Queue Lock – completely eliminates all useless invalidations One could think of other optimizations… For more information, look at the references in the syllabus
51
51 CMPT 401 Summer 2007 © A. Fedorova OS Support For Distributed Systems: Summary (I) Networking –Access to network devices –Implementation of network protocols: TPC, UDP, IP Processes and Threads (because many DS components use MP/MT architectures). Must ensure: –Good load balance –Good response time –Minimize context switches –We looked at how Solaris time-sharing scheduler does this
52
52 CMPT 401 Summer 2007 © A. Fedorova OS Support For Distributed Systems: Summary (II) Inter-process communication –Pipes –Memory-mapped files –Inter-process shared memory Scalable Synchronization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.