Outline for Today Objectives: Linux scheduler Lottery scheduling BRING CARDS TO SHUFFLE
Linux Scheduling Policy Runnable process with highest priority and timeslice remaining runs (SCHED_OTHER policy) Dynamically calculated priority Starts with nice value Bonus or penalty reflecting whether I/O or compute bound by tracking sleep time vs. runnable time: sleep_avg – accumulated during sleep up to MAX_SLEEP_AVG (10 ms default) decremented by timer tick while running
Linux Scheduling Policy Dynamically calculated timeslice The higher the dynamic priority, the longer the timeslice: Recalculated every round when “expired” and “active” swap Exceptions for expired interactive Go back on active unless there are starving expired tasks High priority more interactive Low priority less interactive 10ms 150ms 300ms
Runqueue for O(1) Scheduler . Higher priority more I/O 300ms priority array . priority queue active priority queue lower priority more CPU 10ms expired . priority array . priority queue priority queue
Runqueue for O(1) Scheduler . priority array . priority queue 1 active priority queue expired . priority array . priority queue priority queue
Runqueue for O(1) Scheduler . priority array . priority queue X active priority queue expired . priority array . priority queue 1 priority queue
Linux Real-time No guarantees SCHED_FIFO SCHED_RR Static priority, effectively higher than SCHED_OTHER processes* No timeslice – it runs until it blocks or yields voluntarily RR within same priority level SCHED_RR As above but with a timeslice. * Although their priority number ranges overlap
Diversion: Synchronization Disable Interrupts Busywaiting solutions - spinlocks execute a tight loop if critical section is busy benefits from specialized atomic (read-mod-write) instructions Blocking synchronization sleep (enqueued on wait queue) while critical section is busy.
Support for SMP Every processor has its own private runqueue Locking – spinlock protects runqueue Load balancing – pulls tasks from busiest runqueue into mine. Affinity – cpus_allowed bitmask constrains a process to particular set of processors Symmetric mp P P P P $ $ $ $ Memory load_balance runs from schedule( ) when runqueue is empty or periodically esp. during idle. Prefers to pull processes from expired, not cache-hot, high priority, allowed by affinity
Lottery Scheduling Waldspurger and Weihl (OSDI 94)
Claims Goal: responsive control over the relative rates of computation Support for modular resource management Generalizable to diverse resources Efficient implementation of proportional-share resource management: consumption rates of resources by active computations are proportional to relative shares allocated
Basic Idea Resource rights are represented by lottery tickets abstract, relative (vary dynamically wrt contention), uniform (handle heterogeneity) responsiveness: adjusting relative # tickets gets immediately reflected in next lottery At allocation time: hold a lottery; Resource goes to the computation holding the winning ticket.
Fairness Expected allocation is proportional to # tickets held - actual allocation becomes closer over time. Number of lotteries won by client E[w] = n p where p = t/T Response time (# lotteries to wait for first win) E[n] = 1/p w # wins t # tickets T total # tickets n # lotteries
Example List-based Lottery 10 2 5 1 2 Summing: 10 12 17 Random(0, 19) = 15
Bells and Whistles Ticket transfers - objects that can be explicitly passed in messages Can be used to solve priority inversions Ticket inflation Create more - used among mutually trusting clients to dynamically adjust ticket allocations Currencies - “local” control, exchange rates Compensation tickets - to maintain share use only f of quantum, ticket inflated by 1/f in next
Kernel Objects Backing tickets Currency name amount 1000 base C_name Active amount 300 ticket Issued tickets
base alice bob task1 task3 task2 thread1 thread3 thread4 thread2 3000 1000 base 2000 base 1 bob = 20 base 1 alice = 5 base alice bob 200 100 100 bob 200 alice 100 alice task1 1 task2= .4 alice = 2 base task3 100 task2 500 100 task1 300 task2 100 task3 200 task2 thread1 thread3 thread4 thread2
base alice bob task1 task3 task2 thread1 thread3 thread4 thread2 3000 1000 base 2000 base 1 bob = 20 base 1 alice = 3.33 base alice bob 300 100 100 bob 200 alice 100 alice task1 1 task2= .4 alice = 1.33 base task3 100 100 task2 500 100 task1 300 task2 100 task3 200 task2 thread1 thread3 thread4 thread2
Example List-based Lottery 1 base 2bob 5 task3 2bob 10 task2 Random(0, 2999) = 1500
Compensation A holds 400 base, B holds 400 base A runs full 100msec quantum, B yields at 20msec B uses 1/5 allotted time Gets 400/(1/5) = 2000 base at each subsequent lottery for the rest of this quantum a compensation ticket valued at 2000 - 400
Ticket Transfer Synchronous RPC between client and server create ticket in client’s currency and send to server to fund it’s currency on reply, the transfer ticket is destroyed
Control Scenarios Dynamic Control Conditionally and dynamically grant tickets Adaptability Resource abstraction barriers supported by currencies. Insulate tasks.
UI mktkt, rmtkt, mkcur, rmcur fund, unfund lstkt, lscur, fundx (shell)
Prototype Implemented in the Mach microkernel
Relative Rate Accuracy Figure 4: Relative Rate Accuracy. For each allocated ratio, the observed ratio is plotted for each of three 60 second runs. The gray line indicates the ideal where the two ratios are identical.
Fairness Over Time 8 second time windows over 200 sec. Execution Figure 5: Fairness Over Time. Two tasks executing the Dhry-stone benchmark with a 2 : 1 ticket allocation. Averaged over the entire run, the two tasks executed 25378 and 12619 iterations/sec., for an actual ratio of 2.01 : 1.
Client-Server Query Processing Rates Figure 7: Query Processing Rates. Three clients with an 8 : 3 : 1 ticket allocation compete for service from a multithreaded database server. The observed throughput and response time ratios closely match this allocation.
Controlling Video Rates Figure 8: Controlling Video Rates. Three MPEG viewers are given an initial A: B : C = 3 : 2 : 1 allocation, which is changed to 3 : 1 : 2 at the time indicated by the arrow. The total number of frames displayed is plotted for each viewer. The actual frame rate ratios were 1.92 : 1.50 : 1 and 1.92 : 1 : 1.53, respectively, due to distortions caused by the X server.
Insulation Figure 9: Currencies Insulate Loads. Currencies A and B are identically funded. Tasks A1 andA2 are respectively allocated tickets worth 100:A and 200:A.TasksB1 andB2 are respectively allocated tickets worth 100:B and 200:B. Halfway through the experiment, task B3 is started with an allocation of 300:B.The resulting inflation is locally contained within currency B,andaf-fects neither the progress of tasks in currency A, nor the aggregate A: B progress ratio.
Other Kinds of Resources Claim: can be used for any resource where queuing is used Control relative waiting times for mutex locks. Mutex currency funded out of currencies of waiting threads Holder gets inheritance ticket in addition to its own funding, passed on to next holder (resulting from lottery) on release. Space sharing - inverse lottery, loser is victim (e.g. in page replacement decision, processor node preemption in MP partitioning)
Lock Funding Waiting thread 1 Waiting thread 1 lock 1 holding thread 1 bt holding thread 1
Lock Funding 1 Waiting thread 1 lock 1 1 t bt New holding thread Old holding thread 1
Mutex Waiting Times Figure 11: Mutex Waiting Times. Eight threads compete to acquire a lottery-scheduled mutex. The threads are divided into two groups (A, B) of four threads each, with the ticket allocation A: B = 2 : 1. For each histogram, the solid line indicates the mean (); the dashed lines indicate one standard deviation about the mean ( ÿ). The ratio of average waiting times is A: B = 1 : 2.11; the mutex acquisition ratio is 1.80 : 1.
Synchronization
The Trouble with Concurrency in Threads... Data: x while(i<10) {xx+1; i++;} while(j<10) j++;} i j See email spring 02 jan 22 What is the value of x when both threads leave this while loop?
Range of Answers Process 0 Process1 LD x // x currently 0 Add 1 ST x // x now 1, stored over 9 Do 9 more full loops // leaving x at 10 Process1 LD x // x currently 0 Add 1 ST x // x now 1 Do 8 more full loops // x = 9 LD x // x now 1 ST x // x = 2 stored over 10
Nondeterminism while (i<10) {xx+1; i++;} What unit of work can be performed without interruption? Indivisible or atomic operations. Interleavings - possible execution sequences of operations drawn from all threads. Race condition - final results depend on ordering and may not be “correct”. while (i<10) {xx+1; i++;} load value of x into reg yield( ) add 1 to reg yield ( ) store reg value at x
Reasoning about Interleavings On a uniprocessor, the possible execution sequences depend on when context switches can occur Voluntary context switch - the process or thread explicitly yields the CPU (blocking on a system call it makes, invoking a Yield operation). Interrupts or exceptions occurring - an asynchronous handler activated that disrupts the execution flow. Preemptive scheduling - a timer interrupt may cause an involuntary context switch at any point in the code. On multiprocessors, the ordering of operations on shared memory locations is the important factor.
Critical Sections If a sequence of non-atomic operations must be executed as if it were atomic in order to be correct, then we need to provide a way to constrain the possible interleavings in this critical section of our code. Critical sections are code sequences that contribute to “bad” race conditions. Synchronization needed around such critical sections. Mutual Exclusion - goal is to ensure that critical sections execute atomically w.r.t. related critical sections in other threads or processes. How?
The Critical Section Problem Each process follows this template: while (1) { ...other stuff... //processes in here shouldn’t stop others enter_region( ); critical section exit_region( ); } The problem is to define enter_region and exit_region to ensure mutual exclusion with some degree of fairness.
Implementation Options for Mutual Exclusion Disable Interrupts Busywaiting solutions - spinlocks execute a tight loop if critical section is busy benefits from specialized atomic (read-mod-write) instructions Blocking synchronization sleep (enqueued on wait queue) while C.S. is busy Synchronization primitives (abstractions, such as locks) which are provided by a system may be implemented with some combination of these techniques.