Scheduler Activations: Effective Kernel Support for the User-level Management of Parallelism Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy Presenter: Yi Qiao
Outline Introduction User-level threads: advantages and limitations Effective kernel support for user-level management of parallelism Implementation Performance Summary
Introduction Effectiveness of parallel computing –Largely depends on the performance and cost of primitives used to express and control the parallelism within programs Shared memory between multiple processes –Better for uniprocessor environment Use of threads –Separate the notion of a sequential execution stream from other aspects such as address spaces and I/O descriptors –A significant performance advantage over traditional processes
Problem with Threads User-level threads –Execute within the context of traditional processes Thread management requires no kernel intervention Flexible, easily customized without kernel modification Each process – virtual processor –Multiprogramming, I/O, page faults can lead to poor performance or incorrect behavior of user-level threads Kernel-level threads –Avoids system integration problem Directly mapped onto physical processor –Too heavyweight An order of magnitude worse than the best performance user-level threads
Goal of the Work A kernel interface and a user-level thread package that combine the functionality of kernel threads and the performance and flexibility of user-level threads –When no kernel intervention needed, same performance as best user-level thread –When kernel needs to be involved, mimic a kernel thread management system No idle processors No high priority thread waits for low-priority ones Trap of a thread won’t block others –Simple and easy application-specific cutomization Challenge –Necessary control and scheduling information is distributed between the kernel and application address space
Approach Each application provided with a virtual multiprocessor, and control which of its threads run on these processors The OS kernel control the allocation of processors among address spaces Kernel notifies the address space scheduler of relevant kernel event –Scheduler activation Vectors control to the thread scheduler on a kernel event Thread system notified kernel of user-level thread events that affect processor allocation decisions –Thread scheduler Execute user-level threads Make requests to the kernel
User-level Threads: Performance Advantages and Functionality Limitations Inherent cost in kernel threads management –Accessing thread management operations Kernel trap, parameter copy and checking –Cost of generality A single underlying implementation used by all applications User-level threads improve both performance and flexibility
User-level Threads: Performance Advantages and Functionality Limitations (Cont.) Poor integration of user-level threads on kernel interface –Kernel threads are wrong abstraction of supporting user-level systems Kernel threads block, resume and preempted without notification to user level Kernel threads are scheduled obliviously to user-level thread state Cause problems both for uniprogrammed systems and multiprogrammed systems –I/O –Page faults
Effective Kernel Support for the User- level Management of Parallelism A new kernel interface + user-level thread system –Functionality of kernel threads –Performance and flexibility of user-level threads –Each user-level thread system is provided with its own virtual multiprocessor, the abstraction of dedicated physical machine Kernel allocates processors to address spaces – complete control Each user-level thread system has complete control over which threads to run on allocated processors Kernel vectors events to appropriate thread scheduler –# of processors change, I/O, page fault User-level thread system notified kernel when needed –Only a subset of user-level operations which may affect processor allocation Application programmer does the same thing as if programming with kernel threads –Programmers provided with a normal Topaz thread interface
Explicit Vectoring of Kernel Events to the User-level Thread Scheduler Scheduler Activation –Each vectored event causes the user-level thread system to reconsider its scheduling decision –Three roles Serves as a vessel (execution context) for running user- level threads Notify the user-level thread of a kernel event Saving processor context of the activation’s current user- level thread when the thread is stopped by the kernel (I/O or processor preemption) –Similar data structure as a traditional kernel thread
Scheduler Activation (Cont.) Distinction between scheduler activations and kernel threads –Once an activation’s user-level thread is stopped by the kernel, the thread is never directly resumed by the kernel –Maintains the invariant that there are always as many running scheduler activations as processors assigned to the address space Events are vectored where a scheduling decision needs to be made
Example: I/O Request/Completion T1: Two processors allocated by kernel, two upcalls T2: Thread 1 blocks in the kernel, another upcall T3: I/O completes, preempts one processor and do the upcall T4: The upcall takes a thread off the ready list and run it Same mechanism to reallocate a processor from one address space to another
Scheduler Activations (Cont.) Reallocate a processor from one address space to another (multiprogramming) –Stop the old activation, use the processor to do an upcall into the new address space with a new activation –Need a second processor in old address space for an upcall here, notifying stop of two use-level threads Some minor points –If threads have priorities, an additional preemption may be needed –Application is free to build any other concurrency model on top of scheduler activations –Sometimes a user-level thread blocked in the kernel may need to execute further in kernel mode when the I/O completes
Notifying the Kernel of User-level Events Only a small subset of user-level events that affect the kernel processor allocation decision need to be notified –Transition to the state where the address space has more runnable threads than processors –Transition to the state where the address space has more processors than runnable threads How to keep applications honest?
Critical Sections Block or preempt a user-level thread in a critical section –Poor performance –Deadlock Solution –Prevention Requires kernel to yield control over processor allocation to the user-level –Recovery The thread system checks if the thread was executing in a critical section –If so, continue temporarily via a user-level context switch –Then another context switch and relinquished control back to the original upcall
Implementation Modifying Topaz –Change the Topaz thread management routines to implement scheduler activations –Explicit allocation of processors to address spaces Modifying FastThreads –Process upcalls and provide Topaz with information related to processor allocation decisions A few handred lines of code added to FastThreads, 1200 lines to Topaz
Implementation (Cont.) Processor Allocation Policy –Processors divided evenly among highest priority address spaces –Then are divided evenly among the remainder –Time-sliced only if the available processors is not an integer multiple of the number of address spaces that want them –Possible to for an address space to use kernel threads instead of scheduler activations Binary compatibility with existing Topaz applications Thread Scheduling Policy –Application can choose any scheduling policy Default: per-processor ready lists following FIFO
Implementation (Cont.) Performance Enhancements –Critical Sections – need to check whether the preempted user-level thread has a lock Thread set a flag when entering a critical section and clear it when finish –Overhead + latency Making a copy of every low-level critical section with post-processing of complier-generated code, and continues the preempted thread at the copy of the critical section –No overhead on lock latency in the common case –Management of scheduler activations Caching discarded scheduler activations for later reuse
Performance Goal: Combining the functionality of kernel threads with the performance and flexibility of user-level threads Evaluation questions –What is the cost of user-level thread operations? Fork, block –What is the cost of communication between kernel and the user level? –What is the overall effect on the performance of applications?
Performance (Cont.) Thread Performance –Cost of user-level thread operations close to those of the FastThreads package Preserve the order of magnitude advantage over kernel threads Upcall Performance –Help determine the “break-even” point to outperform kernel threads –Two user-level threads signal and wait through the kernel 2.4 milliseconds, five times worse than Topaz threads –Built as a quick modification to existing Topaz thread system –Written in Modula-2+, much slower than assembler Production scheduler activation could be faster
Application Performance Compare Topaz kernel threads, FastThreads, and FastThreads on top of scheduler activations –Application An O(N logN) solution to the N-body problem Can be either compute or I/O bound –Memory used by the application can be controlled All tests run on a six processor CVAX Firefly
Application Performance Case 1- Application makes minimal use of kernel services –Enough memory, negligible I/O and no other applications Run as fast as original FastThreads –1 processor, all perform worse than sequential implementation –More processors, kernel threads prevent good performance –Slight divergence of FastThreads and new FastThreads for 4 or 5 processors
Application Performance Case 2 – Kernel involvement required for I/O purposes –New FastThreads performs best When less and less memory available, all three systems degrade fast –Old FastThreads is the worst one »When a user-level thread blocks, the kernel thread also blocks –New FastThreads and Topaz threads can overlap I/O with useful computation
Application Performance Case 3 – Multiprogramming environment –Two copies of the N-body application on the six processors Speedup of new FastThread is within 5% of uniprogramming environment with 3 processors Old FastThread and Topaz perform much worse –Old FastThread - Physical processors idling waiting for a lock to be released while the lock holder is descheduled –Topaz – common thread operations are more expensive Limitation of the experiments –Limited number of processors makes it impossible for large parallel applications or higher multiprogramming levels
Conclusion Scheduler Activation – a kernel interface that combines with user-level thread package –Achieves the performance of user-level threads (in the common case) with the functionality of kernel threads (correct behavior for infrequent case) –Responsibility division Kernel –Processor allocation –Kernel event notification Application address space –Thread scheduling –Subset of user-level events affecting processor allocation decisions –Any user-level concurrency model can be supported