User-Level Interprocess Communication for Shared Memory Multiprocessors By Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Presented by Damon Tyman (all figures/tables from the paper)
The progression LRPC (Lightweight Remote Procedure Call; Feb 1990) Optimize RPC for performance in “common case” cross-domain (same-machine) communication. URPC (User-Level Remote Procedure Call; May 1991) Eliminate the involvement of the kernel in cross-address space communication and thread management, utilizing shared memory and user-level threads.
Introduction Efficient interprocess communication encourages system decomposition across address space boundaries. Advantages of decomposed systems: Failure isolation Extensibility Modularity Unfortunately, when cross-address space communication is slow, these advantages are usually traded for better system performance.
Kernel-level interprocess communication Traditionally, interprocess communication has been the responsibility of the kernel. Problems with kernel-based communication: Architectural performance barriers Performance limited by cost of invoking the kernel and reallocating processor to a different address space. In authors’ earlier work, LRPC, 70% of overhead can be attributed to kernel mediation. Interaction with high-performance user-level threads For satisfactory performance, medium- and fine-grained parallel applications need user-level thread management. Costs (performance and system complexity) high for partitioning strongly interdependent communication and thread management across protection boundaries.
User-level cross-address space communication With shared memory multiprocessor, kernel can be eliminated from cross- address space communication. User-level managed cross-address space communication: Shared memory can be utilized as the data transfer channel. Messages sent between address spaces directly—no kernel intervention. Frequency of processor reallocation can be reduced by preferring threads from the same address space as the processor is already active in. The inherent parallelism of sending and receiving messages can be exploited for improved call performance.
User-level Remote Procedure Call (URPC) URPC provides safe and efficient communication between address spaces on the same machine without kernel mediation. Three components of interprocess communication isolated: processor reallocation, thread management, and data transfer. Control transfer between address spaces handled by thread management and processor reallocation. Compared to traditional “small kernel systems” in which the kernel handles all three components, in URPC only processor reallocation involves the kernel. URPC implementation cross-address space procedure call latency 93 microseconds compared to LRPC’s (using kernel-managed communication) 157 microseconds.
Messages vs. Procedure Calls Commonly, in existing systems communication between applications achieved by sending messages over a narrow channel that supports just a few operations. While powerful construct, message passing may require thinking in a paradigm not supported by the programming language in use. RPC better supports the programming paradigm supported by the programming language. Message-passing may serve as the underlying technique, but programming approach more familiar. With many programming languages, communication within an address space has synchronous procedure call, data typing, and shared memory features. RPC reconciles the above with the untyped, asynchronous messages of cross- address space communication by using an abstraction that hides these characteristics of the cross-address space communication from the user.
Synchronization To the programmer, cross-address space procedure calls appear synchronous (calling thread blocks waiting for return). At the thread management level and below, however, the call is asynchronous. While blocked, another thread can be run. URPC favors threads in the same address space as the processor is already running. As soon as the reply arrives, the blocked thread can be scheduled to any processor assigned to its address space. All the involved scheduling operations can be handled by a user-level thread management system, avoiding the need to reallocate any processors to a new address space (if the callee address space has at least one processor assigned to it).
User view/system components Two software packages used by stub and application code: FastThreads (user-level threads scheduled on top of middleweight kernel threads) and URPC package
Processor reallocation overhead There is significantly less overhead in switching a processor over to a thread in the same address space (context switching) than reallocating to a thread in a different address space (processor reallocation). Processor reallocation high overhead: Requires changing the protected mapping registers that define the virtual address space context. Scheduling costs to choose address space for reallocation. Substantial long-term costs due to poor cache and TLB performance from constant locality switches (greater for processor reallocation than just context switching). Minimal latency same-address space context switch takes about 15 microseconds on the C-VAX while cross-address space processor reallocation takes 55 microseconds (doesn’t consider long-term costs!).
Processor Reallocation Policy An optimistic reallocation policy is employed: threads from the same address space are scheduled whenever possible. Assumes: The client has other work to do. The server will soon have a processor with which to service a request. Optimistic assumptions may not always hold (allowing for a processor reallocation to server) Single-threaded applications High-latency I/O operations Real-time applications (bounded call latency) Server may not have its own processor. Voluntary return of processors cannot be guaranteed. No way to enforce policies regarding return of processors. If client gives processor up to server, may never get back.
Sample Execution One client (an editor) and two servers (window manager and file manager) each have their own address space. T1 and T2 are threads in the editor. Two available processors.
Data transfer using shared memory Data flows between URPC packages in different address spaces over a bidirectional shared memory queue with non-spinning test-and-set locks on either end. Receiver should never have to delay (spin-wait) while sender holds the lock. “Abusability factor” not increased by shared memory message channels Denial of service, bogus results, communication protocol violation still possible. Up to higher-level protocols to filter abuses up to application layer. Communication safety responsibility of stubs Arguments passed in shared memory buffers during binding phase. No kernel copying necessary.
Cross-address space call and thread management Strong correlation between communications functions and thread management synchronization functions RPCs synchronous with respect to calling thread Each communication function has corresponding thread management function (send <-> start, receive <-> stop) Thread classification: Heavyweight: kernel makes no distinction between thread and address space Middleweight: kernel-managed threads but decoupled from address spaces (can be multiple threads per address space) Lightweight: threads managed by user-level libraries Fine-grained parallel applications demand high-performance thread management, only possible with user-level threads. Two-level scheduling: lightweight user-level threads are scheduled on top of weightier kernel-managed threads.
URPC Performance Processing times close for client and server Nearly half of time goes to thread management (dispatch)
URPC Performance C: client processors, S: server processors; T:runnable threads in client address space Latency is traded for throughput; T=2 from T=1 means ~20% latency increase, but ~75% throughput increase
And beyond… LRPC: improve RPC for cross-domain communication performance. URPC: move inter-process communication to user-level as much as possible. Threads managed at user-level but scheduled on top of middleweight threads. Scheduler Activations (Feb 1992): provide better abstraction (than kernel threads) for kernel support of user-level threads.