User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Arthur Strutzenberg
Interprocess Communication In the LRPC paper/presentation, it discussed the need for Failure Isolation Extensibility Modularity Usually a balance between the 3 needs and performance This will is a central theme for this paper as well.
Interprocess Communication Traditionally this is the responsibility of the Kernel This suffers from two problems Architectural Performance Interaction between kernel based communication and user level threads Generally designers use a pessimistic (non cooperative) approach This begs the following question “How can you have your cake and eat it too?”
Interprocess Communication What if the Communication layer is extracted out of the kernel, and made part of the User level This can increase performance by allowing Messages sent between address spaces directly Elimination of unnecessary processor reallocation Amortization (processor reallocation (when needed) is spread over several independent calls) Parallelism in message passing is exploited
User-Level Remote Procedure Call (URPC) Allows communication between address spaces without kernel mediation Isolates Processor Reallocation Thread Management Data Transfer Kernel is ONLY responsible for allocating processors to the address space
URPC & Communication Application OS Communication typically is Narrow Channel (Ports) Limited Number of Operations Create Send Receive Destroy Most modern OS have support for RPC
URPC & Communication What does this buy URPC? RPC is generally limited in definition about how the channels of communication operate Also the definition generally does not specify how processor scheduling (reallocation) will interact with the data transfer
URPC & Communication URPC exploits this information by Messages passed through logical channels are kept in memory that is shared between client and server This memory once allocated is kept intact Thread management is User Level (lightweight instead of “Kernel weight”) (Haven’t we read this in another paper?)
URPC & Thread Management There is less overhead involved in switching a processor to another thread in the same address space (context switching) versus reallocating it to another thread in a different address space (Processor Reallocation) URPC uses this along with the user level scheduler to always give preference to threads within the same address space
URPC & Thread Management Some numbers for comparison: A context switch within the address space 15 microseconds A processor reallocation 55 microseconds
URPC & Processor Allocation What happens when a client invokes a procedure on a server process and the server has no processors allocated to it? URPC calls this “underpowered” The paper identifies this as a load balancing problem The solution is reallocation from client to server A client with an idle processor can elect to reallocate the idle processor to the server This is not without cost, as this is expensive and requires a call to the kernel
Rationale for URPC The design of the URPC package presented in this paper has three main components Thread Management Data Transfer Processor Reallocation
Lets kill two birds with one stone URPC uses an “optimistic reallocation policy” which makes the following assumptions The Client will always have other work to do The server will (soon) have a processor available to service messages This leads to the “amortization of cost” The cost of a processor reallocation is spread over several calls
Why the optimistic approach doesn’t always hold This approach does not work as well when the application Runs as a single thread Is Real time Has high latency I/O Priority Invocations URPC handles this by allowing the client’s address space to force a processor reallocation to the server’s even though there might still be work to do
The Kernel handles Processor Reallocation URPC handles this through call called “Processor.Donate” This passes control of an idle processor down to the kernel, and then back up to a specified address in the receiving space
Voluntary Return of Processors The policy of URPC on its server processors is “…Upon receipt of a processor from a client address, return the processor when all outstanding messages from the client have generated replies, or when the server determines that the client has become ‘underpowered’….”
Parallels to the User Threads Paper Even though URPC implement a policy/protocol, there is absolutely no way to enforce it. This has the potential to lead to some interesting side effects. This is extremely similar to some of the problems discussed in the User Threads paper For example, a server thread could conceivably continue to hold a donated processor and handle requests from other clients
What this leads to… One word: STARVATION URPC handles this by only directly reallocating processors to load balance. In other words, the system also needs the notion of preemptive reallocation The Preemptive reallocation must also adhere to No higher priority thread waits while a lower priority thread runs No processor idles when there is work for it to do (even if the work is in another address space)
Controlling Channel Access Data flows in URPC involving different address spaces use a bidirectional shared memory queue. The queues have a test and set lock on either end, which the papers specifically state must be NON SPINNING The protocol is, if the lock is free, acquire it, otherwise go on and do something else Remember this protocol operates under the assumption that there is always work to do!!
Data Transfer Using Shared Memory There is still the risk of what the paper refers to as the “abusability factor” with RPC, where Clients & Servers can Overload each other Deny service Provide bogus results Violate communication protocols URPC passes the responsibility to handle this off to the stubs.
Cross-Address Space Procedure Call and Thread Management This section of the paper identifies that there is a correspondence between Send Receive (messaging) And Start Stop (Threads) Does this not remind everybody of a classic paper that we had to read?
Another link to the User Threads Paper Additionally the paper identifies three arguments with the thread—message relationship High performance thread management facilities are needed for fine-grained parallel programs High performance can only be provided at the user level The close interaction between communication and thread management can be exploited
URPC Performance Some comparisons: (values are in microseconds) Test URPC Fast Threads Taos Threads Ratio of Taos Cost to URPC Cost Procedure Call 7 1.0 Fork 43 1192 27.7 Fork;Join 102 1574 15.4 Yield 37 57 1.5 Acquire, Release 27 PingPong 53 271 5.1
(values are in microseconds) URPC Performance URPC can be broken down into 4 components Send Poll Receive Dispatch (values are in microseconds) Component Client Server Poll 18 13 Send 6 Receive 10 9 Dispatch 20 25 Total 54 53
Call Latency and Throughput Call Latency is the time from which a thread calls into the stub until control returns from the stub. These are load dependent, and depend on Number of Client Processors (C) Number of Server Processors (S) Number of runnable threads in the client’s Address Space (T) The graphs measure how long it takes to make 100,000 “Null” procedure calls into the server in a “tight loop”
Call Latency and Throughput
Conclusions In certain circumstances, it makes sense to move the Communication layer from the kernel to user space. Most OS’s are designed for a uniprocessor system, and are ported over to an SMMP system. URPC is one example of a system that is designed for SMMP directly, and takes direct advantage of the characteristics of the system
Conclusions As a lead in to Professor Walpoles Discussion and Q&A, lets conclude by trying to fill out the following table: RPC Type Similarities Differences Generic RPC LRPC URPC