CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington Appears in SOSP 1989 Presented by: Fabián E. Bustamante

2 Introduction Granularity of protection mech. used by an OS has significant impact on system’s design & use Capability systems –Fine-grained protection – object exists in its own protection domain –All object live within a single name or address space –A process in one domain can act on an object in another only through a protected procedure call –Parameter passing simplified by existence of global name space containing all objects –Problems w/ efficient implementations In distributed computing, large-grained protection mechanisms –RPC facilitates placement of subsystems in different machines –Absence of global address space ameliorated by automatic stub generators & sophisticated run-time libraries –Widely used, efficient and convenient model

3 Observation Small kernel OSs borrows large-grained protection & programming models from distributed computing –Separate component placed in disjoint domains –Messages used for all inter-domain communication But, also adopt their control transfer & comm. model –Independent threads exchanging msgs. containing potentially large, structured values –However, common case – most comm. in an OS are … Cross-domain (bet/ domains on same machine) instead of cross- machine Simple because complex data structures are concealed behind abstract system interfaces –Thus model violates the common case → low performance or bad modularity, commonly the latter Handle normal and worst cases separately as a rule, because the requirements for the two are quite different: The normal case must be fast. The worst case must make some progress. B. Lampson, “Hints for computer system design.”

4 Motivation Use & performance of RPC (inside the OS) Frequency of cross-machine activity –Systems examined The V System –Highly decomposed system – all through msg. passing Tao, the Firefly OS –Middle-sized kernel responsible for VM, sched. and device access; rest access through RPC (FS, network protocols,...) Unix/NSF –All local system functions accessed through kernel traps, RPC for communication w/ FS Operating System Operations that cross mach boundaries (%) V3% Taos5.3% Sun Unix+NFS0.6%

5 Motivation Parameter size & complexity – based on static & dynamic analysis of SRC RPC usage in Taos OS 28 RPC services w/ 366 procedures Over 4 days and 1.5million cross-domain procedure calls –95% calls went to 10/112 procedures –75% to 3/112 –Number of bytes transfer - majority < 200B –No data types were recursively defined

6 Motivation The performance of cross-domain RPC (times in µsec ) Null procedure – void Null() { return;} –Theoretical minimum as a cross-domain operation One procedure call Kernel trap & change of processor’s VM context on call Kernel trap and context change on return –Anything above this is overhead SystemProcessorNull (theoretical min.) Null (actual)Overhead AccentPERQ44423001856 TaosFirefly C-VAX109464355 MachC-VAX90754664 V68020170730560 Amoeba68020170800630 DASH6802017015901420

7 Motivation Overhead – where’s the time going? –Stub overhead – diff. bet/ cross-domain & cross-machine call hidden by lower layers → general but infrequently needed e.g. 70µsec. to run null stub –Message buffer overhead – message exchange bet/ client/server → 4 copies (through kernel) on call/return (alloc & copy) –Access validation – kernel needs to validate sender both ways –Message transfer – queue/de-queue of msg. –Scheduling – while user sees one abstract thread, there’s a thread per domain and needs to be handled –Context switch – from client to server and back –Dispatch – receiver thread in server must interpret msg. & dispatch thread Some optimizations tried –DASH avoids kernel copy by allocating msg. out of a region mapped in both kernel & user domains –Mach & Taos rely on hand-off scheduling to bypass general sched. –Some systems pass few & small parameters in registers –SRC RPC gives up some safety w/ globally shared buffers, no access validation, etc

8 Design & Implementation of LRPC Execution model, programming semantic & large-grained protection model borrowed from RPC Binding done at the granularity of i/f - a set of procedures –A server module exports an i/f LRPC runtime library (server clerk) registers i/f with a name server –A client binds to the i/f by making an import call to kernel Kernel notifies server's waiting clerk Clerk replies with list of PD (procedure descriptor): –One PD per procedure in the i/f »Entry address in server domain & size of A-stack for arguments and return value For each PD –Kernel pairwise allocates in client & server domain a # of A-stacks »A-stacks are read-write shared by client & server, can be shared among proc. –Kernel allocates linkage record for the A-stack Kernel returns to the client –a Binding Object -- unforgable certificate to access server's interface (capability-like) –List of A-stacks for procedures in the i/f

9 Design & Implementation of LRPC Calling –Client calls user-stub Stub puts arguments into A-stack given by kernel at binding Stub places binding object, A-stack address & proc. id into registers & traps to kernel Kernel executes in the context of client thread –Verifies Binding Object, A-stack & locates linkage record –Puts return address & the current stack pointer into linkage record –Finds an E-stack (Execution stack) of the server – new or from pool –Updates thread's user stack pointer to run off of server's E-stack - Note that thread is client thread –Reloads processor's VM registers with those of server domain –Performs upcall into server-stub –Server-stub Calls the server, which executes with the A-stack and E-stack When the server returns, trap to the kernel Kernel does the light weight context switch back to the client address space –Client-stub again Reads the return value from the A-stack Returns the result to the client

10 Design & Implementation of LRPC Stub Generation –Two types of stubs automatically generated from Modula2+ definition file Simple & fast stub in assembly language for most cases – 4x faster Complex & general in Modular2+ for complex arguments, exception handling, etc LRPC on Multiprocessors –Locking mechanism is required for A-stacks –Further reduced context switch In single processor, light-weight context switch still incurs big overheads: vm register updates, TLB misses Context switch In MP - popular server's context are cached in idle processors (domain caching) –When client calls server procedure, kernel exchanges caller's processor w/ server's –Calling thread placed on proc. –On return, kernel exchanges processors back

11 Design and Implementation of LRPC Argument Copying –Conventional RPC: 4 times - user-stub/RPC msg/kernel/RPC msg/server-stub –LRPC: one - user-stub/A-stack Copy operations for LRPC vs Message-based RPC OperationLRPCMsg. passingRestricted msg. passing Call (mutable parameters)AABCEADE Call (immutable parameters)AEABCEADE ReturnFBCFBF CodeCopy operation DFrom sender/kernel space to receiver/kernel domain EFrom message (or A-stack) into server stack FFrom message (or A-stack) into client’s results CodeCopy operation AFrom client stack to message (or A-stack) BFrom sender domain to kernel domain CFrom kernel domain to receiver domain

12 Evaluation Test run on C-VAX Firefly Null is baseline, others have “typical” parameter sizes Each point avg. of 10k cross-domain calls LRPC/MP uses the idle processor optimization TestDescriptionLRPC/MPLRPCTaos NullThe null cross-domain call125157464 AddA proc. w/ 2 4B arg in & 1 4B arg. out130164480 BigInA proc. w/ in 200B arg.173192539 BigInOutA proc. w/ in/out 200B arg.219227636

13 Evaluation Breakdown for the serial (1-proc) Null LRPC on a C-VAX (all in µsec ) Minimum is a timing breakdown for the theoretical minimum cross-domain call Stub cost – 18 in client’s and 3 in server’s stub In kernel costs go to binding, validation and linkage 25% is due to TLB misses during virtual address translation – data structures and control sequence designed to reduce it OperationMinimumLRPC Overhead Modula2+Procedure call7 Two kernel traps36 Two context switches66 Stub21 Kernel transfer27 TOTAL10948

14 Evaluation LRPC avoids locking shared data during call/return to remove contention on shared-memory multiprocessors Each A-stack queue is guarded by its own lock Figure shows the number of processors simultaneously making calls – domain caching was disabled (each call required a context switch) 4,000 6,300 23,000

15 The uncommon cases Working well in common case & acceptable in less common ones (just a few examples) Transparency & cross-machine calls –Cross-domain/machine? Early decision – first instruction in stub. Cost of indirection is negligible in comparison A-stacks – size and number –PD lists are defined during compilation –If size of arguments is known, A-stack size can be determined statically, otherwise use a default size = Ethernet packet size –Beyond that, use an out-of-band mem. segment ($$ & infrequent) Domain termination – e.g. unhandled exception or user action –All resources reclaimed by the OS All bindings are revoked All threads are stopped –If the terminating domain is a server handling a LRPC request, the outstanding call must return to the client domain To handle outstanding threads, you can create a new one to replace captured ones and later kill the captured one upon return

16 Conclusion LRPC – combining elements from capabilities & RPC systems Adopts common-case approach to comm. A viable comm. alternative for small-kernel OSs Optimized for comm. b/ protection domains in same machine Combines control transfer & comm. model of capability systems w/ the programming semantics of & large-grained protection model of RPC Techniques –Simple control transfer – client’s thread does the work in servers domain –Simple data transfer - ~PC passing parameter mechanism (shared stack) –Simple & highly optimized stubs –Design for concurrency – avoids shared data structure bottlenecks Implemented in the DEC C-VAX Firefly

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington.

Similar presentations

Presentation on theme: "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington.

Similar presentations

Presentation on theme: "CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington."— Presentation transcript:

Similar presentations

About project

Feedback