Download presentation
Presentation is loading. Please wait.
Published byAshlynn Howard Modified over 9 years ago
1
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington Appears in SOSP 1989 Presented by: Fabián E. Bustamante
2
2 Introduction Granularity of protection mech. used by an OS has significant impact on system’s design & use Capability systems –Fine-grained protection – object exists in its own protection domain –All object live within a single name or address space –A process in one domain can act on an object in another only through a protected procedure call –Parameter passing simplified by existence of global name space containing all objects –Problems w/ efficient implementations In distributed computing, large-grained protection mechanisms –RPC facilitates placement of subsystems in different machines –Absence of global address space ameliorated by automatic stub generators & sophisticated run-time libraries –Widely used, efficient and convenient model
3
3 Observation Small kernel OSs borrows large-grained protection & programming models from distributed computing –Separate component placed in disjoint domains –Messages used for all inter-domain communication But, also adopt their control transfer & comm. model –Independent threads exchanging msgs. containing potentially large, structured values –However, common case – most comm. in an OS are … Cross-domain (bet/ domains on same machine) instead of cross- machine Simple because complex data structures are concealed behind abstract system interfaces –Thus model violates the common case → low performance or bad modularity, commonly the latter Handle normal and worst cases separately as a rule, because the requirements for the two are quite different: The normal case must be fast. The worst case must make some progress. B. Lampson, “Hints for computer system design.”
4
4 Motivation Use & performance of RPC (inside the OS) Frequency of cross-machine activity –Systems examined The V System –Highly decomposed system – all through msg. passing Tao, the Firefly OS –Middle-sized kernel responsible for VM, sched. and device access; rest access through RPC (FS, network protocols,...) Unix/NSF –All local system functions accessed through kernel traps, RPC for communication w/ FS Operating System Operations that cross mach boundaries (%) V3% Taos5.3% Sun Unix+NFS0.6%
5
5 Motivation Parameter size & complexity – based on static & dynamic analysis of SRC RPC usage in Taos OS 28 RPC services w/ 366 procedures Over 4 days and 1.5million cross-domain procedure calls –95% calls went to 10/112 procedures –75% to 3/112 –Number of bytes transfer - majority < 200B –No data types were recursively defined
6
6 Motivation The performance of cross-domain RPC (times in µsec ) Null procedure – void Null() { return;} –Theoretical minimum as a cross-domain operation One procedure call Kernel trap & change of processor’s VM context on call Kernel trap and context change on return –Anything above this is overhead SystemProcessorNull (theoretical min.) Null (actual)Overhead AccentPERQ44423001856 TaosFirefly C-VAX109464355 MachC-VAX90754664 V68020170730560 Amoeba68020170800630 DASH6802017015901420
7
7 Motivation Overhead – where’s the time going? –Stub overhead – diff. bet/ cross-domain & cross-machine call hidden by lower layers → general but infrequently needed e.g. 70µsec. to run null stub –Message buffer overhead – message exchange bet/ client/server → 4 copies (through kernel) on call/return (alloc & copy) –Access validation – kernel needs to validate sender both ways –Message transfer – queue/de-queue of msg. –Scheduling – while user sees one abstract thread, there’s a thread per domain and needs to be handled –Context switch – from client to server and back –Dispatch – receiver thread in server must interpret msg. & dispatch thread Some optimizations tried –DASH avoids kernel copy by allocating msg. out of a region mapped in both kernel & user domains –Mach & Taos rely on hand-off scheduling to bypass general sched. –Some systems pass few & small parameters in registers –SRC RPC gives up some safety w/ globally shared buffers, no access validation, etc
8
8 Design & Implementation of LRPC Execution model, programming semantic & large-grained protection model borrowed from RPC Binding done at the granularity of i/f - a set of procedures –A server module exports an i/f LRPC runtime library (server clerk) registers i/f with a name server –A client binds to the i/f by making an import call to kernel Kernel notifies server's waiting clerk Clerk replies with list of PD (procedure descriptor): –One PD per procedure in the i/f »Entry address in server domain & size of A-stack for arguments and return value For each PD –Kernel pairwise allocates in client & server domain a # of A-stacks »A-stacks are read-write shared by client & server, can be shared among proc. –Kernel allocates linkage record for the A-stack Kernel returns to the client –a Binding Object -- unforgable certificate to access server's interface (capability-like) –List of A-stacks for procedures in the i/f
9
9 Design & Implementation of LRPC Calling –Client calls user-stub Stub puts arguments into A-stack given by kernel at binding Stub places binding object, A-stack address & proc. id into registers & traps to kernel Kernel executes in the context of client thread –Verifies Binding Object, A-stack & locates linkage record –Puts return address & the current stack pointer into linkage record –Finds an E-stack (Execution stack) of the server – new or from pool –Updates thread's user stack pointer to run off of server's E-stack - Note that thread is client thread –Reloads processor's VM registers with those of server domain –Performs upcall into server-stub –Server-stub Calls the server, which executes with the A-stack and E-stack When the server returns, trap to the kernel Kernel does the light weight context switch back to the client address space –Client-stub again Reads the return value from the A-stack Returns the result to the client
10
10 Design & Implementation of LRPC Stub Generation –Two types of stubs automatically generated from Modula2+ definition file Simple & fast stub in assembly language for most cases – 4x faster Complex & general in Modular2+ for complex arguments, exception handling, etc LRPC on Multiprocessors –Locking mechanism is required for A-stacks –Further reduced context switch In single processor, light-weight context switch still incurs big overheads: vm register updates, TLB misses Context switch In MP - popular server's context are cached in idle processors (domain caching) –When client calls server procedure, kernel exchanges caller's processor w/ server's –Calling thread placed on proc. –On return, kernel exchanges processors back
11
11 Design and Implementation of LRPC Argument Copying –Conventional RPC: 4 times - user-stub/RPC msg/kernel/RPC msg/server-stub –LRPC: one - user-stub/A-stack Copy operations for LRPC vs Message-based RPC OperationLRPCMsg. passingRestricted msg. passing Call (mutable parameters)AABCEADE Call (immutable parameters)AEABCEADE ReturnFBCFBF CodeCopy operation DFrom sender/kernel space to receiver/kernel domain EFrom message (or A-stack) into server stack FFrom message (or A-stack) into client’s results CodeCopy operation AFrom client stack to message (or A-stack) BFrom sender domain to kernel domain CFrom kernel domain to receiver domain
12
12 Evaluation Test run on C-VAX Firefly Null is baseline, others have “typical” parameter sizes Each point avg. of 10k cross-domain calls LRPC/MP uses the idle processor optimization TestDescriptionLRPC/MPLRPCTaos NullThe null cross-domain call125157464 AddA proc. w/ 2 4B arg in & 1 4B arg. out130164480 BigInA proc. w/ in 200B arg.173192539 BigInOutA proc. w/ in/out 200B arg.219227636
13
13 Evaluation Breakdown for the serial (1-proc) Null LRPC on a C-VAX (all in µsec ) Minimum is a timing breakdown for the theoretical minimum cross-domain call Stub cost – 18 in client’s and 3 in server’s stub In kernel costs go to binding, validation and linkage 25% is due to TLB misses during virtual address translation – data structures and control sequence designed to reduce it OperationMinimumLRPC Overhead Modula2+Procedure call7 Two kernel traps36 Two context switches66 Stub21 Kernel transfer27 TOTAL10948
14
14 Evaluation LRPC avoids locking shared data during call/return to remove contention on shared-memory multiprocessors Each A-stack queue is guarded by its own lock Figure shows the number of processors simultaneously making calls – domain caching was disabled (each call required a context switch) 4,000 6,300 23,000
15
15 The uncommon cases Working well in common case & acceptable in less common ones (just a few examples) Transparency & cross-machine calls –Cross-domain/machine? Early decision – first instruction in stub. Cost of indirection is negligible in comparison A-stacks – size and number –PD lists are defined during compilation –If size of arguments is known, A-stack size can be determined statically, otherwise use a default size = Ethernet packet size –Beyond that, use an out-of-band mem. segment ($$ & infrequent) Domain termination – e.g. unhandled exception or user action –All resources reclaimed by the OS All bindings are revoked All threads are stopped –If the terminating domain is a server handling a LRPC request, the outstanding call must return to the client domain To handle outstanding threads, you can create a new one to replace captured ones and later kill the captured one upon return
16
16 Conclusion LRPC – combining elements from capabilities & RPC systems Adopts common-case approach to comm. A viable comm. alternative for small-kernel OSs Optimized for comm. b/ protection domains in same machine Combines control transfer & comm. model of capability systems w/ the programming semantics of & large-grained protection model of RPC Techniques –Simple control transfer – client’s thread does the work in servers domain –Simple data transfer - ~PC passing parameter mechanism (shared stack) –Simple & highly optimized stubs –Design for concurrency – avoids shared data structure bottlenecks Implemented in the DEC C-VAX Firefly
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.