Faster! Vidhyashankar Venkataraman CS614 Presentation.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,
Remote Procedure Call (RPC)
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Akbar.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Lightweight Remote Procedure Call B. Bershad, T. Anderson, E. Lazowska and H. Levy U. Of Washington.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
Lightweight Remote Procedure Call BRIAN N. BERSHAD THOMAS E. ANDERSON EDWARD D. LAZOWSKA HENRY M. LEVY Presented by Wen Sun.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
Computer Systems/Operating Systems - Class 8
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
G Robert Grimm New York University Lightweight RPC.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Chris.
CS533 Concepts of Operating Systems Class 8 Shared Memory Implementations of Remote Procedure Call.
Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
Dawson R. Engler, M. Frans Kaashoek, and James O'Tool Jr.
CS533 Concepts of Operating Systems Class 4 Remote Procedure Call.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy ACM Transactions Vol. 8, No. 1, February 1990,
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
USER LEVEL INTERPROCESS COMMUNICATION FOR SHARED MEMORY MULTIPROCESSORS Presented by Elakkiya Pandian CS 533 OPERATING SYSTEMS – SPRING 2011 Brian N. Bershad.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
CS533 Concepts of Operating Systems Class 4 Remote Procedure Call.
Ethan Kao CS 6410 Oct. 18 th  Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth.
Remote Procedure Calls Taiyang Chen 10/06/2009. Overview Remote Procedure Call (RPC): procedure call across the network Lightweight Remote Procedure Call.
CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.
Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 Operating System Organization.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Optimizing RPC “Lightweight Remote Procedure Call” (1990) Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, Henry M. Levy (University of Washington)
1 Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska and Henry M. Levy Presented by: Karthika Kothapally.
CS533 Concepts of Operating Systems Class 9 Lightweight Remote Procedure Call (LRPC) Rizal Arryadi.
CS510 Concurrent Systems Jonathan Walpole. Lightweight Remote Procedure Call (LRPC)
Stack Management Each process/thread has two stacks  Kernel stack  User stack Stack pointer changes when exiting/entering the kernel Q: Why is this necessary?
ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.
Lightweight Remote Procedure Call (Bershad, et. al.) Andy Jost CS 533, Winter 2012.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Lightweight Remote Procedure Call BRIAN N. BERSHAD, THOMAS E. ANDERSON, EDWARD D. LASOWSKA, AND HENRY M. LEVY UNIVERSTY OF WASHINGTON "Lightweight Remote.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Networking Implementations (part 1) CPS210 Spring 2006.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Presented by: Tim Fleck.
Processes and Virtual Memory
The Mach System Silberschatz et al Presented By Anjana Venkat.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Computer Science Lecture 4, page 1 CS677: Distributed OS Last Class: RPCs RPCs make distributed computations look like local computations Issues: –Parameter.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Kernel Design & Implementation
CS533 Concepts of Operating Systems
B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M
By Brian N. Bershad, Thomas E. Anderson, Edward D
Lecture 4- Threads, SMP, and Microkernels
Fast Communication and User Level Parallelism
Presented by Neha Agrawal
Presented by: SHILPI AGARWAL
System calls….. C-program->POSIX call
Presentation transcript:

Faster! Vidhyashankar Venkataraman CS614 Presentation

U-Net : A User-Level Network Interface for Parallel and Distributed Computing

Background – Fast Computing Emergence of MPP – Massively Parallel Processors in the early 90’s Repackage hardware components to form a dense configuration of very large parallel computing systems Repackage hardware components to form a dense configuration of very large parallel computing systems But require custom software But require custom software Alternative : NOW (Berkeley) – Network Of Workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Interconnected through fast switches Interconnected through fast switches Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently

Issues Problem with traditional networking architectures Software path through kernel involves several copies - processing overhead Software path through kernel involves several copies - processing overhead In faster networks, may not get application speed-up commensurate with network performance In faster networks, may not get application speed-up commensurate with network performanceObservations: Small messages : Processing overhead is more dominant than network latency Small messages : Processing overhead is more dominant than network latency Most applications use small messages Most applications use small messages Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less

Issues (contd.) Flexibility concerns: Protocol processing in kernel Protocol processing in kernel Greater flexibility if application specific information is integrated into protocol processing Greater flexibility if application specific information is integrated into protocol processing Can tune protocol to application’s needs Can tune protocol to application’s needs Eg.. Customized retransmission of video frames Eg.. Customized retransmission of video frames

U-Net Philosophy Achieve flexibility and performance by Removing kernel from the critical path Removing kernel from the critical path Placing entire protocol stack at user level Placing entire protocol stack at user level Allowing protected user-level access to network Allowing protected user-level access to network Supplying full bandwidth to small messages Supplying full bandwidth to small messages Supporting both novel and legacy protocols Supporting both novel and legacy protocols

Do MPPs do this? Parallel machines like Meiko CS-2, Thinking Machines CM-5 Have tried to solve the problem of providing user-level access to network Have tried to solve the problem of providing user-level access to network Use of custom network and network interface – No flexibility Use of custom network and network interface – No flexibility U-Net targets applications on standard workstations Using off-the-shelf components Using off-the-shelf components

Basic U-Net architecture Virtualize N/W device so that each process has illusion of owning NI Mux/ Demuxing device virtualizes the NI Offers protection! Kernel removed from critical path Kernel involved only in setup

The U-Net Architecture Building Blocks Application End-points Communication Segment(CS) Message QueuesSending Assemble message in CS EnQ Message DescriptorReceiving Poll-driven/ Event-driven DeQ Message Descriptor Consume message EnQ buffer in free Q An application endpoint A region of memory

U-Net Architecture (contd.) More on event-handling (upcalls) Can be UNIX signal handler or user-level interrupt handler Amortize cost of upcalls by batching receptions Mux/ Demux : Each endpoint uniquely identified by a tag (eg.. VCI in ATM) OS performs initial route setup and security tests and registers a tag in U-Net for that application The message tag mapped to a communication channel

Observations Have to preallocate buffers – memory overhead! Protected User-level access to NI : Ensured by demarcating into protection boundaries Defined by endpoints and communication channels Defined by endpoints and communication channels Applications cannot interfere with each other because Applications cannot interfere with each other because Endpoints, CS and message queues user-owned Outgoing messages tagged with originating endpoint address Incoming messages demuxed by U-Net and sent to correct endpoint

Zero-copy and True zero-copy Two levels of sophistication depending on whether copy is made at CS Base-Level Architecture Base-Level Architecture Zero-copy : Copied in an intermediate buffer in the CS CS’es are allocated, aligned, pinned to physical memory Optimization for small messages Direct-access Architecture Direct-access Architecture True zero copy : Data sent directly out of data structure Also specify offset where data has to be deposited CS spans the entire process address space Limitations in I/O Addressing force one to resort to Zero- copy

Kernel emulated end-point Communication segments and message queues are scarce resources Optimization: Provide a single kernel emulated endpoint Cost : Performance overhead

U-Net Implementation U-Net architectures implemented in two systems Using Fore Systems SBA 100 and 200 ATM network interfaces Using Fore Systems SBA 100 and 200 ATM network interfaces But why ATM? But why ATM? Setup : SPARCStations 10 and 20 on SunOS with ASX- 200 ATM switch with 140 Mbps fiber links Setup : SPARCStations 10 and 20 on SunOS with ASX- 200 ATM switch with 140 Mbps fiber links SBA-200 firmware 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities Complete redesign of firmware Complete redesign of firmware Device Driver Protection offered through VM system (CS’es) Protection offered through VM system (CS’es) Also through mappings Also through mappings

U-Net Performance RTT and bandwidth measurements Small messages 65 μs RTT (optimization for single cells) Fiber saturated at 800 B

U-Net Active Messages Layer An RPC that can be implemented efficiently on a wide range of hardware A basic communication primitive in NOW Allow overlapping of communication with computation Message contains data & ptr to handler Reliable Message delivery Reliable Message delivery Handler moves data into data structures for some (ongoing) operation Handler moves data into data structures for some (ongoing) operation

AM – Micro-benchmarks Single-cell RTT RTT ~ 71 μs for a 0-32 B message RTT ~ 71 μs for a 0-32 B message Overhead of 6 μs over raw U-Net – Why? Overhead of 6 μs over raw U-Net – Why? Block store BW 80% of the maximum limit with blocks of 2KB size 80% of the maximum limit with blocks of 2KB size Almost saturated at 4KB Almost saturated at 4KB Good performance! Good performance!

Split-C application benchmarks Parallel Extension to C Implemented on top of UAM Tested on 8 processors ATM cluster performs close to CS-2

TCP/IP and UDP/IP over U-Net Good performance necessary to show flexibility Traditional IP-over-ATM shows very poor performance eg.. TCP : Only 55% of max BW eg.. TCP : Only 55% of max BW TCP and UDP over U-Net show improved performance Primarily because of tighter application-network coupling Primarily because of tighter application-network couplingIP-over-U-Net: IP-over-ATM does not exactly correspond to IP-over-UNet IP-over-ATM does not exactly correspond to IP-over-UNet Demultiplexing for the same VCI is not possible Demultiplexing for the same VCI is not possible

Performance Graphs UDP Performance Saw-tooth behavior for Fore UDP TCP Performance

Conclusion U-Net provides virtual view of network interface to enable user- level access to high-speed communication devices The two main goals were to achieve performance and flexibility By avoiding kernel in critical path Achieved? Look at the table below…

Lightweight Remote Procedure Calls

Motivation Small kernel OSes have most services implemented as separate user-level processes Have separate, communicating user processes Improve modular structure Improve modular structure More protection More protection Ease of system design and maintenance Ease of system design and maintenance Cross-domain & cross-machine communication treated equal - Problems? Fails to isolate the common case Fails to isolate the common case Performance and Simplicity considerations Performance and Simplicity considerations

Measurements Measurements show cross-domain predominance V System – 97% V System – 97% Taos Firefly – 94% Taos Firefly – 94% Sun UNIX+NFS Diskless – 99.4% Sun UNIX+NFS Diskless – 99.4% But how about RPCs these days? But how about RPCs these days? Taos takes 109 μs for a Null() local call and 464 μs for RPC – 3.5x overhead Most interactions are simple with small numbers of arguments This could be used to make optimizations This could be used to make optimizations

Overheads in Cross-domain Calls Stub Overhead – Additional execution path Message buffer overhead – Cross-domain calls can involve four copy operations for any RPC Context switch – VM context switch from client’s domain to the server’s and vice versa on return Scheduling – Abstract and Concrete threads

Available solutions? Eliminating kernel copies (DASH system) Handoff scheduling (Mach and Taos) In SRC RPC : Message buffers globally shared! Message buffers globally shared! Trades safety for performance Trades safety for performance

Solution proposed : LRPCs Written for the Firefly system Mechanism for communication between protection domains in the same system Motto : Strive for performance without foregoing safety Basic Idea : Similar to RPCs but, Do not context switch to server thread Do not context switch to server thread Change the context of the client thread instead, to reduce overhead Change the context of the client thread instead, to reduce overhead

Overview of LRPCs Design Client calls server through kernel trap Client calls server through kernel trap Kernel validates caller Kernel validates caller Kernel dispatches client thread directly to server’s domain Kernel dispatches client thread directly to server’s domain Client provides server with a shared argument stack and its own thread Client provides server with a shared argument stack and its own thread Return through the kernel to the caller Return through the kernel to the caller

Implementation - Binding Client Thread Kernel Server thread Clerk Export interface Register with name server Wait Trap for import Notify Clerk Send PDL Processing: Allocates A-stacks Linkage Records Binding Object (BO) Send BO A-stack list ClientServer

Data Structures used and created Kernel receives Procedure Descriptor List (PDL) from Clerk Contains a PD for each procedure Contains a PD for each procedure Entry Address apart from other information Kernel allocates Argument stacks (A-stacks) shared by client-server domains for each PD Allocates linkage record for each A-Stack to record caller’s address Allocates Binding Object - the client’s key to access the server’s interface

Calling Client stub traps kernel for call after Pushing arguments in A-stack Pushing arguments in A-stack Storing BO, procedure identifier, address of A-stack in registers Storing BO, procedure identifier, address of A-stack in registersKernel Validates client, verifies A-stack and locates PD & linkage Validates client, verifies A-stack and locates PD & linkage Stores Return address in linkage and pushes on stack Stores Return address in linkage and pushes on stack Switches client thread’s context to server by running a new stack E- stack from server’s domain Switches client thread’s context to server by running a new stack E- stack from server’s domain Calls the server’s stub corresponding to PD Calls the server’s stub corresponding to PDServer Client thread runs in server’s domain using E-stack Client thread runs in server’s domain using E-stack Can access parameters of A-stack Can access parameters of A-stack Return values in A-stack Return values in A-stack Calls back kernel through stub Calls back kernel through stub

Stub Generation LRPC stub automatically generated in assembly language for simple execution paths Sacrifices portability for performance Sacrifices portability for performance Maintains local and remote stubs Maintains local and remote stubs First instruction in local stub is branch stmt First instruction in local stub is branch stmt

What are optimized here? Using the same thread in different domains reduces overhead Avoids scheduling decisions Avoids scheduling decisions Saves on cost of saving and restoring thread state Saves on cost of saving and restoring thread state Pairwise A-stack allocation guarantees protection from third party domain Within? Asynchronous updates? Within? Asynchronous updates? Validate client using BO – To provide security Elimination of redundant copies through use of A-stack! 1 against 4 in traditional cross-domain RPCs 1 against 4 in traditional cross-domain RPCs Sometimes two? Optimizations apply Sometimes two? Optimizations apply

Argument Copy

But… Is it really good enough? Trades off memory management costs for the reduction of overhead A-stacks have to be allocated at bind time A-stacks have to be allocated at bind time But size generally small Will LRPC work even if a server migrates from a remote machine to the local machine?

Other Issues – Domain Termination Domain Termination LRPC from terminated server domain should be returned back to the client LRPC from terminated server domain should be returned back to the client LRPC should not be sent back to the caller if latter has terminated LRPC should not be sent back to the caller if latter has terminated Use binding objects Revoke binding objects Revoke binding objects For threads running LRPCs in domain restart new threads in corresponding caller For threads running LRPCs in domain restart new threads in corresponding caller Invalidate active linkage records – thread returned back to first domain with active linkage Invalidate active linkage records – thread returned back to first domain with active linkage Otherwise destroyed Otherwise destroyed

Multiprocessor Issues LRPC minimizes use of shared data structures on the critical path Guaranteed by pairwise allocation of A-stacks Guaranteed by pairwise allocation of A-stacks Cache contexts on idle processors Idling threads in server’s context in idle processors Idling threads in server’s context in idle processors When client thread does RPC to server swap processors When client thread does RPC to server swap processors Reduces context-switch overhead Reduces context-switch overhead

Evaluation of LRPC Performance of four test programs (time in μs) (run on CVAX-Firefly averaged over calls)

Cost Breakdown for the Null LRPC Minimum refers to the inherent minimum overhead 18 μs spent in client stub and 3 μs in the server stub 25% time spent in TLB misses

Throughput on a multiprocessor Tested with Firefly on four C- VAX and one MicroVaxII I/O processors Speedup of 3.7 with 4 processors as against 1 processor Speedup of 4.3 with 5 processors SRC RPCs : inferior performance due to a global lock held during critical transfer path

Conclusion LRPC Combines Control Transfer and communication model of capability systems Control Transfer and communication model of capability systems Programming semantics and large-grained protection model of RPCs Programming semantics and large-grained protection model of RPCs Enhances performance by isolating the common case

NOW We will see ‘NOW’ later in one of the subsequent 614 presentations