CS533 Concepts of Operating Systems Jonathan Walpole.

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Improving IPC by Kernel Design Jochen Liedtke Slides based on a presentation by Rebekah Leslie.

Outline of the Paper Introduction. Overview Of L4. Design and Implementation Of Linux Server. Evaluating Compatibility Performance. Evaluating Extensibility.

Computer Organization and Architecture

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

CS533 Concepts of Operating Systems Class 6 Micro-kernels Mach vs L3 vs L4.

Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.

Operating System Kernels Peter Sirokman. Summary of First Paper The Performance of µ-Kernel-Based Systems (Hartig et al. 16th SOSP, Oct 1997) Evaluates.

Dawson R. Engler, M. Frans Kaashoek, and James O'Tool Jr.

Improving IPC by Kernel Design Jochen Liedtke Presented by Ahmed Badran.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Improving IPC by Kernel Design Jochen Liedtke Proceeding of the 14 th ACM Symposium on Operating Systems Principles Asheville, North Carolina 1993.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

3.5 Interprocess Communication

Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.

Threads CSCI 444/544 Operating Systems Fall 2008.

Microkernels: Mach and L4

1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Process Description and Control A process is sometimes called a task, it is a program in execution.

CS533 Concepts of Operating Systems Class 6 The Performance of Micro- Kernel Based Systems.

CS533 Concepts of OS Class 16 ExoKernel by Constantia Tryman.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.

Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.

Improving IPC by Kernel Design

CS533 Concepts of Operating Systems Jonathan Walpole.

1 Micro-kernel. 2 Key points Microkernel provides minimal abstractions –Address space, threads, IPC Abstractions –… are machine independent –But implementation.

CS533 Concepts of Operating Systems Jonathan Walpole.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

The Performance of Microkernel-Based Systems

The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Seungweon Park.

The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.

Processes and Process Control 1. Processes and Process Control 2. Definitions of a Process 3. Systems state vs. Process State 4. A 2 State Process Model.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.

The Mach System Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne Presented by: Jee Vang.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

The Mach System Silberschatz et al Presented By Anjana Venkat.

The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Tim Hamilton.

Efficient Software-Based Fault Isolation Authors: Robert Wahbe Steven Lucco Thomas E. Anderson Susan L. Graham Presenter: Gregory Netland.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

CS533 Concepts of Operating Systems Jonathan Walpole.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Memory Management.

Processes and threads.

CS 6560: Operating Systems Design

The Mach System Sri Ramkrishna.

Chapter 9: Virtual Memory

Threads & multithreading

Chapter 9: Virtual-Memory Management

Page Replacement.

Improving IPC by Kernel Design

Fast Communication and User Level Parallelism

Architectural Support for OS

Improving IPC by Kernel Design

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Improving IPC by Kernel Design

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Outline Operating System Organization Operating System Examples

Architectural Support for OS

System calls….. C-program->POSIX call

CSE 542: Operating Systems

Presentation transcript:

CS533 Concepts of Operating Systems Jonathan Walpole

Improving IPC by Kernel Design & The Performance of Micro- Kernel Based Systems

The IPC Dilemma IPC is very import in μ-kernel design -Increases modularity, flexibility, security and scalability Past implementations have been inefficient -Message transfer takes μs

The L3 (μ-kernel based) OS A task consists of: Threads -Communicate via messages that consist of strings and/or memory objects Dataspaces -Memory objects Address space -Where dataspaces are mapped

Redesign Principles IPC performance is the Master! -All design decisions require a performance discussion -If something performs poorly, look for new techniques -Synergetic effects have to be taken into considerations -Design must cover all levels from architecture to coding -The design must be made on a concrete basis -The design must aim at a concrete performance goal

Achievable Performance A simple scenario -Thread A sends a null message to thread B -Minimum of 172 cycles Will aim at 350 cycles (7 μs) -Will actually achieve 250 cycles (5 μs)

Redesign Levels Architectural -System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks Algorithmic -Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages Interface -Unnecessary Copies, Parameter passing Coding -Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch.

Architectural Level System Calls -Expensive! So, require as few as possible. -Implement two calls: -Call -Reply & Receive Next -Combines sending an outgoing message with waiting for an incoming message -Schedulers can handle replies the same as requests

Messages Complex Messages: -Direct String, Indirect Strings (optional) -Memory Objects -Used to combine sends if no reply is needed -Can transfer values directly from sender’s variable to receiver’s variables. A Complex Message

Direct Transfer Address spaces have a fixed kernel accessible part -Messages transferred via the kernel part -User A space -> Kernel -> User B space -Requires 2 copies. -Larger Messages lead to higher costs User A User B Kernel

Shared User Level memory (LRPC, SRC RPC) -Security can be penetrated -Cannot check message’s legality -Long messages -> address space becoming a critical resource -Explicit opening of communication channels -Not application friendly Shared User Level Memory

Temporary Mapping L3 uses a Communication Window -Only kernel accessible, and exists per address space -Target region is temporarily mapped there -Then the message is copied to the communication window and ends up in the correct place in the target address space User A User B Kernel

Temporary Mapping Must be fast! 2 level page table need only copy one word -pdir A -> pdir B TLB must be clean of entries relating to the use of the communication window by other operations -One thread -TLB is always “window clean”. -Multiple threads -Interrupts – TLB is flushed -Thread switch – Invalidate Communication window entries

Strict Process Orientation One kernel stack per thread, unlike Mach’s use of continuations and a single kernel thread per CPU May lead to a large number of stacks -Minor problem if stacks are objects in virtual memory

Thread Control Blocks (tcb’s) Hold kernel, hardware, and thread-specific data Stored in a virtual array in shared kernel space User areaKernel area tcbKernel stack

Tcb Benefits Fast tcb access using thread id -tcb contains the thread’s kernel stack -saves 3 TLB misses per IPC Threads can be locked by unmapping the tcb -Checking for tcb allocation and residence is moved to the page fault handler

Algorithmic Level Thread ID’s -L3 uses a 64 bit unique identifier (uid) containing the thread number -Tcb address is easily obtained -anding the lower 32 bits with a bit mask and adding the tcb base address

Algorithmic Level Lazy Scheduling -Only a thread state variable is changed (ready/waiting) -Deletion from queues happens when queues are parsed -Reduces delete operations -Reduces insert operations when a thread needs to be inserted that hasn’t been deleted yet!

Algorithmic Level Short messages via registers -Register transfers are fast % of messages ≥ 8 bytes -Up to 8 byte messages can be transferred by registers with a decent performance gain -May not pay off for other processors

Interface Level Parameter Passing -Use registers whenever possible. -Far more efficient -Give compilers better opportunities to optimize code.

Coding Level Cache Misses -Cache line fill sequence should match the usual data access sequence TLB Misses -Try and pack in one page: -IPC related kernel code -Processor internal tables -Start/end of Larger tables -Most heavily used entries

Coding Level Registers -Segment register loading is expensive -One flat segment covering the complete address space Jumps and Checks -Basic code blocks should be arranged so that as few jumps are taken as possible Process switch -Save/restore of stack pointer and address space only -Other registers (floating point etc) saved/restored only when really necessary

L3 IPC Performance vs Mach IPC

But Is That Enough? What is the impact on overall system performance? Haertig et al explore performance of -L4-based Linux -Mach-based Linux -native Linux

L 4 Linux – micro-kernel based Linux Full binary compatibility with Linux/X86 No changes to architecture-independent Linux code No Linux-specific modifications to the L4 kernel

L 4 Linux – Design & Implementation Linux implemented as a single Linux server in a μ-kernel task -μ-kernel tasks used for Linux user processes -A single L4 thread handles system calls and page faults -This thread is multiplexed (treated as a virtual CPU) On booting, the Linux server requests memory from its pager -L4 maps physical memory into the server’s address space The Linux server is then the pager for the processes it creates -L4 converts user-process page faults into an RPC -sent to the Linux server -it then maps pages from its address space to user process

Interrupt Handling Linux top halves run as one server thread per interrupt source -L4 converts an interrupt to a message to the appropriate thread Linux bottom halves all execute in a single high priority thread Linux interrupt threads are high priority -avoids concurrent execution of Linux code on a uniprocessor

System Calls System calls implemented as IPC -user process send IPC to the Linux server Modified libc.so or libc.a avoid trap instructions -use L4 IPC instead to call the Linux server User-level exception handler (trampoline) emulates the native system call ‘trap’ instruction for binary compatibility -L4 redirects trap to emulation library which then used L4 IPC to call the Linux server

Signals Each user process has a separate signal-handler thread Linux server delivers a signal by sending a message to the user process’s signal-handler thread The signal-handler causes the user process’s main thread to save its state and enter Linux by manipulating the main thread’s SP and PC

Scheduling All thread scheduling is done by the L4 kernel The Linux server’s schedule() routine is only used for multiplexing the Linux server’s Main thread across concurrent Linux system calls

Experiment What is the penalty of using L4Linux? -Compare L4Linux to native Linux Does micro-kernel performance matter? -Compare L4Linux to MkLinux Is co-location critical for good performance? -Compare L4Linux to an in-kernel version of MkLinux

Microbenchmarks System call overhead on shortest call “getpid()” Linux 223 cycles L4Linux 526 cycles L4Linux(trampoline) 753 cycles MkLinux (in kernel) 2050 cycles MkLinux (user)14710 cycles

Microbenchmarks (cont.) Performance of specific system calls

Macrobenchmarks Time to recompile Linux server

Macrobenchmarks (cont.) Commercial test suite simulates system under load

Performance Analysis L4Linux averages 8.3% slower than native Linux -Only 6.8% slower at maximum load MkLinux averages 49% slower than native Linux -60% slower at maximum load Co-located MkLinux averages 29% slower than native Linux -37% at maximum

Conclusions Using the L4 micro-kernel imposes a 5-10% slowdown to native Linux -Much faster than previous micro-kernels Is hardware-based protection inherently inefficient? Have we explored fine grained protection?

Spare Slides

Extensibility Performance A micro-kernel must provide more than just the features of the OS running on top of it. Specialization – improved implementation of Os functionality Extensibility – permits implementation of new services that cannot be easily added to a conventional OS.

Pipes and RPC First five (1) use the standard pipe mechanism of the Linux kernel. (2) Is asynchronous and uses only L4 IPC primitives. Emulates POSIX standard pipes, without signalling. Added thread for buffering and cross-address-space communication. (3) Is synchronous and uses blocking IPC without buffering data. (4) Maps pages into the receiver’s address space.

Virtual Memory Operations The “Fault” operation is an example of extensibility – measures the time to resolve a page fault by a user-defined pager in a separate address space. “Trap” – Latency between a write operation to a protected page, and the invocation of related exception handler. “Appel1” – Time to access a random protected page. The fault handler unprotects the page, protects some other page, and resumes. “Appel2” – Time to access a random protected page where the fault handler only unprotects the page and resumes.