The Multikernel: A new OS architecture for scalable multicore systems

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Chorus Vs Unix Operating Systems Overview Introduction Design Principles Programmer Interface User Interface Process Management Memory Management File.
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Akbar.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.
Computer Systems/Operating Systems - Class 8
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
G Robert Grimm New York University Disco.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Microkernels: Mach and L4
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
PRASHANTHI NARAYAN NETTEM.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Windows 2000 Course Summary Computing Department, Lancaster University, UK.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Supporting Multi-Processors Bernard Wong February 17, 2003.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
The Mach System Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne Presented by: Jee Vang.
System Components ● There are three main protected modules of the System  The Hardware Abstraction Layer ● A virtual machine to configure all devices.
Operating System 4 THREADS, SMP AND MICROKERNELS.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
The Mach System Silberschatz et al Presented By Anjana Venkat.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
The Multikernel: A New OS Architecture for Scalable Multicore Systems By (last names): Baumann, Barham, Dagand, Harris, Isaacs, Peter, Roscoe, Schupbach,
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
T HE M ULTIKERNEL : A NEW OS ARCHITECTURE FOR SCALABLE MULTICORE SYSTEMS Presented by Mohammed Mustafa Distributed Systems CIS*6000 School of Computer.
The Mach System Sri Ramkrishna.
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Alternative system models
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The Multikernel A new OS architecture for scalable multicore systems
Fast Communication and User Level Parallelism
Presented by Neha Agrawal
TORNADO OPERATING SYSTEM
CS510 - Portland State University
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Presentation transcript:

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Agenda Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs Messaging Barrelfish Overview Barrelfish Goals Barrelfish Architecture Barrelfish Evaluations

Review: Concurrency Topics Improving concurrency with shared memory data structures Non-blocking solutions: data structure specific CAS and CAS2 Load Linked/Store Conditional Non-blocking solutions: more general approaches Hazard Pointers RCU Transactional Memory

Review: Concurrency Topics Improving concurrency with shared memory data structures Improving spin-locks Exponential back-off Time delay Queuing Several improvement techniques drawn from network arbitration schemes (CSMA)

Review: Concurrency Topics Updating the problem statement Synchronization versus Locality Remote memory access versus Messaging Tornado Object-oriented with Clustered Objects Microkernel approach Interprocess communication using Protected Procedure Calls (PPC) Kernel-Kernel communications Remote memory access versus Remote invocation Dependent on cost ratio of remote memory/local memory Psyche: used shared memory for communication channel

The Multikernel: Introduction Today’s computer systems contain more processor cores and increasingly diverse architectures High-performance systems have scaled, but have done so in a system-specific manner Today’s general purpose OS will not be able to scale fast enough to keep up with the new system designs Consider changing OS to distributed system model

Problems of optimizing OS Improving read-write lock on Sun Niagra Exploits shared, banked L2 cache Concurrent writes to shared cache line tracks presence of readers Highly efficient improvement for Sun Niagra Highly inefficient for others –ping-pong between reader caches Windows 7 optimizations Removal of dispatcher lock – a single, global lock Based on NT Kernel code –designed for 32 proc systems Windows 7/Vista originally limited to 64 processor systems Replacement of dispatcher lock with finer-grained locks Windows 7 now scalable to 256 processor systems Customers still asking for 300+ processor support RCU changes for Linux Original RCU API had 7 components which increased to 31 Changes required for the large variety of architecture + workloads Grace period detection limits scalability

Processor Architectural Changes Replacement of single-shared interconnect (front-side bus) Interconnect topology now resembling a message passing network Ring topologies Mesh network topologies

AMD Opteron and Hypertransport Magny-Cours with 12 cores released March 2010 Hypertransport: front-side bus replacement Packet-based serial protocol Multi-processor interconnect for NUMA systems

Intel Nehalem and QPI Nehalem-EX with 8 cores released April 2010 QPI: front-side bus replacement Packet-based serial protocol Multi-processor interconnect for NUMA systems

Shared Memory vs Messaging Duality of message-passing versus shared memory Lauer and Needham (1978) Model selection dependent on hardware architecture Shared memory abstraction is difficult to tune Shared memory Single core update to shared memory in under 30 cycles Sixteen (16) cores updating data requires 12,000 cycles Cores are stalled on cache misses In message passing, use lightweight RPC Scales linearly with number of threads Consider QPI and Hypertransport Messaging abstraction better suited to architecture

Shared Memory vs. Messaging SHM8 Updating 8 cache lines MSG Messaging Note linear scale

Multikernel: updating the OS OS as distributed system with units communicating via messaging Three design principles Make all inter-core communications explicit Make OS structure hardware neutral View state as replicated instead of shared

Barrelfish – multikernel model Make inter-core communications explicit Since hardware behaves as a network, treat OS as a distributed system Asynchronous communications Make OS hardware neutral Use messaging abstraction to avoid extensive optimizations to achieve scalability Focus on optimization of messaging rather than hardware/cache/memory access View state as replicated Maintain state through replication rather than shared memory Provides improved locality

Barrelfish Goals Comparable performance to existing OS Demonstrate evidence of scalability to large number of cores Can be re-targeted to different hardware Can exploit message-passing abstraction to achieve good performance Can exploit modularity of the OS design

Barrelfish System Structure Privileged-mode CPU driver Lightweight RPC for core-to-core communication Alternate optimizations (L4 raw IPC in 420 cycles) Distinguished user-mode monitor Coordinate system state User level applications use thread libraries

Barrelfish CPU Driver Shares no state with other cores Event driven Single-threaded, nonpreemtable Serially processes events Traps from user processes Device interrrupts Interrupts from other cores Implements lightweight, asychronous communication

Barrelfish User-Mode Monitor Coordinate System State Single-core, user-space application supports OS scheduling Handles user-level application requests for state information Memory allocation tables Address space mappings Virtual memory management Inter-core communication through URPC Shared memory used as channel for communication Sender writes message to cache line Receiver polls cache line to read message

Barrelfish Process Structure Collection of dispatcher objects, one for each core the process may execute on Communication is between dispatchers Dispatchers run in user space Shared address space handled in one of two ways Shared hardware page table between dispatchers Replicating hardware page table among dispatchers Thread scheduler on dispatchers Exchange messages to create/unblock threads Migrate threads between dispatchers (and cores)

Barrelfish: Knowledge Engine Service known as System Knowledge Base Populated with information from hardware discovery Online measurements created at boot time Includes pre-defined system knowledge Useful for optimizing communications to a given hardware architecture

Barrelfish Evaluations TLB Shootdown Measure maintenance of TLB consistency by invalidating entries when pages are unmapped Linux/Windows operation Writes operation to a well known location Issue inter-process interrupts to every core that might have a mapping in its TLB (800 cycles/trap) Each core acks the interrupt by writing to shared memory location, invalidates its TLB and resumes

Barrelfish Evaluations TLB Shootdown in Barrelfish Use messages for shootdown Allows optimization of messaging mechanism Broadcast: good for AMD/Hypertransport which is a broadcast network Unicast: good for small number of cores Multicast: good for shared, on-chip L3 cache Two-level tree – send message to 1st core of processor which forwards to other cores as appropriate Newer 4-core AMD/HyperTransport Similar to Intel Nehalem/QPI, 4-core architecture NUMA-Aware Multicast Send messages to highest latency nodes first

Barrelfish Evaluations

Barrelfish Evaluations Unmap latency on 8x4-core AMD Overhead amortized over more efficient multicore communications Barrelfish Fixed LRPC Overhead

Barrelfish Evalutions: Linux

Conclusions Does not beat Linux in performance, but… Limited number of cores in modeling (32) Linux/Windows have had significant investments towards optimizations Barrelfish approach has similarities to large- scale multiprocessor systems like Cray T3