Kernel-Kernel Communication in a Shared- memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by Akbar.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann et al CS530 Graduate Operating System Presented by.
Introduction to MIMD architectures
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Chapter 7 Protocol Software On A Conventional Processor.
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
G Robert Grimm New York University Disco.
User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.
1: Operating Systems Overview
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Kernel-Kernel communication in a shared-memory multiprocessor By ELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J. LEBLANC, BRIAN D. MARSH* AND MICHAEL.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
OPERATING SYSTEM OVERVIEW
3.5 Interprocess Communication
USER LEVEL INTERPROCESS COMMUNICATION FOR SHARED MEMORY MULTIPROCESSORS Presented by Elakkiya Pandian CS 533 OPERATING SYSTEMS – SPRING 2011 Brian N. Bershad.
1 Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, and Mendel Rosenblum, Stanford University, 1997.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
PRASHANTHI NARAYAN NETTEM.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
More on Locks: Case Studies
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
CS533 Concepts of Operating Systems Jonathan Walpole.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Scheduling Lecture 6. What is Scheduling? An O/S often has many pending tasks. –Threads, async callbacks, device input. The order may matter. –Policy,
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
The Performance of Micro-Kernel- Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presentation by: Tim Hamilton.
Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.
Brian Bershad, Thomas Anderson, Edward Lazowska, and Henry Levy Presented by: Byron Marohn Published: 1991.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Background Computer System Architectures Computer System Software.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Chapter 6 Limited Direct Execution Chien-Chung Shen CIS/UD
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CMSC 611: Advanced Computer Architecture
The Multikernel A new OS architecture for scalable multicore systems
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Multiple Processor Systems
Fast Communication and User Level Parallelism
CS510 Concurrent Systems Class 2a
Presented by Neha Agrawal
Presented by: SHILPI AGARWAL
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Kernel-Kernel Communication in a Shared- memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010

Agenda Introduction Remote Invocation Remote Memory Access RI/RA Combinations Case Study Conclusion

Introduction

Introduction There’s more than one way to handle large shared memory systems ◦ Remote Memory  we’ve studied this a lot! ◦ Remote Invocation  message passing Trade-offs are discussed Theories tested with a case study

Motivation UMA design won’t scale NUMA was seen as the future ◦ It is implemented in commercial CPU’s NUMA allows programmers to choose shared memory or remote invocation The authors discuss the trade-offs

Kernel-kernel Communication Each processor has: ◦ Full range of kernel services ◦ Reasonable performance ◦ Access to all memory on the machine Locality – key to RI success ◦ Previous kernel experience shows that most memory access tend to be local to the “node” “...most memory accesses will be local even when using remote memory accesses for interkernel communication, and that the total amount of time spent waiting for replies from other processors when using remote invocation will be small...”

NUMA NUMA without cache-coherence 3 methods of kernel-kernel communication ◦ Remote Memory Access  Operation executes on node i, accessing node j’s memory as needed. ◦ Remote Invocation  Node i processor sends a message to node j processor asking j to perform i’s operations. ◦ Bulk data transfer  Kernel moves data from node to node.

Remote Invocation

Remote Invocation (RI) Instead of moving data around the architecture, move the operations to the data! Message Passing

Interrupt-Level RI (ILRI) Fast For operations that can be safely executed in an interrupt handler Limitations: ◦ Non-blocking (thus no locks) operations only  interrupt handles lack process context ◦ Deadlock Prevention  severely limits when we can use ILRI

Process-Level RI (PLRI) Slower Requires context switch and possible synchronization with other running processes Used for longer operations Avoid deadlocks by blocking

Remote Memory Access

Memory Considerations If remote memory access is used how is it affected by memory consistency models (not in this paper)? ◦ Strong consistency models will incur contention ◦ Weak consistency models widen the cost gap between normal instructions and synchronization instructions  And require use of memory barriers From Professor Walpole’s slides.

RI/RA Combinations

Mixing RI/RA ILRI, PLRI and shared memory are compatible, as long as guidelines are followed. “It is easy to use different mechanisms for unrelated data structures.”

Using RA with PLRI Remote Access and Process-level Remote Invocation can be used on the same data structure if: ◦ synchronization methods are compatible

Using RA with ILRI Remote Access and Interrupt-level Remote Invocation can be used on the same data structure if: ◦ A Hybrid lock is used  interrupt masking AND spin locks

Using RA with ILRI – Hybrid Lock

Using PLRI and ILRI PLRI & ILRI on the same data structure if: ◦ Avoid deadlock ◦ Always be able to perform incoming invocations while waiting for outgoing invocation. ◦ Example: Cannot make PLRI with ILRIs blocked in order to access data that is shared by normal and interrupt- level code (from Professor Walpole’s slides)

The Costs Latency Impact on local operations Contention and Throughput Complement or clash conceptually with the kernel’s organization

Latency What’s the latency between performing RA and RI? If (R-1)n < C ◦ then implement using RA If operations require a lot of time ◦ then implement using RI

Impact on Local Operations Implicit Synchronization: ◦ PLRI is used for all remote accesses, then it could allow the data structure ◦ This solution depends on the no pre-emption Explicit Synchronization: ◦ Bus-based nodes

Contention and Throughput Operations are serialized at some point! RI: Serialize on processor executing those operations ◦ Even if there is no data in common RA: Serialize at the memory ◦ If access competes for same lock

Complement or Clash Types of kernels ◦ procedure-based  no distinction between user & kernel space  user program enters kernel via traps  fits RA ◦ message-based  each major kernel resource is its own kernel process  ops require communication among these kernel processes  fits RI

Complement or Clash

Case Study

Psyche on Butterfly Plus Procedure-based OS  Uses share memory as primary kernel communication mechanism Authors built in message-based ops  RI – reorganized code; grouped together accesses allowing a single RI call. non-CC-NUMA  1 CPU/node  C = 12:1 (remote -to-local access time)

Psyche on Butterfly Plus High degree of node locality RI implemented optimistically Spin locks used ◦ Test-and-test-and-set used to minimize latency in absence of contention. Otherwise, some atomic instruction is used ◦ This can be decided on the fly

Results

Results

Results

Results

Conclusion

Factors Affecting the choice of RI/RA Cost of the RI mechanism Cost of atomic operations for synchronization Ratio of remote to local memory access time For cache-coherent machines: ◦ cache line size ◦ false sharing ◦ caching effects reducing total cost of kernel ops.

Using PLRI, ILRI, and RA PLRI ◦ Use it once the cost of PLRI surpasses ILRI ◦ Must consider latency, throughput, appeal of eliminating explicitly synch IRLI ◦ Node locality is hugely important ◦ Use it for low-latency ops when you can’t do RA ◦ Use it when the remote node is idle.  Authors used IRLI for console IO, kernel debugging and TLB Shootdown.

Observations On Butterfly Plus: ◦ ILRI was fast ◦ Explicit sync is costly ◦ Remote references much more expensive than local references. ◦ Except for short operations, RI had lower latency. RI might have lower throughput.

Conclusions? Careful design is required for OSs to scale on modern hardware! ◦ Which means you better understand the effects of your underlying hardware. Keep communication to a minimum no matter what solution is used. Where has mixing of RI/RA gone? ◦ Monday’s paper, for one. ◦ What else? ccNUMA is in wide-spread use ◦ How is RI/RA affected?

Thank You