CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Threads, SMP, and Microkernels
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 U-Net: A User-Level Network Interface for Parallel and Distributed Computing T. von Eicken, A. Basu,
FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert
Introduction CSCI 444/544 Operating Systems Fall 2008.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented by Alana Sweat.
Computer Systems/Operating Systems - Class 8
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
DISTRIBUTED COMPUTING
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Computer System Architectures Computer System Software
Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
CS533 Concepts of Operating Systems Jonathan Walpole.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
NETW 3005 I/O Systems. Reading For this lecture, you should have read Chapter 13 (Sections 1-4, 7). NETW3005 (Operating Systems) Lecture 10 - I/O Systems2.
Operating System 4 THREADS, SMP AND MICROKERNELS
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
1 Threads, SMP, and Microkernels Chapter 4. 2 Process Resource ownership: process includes a virtual address space to hold the process image (fig 3.16)
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Operating System 4 THREADS, SMP AND MICROKERNELS.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Presented by: Tim Fleck.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Concepts and Structures. Main difficulties with OS design synchronization ensure a program waiting for an I/O device receives the signal mutual exclusion.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Operating Systems (CS 340 D)
CS490 Windows Internals Quiz 2 09/27/2013.
CS703 - Advanced Operating Systems
Threads, SMP, and Microkernels
Operating System 4 THREADS, SMP AND MICROKERNELS
Prof. Leonardo Mostarda University of Camerino
Presentation transcript:

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang

2 Background Multiprocessor-based system models –Parallel vector processor (PVP) –Symmetric multiprocessor (SMP) –Massively parallel processor (MPP) –Distributed shared memory machine –Cluster of workstations (COW) COW features –Each node is a complete workstation minus peripherals ( monitor, keyboard, mouse,…) –Nodes are connected through a commodity network, e.g., Ethernet, FDDI, ATM switch, etc –A complete OS resides in each node

3 Motivation Problem with COW –The inherent inability of scaling the performance of communication software along with the host CPU performance –High communication overhead : software overhead (time required for the preparation and authentication of the message) is significantly higher than hardware overhead (network setup and message propagation time) Coprocessors on the network interface –Myrinet and ATM –But what should coprocessors do to minimize communication overheads?

4 Motivation Critical step is the reduction of host communication overheads, rather than network latency. Why? –Many existing parallel applications are designed to hide network latencies; –Multithreaded applications typically cannot benefit significantly from improving network latencies below the cost of several user-level thread context switches; –In a cluster, in contrast to a parallel machine, the schedulers of distinct nodes are only loosely synchronized – this implies the existence of highly dynamic offsets among schedulers and therefore among cooperating application threads on the order of tens of microseconds.

5 The VCM approach VCM –Virtual Communication Machine –Enables applications to set a customized and lightweight communication path between their address spaces and the “wire” Goal –Reduction of software communication overheads How –Transfer selected communication-related processing activities from the host CPU(s) to the network coprocessor –A low-level abstraction between applications and coprocessor –Applications directly interact with VCM Hide complexity via a user-level library Usual protection via a kernel extension –VCM and applications operate asynchronously –VCM and applications use shared memory to communicate

6 VCM features The intelligent network interface VCM –They changed the name in a later journal version. VCM has an active role –Access to application address space –Extensions to shared-memory applications Zero-copy messaging available at both ends –sending –receiving Communication related processing can be transferred to the network coprocessor Buffer pages are managed by the application –The application itself knows its behavior better Multiple VCM supported for each host

7 VCM Architecture Coprocessor is responsible for –Ensuring data integrity –Assembling/disassembling messages directly from/into an application’s data structure –Multiplexing/demultiplexing network messages –Enforcing protection Three components –Virtual Communication Machine, implemented on the network coprocessor –A kernel extension module For address space management and protection –A user-level library Hide applications from the complexity of interacting with the VCM and the kernel extension

8 Application–VCM interaction Application access a VCM by registering –Extend a shared memory space with VCM - Command Area –Application and VCM interact via command area –Program or instruction completion is signaled using status words that are placed in the command area. Asynchronous operations –Coprocessor polls for new programs to execute –Host CPU(s) check for program and instruction completion by polling the status words. –Data transfers are performed only by the coprocessor Improve the performance –Loop interaction Bursty invocations with many identical parameters

9 Command Area

10 Implementation Platform –Cluster of Sun UltraSPARCs I Model 170 –Solaris 2.5 –FORE SBA-200E network cards –25MHz i960 microprocessor

11 Implementation VCM interpreter –Running on the coprocessor –Order of requests Protection-related instructions VCM programs Loop instructions Incoming data Protection and buffer page management –VCM accepts protection management instructions only from the kernel or from the connection server –VCM checks the correctness of all parameters received from an application –Messages longer than expected are truncated to the size of the receiving buffer

12 Implementation VCM instruction set

13 Evaluation Microbenchmarks Synthetic client/server application –Ten client workstations issue back-to-back data requests to the server workstation Traveling Salesman Problem (TSP) Georgia Tech Time Warp (GTW) –A parallel kernel for discrete-event simulation –PHold, a synthetic application –PCS, a wireless network simulation

14 Performance - Microbenchmarks The latency is linear with the message size The maximum send rate approaches the maximum data capacity of the wire

15 Performance - Client/server application Outgoing bandwidth of the server as a function of the request size, when the server uses one or two interfaces.

16 Performance – TSP

17 Performance – PHold

18 Performance – PCS

19 Limitations Requires special hardware –A network adapter card equipped with –A network coprocessor –A few megabytes of fast memory –One or more DMA under the control of the coprocessor –Network-specific hardware to help with performance critical processing (e.g., CRC). How hard is it to port shared-memory applications to VCM-based COW?

20 Conclusion Host communication overhead is crucial VCM –Flexibility of integration between network and application –Low overhead on the host processor –latency and bandwidth close to the hardware limits –Enables zero-copy messaging –Porting of certain shared-memory parallel applications to VCM-based COW. Performance is desirable, contribution is valuable