6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
2. Computer Clusters for Scalable Parallel Computing
Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Computer Systems/Operating Systems - Class 8
Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
IPPS 981 What’s So Different about Cluster Architectures? David E. Culler Computer Science Division U.C. Berkeley
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
NOW 1 Berkeley NOW Project David E. Culler Sun Visit May 1, 1998.
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Introduction  What is an Operating System  What Operating Systems Do  How is it filling our life 1-1 Lecture 1.
MS 9/19/97 implicit coord 1 Implicit Coordination in Clusters David E. Culler Andrea Arpaci-Dusseau Computer Science Division U.C. Berkeley.
IPPS 981 Berkeley FY98 Resource Working Group David E. Culler Computer Science Division U.C. Berkeley
Figure 1.1 Interaction between applications and the operating system.
Lecture 1: Introduction CS170 Spring 2015 Chapter 1, the text book. T. Yang.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Threads CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
PRASHANTHI NARAYAN NETTEM.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Chapter 4.1 Interprocess Communication And Coordination By Shruti Poundarik.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Processes Introduction to Operating Systems: Module 3.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Processes & Threads Introduction to Operating Systems: Module 5.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Background Computer System Architectures Computer System Software.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
CS 162 Discussion Section Week 2. Who am I? Prashanth Mohan Office Hours: 11-12pm Tu W at.
Process Management Presented By Aditya Gupta Assistant Professor
Operating System I/O System Monday, August 11, 2008.
Storage Virtualization
Threads, SMP, and Microkernels
CS703 - Advanced Operating Systems
Threads Chapter 4.
Presentation transcript:

6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

6/28/98SPAA/PODC2 What’s Different about Clusters? Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Fast Scalable Communication? => Complete System on every node –virtual memory –scheduler –file system –...

6/28/98SPAA/PODC3 Topics: Part 2 Virtual Networks –communication meets virtual memory Scheduling Parallel I/O Clusters of SMPs VIA

6/28/98SPAA/PODC4 General purpose requirements Many timeshared processes –each with direct, protected access User and system Client/Server, Parallel clients, parallel servers –they grow, shrink, handle node failures Multiple packages in a process –each may have own internal communication layer Use communication as easily as memory

6/28/98SPAA/PODC5 Virtual Networks Endpoint abstracts the notion of “attached to the network” Virtual network is a collection of endpoints that can name each other. Many processes on a node can each have many endpoints, each with own protection domain.

6/28/98SPAA/PODC6 Process 3 How are they managed? How do you get direct hardware access for performance with a large space of logical resources? Just like virtual memory –active portion of large logical space is bound to physical resources Process n Process 2 Process 1 *** Host Memory Processor NIC Mem Network Interface P

6/28/98SPAA/PODC7 Endpoint Transition Diagram COLD Paged Host Memory WARM R/O Paged Host Memory HOT R/W NIC Memory Read Evict Swap Write Msg Arrival

6/28/98SPAA/PODC8 Network Interface Support NIC has endpoint frames Services active endpoints Signals misses to driver –using a system endpont Frame 0 Frame 7 Transmit Receive EndPoint Miss

6/28/98SPAA/PODC9 Solaris System Abstractions Segment Driver manages portions of an address space Device Driver manages I/O device Virtual Network Driver

6/28/98SPAA/PODC10 LogP Performance Competitive latency Increased NIC processing Difference mostly –ack processing –protection check –data structures –code quality Virtualization cheap

6/28/98SPAA/PODC11 Bursty Communication among many Client Server Msg burst work

6/28/98SPAA/PODC12 Multiple VN’s, Single-thread Server

6/28/98SPAA/PODC13 Multiple VNs, Multithreaded Server

6/28/98SPAA/PODC14 Perspective on Virtual Networks Networking abstractions are vertical stacks –new function => new layer –poke through for performance Virtual Networks provide a horizontal abstraction –basis for building new, fast services Open questions –What is the communication “working set” ? –What placement, replacement, … ?

6/28/98SPAA/PODC15 Beyond the Personal Supercomputer Able to timeshare parallel programs –with fast, protected communication Mix with sequential and interactive jobs Use fast communication in OS subsystems –parallel file system, network virtual memory, … Nodes have powerful, local OS scheduler Problem: local schedulers do not know to run parallel jobs in parallel

6/28/98SPAA/PODC16 Local Scheduling Schedulers act independently w/o global control Program waits while trying communicate with its peers that are not running x slowdowns for fine-grain programs! => need coordinated scheduling

6/28/98SPAA/PODC17 Explicit Coscheduling Global context switch according to precomputed schedule How do you build it? Does it work?

6/28/98SPAA/PODC18 Typical Cluster Subsystem Structures A LS AA A A A Master A LS A GS A LS GS A LS A GS LS A GS Local service Applications Communication Global Service Communication Master-Slave Peer-to-Peer

6/28/98SPAA/PODC19 Ideal Cluster Subsystem Structure Obtain coordination without explicit subsystem interaction, only the events in the program –very easy to build –potentially very robust to component failures –inherently “service on-demand” –scalable Local service component can evolve. A LS A GS A LS GS A LS A GS LS A GS

6/28/98SPAA/PODC20 Three approaches examined in NOW GLUNIX explicit master-slave (user level) –matrix algorithm to pick PP –uses stops & signals to try to force desired PP to run Explicit peer-peer scheduling assist with VNs –co-scheduling daemons decide on PP and kick the solaris scheduler Implicit –modify the parallel run-time library to allow it to get itself co- scheduled with standard scheduler A LS AA A A A M A A GS A LS GS A LS A GS LS A GS A LS A GS A LS GS A LS A GS LS A GS

6/28/98SPAA/PODC21 Problems with explicit coscheduling Implementation complexity Need to identify parallel programs in advance Interacts poorly with interactive use and load imbalance Introduces new potential faults Scalability

6/28/98SPAA/PODC22 Why implicit coscheduling might work Active message request-reply model Infer non-local state from local observations; react to maintain coordination observationimplication action fast response partner scheduledspin delayed response partner not scheduledblock WS 1 Job A WS 2 Job BJob A WS 3 Job BJob A WS 4 Job BJob A sleep spin requestresponse

6/28/98SPAA/PODC23 Obvious Questions Does it work? How long do you spin? What are the requirements on the local scheduler?

6/28/98SPAA/PODC24 How Long to Spin? Answer: round trip time + context switch + msg processing –round-trip to stay scheduled together –plus wake-up to get scheduled together –keep spinning if serving messages »interval of 3 x wake-up

6/28/98SPAA/PODC25 Does it work?

6/28/98SPAA/PODC26 Synthetic Bulk-synchronous Apps Range of granularity and load imbalance –spin wait 10x slowdown

6/28/98SPAA/PODC27 With mixture of reads Block-immediate 4x slowdown

6/28/98SPAA/PODC28 Timesharing Split-C Programs

6/28/98SPAA/PODC29 Many Questions What about –mix of jobs? –sequential jobs? –unbalanced placement? –Fairness? –Scalability? How broadly can implicit coordination be applied in the design of cluster subsystems? Can resource management be completely decentralized? –Computational economies, ecologies

6/28/98SPAA/PODC30 A look at Serious File I/O Traditional I/O system NOW I/O system Benchmark Problem: sort large number of 100 byte records with 10 byte keys –start on disk, end on disk –accessible as files (use the file system) –Datamation sort: 1 million records –Minute sort: quantity in a minute Proc- Mem P-M

6/28/98SPAA/PODC31 NOW-Sort Algorithm Read –N/P records from disk -> memory Distribute –scatter keys to processors holding result buckets –gather keys from all processors Sort –partial radix sort on each bucket Write –write records to disk (2 pass: gather data runs onto disk, then local, external merge sort)

6/28/98SPAA/PODC32 Key Implementation Techniques Performance Isolation: highly tuned local disk- to-disk sort –manage local memory –manage disk striping –memory mapped I/O with m-advise, buffering –manage overlap with threads Efficient Communication –completely hidden under disk I/O –competes for I/O bus bandwidth Self-tuning Software –probe available memory, disk bandwidth, trade-offs

6/28/98SPAA/PODC33 World-Record Disk-to-Disk Sort Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth but only in the wee hours of the morning

6/28/98SPAA/PODC34 Towards a Cluster File System Remote disk system built on a virtual network Client RDlib RD server Active msgs

6/28/98SPAA/PODC35 Streaming Transfer Experiment

6/28/98SPAA/PODC36 Results Data distribution affects resource utilization Not delivered bandwidth

6/28/98SPAA/PODC37 I/O Bus crossings

6/28/98SPAA/PODC38 Opportunity: PDISK Producers dump data into I/O river Consumers pull it out Hash data records across disks Match producers to consumers Integrated with work scheduling PPPP PPPP Fast Communication - Remote Queues Fast I/O - Streaming Disk Queues

6/28/98SPAA/PODC39 What will be the building block? network cloud memory interconnect SMP memory network cards  SMP memory Processors per node Nodes SMPs NOWs Clusters of SMPs

6/28/98SPAA/PODC40 Multi-Protocol Communication Uniform Prog. Model is key Multiprotocol Messaging –careful layout of msg queues –concurrent objects –polling network hurts memory Shared Virtual Memory –relies on underlying msgs Pooling vs Contention Send / Write shared memory network communication layer Rcv / Read

6/28/98SPAA/PODC41 LogP analysis of shared mem AM

6/28/98SPAA/PODC42 Virtual Interface Architecture Application VI User Agent (“libvia”) Open, Connect, Map Memory Descriptor Read, Write VI-Capable NIC Sockets, MPI, Legacy, etc. Host NIC Requests Completed VI C SSS COMPCOMP RRR Doorbells Undetermined VIA Kernel Driver (Slow) User-Level (Fast)

6/28/98SPAA/PODC43 VIA Implementation Overview Host NIC Mapped Doorbells Descriptor Queues Data Buffers Kernel Memory Mapped to Application Doorbell Pages Desc Buffer Tx/Rx Buffers VI Request Block Xfer 1 Write 2 DMAReq 3 DMARd 4 DMAReq 5 DMARd 7 DMAWrt

6/28/98SPAA/PODC44 Current VIA Performance

6/28/98SPAA/PODC45 VIA ahead You will be able to buy decent clusters Virtualization in host memory is easy –will it go beyond pinned regions –still need to manage active endpoints (doorbells) Complex descriptor queues will hinder low latency short messages –NICs will chew on them, but many instructions on host Need to re-examine where error handling, flow control, retry are performed Interactions with scheduling, I/O, locking etc. will dominate application speed-up –will demand new development methodologies

6/28/98SPAA/PODC46 Conclusions Complete system on every node makes clusters a very powerful architecture –can finally get serious about I/O Extend the system globally –virtual memory systems, –schedulers, –file systems,... Efficient communication enables new solutions to classic systems challenges Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN –where SPAA and PDOC meet