1 MPI-2: Extending the Message- Passing Interface Rusty Lusk Argonne National Laboratory.

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

MPI Message Passing Interface
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Distributed Processing, Client/Server, and Clusters
Technical Architectures
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Systems Architecture, Fourth Edition1 Internet and Distributed Application Services Chapter 13.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 I/O Management in Representative Operating Systems.
PRASHANTHI NARAYAN NETTEM.
Distributed Systems Lecture # 3. Administrivia Projects –Design and Implement a distributed file system Paper Discussions –Discuss papers as case studies.
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
File Systems (2). Readings r Silbershatz et al: 11.8.
1 MPI-2: Extending the Message- Passing Interface Bill Gropp Rusty Lusk Argonne National Laboratory.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Chapter 9 Message Passing Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere2 Introduction.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
An Introduction to Software Architecture
1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
MIMD Distributed Memory Architectures message-passing multicomputers.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CHAPTER TEN AUTHORING.
1 Introduction to Middleware. 2 Outline What is middleware? Purpose and origin Why use it? What Middleware does? Technical details Middleware services.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
OS2- Sem ; R. Jalili Introduction Chapter 1.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
1 The Message-Passing Model l A process is (traditionally) a program counter and address space. l Processes may have multiple threads (program counters.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
REST By: Vishwanath Vineet.
2/22/2001Greenbook 2001/OASCR1 Greenbook/OASCR Activities Focus on technology to enable SCIENCE to be conducted, i.e. Software tools Software libraries.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
CS 591x Overview of MPI-2. Major Features of MPI-2 Superset of MPI-1 Parallel IO (previously discussed) Standard Process Startup Dynamic Process Management.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Background Computer System Architectures Computer System Software.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Module 12: I/O Systems I/O hardware Application I/O Interface
Distributed Shared Memory
Distribution and components
University of Technology
MPI-Message Passing Interface
Chapter 2: System Structures
Operating System Concepts
Multiple Processor Systems
An Introduction to Software Architecture
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Database System Architectures
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

1 MPI-2: Extending the Message- Passing Interface Rusty Lusk Argonne National Laboratory

2 Outline l Background l Review of strict message-passing model l Dynamic Process Management –Dynamic process startup –Dynamic establishment of connections l One-sided communication –Put/get –Other operations l Miscellaneous MPI-2 features –Generalized requests –Bindings for C++/ Fortran-90; interlanguage issues l Parallel I/O

3 Reaction to MPI-1 l Initial public reaction: –It’s too big! –It’s too small! l Implementations appeared quickly –Freely available (MPICH, LAM, CHIMP) helped expand the user base –MPP vendors (IBM, Intel, Meiko, HP-Convex, SGI, Cray) found they could get high performance from their machines with MPI. l MPP users: –quickly added MPI to the set of message-passing libraries they used; –gradually began to take advantage of MPI capabilities. l MPI became a requirement in procurements.

OSC Users Poll Results l Diverse collection of users l All MPI functions in use, including “obscure” ones. l Extensions requested: –parallel I/O –process management –connecting to running processes –put/get, active messages –interrupt-driven receive –non-blocking collective –C++ bindings –Threads, odds and ends

5 MPI-2 Origins l Began meeting in March 1995, with –veterans of MPI-1 –new vendor participants (especially Cray and SGI, and Japanese manufacturers) l Goals: –Extend computational model beyond message-passing –Add new capabilities –Respond to user reaction to MPI-1 l MPI-1.1 released in June, 1995 with MPI-1 repairs, some bindings changes l MPI-1.2 and MPI-2 released July, 1997

6 Contents of MPI-2 l Extensions to the message-passing model –Dynamic process management –One-sided operations –Parallel I/O l Making MPI more robust and convenient –C++ and Fortran 90 bindings –External interfaces, handlers –Extended collective operations –Language interoperability –MPI interaction with threads

7 Intercommunicators l Contain a local group and a remote group l Point-to-point communication is between a process in one group and a process in the other. l Can be merged into a normal (intra) communicator. Created by MPI_Intercomm_create in MPI-1. l Play a more important role in MPI-2, created in multiple ways.

8 Intercommunicators l In MPI-1, created out of separate intracommunicators. l In MPI-2, created by partitioning an existing intracommunicator. l In MPI-2, the intracommunicators may come from different MPI_COMM_WORLDs Local groupRemote group Send(1) Send(2)

9 Dynamic Process Management l Issues –maintaining simplicity, flexibility, and correctness –interaction with operating system, resource manager, and process manager –connecting independently started processes l Spawning new processes is collective, returning an intercommunicator. –Local group is group of spawning processes. –Remote group is group of new processes. –New processes have own MPI_COMM_WORLD. –MPI_Comm_get_parent lets new processes find parent communicator.

10 Spawning New Processes MPI_SpawnMPI_Init In parentsIn children MPI_Comm_world New intercommunicator Parent intercom- municator Any communicator

11 Spawning Processes MPI_Comm_spawn(command, argv, numprocs, info, root, comm, intercomm, errcodes) Tries to start numprocs process running command, passing them command-line arguments argv. The operation is collective over comm. Spawnees are in remote group of intercomm. Errors are reported on a per-process basis in errcodes. Info used to optionally specify hostname, archname, wdir, path, file, softness.

12 Spawning Multiple Executables l MPI_Comm_spawn_multiple(... ) Arguments command, argv, numprocs, info all become arrays. Still collective

13 In the Children MPI_Init (only MPI programs can be spawned) MPI _ COMM _ WORLD is processes spawned with one call to MPI_Comm_spawn. MPI_Comm_get_parent obtains parent intercommunicator. –Same as intracommunicator returned by MPI _ Comm _ spawn in parents. –Remote group is spawners. –Local group is those spawned.

14 Manager-Worker Example l Single manager process decides how many workers to create and which executable they should run. l Manager spawns n workers, and addresses them as 0, 1, 2,..., n-1 in new intercomm. Workers address each other as 0, 1,... n-1 in MPI_COMM_WORLD, address manager as 0 in parent intercomm. l One can find out how many processes can usefully be spawned.

15 Establishing Connections l Two sets of MPI processes may wish to establish connections, e.g., –Two parts of an application started separately. –A visualization tool wishes to attach to an application. –A server wishes to accept connections from multiple clients. Both server and client may be parallel programs. l Establishing connections is collective but asymmetric (“Client”/“Server”). l Connection results in an intercommunicator.

16 Establishing Connections Between Parallel Programs MPI_AcceptMPI_Connect In serverIn client New intercommunicator

17 Connecting Processes l Server: –MPI_Open_port( info, port_name ) »system supplies port_name » might be host:num; might be low-level switch # –MPI_Comm_accept( port_name, info, root, comm, intercomm ) »collective over comm »returns intercomm; remote group is clients l Client: –MPI_Comm_connect( port_name, info, root, comm, intercomm ) »remote group is server

18 Optional Name Service l MPI_Publish_name( service_name, info, port_name ) l MPI_Lookup_name( service_name, info, port_name ) allow connection between service_name known to users and system-supplied port_name

19 Bootstrapping l MPI_Join( fd, intercomm ) l collective over two processes connected by a socket. fd is a file descriptor for an open, quiescent socket. intercomm is a new intercommunicator. l Can be used to build up full MPI communication. fd is not used for MPI communication.

20 One-Sided Operations: Issues l Balancing efficiency and portability across a wide class of architectures –shared-memory multiprocessors –NUMA architectures –distributed-memory MPP’s –Workstation networks l Retaining “look and feel” of MPI-1 l Dealing with subtle memory behavior issues: cache coherence, sequential consistency l Synchronization is separate from data movement.

21 Remote Memory Access Windows MPI_Win_create( base, size, disp_unit, info, comm, win ) Exposes memory given by (base, size) to RMA operations by other processes in comm. win is window object used in RMA operations. Disp_unit scales displacements: –1 (no scaling) or sizeof(type), where window is an array of elements of type type. –Allows use of array indices. –Allows heterogeneity.

22 Remote Memory Access Windows Get Put Process 2 Process 1 Process 3 Process 0

23 One-Sided Communication Calls MPI_Put - stores into remote memory MPI_Get - reads from remote memory MPI_Accumulate - updates remote memory l All are non-blocking: data transfer is initiated, but may continue after call returns. l Subsequent synchronization on window is needed to ensure operations are complete.

24 Put, Get, and Accumulate l MPI_Put( origin_addr, origin_count, origin_datatype, target_addr, target_count,target_datatype, window ) l MPI_Get(... ) l MPI_Accumulate(..., op,... ) op is as in MPI _ Reduce, but no user-defined operations are allowed.

25 Synchronization Multiple methods for synchronizing on window: MPI_Win_fence - like barrier, supports BSP model MPI_Win_{start, complete, post, wait} - for closer control, involves groups of processes MPI_Win_{lock, unlock} - provides shared- memory model.

26 Extended Collective Operations l In MPI-1, collective operations are restricted to ordinary (intra) communicators. l In MPI-2, most collective operations apply also to intercommunicators, with appropriately different semantics. l E.g, Bcast/Reduce in the intercommunicator resulting from spawning new processes goes from/to root in spawning processes to/from the spawned processes. l In-place extensions

27 External Interfaces l Purpose: to ease extending MPI by layering new functionality portably and efficiently l Aids integrated tools (debuggers, performance analyzers) l In general, provides portable access to parts of MPI implementation internals. l Already being used in layering I/O part of MPI on multiple MPI implementations.

28 Components of MPI External Interface Specification l Generalized requests –Users can create custom non-blocking operations with an interface similar to MPI’s. –MPI_Waitall can wait on combination of built-in and user-defined operations. l Naming objects –Set/Get name on communicators, datatypes, windows. l Adding error classes and codes l Datatype decoding l Specification for thread-compliant MPI

29 C++ Bindings l C++ binding alternatives: –use C bindings –Class library (e.g., OOMPI) –“minimal” binding l Chose “minimal” approach l Most MPI functions are member functions of MPI classes: –example: MPI::COMM_WORLD.send(... ) l Others are in MPI namespace l C++ bindings for both MPI-1 and MPI-2

30 Fortran Issues l “Fortran” now means Fortran-90. l MPI can’t take advantage of some new Fortran (-90) features, e.g., array sections. l Some MPI features are incompatible with Fortran-90. –e.g., communication operations with different types for first argument, assumptions about argument copying. l MPI-2 provides “basic” and “extended” Fortran support.

31 Fortran l Basic support: –mpif.h must be valid in both fixed- and free-from format. l Extended support: –mpi module –some new functions using parameterized types

32 Language Interoperability Single MPI_Init l Passing MPI objects between languages l Constant values, error handlers l Sending in one language; receiving in another l Addresses l Datatypes l Reduce operations

33 Why MPI is a Good Setting for Parallel I/O l Writing is like sending and reading is like receiving. l Any parallel I/O system will need: –collective operations –user-defined datatypes to describe both memory and file layout –communicators to separate application-level message passing from I/O-related message passing –non-blocking operations l I.e., lots of MPI-like machinery

34 Introduction to I/O in MPI l I/O in MPI can be considered as Unix I/O plus (lots of) other stuff. Basic operations: MPI_File_{open, close, read, write, seek} l Parameters to these operations match Unix, aiding straightforward port from Unix I/O to MPI I/O. l However, to get performance and portability, more advanced features must be used.

35 What is Parallel I/O? l Multiple processes participate. l Application is aware of parallelism. l Preferably the “file” is itself stored on a parallel file system with multiple disks. l That is, I/O is parallel at both ends: –application program –I/O hardware l The focus here is on the application program end.

36 Typical Parallel File System Compute Nodes I/O nodes Interconnect Disks

37 MPI I/O Features l Noncontiguous access in both memory and file l Use of explicit offset l Individual and shared file pointers l Nonblocking I/O l Collective I/O l File interoperability l Portable data representation Mechanism for providing hints applicable to a particular implementation and I/O environment (e.g. number of disks, striping factor): info

38 Collective I/O in MPI l A critical optimization in parallel I/O l Allows communication of “big picture” to file system l Framework for 2-phase I/O, in which communication precedes I/O (can use MPI machinery) l Basic idea: build large blocks, so that reads/writes in I/O system will be large Small individual requests Large collective access

39 Typical Access Pattern (block, block) Distributed Array Access Pattern in File

40 Solution: “Two-Phase” I/O l Trade computation and communication for I/O. l The interface describes the overall pattern at an abstract level. l I/O blocks are written in large blocks to amortize effect of high I/O latency. l Message-passing among compute nodes is used to redistribute data as needed. l It is critical that the I/O operation be collective, i.e., executed by all processes.

41 Independent Writes l On Paragon l Lots of seeks and small writes l Time shown = 130 seconds

42 Collective Write l On Paragon l Communication and communication precede seek and write l Time shown = 2.75 seconds

43 MPI-2 Status Assessment l Released July, 1997 l All MPP vendors now have MPI-1. (1.0, 1.1, or 1.2) l Free implementations (MPICH, LAM, CHIMP) support heterogeneous workstation networks. l MPI-2 implementations are being undertaken now by all vendors. –Fujitsu has a complete MPI-2 implementation l MPI-2 is harder to implement than MPI-1 was. l MPI-2 implementations appearing piecemeal, with I/O first. –I/O available in most MPI implementations –One-sided available in some (e.g., HP and Fujitsu)

44 Summary l MPI-2 provides major extensions to the original message-passing model targeted by MPI-1. l MPI-2 can deliver to libraries and applications portability across a diverse set of environments. l Implementations are under way. l Sources: –The MPI standard documents are available at –2-volume book: MPI - The Complete Reference, available from MIT Press –More tutorial books coming soon.

45 The End