The Future of MPI William Gropp Argonne National Laboratory www.mcs.anl.gov/~gropp.

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

MPI Message Passing Interface
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The Triumph of Hope over Experience * ? Bill Gropp *Samuel Johnson.
ADA Langage Report Ligia Nitu. Introduction Ada is the most extensive and expensive computer language ever developed. Ada is the most extensive and expensive.
Portable MPI and Related Parallel Development Tools Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory (The rest of our group:
Distributed Processing, Client/Server, and Clusters
Parallel Computing Overview CS 524 – High-Performance Computing.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Contemporary Languages in Parallel Computing Raymond Hummel.
Grid IO APIs William Gropp Mathematics and Computer Science Division.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Overview of Eclipse Parallel Tools Platform Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
Hossein Bastan Isfahan University of Technology 1/23.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department.
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CSC-115 Introduction to Computer Programming
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Computer Architecture and Interconnect 1b.1.
MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
MIMD Distributed Memory Architectures message-passing multicomputers.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
SCIRun and SPA integration status Steven G. Parker Ayla Khan Oscar Barney.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Towards MPI progression layer elimination with TCP and SCTP
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Programmability Hiroshi Nakashima Thomas Sterling.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
2/22/2001Greenbook 2001/OASCR1 Greenbook/OASCR Activities Focus on technology to enable SCIENCE to be conducted, i.e. Software tools Software libraries.
Computer Science Lecture 3, page 1 CS677: Distributed OS Last Class: Communication in Distributed Systems Structured or unstructured? Addressing? Blocking/non-blocking?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Background Computer System Architectures Computer System Software.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Productive Performance Tools for Heterogeneous Parallel Computing
Distributed Shared Memory
MPI: Portable Parallel Programming for Scientific Computing
Many-core Software Development Platforms
Chapter 4: Threads.
MPI-Message Passing Interface
Threads Chapter 4.
MPJ: A Java-based Parallel Computing System
System calls….. C-program->POSIX call
Presentation transcript:

The Future of MPI William Gropp Argonne National Laboratory

University of Chicago Department of Energy The Success of MPI Applications  Most recent Gordon Bell prize winners use MPI Libraries  Growing collection of powerful software components Tools  Performance tracing (Vampir, Jumpshot, etc.)  Debugging (Totalview, etc.) Results  Papers: Clusters  Ubiquitous parallel computing

University of Chicago Department of Energy Why Was MPI Successful? It address all of the following issues:  Portability  Performance  Simplicity and Symmetry  Modularity  Composability  Completeness

University of Chicago Department of Energy Portability and Performance Portability does not require a “lowest common denominator” approach  Good design allows the use of special, performance enhancing features without requiring hardware support  MPI’s nonblocking message-passing semantics allows but does not require “zero-copy” data transfers BTW, it is “Greatest Common Denominator”

University of Chicago Department of Energy Simplicity and Symmetry MPI is organized around a small number of concepts  The number of routines is not a good measure of complexity  Fortran Large number of intrinsic functions  C and Java runtimes are large  Development Frameworks Hundreds to thousands of methods  This doesn’t bother millions of programmers

University of Chicago Department of Energy Measuring Complexity Complexity should be measured in the number of concepts, not functions or size of the manual MPI is organized around a few powerful concepts  Point-to-point message passing  Datatypes  Blocking and nonblocking buffer handling  Communication contexts and process groups

University of Chicago Department of Energy Elegance of Design MPI often uses one concept to solve multiple problems Example: Datatypes  Describe noncontiguous data transfers, necessary for performance  Describe data formats, necessary for heterogeneous systems “Proof” of elegance:  Datatypes exactly what is needed for high- performance I/O, added in MPI-2.

University of Chicago Department of Energy Parallel I/O Collective model provides high I/O performance  Matches applications most general view: objects, distributed among processes MPI Datatypes extend I/O model to noncontiguous data in both memory and file  Unix readv/writev only applies to memory

University of Chicago Department of Energy Parallel I/O Performance with MPI-IO Structured Mesh I/O Unstructured Grid I/O Posix style I/O (Posix too slow to show)

University of Chicago Department of Energy No One Is Perfect Groups and group manipulation  MPI provides many routines for creating and manipulating groups (e.g., MPI_Group_intersection, MPI_Comm_group, MPI_Comm_create)  None of these is needed (in MPI-1); MPI_Comm_split should be used instead But groups are needed in MPI-2 for scalable remote memory synchronization — another example of a powerful concept having multiple uses Cancel of sends  Difficult to implement correctly  Little benefit to virtual all applications  Semantics don’t even match what users often want (stop message even if started)

University of Chicago Department of Energy Modularity Modern algorithms are hierarchical  Do not assume that all operations involve all or only one process  Provide tools that don’t limit the user Modern software is built from components  MPI designed to support libraries  Many applications have no explicit MPI calls; all MPI contained within well-designed libraries

University of Chicago Department of Energy Composability Environments are built from components  Compilers, libraries, runtime systems  MPI designed to “play well with others” MPI exploits newest advancements in compilers  … without ever talking to compiler writers  OpenMP is an example

University of Chicago Department of Energy Completeness MPI provides a complete parallel programming model and avoids simplifications that limit the model  Contrast: Models that require that synchronization only occurs collectively for all processes or tasks Make sure that the functionality is there when the user needs it  Don’t force the user to start over with a new programming model when a new feature is needed

University of Chicago Department of Energy Is Ease of Use the Overriding Goal? MPI often described as “the assembly language of parallel programming” C and Fortran have been described as “portable assembly languages” Ease of use is important. But completeness is more important.  Don’t force users to switch to a different approach as their application evolves

University of Chicago Department of Energy Lessons From MPI A general programming model for high- performance technical computing must address many issues to succeed Even that is not enough. Also needs:  Good design  Buy-in by the community  Effective implementations MPI achieved these through an Open Standards Process

University of Chicago Department of Energy An Open and Balanced Process Balanced representation from  Users What users want and need  Including correctness  Implementers (Vendors) What can be provided  Many MPI features determined by implementation needs  Researchers Directions and Futures  MPI planned for interoperation with OpenMP before OpenMP conceived  Support for libraries strongly influenced by research

University of Chicago Department of Energy Where Next? Improving MPI  Simplifying and enhancing the expression of MPI programs Improving MPI Implementations  Performance New Directions  What can displace (or complement) MPI? (Yesterday’s panel presentation on programming models project and tomorrow’s panel on the future of supercomputing)

University of Chicago Department of Energy Improving MPI Simpler interfaces  Use compiler or precompiler techniques to support simpler, integrated syntax  Fortran 95 arrays, datatypes in C/C++ Eliminate function calls  Use program analysis and transformation to inline operations More tools for correctness and performance debugging  MPI profiling interface is a good start  Debugger interface used by Totalview is an example of tool development  Effort to provide a common interface to internal performance data, such as idle time waiting for a message Changes to MPI  E.g., MPI-2 RMA lacks a read-modify-write  But don’t hold your breath These require research and experimentation before they are ready for a standardization process

University of Chicago Department of Energy Improving MPI Implementations Faster Point-to-point  Some current implementations make unnecessary copies Collective operations  Better algorithms exist SMP optimizations Scatter-gather broadcast, reduce, etc. Optimizing for new hardware  RDMA networks  NIC-enabled remote atomic operations Wide area networks  Optimizations for high latency  Speculative sends  Quality of service extensions (through MPI attributes) Massive scaling  Many implementations optimize internal buffers for modest numbers of processes  Some MPI routines (e.g., MPI_Graph_create) do not have scalable definitions

University of Chicago Department of Energy More Improvements for MPI Implementations Reduce latency  Automatic techniques to compress code paths  Closer match to hardware capabilities Improve RMA  Many current implementations at best functional Parallel I/O, particularly for clusters  Communication aggregation  Reliability in the presence of faults Fault tolerance  Exploit MPI Intercommunicators to generalize the two-party model Thread safe and efficient implementations  Lock-free design  Software engineering for common MPI implementation source tree Many groups working on improved MPI implementations  MPICH-2 is an all-new and efficient implementation Includes many of these ideas Designed, as MPICH was, to encourage others to experiment and extend MPI

University of Chicago Department of Energy What’s New in MPICH2 Beta-test version available for groups that expect to perform research on MPI implementations with MPICH2  Version 0.92 released last Friday Contains  All of MPI-1, MPI-I/O, service functions from MPI-2, active- target RMA  C, C++, Fortran 77 bindings  Example devices for TCP, Infiniband, shared memory  Documentation Passes extensive correctness tests  Intel test suite (as corrected); good unit test suite  MPICH test suite; adequate system test suite  Notre Dame C++ tests, based on IBM C test suite  Passes more tests than MPICH1

University of Chicago Department of Energy MPICH2 Research All new implementation is our vehicle for research in  Thread safety and efficiency (e.g., avoid thread locks)  Optimized MPI datatypes  Optimized Remote Memory Access (RMA)  High Scalability (64K MPI processes and more)  Exploiting Remote Direct Memory Access (RDMA) capable networks  All of MPI-2, including dynamic process management, parallel I/O, RMA  Usability and Robustness Software engineering techniques that automate and simplify creating and maintaining a solid, user-friendly implementation Allow extensive runtime error checking but do not require it Integrated performance debugging  Clean interfaces to other system components such as scalable process managers

University of Chicago Department of Energy Some Target Platforms Clusters (TCP, UDP, Infiniband, Myrinet, Proprietary Interconnects, …) Clusters of SMPs Grids (UDP, TCP, Globus I/O, …) Cray Red Storm BlueGene/x  64K processors; 64K address spaces  ANL/IBM developing MPI for BG/L QCDoC Cray X1 (at least I/O) Other systems

University of Chicago Department of Energy (Logical) Structure of MPICH-2 ADI-3 ADIO MPICH-2 Other parallel file systems PMI MPD Vendors Channel Interface Myrinet, Other NIC Multi- Method BG/L Portals MM Existing In Progress For others Fork PVFS TCP Unix (python) Windows shmem TCPshmemInfiniband remshell XFS NFS HFSSFS bproc

University of Chicago Department of Energy Conclusions The Future of MPI is Bright!  Higher-performance implementations  More libraries and applications  Better tools for developing and tuning MPI programs  Leverage of complementary technologies Full MPI-2 implementations will become common  Several already exist; many ES apps use MPI RMA