Open MPI - A High Performance Fault Tolerant MPI Library Richard L. Graham Advanced Computing Laboratory, Group Leader (acting)
Overview Open MPI Collaboration MPI Run-time Future directions
Collaborators Los Alamos National Laboratory (LA-MPI) Sandia National Laboratory Indiana University (LAM/MPI) The University of Tennessee (FT-MPI) High Performance Computing Center, Stuttgart (PACX-MPI) University of Houston Cisco Systems Mellanox Voltaire Sun Myricom IBM QLogic URL:
A Convergence of Ideas Robustness (CSU) PACX-MPI (HLRS) LAM/MPI (IU) LA-MPI (LANL) FT-MPI (U of TN) Open MPI Fault Detection (LANL, Industry) Grid (many) Autonomous Computing (many) FDDP (Semi. Mfg. Industry) ResilientComputingSystems OpenRTE
Components Formalized interfaces Specifies “black box” implementation Different implementations available at run-time Can compose different systems on the fly Interface 1Interface 2Interface 3 Caller
Performance Impact
MPI
Two Sided Communications
P2P Component Frameworks
Shared Memory - Bandwidth
Shared Memory - Latency
IB Performance Latency Message SizeLatency - Open MPILatency - MVAPICH (anomaly?)
IB Performance Bandwidth
GM Performance Data Ping-Pong Latency (usec) Data SizeOpen MPIMPICH-GM 0 Byte Byte Byte Byte
GM Performance Data Ping-Pong Latency (usec) - Data FT Data SizeOpen MPI - OB1 Open MPI - FT LA-MPI - FT 0 Byte Byte Byte Byte
GM Performance Data Ping-Pong Bandwidth
MX Ping-Pong Latency (usec) Message SizeOpen MPI - MTL MPICH - MX
MX Performance Data Ping-Pong Bandwidth (MB/sec)
XT3 Performance Latency Implementation1 Byte Latency Native Portals5.30us MPICH-27.14us Open MPI8.50us
XT3 Performance Bandwidth
Collective Operations
MPI Reduce - Performance
MPI Broadcast - Performance
MPI Reduction - II
Open RTE
Seamless, transparent environment for high- performance applications Inter-process communications within and across cells Distributed publish/subscribe registry Supports event-driven logic across applications, cells Persistent, fault tolerant Dynamic “spawn” of processes, applications both within and across cells Grid Single Computer Cluster Open RTE - Design Overview
Grid Single Computer Cluster Open RTE - Components
General Purpose Registry Cached, distributed storage/retrieval system All common data types plus user-defined Heterogeneity between storing process and recipient automatically resolved Publish/subscribe Support event-driven coordination and notification Subscribe to individual data elements, groups of elements, wildcard collections Specify actions that trigger notifications
Subscription Services Subscribe to container and/or keyval entry Can be entered before data arrives Specifies data elements to be monitored Container tokens and/or data keys Wildcards supported Specifies action that generates event Data entered, modified, deleted Number of matching elements equals, exceeds, is less than specified level Number of matching elements transitions (increases/decreases) through specified level Events generate message to subscriber Includes specified data elements Asynchronously delivered to specified callback function on subscribing process
Future Directions
Revise MPI Standard Clarify standard Standardized the interface Simplify standard Make the standard more “H/W Friendly”
Beyond Simple Performance Measures Performance and scalability are important, but What about future HPC systems Heterogeneity Multi-core Mix of processors Mix of networks Fault-tolerance
Focus on Programmability Performance and Scalability are important, but what about Programmability