MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory

Overview MPICH2 –High-performance –Open-source –Widely portable MPICH2-based implementations –IBM for BG/L and BG/P –Cray for XT3/4 –Intel –Microsoft –SiCortex –Myricom –Ohio State

Outline Architectural overview Nemesis – a new communication subsystem New features and optimizations –Intranode communication –Optimizing non-contiguous messages –Optimizing large messages Current work in progress –Optimizations –Multi-threaded environments –Process manager –Other optimizations Libraries and tools

Traditional MPICH2 Developer APIs Two APIs for porting MPICH2 to new communication architectures –ADI3 –CH3 ADI3 – Implement a new device –Richer interface ~60 functions –More work to port –More flexibility CH3 – Implement a new CH3 channel –Simpler interface ~15 functions –Easier to port –Less flexibility

ROMIO Sock Application MPI Layer CH3 Device MX BG MPE MPI Inteface PVFS... GPFSXFS Nemesis SCTP SSHM SHM CH3 Interface.. TCP IB/iWARP PSM MX Cray Support for High-speed Networks –10-Gigabit Ethernet iWARP, Qlogic PSM, InfiniBand, Myrinet (MX and GM) Supports proprietary platforms –BlueGene/L, BlueGene/P, SiCortex, Cray Distribution with ROMIO MPI/IO library Profiling and visualization tools (MPE, Jumpshot) ADIO Interface MPD SMPD Gforker PMI Interface PMPI Nemesis Net Mod Interface GM Jumpshot... ADI3 Inteface

Nemesis Nemesis is a new CH3 channel for MPICH2 –Shared-memory for intranode communication Lock-free queues Scalability Improved intranode performance –Network modules for internode communication New interface New developer API – Nemesis netmod interface –Simpler interface than ADI3 –More flexible than CH3

Nemesis: Lock-Free Queues Atomic memory operations Scalable –One recv queue per process Optimized to reduce cache misses Recv Free Recv Free Recv Free 1 2

Nemesis Network Modules Improved interface for network modules –Allows optimized handling of noncontiguous data –Allows optimized transfer of large data –Optimized small contiguous message path < 2.5us over QLogic PSM Future work –Multiple network modules E.g., Myrinet for intra-cluster and TCP for inter-cluster –Dynamically loadable

Optimized Non-contiguous Issues with non-contiguous data –Representation –Manipulation Packing, generating other representations (e.g., iov), etc Dataloops – MPICH2’s optimized internal datatype representation –Efficiently describes non-contiguous data –Utilities to efficiently manipulate non-contiguous data Dataloop is passed to network module –Previously, an I/O vector was generated then passed –Netmod implementation manipulates the dataloop. E.g., TCP uses iov IB, PSM, pack data into send buffer.

Optimized Large Message Transfer Using Rendezvous MPICH2 uses rendezvous to transfer large messages –Original implementation: channel was oblivious to rendezvous CH3 sent RTS, CTS, DATA Shared mem: Large messages would be sent through queue Netmod: Netmod would perform its own rendezvous –Shm: Queues may not be the most efficient mechanism to transfer large data E.g., network RDMA, inter-process copy mechanism, copy buffer –Netmod: Redundant rendezvous Developed LMT interface to support various mechanisms –Sender transfers data (put) –Receiver transfers data (get) –Both sender and receiver participate in data transfer Modified CH3 to use LMT –Works with rendezvous protocol

Optimization: LMT for Intranode Communication For intranode, LMT copies through buffer in shared memory Sender allocates shared memory region –Sends buffer ID to receiver in RTS packet Receiver attaches to memory region Both sender and receiver participate in transfer –Use double-buffering SenderReceiver

Current Work In Progress Optimizations Multi-threaded environments Process manager Other work Atomic Operations Library

Current Optimization Work Handle common case fast: Eager contiguous messages –Identify this case early in the operation –Call netmod’s send_eager_contig() function directly Bypass receive queue –Currently: check unexp queue, post on posted queue, check network –Optimized: check unexp queue, check network Reduced instruction count by 48% Eliminate function calls –Collapse layers where possible Merge Nemesis with CH3 –Move Nemesis functionality to CH3 –CH3 shared memory support –New CH3 channel/netmod interface Cache-aware placement of fields in structures

Fine Grained Threading MPICH2 supports multi-threaded applications –MPI_THREAD_MULTIPLE Currently, thread safety is implemented with a single lock –Lock is acquired on entering an MPI function –And released on exit –Also released when making blocking communication system calls Limits concurrency in communication –Only one thread can be in the progress engine at one time New architectures have multiple DMA engines for communication –These can work independently of each other Concurrency is needed in the progress engine for maximum performance Even without independent network hardware –Internal concurrency can improve performance

Multicore-Aware Collectives Intra-node communication is much faster than inter-node Take advantage of this in collective algorithms E.g., Broadcast –Send to one process per node, that process broadcasts to other processes on that node Step further: collectives over shared memory –E.g., Broadcast Within a node, process writes data to shared memory region Other processes read data –Issues Memory traffic, cache misses, etc.

Process Manager Enhanced support for third party process managers –PBS, Slurm, –Working on others Replacement for existing process managers –Scalable to 10,000’s of nodes + –Fault-tolerant –Aware of topology

Other Work Heterogeneous data representations –Different architectures use different data representations E.g., big/little-endian, 32/64-bit, IEEE floats/non-IEEE floats, etc –Important for heterogeneous clusters and grids –Use existing datatype manipulation utilities Fault-tolerance support –CIFTS – fault-tolerance backplane –Fault detection and reporting

Atomic Operations Library Lock-free algorithms use atomic assembly instructions Assembly instructions are non-portable –Must be ported for each architecture and compiler We’re working on an atomic operations library –Implementations for various architectures and various compilers –Stand-alone library –Not all atomic operations are natively supported on all architectures E.g., some have LL-SC but no SWAP –Such operations can be emulated using provided operations

Tools Included in MPICH2 MPE library for tracing MPI and other calls Scalable log file format (slog2) Jumpshot tool for visualizing log files –Supports threads Collchk library for checking that the application calls collective operations correctly

For more information… MPICH2 website –http://www.mcs.anl.gov/research/projects/mpich2 SVN repository –svn co https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk mpich2 Developer pages –http://wiki.mcs.anl.gov/mpich2/index.php/Developer_Documentation Mailing lists –mpich2-maint@mcs.anl.gov –mpich-discuss@mcs.anl.gov Me –buntinas@mcs.anl.gov –http://www.mcs.anl.gov/~buntinas

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

Similar presentations

Presentation on theme: "MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

Similar presentations

Presentation on theme: "MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback