OPTIMIZATION STRATEGIES FOR MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division

Advertisements

Multiple Processor Systems

RMA Considerations for MPI-3.1 (or MPI-3 Errata)

MPI-INTEROPERABLE GENERALIZED ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign Argonne National.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

MPI Message Passing Interface

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

File Consistency in a Parallel Environment Kenin Coloma

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

TOWARDS ASYNCHRONOUS AND MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

1 Complexity of Network Synchronization Raeda Naamnieh.

CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.

Concurrency CS 510: Programming Languages David Walker.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Bulk Synchronous Parallel (BSP) Model Illustration of a BSP superstep.

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Inter-process Communication and Coordination Chaitanya Sambhara CSC 8320 Advanced Operating Systems.

Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.

Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

 Remote Procedure Call (RPC) is a high-level model for client-sever communication.  It provides the programmers with a familiar mechanism for building.

Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Distributed Shared Memory Presentation by Deepthi Reddy.

Chapter 24 Transport Control Protocol (TCP) Layer 4 protocol Responsible for reliable end-to-end transmission Provides illusion of reliable network to.

Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.

Efficient Intranode Communication in GPU-Accelerated Systems

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.

Pony – The occam-π Network Environment A Unified Model for Inter- and Intra-processor Concurrency Mario Schweigler Computing Laboratory, University of.

Channels. Models for Communications Synchronous communications – E.g. Telephone call Asynchronous communications – E.g. .

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

A Parallel Communication Infrastructure for STAPL

Summary Background Introduction in algorithms and applications

Distributed Resource Management: Distributed Shared Memory

Presentation transcript:

OPTIMIZATION STRATEGIES FOR MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign Argonne National Laboratory ScalCom’13 December 21, 2013

2 Examples: graph algorithm (BFS), sequence assembly Common characteristics Organized around sparse structures Communication-to-computation ratio is high Irregular communication pattern Remote complex computation remote search local node remote node ACGCGATTCAG GCGATTCAGTA ACGCGATTCAGTA DNA consensus sequence Data-Intensive Applications

MPI: industry standard communication runtime for high performance computing Message Passing Models Process 0 Process 1 Send (data) Receive (data) Process 0 Process 1 Put (data) Get (data) Acc (data) += Send (data) Receive (data) two-sided communication (explicit sends and receives) one-sided (RMA) communication (explicit sends, implicit receives, simple remote operations) 3 Active Messages Sender explicitly sends message Upon message’s arrival, message handler is triggered, receiver is not explicitly involved User-defined operations on remote process origin target messages handler reply reply handler ✗ ✗ ✓

–Correctness semantics Memory consistency –MPI runtime must ensure consistency of window Three different type of ordering Concurrency: by default, MPI runtime behaves “as if” AMs are executed in sequential order. User can release concurrency by setting MPI assert. 4 –Streaming AMs define “segment”—minimum number of elements for AM execution achieve pipeline effect and reduce buffer requirement [ICPADS 2013] X. Zhao, P. Balaji, W. Gropp, R. Thakur, “MPI-Interoperable and Generalized Active Messages”, in proceedings of ICPADS’ 13 –Explicit and implicit buffer management user buffers: rendezvous protocol, guarantee correct execution system buffers: eager protocol, not always enough AM input data AM output data origin input bufferorigin output buffer target input buffertarget output buffer target persistent buffer AM handler MPI-AM workflow AM handler memory barrier AM handler flush cache line back SEPARATE window model UNIFIED window model MPI-AM: an MPI-interoperable framework that can dynamically manage data movement and user-defined remote computation. memory barrier Past Work: MPI-Interoperable Generalized AM

How MPI-Interoperable AMs work unlock origin target ACC(Z) epoch lock PUT(X) AM(X) GET(Y) process 0 process 1 GET (Y) PUT (X) process 2 AM (U) GET (W) AM (M) fence PUT (N) epoch Leveraging MPI RMA interface passive target mode (EXCLUSIVE or SHARED lock) active target mode (Fence or Post-Start-Complete-Wait) 5

Performance Shortcomings with MPI-AM Synchronization stalls in data buffering Inefficiency in data transmission acknowledgement receive in user buffer ORIGIN TARGET request for user buffer stall receive in system buffer return output data in system buffer reserve user buffer AM on system buffer is finished AM segment 1 segment 2 segment 3 local node remote node 1 consensus DNA sequences remote node 2 query DNA sequences remote search ACGCGATTCAG GCGATTCAGTA ACGCGATTCAGT A Effective strategies are needed to improve performance System level optimization User level optimization 6

Opt #1: Auto-Detected Exclusive User Buffer MPI internally detects EXCLUSIVE passive mode Optimization is transparent to user unlock origin target AM 2 epoch lock AM 1 EXCLUSIVE hand-shake unlock origin target AM 2 epoch lock AM 1 SHARED  hand-shake Only one hand-shake is required for entire epoch Hand-shake is required whenever AM uses user buffer 7

User can define amount of user buffer guaranteed available for certain processes Beneficial for SHARED passive mode and active mode No hand-shake operation is required! process 0 process 1 AM 2 win_create AM 1 win_create process 2 win_create MB 30 MB rank AM 3 AM 1 AM 2 fence size Opt #2: User-Defined Exclusive User Buffer 8

Opt #3: Improving Data Transmission Two different models  Contiguous output data layout Really straightforward and convenient? Out-of-order AMs require buffering or reordering  Non-contiguous output data layout Require new API — not a big deal Must transfer back count array Packing and unpacking No buffering or reordering is needed segment #1 segment #2 segment #3 pack unpack segment #4 pack unpack pipeline unit #1 pipeline unit #2 segment #1segment #2 segment #3 segment #4 AM segment #4 segment #1segment #2 segment #3 TARGET ORIGIN segment #3segment #4 segment #1 segment #2 AM pipeline unit #1pipeline unit #2 Data packing vs. data transmission  Controlled by system-specific threshold 9

BLUES cluster at ANL: 310 nodes, with each consisting 16 cores, connected with QLogic QDR InfiniBand Based on MPICH-3.1b1 Micro-benchmarks: two common operations Remote search of string sequences (20 characters per sequence) Remote summation of absolute values in two arrays (100 integers per array) Result data is returned 10 Experimental Settings

11 scalability: 25% improvement Effect of Exclusive User Buffer

12 system-specific threshold helps eliminate packing and unpacking overhead MPIX_AMV(0.8) transmits more data than MPIX_AM due to additional counts array Comparison between MPIX_AM and MPIX_AMV

Data-intensive applications are increasingly important, MPI is not a well-suited model We proposed MPI-AM framework in our previous work, to make data-intensive applications more efficient and require less programming effort There are performance shortcomings in current MPI-AM framework Our optimization strategies, including auto-detected and user-defined methods, can effectively reduce synchronization overhead and improve efficiency of data transmission 13 Conclusion

14 Our previous work on MPI-AM: Towards Asynchronous and MPI-Interoperable Active Messages. Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev Thakur, Ahmad Afsahi, William Gropp. In proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013). MPI-Interoperable Generalized Active Messages Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur. In proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). They can be found at More about MPI-3 RMA interface can be found in MPI-3 standard ( Thanks for your attention!

BACKUP

Data-Intensive Applications 16 “Traditional” applications Organized around dense vectors or matrices Regular communication, use MPI SEND/RECV or collectives Communication-to-computation ratio is low Example: stencil computation, matrix multiplication, FFT Data-intensive applications Organized around graphs, sparse vectors Communication pattern is irregular and data-dependent Communication-to-computation ratio is high Example: bioinformatics, social network analysis

Vector Version of AM API MPIX_AMV INorigin_input_addr IN origin_input_segment_count INorigin_input_datatype OUTorigin_output_addr INorigin_output_segment_count INorigin_output_datatype INnum_segments INtarget_rank INtarget_input_datatype INtarget_persistent_disp INtarget_persistent_count INtarget_persistent_datatype INtarget_output_datatype INam_op INwin OUT output_segment_counts 17 MPIX_AMV_USER_FUNCTION IN input_addr IN input_segment_count IN input_datatype INOUT persistent_addr INOUT persistent_count INOUT persistent_datatype OUT output_addr OUT output_segment_count OUT output_datatype IN num_segments IN segment_offset OUT output_segment_counts

Effect of Exclusive User Buffer 18 scalability: 25% improvement providing more exclusive buffer greatly reduces contention

Auto-detected Exclusive User Buffer Handle detaching user buffers One more hand-shake is required after MPI_WIN_FLUSH, because user buffer may be detached on target User can pass a hint to tell MPI that there will be no buffer detachment, in such case, MPI will eliminate hand-shake after MPI_WIN_FLUSH 19 unlock origin target AM 2 lock AM 1 EXCLUSIVE hand-shake flush hand-shake AM 4 AM 3 AM 5 detach user buffer

Two synchronization modes 20 unlock origin target ACC(X) epoch lock PUT(Y) access epoch complete origin target ACC(X) start PUT(Y) post wait passive target mode (SHARED or EXCLUSIVE lock/lock_all) active target mode (post-start-complete-wait/fence) exposure epoch Background —MPI One-sided Synchronization Modes