Automatic Tuning of Collective Communications in MPI

Slides:



Advertisements
Similar presentations
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Advertisements

Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu Toward Optimal Deployment of Communication-Intensive Cloud Applications 1.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Exercises for Chapter 4: Interprocess.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
GREEDY RANDOMIZED ADAPTIVE SEARCH PROCEDURES Reporter : Benson.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Topic Overview One-to-All Broadcast and All-to-One Reduction
1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
Optimal binary search trees
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Distributed Constraint Optimization * some slides courtesy of P. Modi
FLANN Fast Library for Approximate Nearest Neighbors
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
Operations Research Models
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
MPI implementation – collective communication MPI_Bcast implementation.
Classification Ensemble Methods 1
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Nonparametric tests: Tests without population parameters (means and standard deviations)
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
10/3/2017 Chapter 6 Index Structures.
© 2017 Pearson Education, Hoboken, NJ. All rights reserved
Presented by: Omar Alqahtani Fall 2016
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Red-Black Tree Neil Tang 02/07/2008
IEEE Protocol: Design and Performance Evaluation of An Adaptive Backoff Mechanism JSAC, vol.18, No.9, Sept Authors: F. Cali, M. Conti and.
Dynamic Graph Partitioning Algorithm
Pattern Parallel Programming
Wireless Sensor Network Architectures
Department of Electrical & Computer Engineering
Using SCTP to hide latency in MPI programs
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Parallel Programming in C with MPI and OpenMP
for more information ... Performance Tuning
Department of Computer Science University of California, Santa Barbara
En Wang 1,2 , Yongjian Yang 1 , and Jie Wu 2
Agent-based Model Simulation with Twister
Chapter 8 Arrays Objectives
Architectural Interactions in High Performance Clusters
Algorithm Efficiency Chapter 10.
Course Outline Introduction in algorithms and applications
CSE8380 Parallel and Distributed Processing Presentation
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Parallel Sorting Algorithms
Algorithm Efficiency Chapter 10
Distributed Systems CS
COMP60621 Fundamentals of Parallel and Distributed Systems
Distributed Computing:
Resource Allocation in a Middleware for Streaming Data
Intrinsic and Task-Evoked Network Architectures of the Human Brain
Chapter 8 Arrays Objectives
Chapter 01: Introduction
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Optimizing MPI collectives for SMP clusters
Donghui Zhang, Tian Xia Northeastern University
N-Body Gravitational Simulations
Presentation transcript:

Automatic Tuning of Collective Communications in MPI Rajesh Nishtala, Kushal Chakrabarti, Neil Patel, Kaushal Sanghavi Computer Science Division University of California at Berkeley Automatic Tuning of Collective Communications in MPI No single implementation of MPI collectives is optimal across all environmental variables (eg. node architecture and network load). Our Probabilistic Algorithm Selection System (PASS) accounts for this and optimizes the collectives by learning from the implementations’ past performance PASS optimizes operations above the level of underlying MPI point to point operations, making it cluster and implementation independent and thus adaptive and extensible. Our new tuned implementations yield up to 10x speedups through the use of pipelining. Binary Tree Chain Tree Sequential Tree Binomial Tree Figure 1: Trees Performance results for four different implementations of MPI collectives were gathered varying the following parameters: Cluster interconnect Node architecture Number of nodes Pipeline segment size Figure 3: Because there is no definitive winning implementation of gather on CITRIS, transient cluster conditions could cause a superior implementation to emerge. PASS dynamically accounts for these changes and selects the optimal implementation. Figure 2: The best implementation varies across clusters and the number of nodes. For instance, the chain tree is always the best implementation on Seaborg, while it is only best for large numbers of nodes on Millennium. Because the binary tree is better for smaller numbers of nodes and the base implementation is suboptimal, space exists for performance tuning. PASS most frequently chooses top performing implementations… …but will occasionally search for better ones Figure 5: PASS accounts for varying performance times (shown above) by usually choosing the optimal ones. Averaging over multiple trials, the difference between the PASS mean and median lines illustrate the effect of exploration. Future work will focus on fine-tuning PASS so that the negative effects of exploration are negligible. Figure 4: Pipelined transmission of messages yields significant improvements. Although performance is very sensitive to the unit of transmission (segment size), PASS will discover the optimal size. http://www.cs.berkeley.edu/~rajeshn/mpi_opt.pdf