MPI implementation – collective communication MPI_Bcast implementation.

Slides:



Advertisements
Similar presentations
Costas Busch Louisiana State University CCW08. Becomes an issue when designing algorithms The output of the algorithms may affect the energy efficiency.
Advertisements

Grid Communication Simulator Boro Jakimovski Marjan Gusev Institute of Informatics Faculty of Natural Sciences and Mathematics University of Sts. Cyril.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Parallel System Performance CS 524 – High-Performance Computing.
5/31/05CS118/Spring051 twisted pair hub 10BaseT, 100BaseT, hub r T= Twisted pair (copper wire) r Nodes connected to a hub, 100m max distance r Hub: physical.
Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
Wide Area Networks School of Business Eastern Illinois University © Abdou Illia, Spring 2007 (Week 11, Thursday 3/22/2007)
Models of Parallel Computation
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
1 Topology Design of Structured Campus Networks by Habib Youssef Sadiq M. SaitSalman A. Khan Department of Computer Engineering King Fahd University of.
Parallel System Performance CS 524 – High-Performance Computing.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
CS 240A: Complexity Measures for Parallel Computation.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
Distributed Constraint Optimization * some slides courtesy of P. Modi
Storage area network and System area network (SAN)
Network Topologies.
Switching, routing, and flow control in interconnection networks.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Interconnect Network Topologies
Interconnect Networks
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
Broadcast & Convergecast Downcast & Upcast
Network Aware Resource Allocation in Distributed Clouds.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
MIMD Distributed Memory Architectures message-passing multicomputers.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Extending LANs Fiber modems Repeaters Bridges Switches.
1 Chapter 11 Extending Networks (Repeaters, Bridges, Switches)
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
1 An Adaptive File Distribution Algorithm for Wide Area Network Takashi Hoshino, Kenjiro Taura, Takashi Chikayama University of Tokyo.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Interconnection network network interface and a case study.
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Streaming Applications for Robots with Real Time QoS Oct Supun Kamburugamuve Indiana University.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,
Introduction Computer networks: – definition – computer networks from the perspectives of users and designers – Evaluation criteria – Some concepts: –
Basic Communication Operations Carl Tropper Department of Computer Science.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
CIS 825 Lecture 9. Minimum Spanning tree construction Each node is a subtree/fragment by itself. Select the minimum outgoing edge of the fragment Send.
Interconnection Networks Communications Among Processors.
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Collective Communication Implementations
Collective Communication Implementations
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Efficient Join Query Evaluation in a Parallel Database System
Automatic Tuning of Collective Communications in MPI
Collective Communication Implementations
Storage area network and System area network (SAN)
Networked Real-Time Systems: Routing and Scheduling
Optimizing MPI collectives for SMP clusters
Presentation transcript:

MPI implementation – collective communication MPI_Bcast implementation

Collective routines A collective communication involves a group of processes. –Assumption: Collective operation is realized based on point-to-point communications. –There are many ways (algorithms) to carry out a collective operation with point-to-point operations. How to choose the best algorithm?

Two phases design Design collective algorithms under an abstract model: –Ignore physical constraints such as topology, network contention, etc. –Obtain a theoretically efficient algorithm under the model. Effectively mapping the algorithm onto a physical system. –Contention free communication.

Design collective algorithms under an abstract model A typical system model –All processes are connected by a network that provides the same capacity for all pairs of processes. interconnect

Design collective algorithms under an abstract model Models for point-to-point comm. cost(time): –Linear model: T(m) = c * m Ok if m is very large. –Honckey’s model: T(m) = a + c * m a – latency term, c – bandwidth term –LogP family models –Other more complex models. Typical Cost (time) model for the whole operation: –All processes start at the same time –Time = the last completion time – start time

MPI_Bcast A A A A A

First try: the root sends to all receivers (flat tree algorithm) If (myrank == root) { For (I=0; I<nprocs; I++) MPI_Send(…data,I,…) } else MPI_Recv(…, data, root, …) Flat tree algorithm

Broadcast time using the Honckey’s model? –Communication time = (P-1) * (a + c * msize) Can we do better than that? What is the lower bound of communication time for this operation? –In the latency term: how many communication steps does it take to complete the broadcast? –In the bandwidth term: how much data each node must send to complete the operation?

Lower bound? In the latency term (a): –How many steps does it take to complete the broadcast? –1, 2, 4, 8, 16, …  log(P) In the bandwidth term: –How many data each process must send/receive to complete the operation? Each node must receive at least one message: –Lower_bound (latency) = c*m Combined lower bound = log(P)*a + c *m –For small messages (m is small): we optimize logP * a –For large messages (c*m >> P*a): we optimize c*m

Flat tree is not optimal both in a and c! Binary broadcast tree: –Much more concurrency Communication time? 2*(a+c*m)*treeheight = 2*(a+c*m)*log(P)

A better broadcast tree: binomial tree Number of steps needed: log(P) Communication time? (a+c*m)*log(P) The latency term is optimal, this algorithm is widely used to broadcast small messages!!!! Step 1: 0  1 Step 2: 0  2, 1  3 Step 3: 0  4, 1  5, 2  6, 3  7

Optimizing the bandwidth term We don’t want to send the whole data in one shot – running out of budget right there –Chop the data into small chunks –Scatter-allgather algorithm. P0P1P2P3

Scatter-allgather algorithm P0 send 2*P messges of size m/P Time: 2*P * (a + c*m/P) = 2*P*a + 2*c*m –The bandwidth term is close to optimal –This algorithm is used in MPICH for broadcasting large messages.

How about chopping the message even further: linear tree pipelined broadcast (bcast-linear.c). S segments, each m/S bytes Total steps: S+P-1 Time: (S+P-1)*(a + c*m/S) When S>>P-1, (S+P-1)/S = 1 Time = (S+P-1)*a + c*m near optimal. P0P3P2P1

Summary Under the abstract models: –For small messages: binomial tree –For very large message: linear tree pipeline –For medium sized message: ???

Second phase: mapping the theoretical good algorithms to the actual system Algorithms for small messages can usually be applied directly. –Small message usually do not cause networking issues. Algorithms for large messages usually need attention. –Large message can easily cause network problems.

Realizing linear tree pipelined broadcast on a SMP/Multicore cluster (e.g. linprog1 + linprog2) A SMP/multicore is roughly a tree topology

Linear pipelined broadcast on tree topology Communication pattern in the linear pipelined algorithm: –Let F:{0, 1, …, P-1}  {0, 1, …, P-1} be a one-to-one mapping function. The pattern can be F(0)  F(1)  F(2)  ……  F(P-1) –To achieve maximum performance, we need to find a mapping such that F(0)  F(1)  F(2)  ……  F(P-1) does not have contention.

An example of bad mapping 0  1  2  3  4  5  6  7 –S0  S1 must carry traffic from 0  1, 2  3, 4  5, 6  A good mapping: 0  2  4  6  1  3  5  7 –S0  S1 only carry traffic for 6  S0 S1

Algorithm for finding the contention free mapping of linear pipelined pattern on tree Starting from the switch connected to the root, perform depth first search (DFS). Number the switches based on the DFS order Group machines connected to each switch, order the group based on the DFS switch number.

Example: the contention free linear pattern for the following topology is n0  n1  n8  n9  n16  n17  n24  n25  n2  n 3  n10  n11  n18  n19  n26  n27  n4  n5  n12  n13  n20  n21  n28  n29  n6  n7  n14  n15  n22  n23  n30  n31

Some broadcast study can be found in our paper: –P. Patarasu, A. Faraj, and X. Yuan, "Pipelined Broadcast on Ethernet Switched Clusters." Journal of Parallel and Distributed Computing, 68(6): , June ( Broadcast on Ethernet Switched Clusters