Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.

Slides:



Advertisements
Similar presentations
Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.
Advertisements

Data Communications and Networking
Basic Communication Operations
Lecture 9: Group Communication Operations
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Multicast in Wireless Mesh Network Xuan (William) Zhang Xun Shi.
Leader Election Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.  i,j  V  i,j are non-faulty.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
Ranveer Chandra , Kenneth P. Birman Department of Computer Science
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
A Framework for Distributed Tensor Computations Martin Schatz Bryan Marker Robert van de Geijn The University of Texas at Austin Tze Meng Low Carnegie.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 Delay-efficient Data Gathering in Sensor Networks Bin Tang, Xianjin Zhu and Deng Pan.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.
Communication operations Efficient Parallel Algorithms COMP308.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
CS 684.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
1 Static Interconnection Networks CEG 4131 Computer Architecture III Miodrag Bolic.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Parallel Programming in C with MPI and OpenMP
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Agent-based Model Simulation with Twister Bingjing Zhang, Lilian Weng, B649 Term.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
 Collectives on Two-tier Direct Networks EuroMPI – 2012 Nikhil Jain, JohnMark Lau, Laxmikant Kale 26 th September, 2012.
Analysis of Topology-Dependent MPI Performance on Gemini Networks Antonio J. Peña, Ralf G. Correa Carvalho, James Dinan, Pavan Balaji, Rajeev Thakur, and.
Design and Analysis of an MST-Based Topology Control Algorithm Ning Li, Jennifer C. Hou, and Lui Sha Department of Computer Science University of Illinois.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
COMPUTER NETWORKING.  Definition  Need & advantages  Types of network  Basics of network architecture  LAN Topologies  Network models  Network.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
A Multicast Mechanism in WiMax Mesh Network Jianfeng Chen, Wenhua Jiao, Pin Jiang, Qian Guo Asia-Pacific Conference on Communications, (APCC '06)
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
MPI implementation – collective communication MPI_Bcast implementation.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
Data Structures and Algorithms in Parallel Computing Lecture 8.
A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,
Basic Communication Operations Carl Tropper Department of Computer Science.
1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.
3 rd lecture Presented by Dr. Sarah Mustafa Eljack.
Interconnection Networks Communications Among Processors.
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Collective Communication Implementations
Distributed and Parallel Processing
Collective Communication Implementations
Interconnection topologies
Multi-Node Broadcasting in Hypercube and Star Graphs
Agent-based Model Simulation with Twister
Collective Communication Implementations
Improving Routing & Network Performances using Quality of Nodes
Optimizing MPI collectives for SMP clusters
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Presentation transcript:

Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan

Authors Ernie Chan Robert van de Geijn  Department of Computer Sciences The University of Texas at Austin William Gropp Rajeev Thakur  Mathematics and Computer Science Division Argonne National Laboratory

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Testbed Architecture IBM Blue Gene/L  3D torus point-to-point interconnect network  One rack 1024 dual-processor nodes Two 8 x 8 x 8 midplanes  Special feature to send simultaneously Use multiple calls to MPI_Isend

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Model of Parallel Computation Target Architectures  Distributed-memory parallel architectures Indexing  p computational nodes  Indexed 0 … p - 1 Logically Fully Connected  A node can send directly to any other node

Model of Parallel Computation Topology  N-dimensional torus

Model of Parallel Computation Old Model of Communicating Between Nodes  Unidirectional sending or receiving

Model of Parallel Computation Old Model of Communicating Between Nodes  Simultaneous sending and receiving

Model of Parallel Computation Old Model of Communicating Between Nodes  Bidirectional exchange

Model of Parallel Computation Communicating Between Nodes  A node can send or receive with 2N other nodes simultaneously along its 2N different links

Model of Parallel Computation Communicating Between Nodes  Cannot perform bidirectional exchange on any link while sending or receiving simultaneously with multiple nodes

Model of Parallel Computation Cost of Communication α + nβ  α: startup time, latency  n: number of bytes to communicate  β: per data transmission time, bandwidth

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Sending Simultaneously Old Cost of Communication with Sends to Multiple Nodes  Cost to send to m separate nodes (α + nβ) m

Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1)

Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends

Sending Simultaneously New Cost of Communication with Simultaneous Sends (α + nβ) m can be replaced with (α + nβ) + (α + nβ) (m - 1) τ Cost of one sendCost of extra sends 0 ≤ τ ≤ 1

Sending Simultaneously Benchmarking Sending Simultaneously  Logarithmic-Logarithmic timing graphs  Midplane – 512 nodes  Sending simultaneously with 1 – 6 neighbors  8 bytes – 4 MB

Sending Simultaneously

Cost of Communication with Simultaneous Sends (α + nβ) (1 + (m - 1) τ)

Sending Simultaneously

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Collective Communication Broadcast (Bcast)  Motivating example Before After

Collective Communication Scatter Before After

Collective Communication Allgather Before After

Collective Communication Broadcast  Can be implemented as a Scatter followed by an Allgather

ScatterAllgather

Collective Communication Lower Bounds: Latency  Broadcastlog 2N+1 (p) α  Scatterlog 2N+1 (p) α  Allgatherlog 2N+1 (p) α

Collective Communication Lower Bounds: Bandwidth  Broadcast nβ 2N  Scatter p - 1 nβ p 2N  Allgather p - 1 nβ p 2N

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Generalized Algorithms Short-Vector Algorithms  Minimum-Spanning Tree Long-Vector Algorithms  Bucket Algorithm

Generalized Algorithms Minimum-Spanning Tree

Generalized Algorithms Minimum-Spanning Tree  Recursively divide network of nodes in half  Cost of MST Bcast log 2 (p) (α + nβ)  What if can send to N nodes simultaneously?

Generalized Algorithms Minimum-Spanning Tree  Divide p nodes into N+1 partitions

Generalized Algorithms Minimum-Spanning Tree  Disjointed partitions on N-dimensional mesh

Generalized Algorithms Minimum-Spanning Tree  Divide dimensions by a decrementing counter from N

Generalized Algorithms Minimum-Spanning Tree  Now divide into 2N+1 partitions

Generalized Algorithms Minimum-Spanning Tree  Cost of new Generalized MST Bcast log 2N+1 (p) (α + nβ)  Attains lower bound for latency!

Generalized Algorithms Minimum-Spanning Tree  MST Scatter Only send data that must reside in that partition at each step  Cost of new generalized MST Scatter  Attains lower bound for latency and bandwidth! log 2N+1 (p) α + p - 1 p nβnβ 2N

Generalized Algorithms Bucket Algorithm

Generalized Algorithms Bucket Algorithm  Send n/p sized data messages at each step  Cost of Bucket Allgather  What if can send to N nodes simultaneously? p - 1 p nβnβ (p - 1) α +

Generalized Algorithms Bucket Algorithm  Collect data around N buckets simultaneously

Generalized Algorithms Bucket Algorithm  Cannot send to N neighbors at each step

Generalized Algorithms Bucket Algorithm  Assume collecting data in buckets is free in all but one dimension  D is an N-ordered tuple representing the number of nodes in each dimension of the torus π D i = p0 1 | D | = N i = 1 N

Generalized Algorithms Bucket Algorithm  Cost of the new generalized Bucket Allgather where D j - 1 DjDj nβnβ (d - N) α + d = Σ D i i ≠ j, D j ≥ D i A i = 1 N

Generalized Algorithms Bucket Algorithm  New generalized Bcast derived from MST Scatter followed by Bucket Allgather  Cost of new long-vector Bcast p - 1 D j - 1 2Np D j nβnβ (log 2N+1 (p) + d - N) α + + ( )

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

Performance Results Logarithmic-Logarithmic Timing Graphs  Collective Communication Operations Broadcast Scatter Allgather  Algorithms MST Bucket  8 bytes – 4 MB

Performance Results Single point-to-point communication

Performance Results my-bcast-MST

Performance Results

Outline Testbed Architecture Model of Parallel Computation Sending Simultaneously Collective Communication Generalized Algorithms Performance Results Conclusion

IBM Blue Gene/L supports functionality of sending simultaneously  Benchmarking along with model checking verifies this claim New generalized algorithms show clear performance gains

Conclusion Future Directions  Room for optimization to reduce implementation overhead  What if not using MPI_COMM_WORLD ?  Possible new algorithm for Bucket Algorithm