1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

Slides:



Advertisements
Similar presentations
CSCI-1680 Switching Based partly on lecture notes by David Mazières, Phil Levis, John Jannotti Rodrigo Fonseca.
Advertisements

Multicasting in Mobile Ad hoc Networks By XIE Jiawei.
Misbah Mubarak, Christopher D. Carothers
PROMISE: Peer-to-Peer Media Streaming Using CollectCast Mohamed Hafeeda, Ahsan Habib et al. Presented By: Abhishek Gupta.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Networks Thu D. Nguyen Spring 2005 Computer Science Rutgers University.
3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.
1 Hierarchical Distance-Vector Multicast Routing for MBone Presented by Nitin Deshpande Darpan Bhuva.
1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Switching, routing, and flow control in interconnection networks.
Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Interconnect Networks
Itrat Rasool Quadri ST ID COE-543 Wireless and Mobile Networks
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Circuit & Packet Switching. ► Two ways of achieving the same goal. ► The transfer of data across networks. ► Both methods have advantages and disadvantages.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Interconnection network network interface and a case study.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Michael Buettner, Gary V. Yee, Eric Anderson, Richard Han
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Amoeba Group Communication CS294-4 P2P Systems 2003 David Ratajczak.
Background Computer System Architectures Computer System Software.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
High Performance and Reliable Multicast over Myrinet/GM-2
Advanced Computer Networks
Alternative system models
Accelerating Large Charm++ Messages using RDMA
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Switching, routing, and flow control in interconnection networks
High Throughput Route Selection in Multi-Rate Ad Hoc Wireless Networks
Indirect Networks or Dynamic Networks
Interconnection Network Design Lecture 14
Networks Networking has become ubiquitous (cf. WWW)
Embedded Computer Architecture 5SAI0 Interconnection Networks
CS 6290 Many-core & Interconnect
Switching, routing, and flow control in interconnection networks
Support for Adaptivity in ARMCI Using Migratable Objects
Dragonfly+: Low Cost Topology for scaling Datacenters
Presentation transcript:

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04

07/07/04ICPADS’042 Collective Communication Communication operation in which all or a large subset participate For example broadcast Performance impediment All to all communication All to all personalized communication (AAPC) All to all multicast (AAM)

07/07/04ICPADS’043 Communication Model Overhead of a point to point message is T p2p = α + mβ α is the total software overhead of sending the message β is the per byte network overhead m is the size of the message Direct all to all overhead T AAM = (P – 1) × (α + mβ) α domain when m is small β domain when m is large

07/07/04ICPADS’044 Optimization Strategies Short messages Parameter α dominates Message combining Reduce the total number of messages Multistage algorithm to send messages along a virtual topology Large messages Parameter β dominates Network contention Network topology specific optimizations that minimize network contention

07/07/04ICPADS’045 Direct Strategies Direct strategies optimize all to all multicast for large messages Minimize network contention Topology specific optimizations that take advantage of contention free schedules

07/07/04ICPADS’046 Fat-tree Networks Popular network topology for clusters Bisection bandwidth O(P) Network scales to several thousands of nodes Topology: k-ary,n-tree

07/07/04ICPADS’047 k-ary n-trees c) 4-ary 3 tree a) 4-ary 1-tree b) 4-ary 2-tree

07/07/04ICPADS’048 Contention Free Permutations Fat-trees have a nice property:some processor permutations are contention free Prefix permutation k Processor i sends data to Cyclic shift by k Processor i sends a message to Contention free if Contention free permutations presented in Heller et. al. from CM-5

07/07/04ICPADS’049 Prefix Permutation Prefix Permutation by 1 Processor p sends to p XOR 1

07/07/04ICPADS’0410 Prefix Permutation Prefix Permutation by 2 Processor p sends to p XOR 2

07/07/04ICPADS’0411 Prefix Permutation Prefix Permutation by 3 Processor p sends to p XOR 3

07/07/04ICPADS’0412 Prefix Permutation 4 … Prefix Permutation by 4 Processor p sends to p XOR 4

07/07/04ICPADS’0413 Cyclic Shift by k Cyclic Shift by 2

07/07/04ICPADS’0414 Quadrics: HPC Interconnect Popular interconnect Several in top500 use quadrics Used by Pittsburgh ’ s Lemieux (6TF) and ASCI-Q (20TF) Features Low latency (5 μs for MPI) High bandwidth (320MB/s/node) Fat tree topology Scales to 2K nodes

07/07/04ICPADS’0415 Effect of Contention of Throughput Drop in bandwidth at k=4,16,64 Node Bandwidth K th Permutation (MB/s) Sending data from main memory is much slower

07/07/04ICPADS’0416 Performance Bottlenecks 320 byte packet size Packet protocol restricts bandwidth to faraway nodes PCI/DMA bandwidth is restrictive Achievable bandwidth is only 128MB/s

07/07/04ICPADS’0417 Quadrics Packet Protocol Nearby Nodes Full Utilization Send the first packet Sender Receiver Ack Header Receive Ack Send Header Send Payload Send the next packet after first has been acked.

07/07/04ICPADS’0418 Far Away Messages Send the first packet Sender Receiver Ack Header Receive Ack Send Header Send Payload Send the next packet Faraway Nodes Low Utilization

07/07/04ICPADS’0419 AAM on Fat-tree Networks Overcome bottlenecks Messages sent from NIC memory have 2.5 times better performance Avoid sending messages to far away nodes Using contention free permutations Permutation: every processor sends a message to a different destination

07/07/04ICPADS’0420 AAM Strategy: Ring Performs all to all multicast by sending messages along a ring formed by the processors Equivalent to P-1 cyclic-shift-by-1 operations Congestion free Has appeared in literature before Drawback Processors send different messages in each step 012ii+1P-1 …………..

07/07/04ICPADS’0421 Prefix Send Strategy P-1 prefix permutations In stage j, processor i sends a message to processor (i XOR (j+1)) Congestion free Can send messages from Elan memory Bad performance on large fat-trees Sends P/2 messages to far-away nodes at distance P/2 or more away Wire/Switch delays restrict performance

07/07/04ICPADS’0422 K-Prefix Strategy Hybrid of ring strategy and prefix send Prefix send used in partitions of size k Ring used between the partitions Our contribution! 012ii+1P-1 ………….. Ring across fat-trees of size k Prefix Send within

07/07/04ICPADS’0423 Performance Node bandwidth (MB/s) each way NodesMPIPrefixK-Prefix Our strategies send messages from Elan memory

07/07/04ICPADS’0424 Cost Equation α, host and network software overhead α b, cost of barrier (barriers needed to synchronize the nodes) β em, per byte network transmission cost δ, copying overhead to NIC memory P, Number of processors k, Size of the partition in k-Prefix

07/07/04ICPADS’0425 K-Prefixlb Strategy k-Prefixlb strategy synchronizes nodes after a few steps

07/07/04ICPADS’0426 CPU Overhead Strategies should also be evaluated on compute overhead Asynchronous non blocking primitives needed A data driven system like Charm++ will automatically support this

07/07/04ICPADS’0427 Predicted vs Actual Performance Predicted plot assumes: α = 9us, α b = 15us, β = δ = 294MB/s

07/07/04ICPADS’0428 Missing Nodes Missing nodes due to down nodes in the fat tree Prefix-Send and k-Prefix do badly in this scenario K-PrefixPrefix-SendMPINodes Node bandwidth with 1 missing node

07/07/04ICPADS’0429 K-Shift Strategy Processor i sends data to the consecutive nodes [i-k/2+1, …, i-1, i+1, …, i+k/2] and to i+k Contention free and good performance with non-contiguous nodes, when k=8 Our contribution K-Prefix K-Shift Nodes Node bandwidth (MB/s) with one missing node 0i+k/2i-k/2+1i-1iP-1 ……… …… i+k … K-shift gains because most of the destinations for each node do not change in the presence of missing nodes

07/07/04ICPADS’0430 Conclusion We optimize AAM for Quadrics QsNet Copying and sending a message from the NIC has more bandwidth K-Prefix avoids sending messages to far away nodes Handle missing nodes through the k-shift strategy Cluster interconnects other than quadrics also have such problems Impressive performance results CPU overhead should be a metric to evaluate AAM strategies

07/07/04ICPADS’0431 Future Work