1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04

07/07/04ICPADS’042 Collective Communication Communication operation in which all or a large subset participate For example broadcast Performance impediment All to all communication All to all personalized communication (AAPC) All to all multicast (AAM)

07/07/04ICPADS’043 Communication Model Overhead of a point to point message is T p2p = α + mβ α is the total software overhead of sending the message β is the per byte network overhead m is the size of the message Direct all to all overhead T AAM = (P – 1) × (α + mβ) α domain when m is small β domain when m is large

07/07/04ICPADS’044 Optimization Strategies Short messages Parameter α dominates Message combining Reduce the total number of messages Multistage algorithm to send messages along a virtual topology Large messages Parameter β dominates Network contention Network topology specific optimizations that minimize network contention

07/07/04ICPADS’045 Direct Strategies Direct strategies optimize all to all multicast for large messages Minimize network contention Topology specific optimizations that take advantage of contention free schedules

07/07/04ICPADS’046 Fat-tree Networks Popular network topology for clusters Bisection bandwidth O(P) Network scales to several thousands of nodes Topology: k-ary,n-tree

07/07/04ICPADS’047 k-ary n-trees c) 4-ary 3 tree a) 4-ary 1-tree b) 4-ary 2-tree

07/07/04ICPADS’048 Contention Free Permutations Fat-trees have a nice property:some processor permutations are contention free Prefix permutation k Processor i sends data to Cyclic shift by k Processor i sends a message to Contention free if Contention free permutations presented in Heller et. al. from CM-5

07/07/04ICPADS’049 Prefix Permutation 1 01 23456 7 Prefix Permutation by 1 Processor p sends to p XOR 1

07/07/04ICPADS’0412 Prefix Permutation 4 … 01 23456 7 Prefix Permutation by 4 Processor p sends to p XOR 4

07/07/04ICPADS’0413 Cyclic Shift by k 01 23456 7 Cyclic Shift by 2

07/07/04ICPADS’0414 Quadrics: HPC Interconnect Popular interconnect Several in top500 use quadrics Used by Pittsburgh ’ s Lemieux (6TF) and ASCI-Q (20TF) Features Low latency (5 μs for MPI) High bandwidth (320MB/s/node) Fat tree topology Scales to 2K nodes

07/07/04ICPADS’0415 Effect of Contention of Throughput Drop in bandwidth at k=4,16,64 Node Bandwidth K th Permutation (MB/s) Sending data from main memory is much slower

07/07/04ICPADS’0416 Performance Bottlenecks 320 byte packet size Packet protocol restricts bandwidth to faraway nodes PCI/DMA bandwidth is restrictive Achievable bandwidth is only 128MB/s

07/07/04ICPADS’0417 Quadrics Packet Protocol Nearby Nodes Full Utilization Send the first packet Sender Receiver Ack Header Receive Ack Send Header Send Payload Send the next packet after first has been acked.

07/07/04ICPADS’0418 Far Away Messages Send the first packet Sender Receiver Ack Header Receive Ack Send Header Send Payload Send the next packet Faraway Nodes Low Utilization

07/07/04ICPADS’0419 AAM on Fat-tree Networks Overcome bottlenecks Messages sent from NIC memory have 2.5 times better performance Avoid sending messages to far away nodes Using contention free permutations Permutation: every processor sends a message to a different destination

07/07/04ICPADS’0420 AAM Strategy: Ring Performs all to all multicast by sending messages along a ring formed by the processors Equivalent to P-1 cyclic-shift-by-1 operations Congestion free Has appeared in literature before Drawback Processors send different messages in each step 012ii+1P-1 …………..

07/07/04ICPADS’0421 Prefix Send Strategy P-1 prefix permutations In stage j, processor i sends a message to processor (i XOR (j+1)) Congestion free Can send messages from Elan memory Bad performance on large fat-trees Sends P/2 messages to far-away nodes at distance P/2 or more away Wire/Switch delays restrict performance

07/07/04ICPADS’0422 K-Prefix Strategy Hybrid of ring strategy and prefix send Prefix send used in partitions of size k Ring used between the partitions Our contribution! 012ii+1P-1 ………….. Ring across fat-trees of size k Prefix Send within

07/07/04ICPADS’0423 Performance Node bandwidth (MB/s) each way NodesMPIPrefixK-Prefix 64123260265 12899224259 14494-261 25695215256 Our strategies send messages from Elan memory

07/07/04ICPADS’0424 Cost Equation α, host and network software overhead α b, cost of barrier (barriers needed to synchronize the nodes) β em, per byte network transmission cost δ, copying overhead to NIC memory P, Number of processors k, Size of the partition in k-Prefix

07/07/04ICPADS’0425 K-Prefixlb Strategy k-Prefixlb strategy synchronizes nodes after a few steps

07/07/04ICPADS’0426 CPU Overhead Strategies should also be evaluated on compute overhead Asynchronous non blocking primitives needed A data driven system like Charm++ will automatically support this

07/07/04ICPADS’0427 Predicted vs Actual Performance Predicted plot assumes: α = 9us, α b = 15us, β = δ = 294MB/s

07/07/04ICPADS’0428 Missing Nodes Missing nodes due to down nodes in the fat tree Prefix-Send and k-Prefix do badly in this scenario 173 - 69240 16915872128 K-PrefixPrefix-SendMPINodes Node bandwidth with 1 missing node

07/07/04ICPADS’0429 K-Shift Strategy Processor i sends data to the consecutive nodes [i-k/2+1, …, i-1, i+1, …, i+k/2] and to i+k Contention free and good performance with non-contiguous nodes, when k=8 Our contribution 173197240 169196128 K-Prefix K-Shift Nodes Node bandwidth (MB/s) with one missing node 0i+k/2i-k/2+1i-1iP-1 ……… …… i+k … K-shift gains because most of the destinations for each node do not change in the presence of missing nodes

07/07/04ICPADS’0430 Conclusion We optimize AAM for Quadrics QsNet Copying and sending a message from the NIC has more bandwidth K-Prefix avoids sending messages to far away nodes Handle missing nodes through the k-shift strategy Cluster interconnects other than quadrics also have such problems Impressive performance results CPU overhead should be a metric to evaluate AAM strategies

07/07/04ICPADS’0431 Future Work

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

Similar presentations

Presentation on theme: "1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

Similar presentations

Presentation on theme: "1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04."— Presentation transcript:

Similar presentations

About project

Feedback