Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li
HiPS：Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Good morning, every one ! I am Jinkun Geng, from Tsinghua university. Today the topic I would like to share with you is HiPS：Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning. This work is directed by my supervisor, Prof Dan Li, and cooperated with Yang Cheng, Shuai Wang and Junfeng Li. Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

ACM SIGCOMM Workshop on NetAI
for As indicated from the name of the workshop, AI is really a hot topic in recent years, and has penetrated into every fields. As network researchers, it is a worthwhile for us to combine network and AI. In the past few days, we have witnessed numerous works focusing on AI for net. And in the following part of the workshop, we will see more of them. In this presentation, I will Introduce something about Net for AI. And HiPS is one example in this direction.

Distributed Machine Learning
Background Computation Communication As is known to all, DML is becoming more and more popular in recent years. As for the performance of DML, it is mainly concerned about two dimensions, one is computation, and the other is communication. Distributed Machine Learning

Background Strong Computation Power (GPU & TPU)
As for the computation side, with the rapid development of GPUs, TPUs and so on. The computation power is becoming strong, but then the communication side is confronted with significant challenges.

Background Communication Challenge
TCP: High Latency & Low Throughput, Kernel Overheads, etc. RDMA-Promising Alternative to TCP In recent years, TCP has been blamed for its high latency and low throughput, and this is much due to the kernel overheads introduced by TCP. Considering this, RDMA serves as an alternative technology to TCP, which offloads the processing logic into the hardware and eliminates the kernel-based overheads, thus it can achieve ultra-low latency and is considered as one promising alternative to TCP.

Background A MNIST Benchmark with 1 million paras
To demonstrate the benefit of RDMA for DML, we conduct a simple experiment, we train a MNIST Benchmark, which contains 1 million parameters for iterations, and we use three simple configurations, which be shown that RoCE-based DML can reduce the training time by about 50% compared with its TCP-based counterpart. Running a worker on one machine and a parameter server on a remote machine,

Background RoCE/RDMA –multi-vendor ecosystem
Many Problems in Fat-Tree based Deployment Though RoCE has been widely applied in practice and it has formed a multi-vendor ecosystem, it should be noted that many problems still exist while we deploy RoCE in Fat-Tree for large scale

Background Fat-Tree based Deployment
PFC pause frame storm [SIGCOMM’15,’16, NS-3 Simulation] Resilient RoCE-Performance Sacrifice [Chelsio-Tech] Synchronization Performance PFC pause frame storm has been considered in previous literature as one problem in RoCE deployment. Also, our simulation with NS-3 also observed PFC Pause Frames can easily occur in Fat-Tree,especially when there are all-to-all communications, , even we have adopted DCQCN to mitigate them, Recently, Mellanox has claimed that CX-4 and higher level of RNICs and switches can support resilient RoCE, which eliminates PFC for congestion control. However, resilient RoCE is criticized by its rival, because it cause performance sacrifice due to packet recovery. More importantly, the existing works mainly focus on the logic view while designing sync algos, without considering the real physical topology, thus may damaging the sync performance.

Server-Centric Networks
Background Fat-Tree based Deployment PFC pause frame storm [SIGCOM’15,’16] Resilient RoCE-Performance Sacrifice Targeting at the PFC-related problems, we decide not to use Fat-Tree as the topology, instead, we choose server-centric networks Server-Centric Networks

Hierarchical Synchronization
Background Fat-Tree based Deployment Synchronization Performance Meanwhile, we fully consider the physical topology and leverage hierarchical synchronization for high performance. Hierarchical Synchronization

Background Server-Centric Networks
Less hops lead to less PFC pause frames Servers prevent cascading effect of PFC pause frame Why we choose server-centric networks to mitigate PFC pause frame storms? There are two main reasons. According to our simulation and some empirical study, we find that PFC pause frame storm is more likely to occur when there are multi-hops. On the contrary, if we avoid multi-hops and only involve single-hop flows, the PFC pause frame will be significantly reduced or even avoided. Besides, in server-centric networks, such as Bcube and Torus, the server enjoys forwarding intelligence and it can help to constrain the cascading effect. Even when there are PFC-pause frame occurs, the server can absorb the PFC pause frame to its memory, thus it will not be spread to the whole network and cause complete failure.

Background Synchronization Algorithm PS-based Mesh-based Ring-based
Then we turn to the synchronization algorithm. Why we design hierarchical algorithm? To answer this question, we first make a review of the existing synchronziation algorithms in existing DML systems.

Background Synchronization Algorithm PS-based (Pull+Push)
As for PS-based, the sync process is divided into two stages, pull and push

Background Synchronization Algorithm Mesh-based (Diffuse+Collect)
As for Mesh-based, it can be considered as a special case of PS-based algorithm, with load balance. The sync process can be also divided into two main steps. Diffuse and Collect

Background Synchronization Algorithm Ring-based (Scatter+Gather)
As for Ring-based algorithm, it is well-known as Ring Allreduce algorithm. For N servers in the ring, it needs N-1 steps of Scatter and N-1 steps of Gather

Background Synchronization Algorithm Ring-based (Scatter+Gather)
In the logic view, we may feel that, yeah, these algorithms are very efficient and they can work well in large scale. However, when we think of the real topology, the situation may become more complex, will there be bandwidth imbalance, will there be link conflict? Is the parallelism fully leveraged? Perhaps the answer is No!

HiPS Design Map Logic View and Physical Structure
Flexible (Topology-Aware) Hierarchical (Efficient) We believe an efficient synchronization algorithm for large-scale DML should take the physical topology into consideration and we need to map the logic view with the physical structure. So we decide to design a flexible and hierarchical algorithm, which we name it, HiPS.

HiPS Design HiPS in BCube
HiPS fully takes the physical topology into consideration and we divide the whole sync process into several stages, each stage will choose one of the three sync algos and we aggregated the parameters in each stage, so that the sync workload will be reduced stage by stage, just like a dumbell

To make a better explanation, we take HiPS in Bcube as an example to illustrate the execution of HiPS.

To further understand HiPS， we take a look at the parameter states of one server, for example,

HiPS Design HiPS in BCube (Server <01>)
First it enters Stage 0, and it conducts sync, such as Diffuse or Scatter, then it gains three partial aggregated parameters from the group, then it enters Stage 1, and it only conducts sync with the aggregated parameters, for example, it also conducts Mesh-based sync in Stage 1. then after Diffuse, it gains 1 global parameters, aggregated from the 9 servers, after Collect, it gains 3 global parameters aggregated from 9 servers, finally, it returns to Stage 0 and continues the sync, Later it gains all the synchronized parameters from the servers.

If we take a look at the synchronization workload in each stage, it can be illustrated like a dumbell. The workload for each stage is shrinked.

HiPS Design HiPS in Torus As for Torus, we have a similar design.
Notice that both Bcube and Torus enjoy good symmetry and we can launch several parallel processes to fully utilize the link and bandwidth and accelerate the synchronization process.

Theoretical Evaluation
Here we present the theoretical evaluation with Bcube and Torus.

To better illustrate the gap, we can compare the global synchronization time for different synchronization algorithms in the figure. We can find that for both Bcube and Torus, HiPS achieves better synchronization performance compared with the existing flat algorithms

This is for the comparison in Torus.

Future Work Conduct Further Comparative Study
Integrate HiPS into DML systems As for the future work for this workshop paper, we have two main directions, But since the acceptance of the paper, there have been several months passed. we have made some further progress and I will show you the primary results in our following work.

Simulation Evaluation
NS-3 Simulation with VGG Workload BCube: GST reduced by 37.5%∼61.9%. Torus: GST reduced by 49.6%∼66.4% GSTs in BCube. Compared with PSS/MS, HiPS reduces the GST by 37.5%∼61.9%. Compared with RS, HiPS reduces the GST by 49.6%∼66.4%. GSTs in Torus. Compared with PSS/MS, HiPS reduces the GST by 52.9%∼87.4%. Compared with RS, HiPS reduces the GST by 48.0%∼65.6%. GST Comparison with RDMA in BCube GST Comparison with RDMA in Torus

Testbed Evaluation System Instance of HiPS: BML
Add an OP in Tensorflow 9 Servers, each equipped with 2 RNICs (BCube (3,1)) MINIST and VGG19 as benchmarks Ring Allreduce in Ring and Mesh-based (P2P) Sync in Fat-Tree as Baseline 18.7%~56.4%

Testbed Evaluation 18.7%~56.4%

18.7%~56.4% Testbed Evaluation
According to our experiment study, we can demonstrated that HiPS can outperform the baselines by 18.7%~56.4%, compared with Mesh-based algorithm. Meanwhile, since we have only built a two-layer Bcube, the theoretical GST of HiPS in BCube and the GST of bidirectional Ring-based algorithm are equal. But we can find that HiPS is still a little bit faster than RingAllreduce, the reason is because the Ring-Allreduce algorithm needs more send operations. Although it has been proved that Ring Allreduce is bandwidth optimal, it needs more start-up time, which can become a potential overhead. The result is not perfect and we are still optimizing the current work. 18.7%~56.4%

Ongoing Work Conduct Further Comparative Study
Optimize HiPS in DML systems More Cases of Network for AI As for the future work for this workshop paper, Also, we will continue this direction and conduct more researches on Network for AI.

https://nasp.cs.tsinghua.edu.cn/
Thanks! NASP Research Group

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

Similar presentations

Presentation on theme: "Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

Similar presentations

Presentation on theme: "Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li"— Presentation transcript:

Similar presentations

About project

Feedback