Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

Slides:

Advertisements

Similar presentations

SDN Controller Challenges

Advertisements

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Page 1 / 14 The Mesh Comparison PLANET’s Layer 3 MAP products v.s. 3 rd ’s Layer 2 Mesh.

Presented By- Sayandeep Mitra TH SEMESTER Sensor Networks(CS 704D) Assignment.

Introduction to Parallel Processing Ch. 12, Pg

Camdoop: Exploiting In-network Aggregation for Big Data Applications Paolo Costa, Austin Donnelly, Antony Rowstron, Greg O’Shea Presenter – Manoj Kumar(mkumar11)

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

1 Finding Constant From Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds Yifan Gong, Bingsheng He, Dan Li Nanyang Technological.

Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.

C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.

Cooperative Caching in Wireless P2P Networks: Design, Implementation And Evaluation.

© THE UNIVERSITY OF WAIKATO TE WHARE WANANGA O WAIKATO 1 ns-2 TCP Simulations with The Network Simulation Cradle Sam Jansen and Anthony McGregor.

A Hierarchical Edge Cloud Architecture for Mobile Computing IEEE INFOCOM 2016 Liang Tong, Yong Li and Wei Gao University of Tennessee – Knoxville 1.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

VL2: A Scalable and Flexible Data Center Network

Yiting Xia, T. S. Eugene Ng Rice University

TensorFlow– A system for large-scale machine learning

Resilient Datacenter Load Balancing in the Wild

Recommendation in Scholarly Big Data

Authors: Jiang Xie, Ian F. Akyildiz

Confluent vs. Splittable Flows

2018/4/23 Dynamic Load-balanced Path Optimization in SDN-based Data Center Networks Author: Yuan-Liang Lan , Kuochen Wang and Yi-Huai Hsu Presenter: Yi-Hsien.

5/3/2018 3:51 AM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Auction-based in-network caching in Information-centric networks Workshop ACROSS, 16th of September 2016 | Lucia D’Acunto.

Architecture and Algorithms for an IEEE 802

Data Center Network Architectures

A Survey of Data Center Network Architectures By Obasuyi Edokpolor

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Chilimbi, et al. (2014) Microsoft Research

Introduction to Wireless Sensor Networks

Rule Induction for Classification Using

Improving Datacenter Performance and Robustness with Multipath TCP

Efficient Join Query Evaluation in a Parallel Database System

Parallel Density-based Hybrid Clustering

Parallel Algorithm Design

Exploring Concentration and Channel Slicing in On-chip Network Router

FAR: A Fault-avoidance Routing Method for Data Center Networks with Regular Topology Please send.

11/13/ :11 PM Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter Yuanwei Lu1,2, Guo Chen2, Zhenyuan Ruan1,2, Wencong Xiao2,3,

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Overlay Networking Overview.

ElasticTree: Saving Energy in Data Center Networks

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

CLUSTER COMPUTING.

CPU SCHEDULING.

Jellyfish: Networking Data Centers Randomly

COMP60621 Fundamentals of Parallel and Distributed Systems

SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691

Fast Congestion Control in RDMA-Based Datacenter Networks

RDMA over Commodity Ethernet at Scale

Memento: Making Sliding Windows Efficient for Heavy Hitters

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Feasibility of Coordinated Transmission for HEW

2019/5/13 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Peng Wang, George Trimponias, Hong Xu,

TensorFlow: A System for Large-Scale Machine Learning

Helen: Maliciously Secure Coopetitive Learning for Linear Models

Department of Computer Science University of California, Santa Barbara

COMP60611 Fundamentals of Parallel and Distributed Systems

2019/10/9 A Weighted ECMP Load Balancing Scheme for Data Centers Using P4 Switches Presenter：Hung-Yen Wang Authors：Jin-Li Ye, Yu-Huang Chu, Chien Chen.

A Closer Look at NFV Execution Models

Accelerating Distributed Machine Learning by Smart Parameter Server

Rethinking Transport Layer Design for Distributed Machine Learning

Feasibility of Coordinated Transmission for HEW

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li HiPS：Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning Good morning, every one ! I am Jinkun Geng, from Tsinghua university. Today the topic I would like to share with you is HiPS：Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning. This work is directed by my supervisor, Prof Dan Li, and cooperated with Yang Cheng, Shuai Wang and Junfeng Li. Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

ACM SIGCOMM Workshop on NetAI for As indicated from the name of the workshop, AI is really a hot topic in recent years, and has penetrated into every fields. As network researchers, it is a worthwhile for us to combine network and AI. In the past few days, we have witnessed numerous works focusing on AI for net. And in the following part of the workshop, we will see more of them. In this presentation, I will Introduce something about Net for AI. And HiPS is one example in this direction.

Distributed Machine Learning Background Computation Communication As is known to all, DML is becoming more and more popular in recent years. As for the performance of DML, it is mainly concerned about two dimensions, one is computation, and the other is communication. Distributed Machine Learning

Background Strong Computation Power (GPU & TPU) As for the computation side, with the rapid development of GPUs, TPUs and so on. The computation power is becoming strong, but then the communication side is confronted with significant challenges.

Background Communication Challenge TCP: High Latency & Low Throughput, Kernel Overheads, etc. RDMA-Promising Alternative to TCP In recent years, TCP has been blamed for its high latency and low throughput, and this is much due to the kernel overheads introduced by TCP. Considering this, RDMA serves as an alternative technology to TCP, which offloads the processing logic into the hardware and eliminates the kernel-based overheads, thus it can achieve ultra-low latency and is considered as one promising alternative to TCP.

Background A MNIST Benchmark with 1 million paras To demonstrate the benefit of RDMA for DML, we conduct a simple experiment, we train a MNIST Benchmark, which contains 1 million parameters for 10000 iterations, and we use three simple configurations, which be shown that RoCE-based DML can reduce the training time by about 50% compared with its TCP-based counterpart. Running a worker on one machine and a parameter server on a remote machine,

Background RoCE/RDMA –multi-vendor ecosystem Many Problems in Fat-Tree based Deployment Though RoCE has been widely applied in practice and it has formed a multi-vendor ecosystem, it should be noted that many problems still exist while we deploy RoCE in Fat-Tree for large scale

Background Fat-Tree based Deployment PFC pause frame storm [SIGCOMM’15,’16, NS-3 Simulation] Resilient RoCE-Performance Sacrifice [Chelsio-Tech] Synchronization Performance PFC pause frame storm has been considered in previous literature as one problem in RoCE deployment. Also, our simulation with NS-3 also observed PFC Pause Frames can easily occur in Fat-Tree,especially when there are all-to-all communications, , even we have adopted DCQCN to mitigate them, Recently, Mellanox has claimed that CX-4 and higher level of RNICs and switches can support resilient RoCE, which eliminates PFC for congestion control. However, resilient RoCE is criticized by its rival, because it cause performance sacrifice due to packet recovery. More importantly, the existing works mainly focus on the logic view while designing sync algos, without considering the real physical topology, thus may damaging the sync performance.

Server-Centric Networks Background Fat-Tree based Deployment PFC pause frame storm [SIGCOM’15,’16] Resilient RoCE-Performance Sacrifice Targeting at the PFC-related problems, we decide not to use Fat-Tree as the topology, instead, we choose server-centric networks Server-Centric Networks

Hierarchical Synchronization Background Fat-Tree based Deployment Synchronization Performance Meanwhile, we fully consider the physical topology and leverage hierarchical synchronization for high performance. Hierarchical Synchronization

Background Server-Centric Networks Less hops lead to less PFC pause frames Servers prevent cascading effect of PFC pause frame Why we choose server-centric networks to mitigate PFC pause frame storms? There are two main reasons. According to our simulation and some empirical study, we find that PFC pause frame storm is more likely to occur when there are multi-hops. On the contrary, if we avoid multi-hops and only involve single-hop flows, the PFC pause frame will be significantly reduced or even avoided. Besides, in server-centric networks, such as Bcube and Torus, the server enjoys forwarding intelligence and it can help to constrain the cascading effect. Even when there are PFC-pause frame occurs, the server can absorb the PFC pause frame to its memory, thus it will not be spread to the whole network and cause complete failure.

Background Synchronization Algorithm PS-based Mesh-based Ring-based Then we turn to the synchronization algorithm. Why we design hierarchical algorithm? To answer this question, we first make a review of the existing synchronziation algorithms in existing DML systems.

Background Synchronization Algorithm PS-based (Pull+Push) As for PS-based, the sync process is divided into two stages, pull and push

Background Synchronization Algorithm Mesh-based (Diffuse+Collect) As for Mesh-based, it can be considered as a special case of PS-based algorithm, with load balance. The sync process can be also divided into two main steps. Diffuse and Collect

Background Synchronization Algorithm Ring-based (Scatter+Gather) As for Ring-based algorithm, it is well-known as Ring Allreduce algorithm. For N servers in the ring, it needs N-1 steps of Scatter and N-1 steps of Gather

Background Synchronization Algorithm Ring-based (Scatter+Gather) In the logic view, we may feel that, yeah, these algorithms are very efficient and they can work well in large scale. However, when we think of the real topology, the situation may become more complex, will there be bandwidth imbalance, will there be link conflict? Is the parallelism fully leveraged? Perhaps the answer is No!

HiPS Design Map Logic View and Physical Structure Flexible (Topology-Aware) Hierarchical (Efficient) We believe an efficient synchronization algorithm for large-scale DML should take the physical topology into consideration and we need to map the logic view with the physical structure. So we decide to design a flexible and hierarchical algorithm, which we name it, HiPS.

HiPS Design HiPS in BCube HiPS fully takes the physical topology into consideration and we divide the whole sync process into several stages, each stage will choose one of the three sync algos and we aggregated the parameters in each stage, so that the sync workload will be reduced stage by stage, just like a dumbell

HiPS Design HiPS in BCube To make a better explanation, we take HiPS in Bcube as an example to illustrate the execution of HiPS.

HiPS Design HiPS in BCube To further understand HiPS， we take a look at the parameter states of one server, for example,

HiPS Design HiPS in BCube (Server <01>) First it enters Stage 0, and it conducts sync, such as Diffuse or Scatter, then it gains three partial aggregated parameters from the group, then it enters Stage 1, and it only conducts sync with the aggregated parameters, for example, it also conducts Mesh-based sync in Stage 1. then after Diffuse, it gains 1 global parameters, aggregated from the 9 servers, after Collect, it gains 3 global parameters aggregated from 9 servers, finally, it returns to Stage 0 and continues the sync, Later it gains all the synchronized parameters from the servers.

HiPS Design HiPS in BCube If we take a look at the synchronization workload in each stage, it can be illustrated like a dumbell. The workload for each stage is shrinked.

HiPS Design HiPS in Torus As for Torus, we have a similar design. Notice that both Bcube and Torus enjoy good symmetry and we can launch several parallel processes to fully utilize the link and bandwidth and accelerate the synchronization process.

Theoretical Evaluation Here we present the theoretical evaluation with Bcube and Torus.

Theoretical Evaluation To better illustrate the gap, we can compare the global synchronization time for different synchronization algorithms in the figure. We can find that for both Bcube and Torus, HiPS achieves better synchronization performance compared with the existing flat algorithms

Theoretical Evaluation This is for the comparison in Torus.

Future Work Conduct Further Comparative Study Integrate HiPS into DML systems As for the future work for this workshop paper, we have two main directions, But since the acceptance of the paper, there have been several months passed. we have made some further progress and I will show you the primary results in our following work.

Simulation Evaluation NS-3 Simulation with VGG Workload BCube: GST reduced by 37.5%∼61.9%. Torus: GST reduced by 49.6%∼66.4% GSTs in BCube. Compared with PSS/MS, HiPS reduces the GST by 37.5%∼61.9%. Compared with RS, HiPS reduces the GST by 49.6%∼66.4%. GSTs in Torus. Compared with PSS/MS, HiPS reduces the GST by 52.9%∼87.4%. Compared with RS, HiPS reduces the GST by 48.0%∼65.6%. GST Comparison with RDMA in BCube GST Comparison with RDMA in Torus

Testbed Evaluation System Instance of HiPS: BML Add an OP in Tensorflow 9 Servers, each equipped with 2 RNICs (BCube (3,1)) MINIST and VGG19 as benchmarks Ring Allreduce in Ring and Mesh-based (P2P) Sync in Fat-Tree as Baseline 18.7%~56.4%

Testbed Evaluation 18.7%~56.4%

18.7%~56.4% Testbed Evaluation According to our experiment study, we can demonstrated that HiPS can outperform the baselines by 18.7%~56.4%, compared with Mesh-based algorithm. Meanwhile, since we have only built a two-layer Bcube, the theoretical GST of HiPS in BCube and the GST of bidirectional Ring-based algorithm are equal. But we can find that HiPS is still a little bit faster than RingAllreduce, the reason is because the Ring-Allreduce algorithm needs more send operations. Although it has been proved that Ring Allreduce is bandwidth optimal, it needs more start-up time, which can become a potential overhead. The result is not perfect and we are still optimizing the current work. 18.7%~56.4%

Ongoing Work Conduct Further Comparative Study Optimize HiPS in DML systems More Cases of Network for AI As for the future work for this workshop paper, Also, we will continue this direction and conduct more researches on Network for AI.

https://nasp.cs.tsinghua.edu.cn/ Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/