Rethinking Transport Layer Design for Distributed Machine Learning

Slides:

Advertisements

Similar presentations

1 Transport Protocols & TCP CSE 3213 Fall April 2015.

Advertisements

Fixing TCP in Datacenters Costin Raiciu Advanced Topics in Distributed Systems 2011.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,

The War Between Mice and Elephants LIANG GUO, IBRAHIM MATTA Computer Science Department Boston University ICNP (International Conference on Network Protocols)

The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

CS 599 Intelligent Embedded Systems1 Adaptive Protocols for Information Dissemination in Wireless Sensor Networks W.R.Heinzelman, J.kulik, H.Balakrishnan.

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Transport Protocols Slide 1 Transport Protocols.

Information-Agnostic Flow Scheduling for Commodity Data Centers

Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.

1 Transport Layer Computer Networks. 2 Where are we?

Detail: Reducing the Flow Completion Time Tail in Datacenter Networks SIGCOMM PIGGY.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Wei Bai with Li Chen, Kai Chen, Dongsu Han, Chen Tian, Hao Wang SING HKUST Information-Agnostic Flow Scheduling for Commodity Data Centers 1 SJTU,

1 CS 4396 Computer Networks Lab TCP – Part II. 2 Flow Control Congestion Control Retransmission Timeout TCP:

TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).

Protocols and layering Network protocols and software Layered protocol suites The OSI 7 layer model Common network design issues and solutions.

1 Chapter 24 Internetworking Part 4 (Transport Protocols, UDP and TCP, Protocol Port Numbers)

Introduction to Machine Learning, its potential usage in network area,

SketchVisor: Robust Network Measurement for Software Packet Processing

Distributed Systems 11. Transport Layer

Big data classification using neural network

9. Principles of Reliable Data Transport – Part 1

Chapter 9: Transport Layer

Resilient Datacenter Load Balancing in the Wild

Master’s Project Presentation

Instructor Materials Chapter 9: Transport Layer

TCP Vegas: New Techniques for Congestion Detection and Avoidance

Randomness in Neural Networks

Chilimbi, et al. (2014) Microsoft Research

Distributed Computing

Packets and Making a Reliable Internet

Parallel Density-based Hybrid Clustering

dawn.cs.stanford.edu/benchmark

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

Understanding the OSI Reference Model

Maximal Independent Set

Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.

Client-Server Interaction

Transport Layer Unit 5.

Hamed Rezaei, Mojtaba Malekpourshahraki, Balajee Vamanan

TCP - Part II Relates to Lab 5. This is an extended module that covers TCP flow control, congestion control, and error control in TCP.

Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li

DHT Routing Geometries and Chord

Augmenting Proactive Congestion Control with Aeolus

Amogh Dhamdhere, Hao Jiang and Constantinos Dovrolis

Cse 344 May 4th – Map/Reduce.

CS4470 Computer Networking Protocols

Logistic Regression & Parallel SGD

IT351: Mobile & Wireless Computing

COMP60621 Fundamentals of Parallel and Distributed Systems

Neural Networks Geoff Hulten.

October 6, 2011 Dr. Itamar Arel College of Engineering

Distributed Systems CS

Boltzmann Machine (BM) (§6.4)

CS4470 Computer Networking Protocols

CS639: Data Management for Data Science

Inception-v4, Inception-ResNet and the Impact of

Helen: Maliciously Secure Coopetitive Learning for Linear Models

COMP60611 Fundamentals of Parallel and Distributed Systems

Computer Networks Protocols

David Kauchak CS158 – Spring 2019

Classifier-Based Approximate Policy Iteration

Review of Internet Protocols Transport Layer

Accelerating Distributed Machine Learning by Smart Parameter Server

Distributed Systems CS

Presentation transcript:

Rethinking Transport Layer Design for Distributed Machine Learning Jiacheng Xia1, Gaoxiong Zeng1, Junxue Zhang1,2, Weiyan Wang1, Wei Bai3, Junchen Jiang4, Kai Chen1,5 Hello everyone I‘m Jiacheng Xia from SING lab, HKUST. Today I’m presenting our work titled “Rethinking Transport Layer Design for Distributed Machine Learning”. (Loud) In this work we are presenting an import feature that DML tolerates data loss to a certain degree, and we argue that a transport layer protocol that leverages such features brings a non-trivial improvement for DML speed. 8/17/19 APNet' 19, Beijing, China

Growth of Machine Learning Growing applications of AI, many of them leverages “machine learning”. Our work: Running distributed machine learning over reliable data transfer protocol does NOT lead to optimal performance! In recent years we have seen the growing trend of artificial intelligence. Many AI applications leverages machine learning to solve a specific task, including … To meet with growing trend of data, distributed training of models is often used to better leverage the computational resource available and increase the processing speed. <While existing designs of DML systems rely on reliable transport protocols like TCP, in this paper we point out that this does not necessarily lead to best performance> 8/17/19 APNet' 19, Beijing, China

ML as Iterative Approximation Many ML applications iteratively “learns” a mathematical model to describe data Represented as minimizing obj. function E.g. Stochastic Gradient Descent (SGD) <Why do we even argue that DML can run over a somehow unreliable data transfer protocol? To answer this question we first need to understand how machine learning works.> The goal of many machine learning algorithms can be expressed as finding a model to describe the dataset. Typically it includes the process of finding a good mathematical model via iter. approx., for example, stochastic gradient descent algorithms. The right side figure illustrates the contour map of an objective function. The goal of SGD is to find the minimized objective value, i.e. the yellow star representing the error of the model. The process of SGD is to go to the gradient direction (direction that’s orthogonal to tangent line) as a greedy reduction of loss value and approaches the minimizer with multiple iteration. 8/17/19 APNet' 19, Beijing, China

Distributed Machine Learning (DML) After each iteration, workers exchange their parameter updates. Often uses “synchronous training” for best performance  slowest worker determines speed … Parameter Servers Workers Data Shards We focus our discussions in data parallel training of DML. In data parallel distributed training, workers take a subset of data and exchange their training results every iteration. The figure below illustrates parameter server. Where 1234 This update process is often "synchronized" for the best performance, that is, every worker needs to finish the current iteration and exchange their results with others before starting next one. <This infers that a DML job's completion time is determined by the speed of the slowest worker> 8/17/19 APNet' 19, Beijing, China

Packet Losses in DML Multiple flows simultaneously -> Likely to have losses (even TCP timeouts) Small flows with a few RTTs, RTO >> FCT w/o timeout Synchronous training, tail FCT determines job speed S S S S For DML tasks, recovering packet losses can be an expensive operation. Firstly, DML traffics incurs multiple flows almost at the same time. E.g. Parameter Server architecture, two phases (click click click) Secondly, these flows are small flows that can be finished within a few RTTs. In DC networks this is much smaller than a TCP timeout. As tail FCT determines job speed, the job will encounter a severe slowdown if timeout occurs. W W W W 8/17/19 APNet' 19, Beijing, China

Faster Computations With growing speed of hardware, computations are faster, larger effect of timeouts Model Iteration time (no timeouts) Slowdown w/ timeouts (RTO = 5ms) MLP 7ms 2.4 Matrix Factorization 6ms 2.7 ResNet-18 (CNN) 25ms 1.4 LSTM (RNN) 82ms 1.12 Furthermore, as the machine learning processors becomes more and more powerful, the time for completing one iteration becomes shorter. Some of them takes only a few milliseconds to complete. In such cases, if timeouts do occur in both the beginning and end of iterations, it adds a non-trivial overhead to the job. As the table shows, for such applications it may take more than twice of original time to finish. 8/17/19 APNet' 19, Beijing, China

High Cost of Loss Recovery High recovery cost. E.g. TCP timeouts: Fast computation, >2x longer completion time w/ timeouts TCP w/o timeout TCP w/ timeout >2x completion time Network Why do we observe such a severe effect of timeouts? We model the behaviour of a typical iteration in one worker. As the figure below shows, the iteration completion time is determined by the computation time locally and the communication time at the beginning & end of every iteration. If timeouts takes place at both pull & push phases, it adds 2*RTO to the original completion time. This overhead is significantly larger than the original time and therefore slows down the whole process. Compute Worker pull Worker push 8/17/19 APNet' 19, Beijing, China

Handling Packet Drops: Necessary? Timeout as a “backup” to recover packet drops. Is this necessary to handle every packet drop for DML? NO. DML is inherently iterative approximation, so it only requires approximately correct results. DML algorithms (e.g. SGD) are greedy optimization, can recover from slightly incorrect results Typically timeouts serve as a "backup" mechanism to guarantee the protocol is reliable, that is, to recover every lost packets. <But we have seen the exceedingly high cost to do so for DML applications. At this point we take one step back and ask a fundamental question: Is it really necessary to respond to every packet drops in DML network transportation?> The answer is likely to be negative. Firstly, as DML is inherently …, it only requires the results in each step to be approximately correct. Secondly, as many DML algo like SGD conducts greedy optimization, the error of previous iterations can be recovered from later iterations. Such features give us a chance to achieve same DML application performance (i.e., accuracy of a classification task) even with ignored packet drops. 8/17/19 APNet' 19, Beijing, China

ML are Bounded-Loss Tolerant Same rounds, reduced JCT More rounds, reduced JCT Do not converge We mimic the packet drops in network by randomly marking a fraction of parameter values as "lost" ,Our key observation is that DML is BLT. To be concrete, the time-loss relation can be described in 3 phases. In the first phase, as the data loss rate increases, the model still achieves same performance as baseline setting where no data is lost. There is some bound when exceeding this bound results in a increased number of convergence round or even do not converge. We mark the applications that fail to converge with 100 epochs as not converging. What does this infer on job completion time? As the data loss rate that requires timeouts are only a small fraction of packet drops, it is safe to directly proceed without recovering these packets. Emulate parameter loss locally, compute communication time with NS-3 simulations 8/17/19 APNet' 19, Beijing, China

ML view of Bounded Loss Tolerance SGD starts new estimation with results in previous iteration. Can recover from ”incorrect” results With bounded loss, SGD still converges to same point After the first iteration which gives an “incorrect” result, click, it does not take the same direction of gradients like the blue line in second iteration Lossless SGD “Lossy” SGD 8/17/19 APNet' 19, Beijing, China

Existing Solutions are Insufficient Reduced communications? Unreliable Protocol? Despite this bounded loss tolerance feature, it can not be trivially applied to benefit DML. Firstly, given, why not? Using a completely unreliable protocol also does not solve the problem A “simplified protocol” to explain in the following has the potential to significantly outperform these settings. 8/17/19 APNet' 19, Beijing, China

Packet Drops on Different Schemes Packet Drops occur on different parameter sync. schemes Parameter Server (PS) Ring AllReduce (RING) Besides, this problem widely exists in PS and RING schemes. On the other hand, ignoring packet losses can significantly cut down tail flow completion time of DML. This motivates us to design a new protocol based on the bounded loss tolerance features. The green bars shows the gain out of a simplified protocol we designed, we soon explain the setups in the next slide 8/17/19 APNet' 19, Beijing, China

A Simplified Protocol Minimizes the time for receiver a predefined threshold of packets TCP-like congestion control logic Receivers notify application layers once received pre-defined threshold of data Preliminary results in NS-3 simulators 8/17/19 APNet' 19, Beijing, China

Results: Simplified Protocol [Simulation] 1.1-2.1x speed on both PS and RING scheme Note the results are model-specific, since some models are more computation intensive than the others. "PS RING" You can find detailed evaluation setup in our paper. 8/17/19 APNet' 19, Beijing, China

Reduced Tail FCT The FCT reduction results from reduced tail FCTs. A bounded-loss tolerant protocol benefits DML by ignoring some packet drops 8/17/19 APNet' 19, Beijing, China

Future Works We have seen that leveraging Bounded Loss Tolerant has huge potential to speed up DML A concrete testbed implementation of bounded loss tolerant protocols Software prototype on top of this protocol 8/17/19 APNet' 19, Beijing, China

Summary DML applications run with reliable data transfer – not necessarily the only way DML applications are bounded-loss tolerant, due to its stochastic (iterative approximation) feature Ignoring some packet drops significantly reduces job completion time without affecting performance 8/17/19 APNet' 19, Beijing, China

Thanks! Q & A 8/17/19 APNet' 19, Beijing, China