Accelerating Distributed Machine Learning by Smart Parameter Server

Slides:



Advertisements
Similar presentations
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Reference: Message Passing Fundamentals.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
Mining Cross-network Association for YouTube Video Promotion Ming Yan, Jitao Sang, Changsheng Xu*. 1 Institute of Automation, Chinese Academy of Sciences,
From Machine Learning to Deep Learning. Topics that I will Cover (subject to some minor adjustment) Week 2: Introduction to Deep Learning Week 3: Logistic.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.
Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Neural networks.
Big data classification using neural network
TensorFlow– A system for large-scale machine learning
Recommendation in Scholarly Big Data
Provide instruction.
Possible options of using DDS in oneM2M
2 Research Department, iFLYTEK Co. LTD.
Chilimbi, et al. (2014) Microsoft Research
Introduction to Wireless Sensor Networks
The ILC Control Work Packages
Artificial Intelligence (CS 370D)
The University of Adelaide, School of Computer Science
Book: Integrated business processes with ERP systems
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Intelligent Information System Lab
Introduction to Computers
Performance Evaluation of Adaptive MPI
Attentional Neural Network: Feature Selection Using Cognitive Feedback
Automating Profitable Growth™
Book: Integrated business processes with ERP systems
Parallelizing Dynamic Time Warping
Jinkun Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Deep Learning Hierarchical Representations for Image Steganalysis
OVERVIEW OF BIOLOGICAL NEURONS
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Automating Security Operations using Phantom
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
COMP60621 Fundamentals of Parallel and Distributed Systems
Capabilities of Threshold Neurons
Resource Recommendation for AAN
Spectrum Sharing in Cognitive Radio Networks
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Designing Neural Network Architectures Using Reinforcement Learning
ESA's TEC Directorate Asset Management - Present and Future
Gandiva: Introspective Cluster Scheduling for Deep Learning
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
NEW INTERACTIVE FEATURES
Actively Learning Ontology Matching via User Interaction
TensorFlow: A System for Large-Scale Machine Learning
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Distributed Edge Computing
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Algorithms for Selecting Mirror Sites for Parallel Download
COMP60611 Fundamentals of Parallel and Distributed Systems
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
When was the last time you did a physical examination in a clinic
eCV Replacement Susan Hallmark
Presenter: Sihong LIN Adviser: Bei ZHOU 9 July, 2019
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Pulse Survey Follow up 2019.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning
Presentation transcript:

Accelerating Distributed Machine Learning by Smart Parameter Server Hello, everyone. I am Jinkun Geng from Tsinghua University. Today the topic I would like to share with you is “accelerating distributed machine…”, this work was cooperated with my supervisor Professor Dan Li, and Mr. Shuai Wang. Jinkun Geng, Dan Li and Shuai Wang

Background Distributed machine learning becomes the common practice, because of: 1. The explosive growth of data size As we all know, machine learning has become the hottest topic in every field, and distributed machine learning has become the common practice. On one hand, the training data is experiencing an explosive growth,

Background Distributed machine learning becomes the common practice, because of: 2. The increasing complexity of training model And on the other hand, the training model is also becoming more and more complex, in order to gain stronger learning capability. ImageNet Competition: <10(Hinton, 2012), 22 (Google, 2014), 152 (Microsoft, 2015), 1207 (SenseTime, 2016)

Background Parameter Server (PS)-based architecture is widely supported by mainstream DML systems. Both the increasing size of training data and training model motivate the development of DML systems. And parameter server system has is widely supported by the mainstream DML systems, such as Tensorflow, MXNEt, Pytorch and so on.

Background However, the power of PS architecture has not been fully exploited. 1. Communication redundancy 2. Straggler problem However, the power of PS architecture has not been fully exploited. The first issue is the communication redundancy. During the iterative DML process, there is much communication redundancy involved. The second issue is the straggler problem. As mentioned in the talk of yesterday, when we use BSP for DML, the overall performance actually is determined by the slowest worker, that is, the straggler.

Background A deeper insight… 1. Worker-centric design is less efficient 2. PS can be more intelligent (i.e. Smart PS) So how can we try to optimize the DML performance and mitigate the two problems? With a deep insight into the current PS-based arch, we note that the existing works all follows the worker-centric design, which can be less efficient. Actually, the parameter server can usually hold more information than the worker, which can be better exploited. In other words, PS can hold a global view of the parameter dependencies whereas workers only hold a partical view of itself. Therefore, we can make the PS more intelligent and design a smart PS. Smart PS

Background To make PS more intelligent… Dependency-Aware Straggler-Assistant Targeting at the aforementioned two problems, if we want to accelerate DML with a smart PS, then the PS should be dependency-aware, as well as straggler-assistant. I will talk about the two features in the following slides.

A Simple Model of Parameters In order to study the dependency among parameters, we first need to generate a simple model for parameters, because as we know, the current DML models may have millions of, or even billions of parameters, and the cost is unaffordable if we study the dependency among each two of them. Fortunately, reviewing the typical DML models, we can find that a certain group of parameters may have similar behavior during the DML process, and we can just treat them as one unit. For example, in deep neural network, the parameters in one layer share the same behavior, so we partition the parameters in to parameter units (PU) in a layer-wise manner. Another example, is the matrix factorization. Each matrix block can be considered as one unit. The partition of the parameters is user-defined and here we just illustrate them with some typical examples. Users can choose their own way to define the parameter unit.

Workflow of PS-based DML So with the simple model of PU, let’s have a look at the workflow of PS-based DML. Generally speaking, the workflow can be divided into four phases. During the first step, workers pull the parameters from the PS. During the second step, workers compute for parameter refinement. During the third step, they push the refined parameters to the Pses. During the forth step, parameters aggregate the parameters and wait for the worker to pull them back and start the next iteration.

Workflow of PS-based DML If we depict the four phases in timeline, it can be illustrated as this picture. We can see that the makespan of one iteration is composed of four main parts.

Workflow of PS-based DML In our work of SmartPS, we mainy focus on the three parts and try to acclerate DML by optimizing them. As for the remaining tc, there have been a series of existing works, such as work stealing, to optimize them. And these works can be further complimented with SmartPS to further improve the DML performance.

Design Strategies To make PS more intelligent… 1. Selective update ( 𝒕 𝑹 ) 2. Proactive push ( 𝒕 𝑰 ) 3. Prioritized transmission ( 𝒕 𝑹 𝒂𝒏𝒅 𝒕 𝑰 ) 4. Unnecessary push blockage ( 𝒕 𝑺 𝒂𝒏𝒅 𝒕 𝑹 ) In order to accelerate DML with more intelligence in PS, we design four main stategies. And I will introduce them one by one.

Strategy 1: Selective Update Recall the workflow of DML. We can see for the typical DML tasks, especially those model-parallel tasks, not every worker needs all fresh parameters during one iteration.

Strategy 1: Selective Update 𝑷𝑼 𝒊,𝒋 𝒅𝒆𝒏𝒐𝒕𝒆𝒔 𝑷𝑼 𝒋 𝒊𝒏 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏 𝒊 For example, we suppose there are four workers and each worker holds just one PU. The dependency among PUs can be illustrated in this picture.

Strategy 1: Selective Update 𝑷𝑼 𝒊,𝒋 𝒅𝒆𝒏𝒐𝒕𝒆𝒔 𝑷𝑼 𝒋 𝒊𝒏 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏 𝒊 When we use BSP for DML, the overall performance is mainly determined by the slowest worker, that is worker 3.

Strategy 1: Selective Update 𝑷𝑼 𝒊,𝒋 𝒅𝒆𝒏𝒐𝒕𝒆𝒔 𝑷𝑼 𝒋 𝒊𝒏 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏 𝒊 But we can see that, during the two iterations, the PU on worker 0 does not depend on the PUs from worker 3, so it does not need to pull the parameters from worker 3. Instead, it just need to selectively pull the parameters from worker 1 to update its local parameters. So the transmission time is saved.

Strategy 2: Proactive Push 𝑷𝑼 𝒊,𝒋 𝒅𝒆𝒏𝒐𝒕𝒆𝒔 𝑷𝑼 𝒋 𝒊𝒏 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏 𝒊 Meanwhile, though worker 0 does not depend on worker 3, but worker 3 depends on worker 0. Remember that worker 3 is the slowest whereas worker 0 is the fastest. So even worker 3 is still computing and has not asked for the parameters, worker 0 has prepared the parameters and pushed it to the PS. So PS can proactively push the parameters to worker 3 even before it asked for that. In this way, the parameters do not need to remain idle on the PS and they can be accessed by the worker at an earlier time.

Strategy 3: Straggler-Assistant 𝑷𝑼 𝒊,𝒋 𝒅𝒆𝒏𝒐𝒕𝒆𝒔 𝑷𝑼 𝒋 𝒊𝒏 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏 𝒊 As for the third strategy, we try to assist the stragglers by prioritized parameters transmission. Take the data-parallel DML tasks as an example, the PUs on each worker is needed by the PUs of all the others.

Strategy 3: Straggler-Assistant 𝐁𝐚𝐬𝐞𝐥𝐢𝐧𝐞 (𝐍𝐨 𝐀𝐬𝐬𝐢𝐬𝐭𝐚𝐧𝐜𝐞): We assume that worker 0 is the slowest, whereas worker 2 is the fasters. Then worker 0 becomes the straggler. If we treat them as the same. Then when the parameters are available on the PS, the PS will transmit the parameters to each worker equally. That is to say, they will share the bandwidth, and will access the parameters at the same time.

Strategy 3: Straggler-Assistant 𝐒𝐭𝐫𝐚𝐠𝐠𝐥𝐞𝐫−𝐀𝐬𝐬𝐢𝐬𝐭𝐚𝐧𝐭: However, if we first transmit the parameters to worker 0, then to worker 1, and finally to worker 2, then worker 0 can access the parameters earlier and start the next iteration earlier. In this way, the performance gap between worker 0 and worker 1 is narrowed in the following iterations, and the overall training performance can be accelerated.

Strategy 4: Blocking Unnecessary Pushes 𝐒𝐭𝐫𝐚𝐠𝐠𝐥𝐞𝐫 𝐀𝐬𝐬𝐢𝐬𝐭𝐚𝐧𝐭 𝐓𝐫𝐚𝐧𝐬𝐦𝐢𝐬𝐬𝐢𝐨𝐧: And as for the last strategy, we try to utilize the data locality of DML tasks. For example, we can see that for worker 2, the PU on its server is not needed by the others during the two iterations, therefore, after the first iteration, it does not need to push the parameters to the PS. In other words, we just block the unnecessary pushes and the save the tranismission bandwidth and time.

Evaluation Experiment Setting: 17 Nodes with different performance configurations: 1 PS + 16 Worker 2 Benchmarks: Matrix Factorization and PageRank 5 Baselines: BSP, ASP,SSP(slack=1), SSP(slack =2), SSP (slack =3) Okay, so with the four intelligent strategies integrated, we conduct comparative experiments to evaluate the SmartPS.

Evaluation MF Benchmark: With a common threshold, SmartPS reduces the training time by 68.1%~90.3% compared with the baselines.

Evaluation PR Benchmark: With a common threshold, SmartPS reduces the training time by 65.7%~84.9% compared with the baselines.

Further Discussion Comparison to some recent works: SmartPS TICTAC/P3 Both leverage the knowledge of parameter dependency 2. Both leverage prioritized transmission for DML acceleration

Comparison to some recent works: SmartPS TICTAC/P3 Further Discussion Comparison to some recent works: SmartPS TICTAC/P3 General Solution (Data-Parallel and Model-Parallel) CNN-Specific Straggler-Assistant Straggler-Oblivious Inter- and Intra-Iteration Overlap Only Inter-Iteration Overlap Compatible to ASP and SSP Only work for BSP (worsen the performance gap under ASP)

A deeper insight into PS-based arch… Function of PS: Ongoing Work A deeper insight into PS-based arch… Function of PS: 1. Parameter Distribution 2. Parameter Aggregation Function of Worker: 1. Parameter Refinement -> Data Access Control -> Data Operation -> Data Operation

Ongoing Work Parameter Distribution Parameter Aggregation Parameter Refinement

Ongoing Work Data Access Control Data Operation Data Operation

Ongoing Work Data Access Control Token Token Token Data Operation

Next Generation of SmartPS Parameter Server -> Token Server 1. Decouple data (access) control and data operation 2. A light-weight and smart Token Server instead of Parameter Server. Parameter Server Token Server

https://nasp.cs.tsinghua.edu.cn/ Thanks! NASP Research Group https://nasp.cs.tsinghua.edu.cn/ https://www.gengjinkun.com/