Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies span a continuum from application, to engineering, and to theoretical research. By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software. Machine Learning (ML) has become a primary mechanism for distilling structured information and knowledge from raw data Conventional ML research and development — which excels in model, algorithm, and theory innovations — are now challenged by the growing prevalence of Big Data collections such as hundreds of hours of video uploaded to video-sharing sites every minute…
Tushar's Birthday Bombs It’s Tushar’s birthday today and he has N friends. Friends are numbered [0, 1, 2, …., N-1] and i-th friend have a positive strength S(i). Today being his birthday, his friends have planned to give him birthday bombs (kicks :P). Tushar’s friends know Tushar’s pain bearing limit and would hit accordingly. If Tushar’s resistance is denoted by R (>=0) then find the lexicographically smallest order of friends to kick Tushar so that the cumulative kick strength (sum of the strengths of friends who kicks) doesn’t exceed his resistance capacity and total no. of kicks hit are maximum. Also note that each friend can kick unlimited number of times (If a friend hits x times, his strength will be counted x times) For example: if R = 11, S = [6,8,5,4,7], then the answer is [0,2]
Can Decentralized Algorithms Outperform Centralized Algorithms Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent Xiangru Lian , Ce Zhang , Huan Zhang , Cho-Jui Hsieh, Wei Zhang , and Ji Liu University of Rochester ETH Zurich University of California, Davis IBM T. J. Watson Research Center NIPS 2017 (oral)
Distributed Environment for Big Data/Model 352 GPUs (P100) (50k RMB per P100)
Degrade by the slowest one! Model Parallelism Ideal Situation Different Workload Varied Performance Degrade by the slowest one!
Data Parallelism
Hierarchical Topology Network Switch GPU0 GPU1 GPU2 GPU3 PCIe Switch CPU GPU0 GPU1 GPU2 GPU3 PCIe Switch CPU Network Switch Network Switch Network Switch
Centralized vs Decentralized
Related Work P2P network: wireless sensing network [Zhang and Kwok, ICML2014]: ADMM without speedup [Yuan et. al, Optimization2016]: Inconsistent with the centralized [Wu et. al, ArXiv 2016]: convergence with asynchronous setting P2P network: wireless sensing network Decentralized parallel stochastic algorithms Method Computational complexity [Lan et. al, ArXiv 2017] 𝑂 𝑃 𝜖 2 for general convex 𝑂 𝑃 𝜖 for strongly convex [Sirb et. al, BigData 2016]: Asynchronous approach “None of them is proved to have speedup when we increase the number of nodes.”
Contributions The first positive answer to this question: “Can decentralized algorithms be faster than its centralized counterpart?” Theoretical Analysis Large-scale empirical experiments (112 GPUs for ResNet20)
Problem Formulation Stochastic Optimization Problem Deep learning min 𝑥∈ ℝ 𝑁 𝑓 𝑥 ≔ 𝔼 𝜉∼𝒟 𝐹(𝑥;𝜉) Stochastic Optimization Problem Deep learning Linear regression Logistic regression
Distributed Setting 𝐹 2 (𝑥;𝜉) 𝒟 2 𝑓 2 (𝑥) 𝐹 1 (𝑥;𝜉) 𝒟 1 𝐹 3 (𝑥;𝜉) 𝒟 3 𝑓 1 (𝑥) 𝑓 3 (𝑥) 𝐹 5 (𝑥;𝜉) 𝒟 5 𝐹 4 (𝑥;𝜉) 𝒟 4 𝑓 5 (𝑥) 𝑓 4 (𝑥) min 𝑥∈ ℝ 𝑑 𝑓 𝑥 = 1 𝑃 𝑖=1 𝑃 𝑓 𝑖 (𝑥) = 1 𝑃 𝑖=1 𝑃 𝔼 𝜉∼ 𝒟 𝑖 𝐹 𝑖 (𝑥;𝜉) 𝒟 1 = 𝒟 2 =⋯= 𝒟 𝑃 𝒟 𝑖 is a proper partition from 𝒟
Runtime: Decentralized Setting 𝑥 2 (𝑡) 𝒟 2 𝑥 1 (𝑡) 𝒟 1 𝑥 3 (𝑡) 𝒟 3 𝑥 5 (𝑡) 𝒟 5 𝑥 4 (𝑡) 𝒟 4 𝑥= 1 𝑃 𝑖=1 𝑃 𝑥 𝑖 𝑇 Or the average is optimal 𝑥 1 𝑇 , 𝑥 2 𝑇 ,…, 𝑥 𝑃 𝑇 → 𝑥 𝑇 We expect:
Algorithm: at Iteration 𝑡 𝑥 2 𝑡 𝑥 3 𝑡 𝑥 5 𝑡 𝑥 2 𝑡 𝒟 2 𝐹 2 𝑥 2 𝑡 ; 𝜉 2 𝑡 𝑥 1 𝑡 𝒟 1 𝑥 3 𝑡 𝒟 3 𝐹 1 𝑥 1 𝑡 ; 𝜉 1 𝑡 𝐹 3 𝑥 3 𝑡 ; 𝜉 3 𝑡 𝑥 5 𝑡 𝒟 5 𝑥 4 𝑡 𝒟 4 𝐹 5 𝑥 5 𝑡 ; 𝜉 5 𝑡 𝐹 4 𝑥 4 𝑡 ; 𝜉 4 𝑡 𝑥 𝑖 𝑡+1 = 𝑗∈ 𝑁 𝑖 ,𝑖 𝑥 𝑗 𝑡 𝑤 𝑖𝑗 −𝛾𝜕 𝐹 𝑖 𝑥 𝑖 𝑡 ; 𝜉 𝑖 𝑡 𝑊∈ ℝ 𝑃×𝑃 is a symmetric doubly stochastic matrix: 𝑤 𝑖𝑗 ∈[0,1] 𝑤 𝑖𝑗 = 𝑤 𝑗𝑖 𝑗 𝑤 𝑖𝑗 =1
Convergence Rate Analysis 1 𝑇 1−𝛾𝐿 2 𝑡=0 𝑇−1 𝔼 𝜕𝑓 𝑋 𝑡 𝑰 𝑃 𝑃 2 + 𝐷 1 𝑡=0 𝑇−1 𝔼 𝛻𝑓 𝑋 𝑡 𝑰 𝑃 𝑃 2 ≤ 𝑓 0 − 𝑓 ∗ 𝛾𝑇 + 𝛾𝐿 2𝑃 𝜎 2 + 𝛾 2 𝐿 2 𝑃 𝜎 2 1−𝜌 𝐷 2 + 9 𝛾 2 𝐿 2 𝑃 𝜍 2 1− 𝜌 𝐷 2 Convergence of Algorithm 𝐷 1 = 1 2 − 9 𝛾 2 𝐿 2 𝑃 1− 𝜌 2 𝐷 2 𝐷 2 =1− 18 𝛾 2 1− 𝜌 2 𝑃 𝐿 2 𝜌= max 𝜆 2 𝑊 , 𝜆 𝑃 𝑊 2 𝔼 𝜉∼ 𝒟 𝑖 𝛻 𝐹 𝑖 𝑥;𝜉 −𝛻 𝑓 𝑖 𝑥 2 ≤ 𝜎 2 ,∀𝑖,∀𝑥 𝔼 𝑖∼𝒰 𝑃 𝛻 𝑓 𝑖 𝑥 −𝛻𝑓 𝑥 2 ≤ 𝜍 2 ,∀𝑥 Assumptions
Convergence Rate Analysis If 𝛾= 1 2𝐿+𝜎 𝑇/𝑃 , then: 𝑡=0 𝑇−1 𝔼 𝛻𝑓 𝑋 𝑇 𝑰 𝑃 𝑃 ≤ 8 𝑓 0 − 𝑓 ∗ 𝐿 𝑇 + 8𝑓 0 −8 𝑓 ∗ +4𝐿 𝜎 𝑇𝑃 If 𝑇 is sufficiently large, in particular: 𝑇≥ 4 𝐿 4 𝑃 5 𝜎 6 𝑓 0 − 𝑓 ∗ +𝐿 2 𝜎 2 1−𝜌 + 9 𝜍 2 1− 𝜌 2 2 𝑇≥ 72 𝐿 2 𝑃 2 𝜎 2 1− 𝜌 2
Centralized PSGD vs Decentralized PSGD Algorithm Communication complexity on the busiest node Convergence Rate Computational complexity Centralized-PSGD (mini-batch) 𝑂(𝑃) 𝑂 1 𝑇𝑃 𝑂 𝑃 𝜖 + 1 𝜖 2 Decentralized PSGD 𝑂 Deg Network 𝑂 1 𝑇 + 1 𝑇𝑃 D-PSGD is better than C-PSGD: avoid communication traffic jam Linear speedup
Ring Network 𝑊= 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 ⋱ ⋱ ⋱ 1/3 1/3 1/3 ∈ ℝ 𝑃×𝑃 Linear speedup can be achieved if: 𝑃=𝑂 𝑇 1 9 if share the 𝒟 𝑃=𝑂 𝑇 1 13 if 𝒟 is partitioned These results are too loose!
Convergence Rate for the Average of Local Variables If 𝛾= 1 2𝐿+𝜎 𝑇/𝑃 we have: 1 𝑇𝑃 𝔼 𝑡=0 𝑇−1 𝑖=1 𝑃 𝑗=1 𝑃 𝑥 𝑗 𝑡 𝑃 − 𝑥 𝑖 𝑡 2 ≤ 𝑃 𝛾 2 𝐴 𝐷 2 𝐴= 2 𝜎 2 1−𝜌 + 18 𝜍 2 1− 𝜌 2 + 𝐿 2 𝐷 1 𝜎 2 1−𝜌 + 9 𝜍 2 1− 𝜌 2 + 18 1− 𝜌 2 𝑓 0 − 𝑓 ∗ 𝛾𝑇 + 𝛾𝐿 𝜎 2 2𝑃 𝐷 1 The running average of 𝔼 𝑗=1 𝑃 𝑥 𝑗 𝑡 𝑃 − 𝑥 𝑖 𝑡 2 convergence to 0 with a 𝑂 1 𝑇 rate.
Experimental Settings RestNet on CIFAR-10 CNTK with MPI’s AllReduce primitive Centralized: Standard parameter-server based synchronous SGD and one parameter server Decentralized: CNTK with MPI point-to-point primitive EASGD: standard implementation of Torch Implementations 7GPUs: single local machine, Nvidia TITAN Xp 10GPUs: 10 p2.xlarge EC2 instances, Nvidia K80 16GPUs: 16 local machines, Nvidia K20 112 GPUs: 4 p2.16xlarge and 6 p2.8xlarge EC2 instances. Nvidia K80 Machines
Comparison between D-PSGD and two centralized implementations (7 and 10 GPUs)
Comparison between D-PSGD and two centralized implementations (7 and 10 GPUs)
Convergence Rate
D-PSGD Speedup
D-PSGD Communication Patterns
Convergence comparison between D-PSGD and EAMSGD (EASGD’s momentum variant)
Convergence comparison between D-PSGD and Momentum SGD
Thanks