Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies.

Slides:



Advertisements
Similar presentations
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Advertisements

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
On Large-Scale Peer-to-Peer Streaming Systems with Network Coding Chen Feng, Baochun Li Dept. of Electrical and Computer Engineering University of Toronto.
Optimization Tutorial
Distributed Optimization with Arbitrary Local Solvers
Proactive Prediction Models for Web Application Resource Provisioning in the Cloud _______________________________ Samuel A. Ajila & Bankole A. Akindele.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Steady and Fair Rate Allocation for Rechargeable Sensors in Perpetual Sensor Networks Zizhan Zheng Authors: Kai-Wei Fan, Zizhan Zheng and Prasun Sinha.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Interactive Supercomputing Update IDC HPC User’s Forum, September 2008.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
A Hierarchical Edge Cloud Architecture for Mobile Computing IEEE INFOCOM 2016 Liang Tong, Yong Li and Wei Gao University of Tennessee – Knoxville 1.
2/13/2018 4:38 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
World’s fastest Machine Learning Engine
Early Results of Deep Learning on the Stampede2 Supercomputer
Sathya Ronak Alisha Zach Devin Josh
Multilayer Perceptrons
COMPUTATIONAL MODELS.
Large-scale Machine Learning
Astronomical Data Processing & Workflow Scheduling in cloud
Lecture 07: Soft-margin SVM
Utilizing AI & GPUs to Build Cloud-based Real-Time Video Event Detection Solutions Zvika Ashani CTO.
The Problem Finding a needle in haystack An expert (CPU)
CJT 765: Structural Equation Modeling
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
dawn.cs.stanford.edu/benchmark
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Performance Evaluation of Adaptive MPI
Interactive Website (
Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units Naiyong Ao, Fan Zhang, Di Wu, Douglas Stones Gang.
A Consensus-Based Clustering Method
Chapter 3: Principles of Scalable Performance
Department of Computer Science University of California, Santa Barbara
Early Results of Deep Learning on the Stampede2 Supercomputer
Summary Background Introduction in algorithms and applications
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
ALL YOU NEED IS A GOOD INIT
Scalable Parallel Interoperable Data Analytics Library
Logistic Regression & Parallel SGD
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Xiaodan Liang Sun Yat-Sen University
Distributed Systems CS
Generalization in deep learning
Professor Ioana Banicescu CSE 8843
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Gandiva: Introspective Cluster Scheduling for Deep Learning
Lecture 2 CMS 165 Optimization.
CS639: Data Management for Data Science
TensorFlow: A System for Large-Scale Machine Learning
Optimizing MPI collectives for SMP clusters
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
ADMM and DSO.
Search-Based Approaches to Accelerate Deep Learning
Convergence of Big Data and Extreme Computing
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies span a continuum from application, to engineering, and to theoretical research. By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software. Machine Learning (ML) has become a primary mechanism for distilling structured information and knowledge from raw data Conventional ML research and development — which excels in model, algorithm, and theory innovations — are now challenged by the growing prevalence of Big Data collections such as hundreds of hours of video uploaded to video-sharing sites every minute…

Tushar's Birthday Bombs It’s Tushar’s birthday today and he has N friends. Friends are numbered [0, 1, 2, …., N-1] and i-th friend have a positive strength S(i). Today being his birthday, his friends have planned to give him birthday bombs (kicks :P). Tushar’s friends know Tushar’s pain bearing limit and would hit accordingly. If Tushar’s resistance is denoted by R (>=0) then find the lexicographically smallest order of friends to kick Tushar so that the cumulative kick strength (sum of the strengths of friends who kicks) doesn’t exceed his resistance capacity and total no. of kicks hit are maximum. Also note that each friend can kick unlimited number of times (If a friend hits x times, his strength will be counted x times) For example: if R = 11, S = [6,8,5,4,7], then the answer is [0,2]

Can Decentralized Algorithms Outperform Centralized Algorithms Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent Xiangru Lian , Ce Zhang , Huan Zhang , Cho-Jui Hsieh, Wei Zhang , and Ji Liu University of Rochester ETH Zurich University of California, Davis IBM T. J. Watson Research Center NIPS 2017 (oral)

Distributed Environment for Big Data/Model 352 GPUs (P100) (50k RMB per P100)

Degrade by the slowest one! Model Parallelism Ideal Situation Different Workload Varied Performance Degrade by the slowest one!

Data Parallelism

Hierarchical Topology Network Switch GPU0 GPU1 GPU2 GPU3 PCIe Switch CPU GPU0 GPU1 GPU2 GPU3 PCIe Switch CPU Network Switch Network Switch Network Switch

Centralized vs Decentralized

Related Work P2P network: wireless sensing network [Zhang and Kwok, ICML2014]: ADMM without speedup [Yuan et. al, Optimization2016]: Inconsistent with the centralized [Wu et. al, ArXiv 2016]: convergence with asynchronous setting P2P network: wireless sensing network Decentralized parallel stochastic algorithms Method Computational complexity [Lan et. al, ArXiv 2017] 𝑂 𝑃 𝜖 2 for general convex 𝑂 𝑃 𝜖 for strongly convex [Sirb et. al, BigData 2016]: Asynchronous approach “None of them is proved to have speedup when we increase the number of nodes.”

Contributions The first positive answer to this question: “Can decentralized algorithms be faster than its centralized counterpart?” Theoretical Analysis Large-scale empirical experiments (112 GPUs for ResNet20)

Problem Formulation Stochastic Optimization Problem Deep learning min 𝑥∈ ℝ 𝑁 𝑓 𝑥 ≔ 𝔼 𝜉∼𝒟 𝐹(𝑥;𝜉) Stochastic Optimization Problem Deep learning Linear regression Logistic regression

Distributed Setting 𝐹 2 (𝑥;𝜉) 𝒟 2 𝑓 2 (𝑥) 𝐹 1 (𝑥;𝜉) 𝒟 1 𝐹 3 (𝑥;𝜉) 𝒟 3 𝑓 1 (𝑥) 𝑓 3 (𝑥) 𝐹 5 (𝑥;𝜉) 𝒟 5 𝐹 4 (𝑥;𝜉) 𝒟 4 𝑓 5 (𝑥) 𝑓 4 (𝑥) min 𝑥∈ ℝ 𝑑 𝑓 𝑥 = 1 𝑃 𝑖=1 𝑃 𝑓 𝑖 (𝑥) = 1 𝑃 𝑖=1 𝑃 𝔼 𝜉∼ 𝒟 𝑖 𝐹 𝑖 (𝑥;𝜉) 𝒟 1 = 𝒟 2 =⋯= 𝒟 𝑃 𝒟 𝑖 is a proper partition from 𝒟

Runtime: Decentralized Setting 𝑥 2 (𝑡) 𝒟 2 𝑥 1 (𝑡) 𝒟 1 𝑥 3 (𝑡) 𝒟 3 𝑥 5 (𝑡) 𝒟 5 𝑥 4 (𝑡) 𝒟 4 𝑥= 1 𝑃 𝑖=1 𝑃 𝑥 𝑖 𝑇 Or the average is optimal 𝑥 1 𝑇 , 𝑥 2 𝑇 ,…, 𝑥 𝑃 𝑇 → 𝑥 𝑇 We expect:

Algorithm: at Iteration 𝑡 𝑥 2 𝑡 𝑥 3 𝑡 𝑥 5 𝑡 𝑥 2 𝑡 𝒟 2 𝐹 2 𝑥 2 𝑡 ; 𝜉 2 𝑡 𝑥 1 𝑡 𝒟 1 𝑥 3 𝑡 𝒟 3 𝐹 1 𝑥 1 𝑡 ; 𝜉 1 𝑡 𝐹 3 𝑥 3 𝑡 ; 𝜉 3 𝑡 𝑥 5 𝑡 𝒟 5 𝑥 4 𝑡 𝒟 4 𝐹 5 𝑥 5 𝑡 ; 𝜉 5 𝑡 𝐹 4 𝑥 4 𝑡 ; 𝜉 4 𝑡 𝑥 𝑖 𝑡+1 = 𝑗∈ 𝑁 𝑖 ,𝑖 𝑥 𝑗 𝑡 𝑤 𝑖𝑗 −𝛾𝜕 𝐹 𝑖 𝑥 𝑖 𝑡 ; 𝜉 𝑖 𝑡 𝑊∈ ℝ 𝑃×𝑃 is a symmetric doubly stochastic matrix: 𝑤 𝑖𝑗 ∈[0,1] 𝑤 𝑖𝑗 = 𝑤 𝑗𝑖 𝑗 𝑤 𝑖𝑗 =1

Convergence Rate Analysis 1 𝑇 1−𝛾𝐿 2 𝑡=0 𝑇−1 𝔼 𝜕𝑓 𝑋 𝑡 𝑰 𝑃 𝑃 2 + 𝐷 1 𝑡=0 𝑇−1 𝔼 𝛻𝑓 𝑋 𝑡 𝑰 𝑃 𝑃 2 ≤ 𝑓 0 − 𝑓 ∗ 𝛾𝑇 + 𝛾𝐿 2𝑃 𝜎 2 + 𝛾 2 𝐿 2 𝑃 𝜎 2 1−𝜌 𝐷 2 + 9 𝛾 2 𝐿 2 𝑃 𝜍 2 1− 𝜌 𝐷 2 Convergence of Algorithm 𝐷 1 = 1 2 − 9 𝛾 2 𝐿 2 𝑃 1− 𝜌 2 𝐷 2 𝐷 2 =1− 18 𝛾 2 1− 𝜌 2 𝑃 𝐿 2 𝜌= max 𝜆 2 𝑊 , 𝜆 𝑃 𝑊 2 𝔼 𝜉∼ 𝒟 𝑖 𝛻 𝐹 𝑖 𝑥;𝜉 −𝛻 𝑓 𝑖 𝑥 2 ≤ 𝜎 2 ,∀𝑖,∀𝑥 𝔼 𝑖∼𝒰 𝑃 𝛻 𝑓 𝑖 𝑥 −𝛻𝑓 𝑥 2 ≤ 𝜍 2 ,∀𝑥 Assumptions

Convergence Rate Analysis If 𝛾= 1 2𝐿+𝜎 𝑇/𝑃 , then: 𝑡=0 𝑇−1 𝔼 𝛻𝑓 𝑋 𝑇 𝑰 𝑃 𝑃 ≤ 8 𝑓 0 − 𝑓 ∗ 𝐿 𝑇 + 8𝑓 0 −8 𝑓 ∗ +4𝐿 𝜎 𝑇𝑃 If 𝑇 is sufficiently large, in particular: 𝑇≥ 4 𝐿 4 𝑃 5 𝜎 6 𝑓 0 − 𝑓 ∗ +𝐿 2 𝜎 2 1−𝜌 + 9 𝜍 2 1− 𝜌 2 2 𝑇≥ 72 𝐿 2 𝑃 2 𝜎 2 1− 𝜌 2

Centralized PSGD vs Decentralized PSGD Algorithm Communication complexity on the busiest node Convergence Rate Computational complexity Centralized-PSGD (mini-batch) 𝑂(𝑃) 𝑂 1 𝑇𝑃 𝑂 𝑃 𝜖 + 1 𝜖 2 Decentralized PSGD 𝑂 Deg Network 𝑂 1 𝑇 + 1 𝑇𝑃 D-PSGD is better than C-PSGD: avoid communication traffic jam Linear speedup

Ring Network 𝑊= 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 ⋱ ⋱ ⋱ 1/3 1/3 1/3 ∈ ℝ 𝑃×𝑃 Linear speedup can be achieved if: 𝑃=𝑂 𝑇 1 9 if share the 𝒟 𝑃=𝑂 𝑇 1 13 if 𝒟 is partitioned These results are too loose!

Convergence Rate for the Average of Local Variables If 𝛾= 1 2𝐿+𝜎 𝑇/𝑃 we have: 1 𝑇𝑃 𝔼 𝑡=0 𝑇−1 𝑖=1 𝑃 𝑗=1 𝑃 𝑥 𝑗 𝑡 𝑃 − 𝑥 𝑖 𝑡 2 ≤ 𝑃 𝛾 2 𝐴 𝐷 2 𝐴= 2 𝜎 2 1−𝜌 + 18 𝜍 2 1− 𝜌 2 + 𝐿 2 𝐷 1 𝜎 2 1−𝜌 + 9 𝜍 2 1− 𝜌 2 + 18 1− 𝜌 2 𝑓 0 − 𝑓 ∗ 𝛾𝑇 + 𝛾𝐿 𝜎 2 2𝑃 𝐷 1 The running average of 𝔼 𝑗=1 𝑃 𝑥 𝑗 𝑡 𝑃 − 𝑥 𝑖 𝑡 2 convergence to 0 with a 𝑂 1 𝑇 rate.

Experimental Settings RestNet on CIFAR-10 CNTK with MPI’s AllReduce primitive Centralized: Standard parameter-server based synchronous SGD and one parameter server Decentralized: CNTK with MPI point-to-point primitive EASGD: standard implementation of Torch Implementations 7GPUs: single local machine, Nvidia TITAN Xp 10GPUs: 10 p2.xlarge EC2 instances, Nvidia K80 16GPUs: 16 local machines, Nvidia K20 112 GPUs: 4 p2.16xlarge and 6 p2.8xlarge EC2 instances. Nvidia K80 Machines

Comparison between D-PSGD and two centralized implementations (7 and 10 GPUs)

Comparison between D-PSGD and two centralized implementations (7 and 10 GPUs)

Convergence Rate

D-PSGD Speedup

D-PSGD Communication Patterns

Convergence comparison between D-PSGD and EAMSGD (EASGD’s momentum variant)

Convergence comparison between D-PSGD and Momentum SGD

Thanks