Chilimbi, et al. (2014) Microsoft Research

Slides:



Advertisements
Similar presentations
Lecture 14 – Neural Networks
Advertisements

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Hybrid AI & Machine Learning Systems Using Ne ural Network and Subsumption Architecture Libraries By Logan Kearsley.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Chapter 6 Neural Network.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Scalpel: Customizing DNN Pruning to the
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Stanford University.
TensorFlow– A system for large-scale machine learning
Last Class: Introduction
Web Server Load Balancing/Scheduling
RNNs: An example applied to the prediction task
Hadoop Aakash Kag What Why How 1.
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
The Relationship between Deep Learning and Brain Function
Web Server Load Balancing/Scheduling
CS427 Multicore Architecture and Parallel Computing
Deep Learning Amin Sobhani.
Large-scale Machine Learning
Randomness in Neural Networks
Data Mining, Neural Network and Genetic Programming
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Spark Presentation.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Restricted Boltzmann Machines for Classification
Parallel Data Laboratory, Carnegie Mellon University
Unrolling: A principled method to develop deep neural networks
Hierarchical Architecture
Inception and Residual Architecture in Deep Convolutional Networks
COMP61011 : Machine Learning Ensemble Models
Intelligent Information System Lab
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Henggang Cui Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons,
TensorFlow and Clipper (Lecture 24, cs262a)
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
RNNs: Going Beyond the SRN in Language Prediction
Machine Learning Today: Reading: Maria Florina Balcan
Approximate Fully Connected Neural Network Generation
Logistic Regression & Parallel SGD
Lecture 24: Memory, VM, Multiproc
COMP60621 Fundamentals of Parallel and Distributed Systems
Neural Networks Geoff Hulten.
Lecture: Deep Convolutional Neural Networks
Gandiva: Introspective Cluster Scheduling for Deep Learning
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Overview of deep learning
Artificial Neural Networks
Database System Architectures
TensorFlow: A System for Large-Scale Machine Learning
Attention for translation
Model Compression Joseph E. Gonzalez
COMP60611 Fundamentals of Parallel and Distributed Systems
These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.
2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.
Parallel Systems to Compute
Presentation transcript:

Chilimbi, et al. (2014) Microsoft Research Project Adam Building an Efficient and Scalable Deep Learning Training System Chilimbi, et al. (2014) Microsoft Research Saifuddin Hitawala October 17, 2016 CS 848, University of Waterloo

Traditional Machine Learning Objective Function Humans Data Hand-crafted features Classifier Prediction

Deep Learning Objective Function Humans Data Deep Learning Prediction

Deep Learning face, object properties textures, shapes edges

Problem with Deep Learning Size of model (weakly labelled) Amount of data Computation required Complexity of task Size of model Complexity of task

Problem with Deep Learning Size of model (weakly labelled) Amount of data Computation required Complexity of task Size of model Complexity of task Current computational needs on the order of petaFLOPS!

Accuracy scales with data and model size

Adam: Scalable Deep Learning Platform Data server: Perform transformations Prevent over-fitting Model training system: Executing input Check for errors Use errors to update weights Parameter server: Maintain weight updates Model parameter server Data server Model training system

System Architecture Model Parallelism Data Parallelism Global Model Parameter Store Model Replica Model Workers Model Parallelism Data Parallelism Data Shards

Asynchronous weight updates Multiple threads on a single machine Each thread processing a different input i.e. computing a weight update Weight updates are associative and commutative Thus, no locks required on shared weights Useful for scaling on multiple machines Single training machine 𝐼 1 𝐼 7 𝐼 12 𝐼 5 𝐼 6 𝐼 15 𝐼 24 𝐼 19 ∆𝑤= ∆ 𝑤 7 +∆ 𝑤 24 +∆ 𝑤 6 +…

Model partitioning: less is more Partition model across multiple machines Don’t want to stream from disk so put it in memory to take advantage of memory bandwidth Single training machine DRAM CPU

Model partitioning: less is more Partition model across multiple machines Don’t want to stream from disk so put it in memory to take advantage of memory bandwidth But, memory bandwidth still a bottleneck Single training machine DRAM Model Shard CPU

Model partitioning: less is more Partition model across multiple machines Don’t want to stream from disk so put it in memory to take advantage of memory bandwidth But, memory bandwidth still a bottleneck Go one level lower and fit model in L3 Cache Single training machine DRAM L3 Cache CPU

Model partitioning: less is more Partition model across multiple machines Don’t want to stream from disk so put it in memory to take advantage of memory bandwidth But, memory bandwidth still a bottleneck Go one level lower and fit model in L3 Cache Speed significantly higher on each machine Single training machine DRAM L3 Cache Model Shard WS CPU

Asynchronous batch updates Replica publishes updates to the parameter server Bottleneck: communication between the model replicas and the parameter server

Asynchronous batch updates Replica publishes updates to the parameter server Bottleneck: communication between the model replicas and the parameter server Aggregate weight updates and then apply them ∆𝑤 1 ∆𝑤 2 ∆𝑤 3

Asynchronous batch updates ∆𝑤= ∆ 𝑤 3 +∆ 𝑤 2 +∆ 𝑤 1 + … Replica publishes updates to the parameter server Bottleneck: communication between the model replicas and the parameter server Aggregate weight updates and then apply them Huge improvement in scalability ∆𝑤 1 ∆𝑤 2 𝑤 ∆𝑤 3

Local weight computation Asynchronous batch update does not work well for fully connected layers Weight updates are O( 𝑁 2 ) O( 𝑁 2 ) ∆𝑤 ∆𝑤= 𝛼∗𝛿∗ a

Local weight computation Send the activation and error gradient vectors where matrix multiply can be performed locally Reduces communication overhead from 𝑂 𝑁 2 to 𝑂(𝐾∗(𝑀+𝑁)) Also offloads computation from model training machines to parameter server machines O(K*(M+N)) <𝛿,𝛼>

System optimizations Whole system co-design: Model partitioning: less is more Local weight computation Exploiting Asynchrony: Multi-threaded weight updates without locks Asynchronous batch updates

Model size scaling

Parameter server performance

Scaling during ImageNet training

Trained model accuracy at scale

Summary Pros World record accuracy on large scale benchmarks Highly optimized and scalable Fault tolerant Cons Thoroughly optimized for Deep Neural Networks; Unclear if it can be applied to other models Focused at solving the ImageNet problem and improving Google’s benchmark No efforts in improving or optimizing the algorithm itself

Questions Can this model be generalized and work as well as it works for vision to solve for other AI problems such as speech, sentiment analysis or even robotics? How well does the model compare when evaluated on other types of models not using backpropagation? Thank You!