DistBelief: Large Scale Distributed Deep Networks Quoc V. Le

Slides:

Advertisements

Similar presentations

Machine Learning and AI

Advertisements

Scalable Learning in Computer Vision

CS590M 2008 Fall: Paper Presentation

Advanced topics.

Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.

Distributed Parameter Synchronization in DNN

ImageNet Classification with Deep Convolutional Neural Networks

Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009

Recent Developments in Deep Learning Quoc V. Le Stanford University and Google.

Data Visualization STAT 890, STAT 442, CM 462

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Artificial Neural Networks

Comp 5013 Deep Learning Architectures Daniel L. Silver March,

Nantes Machine Learning Meet-up 2 February 2015 Stefan Knerr CogniTalk

Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.

A General Distributed Deep Learning Platform

Machine Learning Chapter 4. Artificial Neural Networks

A shallow introduction to Deep Learning

Large-scale Deep Unsupervised Learning using Graphics Processors

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.

Andrew Ng Machine Learning and AI via Brain simulations Andrew Ng Stanford University Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning.

Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Introduction to Deep Learning

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

TensorFlow– A system for large-scale machine learning

Convolutional Neural Network

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Learning Amin Sobhani.

Large-scale Machine Learning

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

ECE 5424: Introduction to Machine Learning

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Article Review Todd Hricik.

Matt Gormley Lecture 16 October 24, 2016

Lecture 24: Convolutional neural networks

Machine Learning I & II.

Deep Learning Yoshua Bengio, U. Montreal

Neural networks (3) Regularization Autoencoder

ECE 6504 Deep Learning for Perception

Deep learning and applications to Natural language processing

Deep Belief Networks Psychology 209 February 22, 2013.

State-of-the-art face recognition systems

Introduction to Neural Networks

Limitations of Traditional Deep Network Architectures

Logistic Regression & Parallel SGD

Neural Networks Geoff Hulten.

Representation Learning with Deep Auto-Encoder

Deep Learning Some slides are from Prof. Andrew Ng of Stanford.

Artificial Neural Networks

Word embeddings (continued)

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Object Detection Implementations

CS249: Neural Language Model

Presentation transcript:

DistBelief: Large Scale Distributed Deep Networks Quoc V. Le Google & Stanford Joint work with: Kai Chen, Greg Corrado, Jeff Dean, Matthieu Devin, Rajat Monga, Andrew Ng, Marc’Aurelio Ranzato, Paul Tucker, Ke Yang Thanks: Samy Bengio, Geoff Hinton, Andrew Senior, Vincent Vanhoucke, Matt Zeiler

Deep Learning Most of Google is doing AI. AI is hard Deep Learning: Work well for many problems Focus: Scale deep learning to bigger models Paper at the conference: Dean et al, 2012. Now used by Google VoiceSearch, StreetView, ImageSearch, Translate…

Deep Learning Use very large scale brain simulations automatically learn high-level representations from raw data can learn from both labeled and unlabeled data Recent academic deep learning results improve on state- of-the-art in many areas: images, video, speech, NLP, ... ... using modest model sizes (<= ~50M parameters) We want to scale this approach up to much bigger models currently: ~2B parameters, want ~10B-100B parameters general approach: parallelize at many levels

Hypothesis Useful high-level representations arise from: Very large models, trained on very large amounts of data But we are impatient.. Want to train such models quickly Faster progress, easier to try out new ideas, etc. Parallelism is our friend!

Model Training Data

Model Training Data Machine (Model Partition)

Model Core Training Data Machine (Model Partition)

Basic DistBelief Model Training Unsupervised or Supervised Objective Minibatch Stochastic Gradient Descent (SGD) Model parameters sharded by partition 10s, 100s, or 1000s of cores per model Training Data Quoc V. Le

Basic DistBelief Model Training Making a single model bigger and faster was the right first step. But training is still slow with large data sets if a model only considers tiny minibatches (10-100s of items) of data at a time. Training Data How can we add another dimension of parallelism, and have multiple model instances train on data in parallel?

Two Approaches to Multi-Model Training (1) Downpour: Asynchronous Distributed SGD (2) Sandblaster: Distributed L-BFGS

Asynchronous Distributed Stochastic Gradient Descent p’ = p + ∆p p’’ = p’ + ∆p’ Parameter Server ∆p’ ∆p p’ p Model Data

Asynchronous Distributed Stochastic Gradient Descent p’ = p + ∆p Parameter Server ∆p p’ Model Workers Data Shards

Asynchronous Distributed Stochastic Gradient Descent Parameter Server Model Workers Data Shards Will the workers’ use of stale parameters to compute gradients mess the whole thing up? From an engineering standpoint, this is much better than a single model with the same number of total machines: Synchronization boundaries involve fewer machines Better robustness to individual slow machines Makes forward progress even during evictions/restarts

Asynchronous Distributed Stochastic Gradient Descent A few recent papers... A neural probabilistic language model Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, JMLR 2003 Slow Learners are Fast John Langford, Alexander J. Smola, Martin Zinkevich NIPS 2009 Distributed Delayed Stochastic Optimization Alekh Agarwal, John Duchi NIPS 2011 Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent Feng Niu, Benjamin Recht, Christopher Re, Stephen J. Wright NIPS 2011 Only bound badness for convex problems, and we’re far from convex. In practice, works quite well on many of our applications

L-BFGS: a Big Batch Alternative to SGD. Async-SGD first derivatives only many small steps mini-batched data (10s of examples) tiny compute and data requirements per step theory is dicey at most 10s or 100s of model replicas L-BFGS first and second derivatives larger, smarter steps mega-batched data (millions of examples) huge compute and data requirements per step strong theoretical grounding 1000s of model replicas

L-BFGS: a Big Batch Alternative to SGD. Leverages the same parameter server implementation as Async- SGD, but uses it to shard computation within a mega-batch. Some current numbers: 20,000 cores in a single cluster up to 1 billion data items / mega-batch (in ~1 hour) Parameter Server Model Workers Data Coordinator (small messages) More network friendly at large scales than Async-SGD. The possibility of running on multiple data centers...

Key ideas Quoc V. Le Model parallelism via partitioning Data parallelism via Downpour SGD (with asynchronous communications) Data parallelism via Sandblaster LBFGS Quoc V. Le

Applications Acoustic Models for Speech Unsupervised Feature Learning for Still Images Neural Language Models Quoc V. Le

Acoustic Modeling for Speech Recognition 8000-label Softmax One or more hidden layers of a few thousand nodes each. 11 Frames of 40-value Log Energy Power Spectra and the label for central frame label

Acoustic Modeling for Speech Recognition Async SGD and L-BFGS can both speed up model training. Results in real improvements in final transcription quality. Significant reduction in Word Error Rate To reach the same model quality DistBelief reached in 4 days took 55 days using a GPU... DistBelief can support much larger models than a GPU, which we expect will mean higher quality

Applications Acoustic Models for Speech Unsupervised Feature Learning for Still Images Neural Language Models Quoc V. Le

Purely Unsupervised Feature Learning in Images Deep sparse auto-encoders (with pooling and local constrast normalization) 1.15 billion parameters (100x larger than largest deep network in the literature) Data are 10 million unlabeled YouTube thumbnails (200x200 pixels) Trained on 16k cores for 3 days using Async-SGD Quoc V. Le

Quoc V. Le Optimal stimulus for face neuron Optimal stimulus for cat neuron Quoc V. Le

A Meme

Semi-supervised Feature Learning in Images But we do have some labeled data, let’s fine tune this same network for a challenging image classification task. ImageNet: 16 million images 20,000 categories Recurring academic competitions

22,000 is a lot of categories… Stingray Mantaray … smoothhound, smoothhound shark, Mustelus mustelus American smooth dogfish, Mustelus canis Florida smoothhound, Mustelus norrisi whitetip shark, reef whitetip shark, Triaenodon obseus Atlantic spiny dogfish, Squalus acanthias Pacific spiny dogfish, Squalus suckleyi hammerhead, hammerhead shark smooth hammerhead, Sphyrna zygaena smalleye hammerhead, Sphyrna tudes shovelhead, bonnethead, bonnet shark, Sphyrna tiburo angel shark, angelfish, Squatina squatina, monkfish electric ray, crampfish, numbfish, torpedo smalltooth sawfish, Pristis pectinatus guitarfish roughtail stingray, Dasyatis centroura butterfly ray eagle ray spotted eagle ray, spotted ray, Aetobatus narinari cownose ray, cow-nosed ray, Rhinoptera bonasus manta, manta ray, devilfish Atlantic manta, Manta birostris devil ray, Mobula hypostoma grey skate, gray skate, Raja batis little skate, Raja erinacea Stingray Mantaray

Semi-supervised Feature Learning in Images Example top stimuli after fine tuning on ImageNet: Neuron 6 Neuron 7 Neuron 8 Neuron 9 Neuron 5

Semi-supervised Feature Learning in Images Example top stimuli after fine tuning on ImageNet: Neuron 10 Neuron 11 Neuron 12 Neuron 13 Neuron 5

Semi-supervised Feature Learning in Images ImageNet Classification Results: ImageNet 2011 (20k categories) Chance: 0.005% Best reported: 9.5% Our network: 20% (Original network + Dropout) Quoc V. Le

Applications Acoustic Models for Speech Unsupervised Feature Learning for Still Images Neural Language Models Quoc V. Le

~100-D joint embedding space Paris Obama SeaWorld dolphin porpoise Quoc V. Le

} Neural Language Models E E E E E Hinge Loss // Softmax Hidden Layers? E E E E Word Embedding Matrix the cat sat on the 100s of millions of parameters, but gradients very sparse } E is a matrix of dimension ||Vocab|| x d Top prediction layer has ||Vocab|| x h parameters. Most ideas from Bengio et al 2003, Collobert & Weston 2008

Visualizing the Embedding Example nearest neighbors in 50-D embedding trained on 7B word Google News corpus apple Apple iPhone

Summary DistBelief parallelizes a single Deep Learning model over 10s -1000s of cores. A centralized parameter server allows you to use 1s - 100s of model replicas to simultaneously minimize your objective through asynchronous distributed SGD, or 1000s of replicas for L-BFGS Deep networks work well for a host of applications: Speech: Supervised model with broad connectivity, DistBelief can train higher quality models in much less time than a GPU. Images: Semi-supervised model with local connectivity, beats state of the art performance on ImageNet, a challenging academic data set. Neural language models are complementary to N-gram model -- interpolated perplexity falls by 33%