Andrew Ng Machine Learning and AI via Brain simulations Andrew Ng Stanford University Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning.

Slides:



Advertisements
Similar presentations
Machine Learning and AI
Advertisements

Scalable Learning in Computer Vision
Object recognition and scene “understanding”
Advanced topics.
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.
Adam Coates Deep Learning and HPC Adam Coates Visiting Scholar at IU Informatics Post-doc at Stanford CS.
ImageNet Classification with Deep Convolutional Neural Networks
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
Recent Developments in Deep Learning Quoc V. Le Stanford University and Google.
Large-Scale Object Recognition with Weak Supervision
Image classification by sparse coding.
Learning Convolutional Feature Hierarchies for Visual Recognition
Deep Learning.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.
AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/
Multimodal Deep Learning
Image Classification using Sparse Coding: Advanced Topics
Andrew Ng CS228: Deep Learning & Unsupervised Feature Learning Andrew Ng TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Comp 5013 Deep Learning Architectures Daniel L. Silver March,
Nantes Machine Learning Meet-up 2 February 2015 Stefan Knerr CogniTalk
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
A General Distributed Deep Learning Platform
DistBelief: Large Scale Distributed Deep Networks Quoc V. Le
A shallow introduction to Deep Learning
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
Large-scale Deep Unsupervised Learning using Graphics Processors
Presented by: Mingyuan Zhou Duke University, ECE June 17, 2011
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Andrew Ng Feature learning for image classification Kai Yu and Andrew Ng.
Announcements  Final exam: Friday Dec 19, 8am-11am, 1 Pimentel  Closed-book, closed-notes except 1 page of notes  Grading  Your midterm score is replaced.
Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.
Andrew Ng, Director, Stanford Artificial Intelligence Lab
A shallow look at Deep Learning
1 Andrew Ng, Associate Professor of Computer Science Robots and Brains.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
1 Andrew Ng, Associate Professor of Computer Science Robots and Brains.
Convolutional Neural Network
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Vision-inspired classification
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Large-scale Machine Learning
Chilimbi, et al. (2014) Microsoft Research
ECE 5424: Introduction to Machine Learning
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Goodfellow: Chap 1 Introduction
Deep Learning Insights and Open-ended Questions
Learning Mid-Level Features For Recognition
Article Review Todd Hricik.
Deep learning and applications to Natural language processing
Goodfellow: Chap 1 Introduction
State-of-the-art face recognition systems
Computer Vision James Hays
Non-linear hypotheses
Limitations of Traditional Deep Network Architectures
Deep learning Introduction Classes of Deep Learning Networks
Lecture: Deep Convolutional Neural Networks
Deep Learning Some slides are from Prof. Andrew Ng of Stanford.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Presentation transcript:

Andrew Ng Machine Learning and AI via Brain simulations Andrew Ng Stanford University Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning Jiquan Ngiam Richard Socher Will Zou Thanks to: Google: Kai Chen, Greg Corrado, Jeff Dean, Matthieu Devin, Andrea Frome, Rajat Monga, Marc’Aurelio Ranzato, Paul Tucker, Kay Le

Coursera ,000

Coursera: Courses from Top Universities

30 of the top 60 universities worldwide (Academic Ranking of World Universities) The #1 or #2 ranked university in 14 countries. Coursera: Courses from Top Universities

Andrew Ng This talk: Deep Learning Using brain simulations: - Make learning algorithms much better and easier to use. - Make revolutionary advances in machine learning and AI. Vision shared with many researchers: E.g., Samy Bengio, Yoshua Bengio, Tom Dean, Jeff Dean, Nando de Freitas, Jeff Hawkins, Geoff Hinton, Quoc Le, Yann LeCun, Honglak Lee, Tommy Poggio, Marc’Aurelio Ranzato, Ruslan Salakhutdinov, Yoram Singer, Josh Tenenbaum, Kai Yu, Jason Weston, …. I believe this is our best shot at progress towards real AI.

Andrew Ng What do we want computers to do with our data? Images/video Audio Text Label: “Motorcycle” Suggest tags Image search … Speech recognition Music classification Speaker identification … Web search Anti-spam Machine translation …

Andrew Ng Computer vision is hard! Motorcycle

Andrew Ng What do we want computers to do with our data? Images/video Audio Text Label: “Motorcycle” Suggest tags Image search … Speech recognition Speaker identification Music classification … Web search Anti-spam Machine translation … Machine learning performs well on many of these problems, but is a lot of work. What is it about machine learning that makes it so hard to use?

Andrew Ng Machine learning and feature representations Learning algorithm Input

Andrew Ng Machine learning and feature representations Input Learning algorithm Feature representation

Andrew Ng How is computer perception done? Image Grasp point Low-level features Image Vision features Detection Images/video Audio Audio features Speaker ID Audio Text Text features Text classification, Machine translation, Information retrieval,....

Andrew Ng Feature representations Learning algorithm Feature Representation Input

Andrew Ng Computer vision features SIFT Spin image HoG RIFT Textons GLOH

Andrew Ng Audio features ZCR Spectrogram MFCC Rolloff Flux

Andrew Ng NLP features Parser features Named entity recognition Stemming Part of speech Anaphora Ontologies (WordNet) Coming up with features is difficult, time- consuming, requires experts. “Applied machine learning” is basically feature engineering.

Andrew Ng Feature representations Input Learning algorithm Feature Representation

Andrew Ng The “one learning algorithm” hypothesis [Roe et al., 1992] Auditory cortex learns to see Auditory Cortex

Andrew Ng The “one learning algorithm” hypothesis [Metin & Frost, 1989] Somatosensory cortex learns to see Somatosensory Cortex

Andrew Ng Feature learning problem Given a 14x14 image patch x, can represent it using 196 real numbers. Problem: Can we find a learn a better feature vector to represent this? …

Andrew Ng First stage of visual processing: V1 V1 is the first stage of visual processing in the brain. Neurons in V1 typically modeled as edge detectors: Neuron #1 of visual cortex (model) Neuron #2 of visual cortex (model)

Andrew Ng Learning sensor representations Sparse coding (Olshausen & Field,1996) Input: Images x (1), x (2), …, x (m) (each in R n x n ) Learn: Dictionary of bases    , …,  k (also R n x n ), so that each input x can be approximately decomposed as: x  a j  j s.t. a j ’s are mostly zero (“sparse”) Use to represent 14x14 image patch succinctly, as [a 7 =0.8, a 36 =0.3, a 41 = 0.5]. I.e., this indicates which “basic edges” make up the image. [NIPS 2006, 2007] j=1 k

Andrew Ng Sparse coding illustration Natural Images Learned bases (  1, …,  64 ): “Edges”  0.8 * * * x  0.8 *  *  *  63 [a 1, …, a 64 ] = [ 0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0 ] (feature representation) Test example More succinct, higher-level, representation.

Andrew Ng More examples Represent as: [a 15 =0.6, a 28 =0.8, a 37 = 0.4]. Represent as: [a 5 =1.3, a 18 =0.9, a 29 = 0.3]. 0.6 * * *  15  28  * * *  5  18  29 Method “invents” edge detection. Automatically learns to represent an image in terms of the edges that appear in it. Gives a more succinct, higher-level representation than the raw pixels. Quantitatively similar to primary visual cortex (area V1) in brain.

Andrew Ng Sparse coding applied to audio [Evan Smith & Mike Lewicki, 2006] Image shows 20 basis functions learned from unlabeled audio.

Andrew Ng Sparse coding applied to audio [Evan Smith & Mike Lewicki, 2006] Image shows 20 basis functions learned from unlabeled audio.

Andrew Ng Learning feature hierarchies Input image (pixels) “Sparse coding” (edges; cf. V1) Higher layer (Combinations of edges; cf. V2) [Lee, Ranganath & Ng, 2007] x1x1 x2x2 x3x3 x4x4 a3a3 a2a2 a1a1 [Technical details: Sparse autoencoder or sparse version of Hinton’s DBN.]

Andrew Ng Learning feature hierarchies Input image Model V1 Higher layer (Model V2?) Higher layer (Model V3?) [Lee, Ranganath & Ng, 2007] [Technical details: Sparse autoencoder or sparse version of Hinton’s DBN.] x1x1 x2x2 x3x3 x4x4 a3a3 a2a2 a1a1

Andrew Ng Hierarchical Sparse coding (Sparse DBN): Trained on face images pixels edges object parts (combination of edges) object models [Honglak Lee] Training set: Aligned images of faces.

Andrew Ng Machine learning applications

Andrew Ng Unsupervised feature learning (Self-taught learning) Testing: What is this? Motorcycles Not motorcycles [This uses unlabeled data. One can learn the features from labeled data too.] Unlabeled images … [Lee, Raina and Ng, 2006; Raina, Lee, Battle, Packer & Ng, 2007]

Andrew Ng Video Activity recognition (Hollywood 2 benchmark) MethodAccuracy Hessian + ESURF [Williems et al 2008]38% Harris3D + HOG/HOF [Laptev et al 2003, 2004]45% Cuboids + HOG/HOF [Dollar et al 2005, Laptev 2004]46% Hessian + HOG/HOF [Laptev 2004, Williems et al 2008]46% Dense + HOG / HOF [Laptev 2004]47% Cuboids + HOG3D [Klaser 2008, Dollar et al 2005 ]46% Unsupervised feature learning (our method)52% Unsupervised feature learning significantly improves on the previous state-of-the-art. [Le, Zhou & Ng, 2011]

Andrew Ng TIMIT Phone classificationAccuracy Prior art (Clarkson et al.,1999) 79.6% Stanford Feature learning 80.3% TIMIT Speaker identificationAccuracy Prior art (Reynolds, 1995) 99.7% Stanford Feature learning 100.0% Audio Images Multimodal (audio/video) CIFAR Object classificationAccuracy Prior art (Ciresan et al., 2011) 80.5% Stanford Feature learning 82.0% NORB Object classificationAccuracy Prior art (Scherer et al., 2010) 94.4% Stanford Feature learning 95.0% AVLetters Lip readingAccuracy Prior art (Zhao et al., 2009) 58.9% Stanford Feature learning 65.8% Galaxy Hollywood2 ClassificationAccuracy Prior art (Laptev et al., 2004) 48% Stanford Feature learning 53% KTHAccuracy Prior art (Wang et al., 2010) 92.1% Stanford Feature learning 93.9% UCFAccuracy Prior art (Wang et al., 2010) 85.6% Stanford Feature learning 86.5% YouTubeAccuracy Prior art (Liu et al., 2009) 71.2% Stanford Feature learning 75.8% Video Text/NLP Paraphrase detectionAccuracy Prior art (Das & Smith, 2009) 76.1% Stanford Feature learning 76.4% Sentiment (MR/MPQA data)Accuracy Prior art (Nakagawa et al., 2010) 77.3% Stanford Feature learning 77.7%

Andrew Ng How do you build a high accuracy learning system?

Andrew Ng Supervised Learning: Labeled data Choices of learning algorithm: –Memory based –Winnow –Perceptron –Naïve Bayes –SVM –…. What matters the most? [Banko & Brill, 2001] Training set size (millions) Accuracy “It’s not who has the best algorithm that wins. It’s who has the most data.”

Andrew Ng Unsupervised Learning Large numbers of features is critical. The specific learning algorithm is important, but ones that can scale to many features also have a big advantage. [Adam Coates]

Learning from Labeled data

Model Training Data

Model Training Data Machine (Model Partition)

Model Machine (Model Partition) Core Training Data

Model Training Data Unsupervised or Supervised Objective Minibatch Stochastic Gradient Descent (SGD) Model parameters sharded by partition 10s, 100s, or 1000s of cores per model Basic DistBelief Model Training

Model Training Data Basic DistBelief Model Training Parallelize across ~100 machines (~1600 cores). Stochastic gradient descent. But training is still slow with large data sets. Add another dimension of parallelism, and have multiple model instances in parallel.

p Model Data ∆p∆p p’p’ p’ = p + ∆p Asynchronous Distributed Stochastic Gradient Descent Parameter Server ∆p’∆p’ p’’ = p’ + ∆p’

Parameter Server Model Workers Data Shards p’ = p + ∆p ∆p∆p p’p’ Asynchronous Distributed Stochastic Gradient Descent

Parameter Server Slave models Data Shards Better robustness to individual slow machines Makes forward progress even during evictions/restarts From an engineering standpoint, superior to a single model with the same number of total machines:

Acoustic Modeling for Speech Recognition Async SGD and L-BFGS can both speed up model training. To reach the same model quality DistBelief reached in 4 days took 55 days using a GPU.... DistBelief can support much larger models than a GPU (useful for unsupervised learning).

Andrew Ng

Speech recognition on Android

Andrew Ng Application to Google Streetview [with Yuval Netzer, Julian Ibarz]

Andrew Ng Learning from Unlabeled data

Andrew Ng Unsupervised Learning Large numbers of features is critical. The specific learning algorithm is important, but ones that can scale to many features also have a big advantage. [Adam Coates]

(training: 50,000 32x32 images) 10 million parameters

(training: 10,000, x200 images) 1 billion parameters

Training procedure What features can we learn if we train a massive model on a massive amount of data. Can we learn a “grandmother cell”? Train on 10 million images (YouTube) 1000 machines (16,000 cores) for 1 week. Test on novel images Training set (YouTube) Test set (FITW + ImageNet)

Top stimuli from the test setOptimal stimulus by numerical optimization The face neuron Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

Feature value Random distractors Faces Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012 Frequency

Invariance properties Feature response Horizontal shifts Vertical shifts Feature response 3D rotation angle Feature response pixels o Feature response Scale factor 1.6x Le, et al., Building high-level features using large-scale unsupervised learning. ICML pixels 20 pixels 0 o 1x0.4x

Cat neuron [Raina, Madhavan and Ng, 2008] Top Stimuli from the test set Average of top stimuli from test set

Best stimuli Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature 7 Feature 8 Feature 6 Feature 9 Best stimuli Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature 11 Feature 10 Feature 12 Feature 13 Best stimuli Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

ImageNet classification 22,000 categories 14,000,000 images Hand-engineered features (SIFT, HOG, LBP), Spatial pyramid, SparseCoding/Compression Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

ImageNet classification: 22,000 classes … smoothhound, smoothhound shark, Mustelus mustelus American smooth dogfish, Mustelus canis Florida smoothhound, Mustelus norrisi whitetip shark, reef whitetip shark, Triaenodon obseus Atlantic spiny dogfish, Squalus acanthias Pacific spiny dogfish, Squalus suckleyi hammerhead, hammerhead shark smooth hammerhead, Sphyrna zygaena smalleye hammerhead, Sphyrna tudes shovelhead, bonnethead, bonnet shark, Sphyrna tiburo angel shark, angelfish, Squatina squatina, monkfish electric ray, crampfish, numbfish, torpedo smalltooth sawfish, Pristis pectinatus guitarfish roughtail stingray, Dasyatis centroura butterfly ray eagle ray spotted eagle ray, spotted ray, Aetobatus narinari cownose ray, cow-nosed ray, Rhinoptera bonasus manta, manta ray, devilfish Atlantic manta, Manta birostris devil ray, Mobula hypostoma grey skate, gray skate, Raja batis little skate, Raja erinacea … Stingray Mantaray

Andrew Ng Unsupervised feature learning (Self-taught learning) Testing: What is this? Motorcycles Not motorcycles [This uses unlabeled data. One can learn the features from labeled data too.] Unlabeled images … [Lee, Raina and Ng, 2006; Raina, Lee, Battle, Packer & Ng, 2007]

0.005% Random guess 9.5%? Feature learning From raw pixels State-of-the-art (Weston, Bengio ‘11) Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

0.005% Random guess 9.5% State-of-the-art (Weston, Bengio ‘11) 18.3% Feature learning From raw pixels Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

Andrew Ng Scaling up with HPC GPU cluster HPC cluster: GPUs with Infiniband Difficult to program---lots of MPI and CUDA code. GPUs with CUDA 1 very fast node. Limited memory; hard to scale out. “Cloud” infrastructure Many inexpensive nodes. Comm. bottlenecks, node failures. Network fabric

Andrew Ng Stanford GPU cluster Current system –64 GPUs in 16 machines. –Tightly optimized CUDA for Deep Learning operations. – 47x faster than single-GPU implementation. –Train 11.2 billion parameter, 9 layer neural network in < 4 days.

Andrew Ng Discussion: Engineering vs. Data

Andrew Ng Discussion: Engineering vs. Data Human ingenuity Data/ learning Contribution to performance

Andrew Ng Discussion: Engineering vs. Data Time Contribution to performance Now

Andrew Ng Deep Learning: Lets learn our features. Discover the fundamental computational principles that underlie perception. Scaling up has been key to achieving good performance. Didn’t talk about: Recursive deep learning for NLP. Online tutorial on deep learning: Deep Learning Adam Coates Quoc Le Honglak Lee Andrew Saxe Andrew Maas Chris Manning Jiquan Ngiam Richard Socher Will Zou Stanford Google Kai Chen Greg Corrado Jeff Dean Matthieu Devin Andrea Frome Rajat Monga Marc’Aurelio Paul Tucker Kay Le Ranzato