Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA.

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
CS590M 2008 Fall: Paper Presentation
Advanced topics.
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.
Nathan Wiebe, Ashish Kapoor and Krysta Svore Microsoft Research ASCR Workshop Washington DC Quantum Deep Learning.
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
Deep Learning.
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
How to do backpropagation in a brain
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Submitted by:Supervised by: Ankit Bhutani Prof. Amitabha Mukerjee (Y )Prof. K S Venkatesh.
Autoencoders Mostafa Heidarpour
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Comp 5013 Deep Learning Architectures Daniel L. Silver March,
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Machine Learning Queens College Lecture 13: SVM Again.
How to do backpropagation in a brain
Deep Learning for Speech and Language Yoshua Bengio, U. Montreal NIPS’2009 Workshop on Deep Learning for Speech Recognition and Related Applications December.
Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA
A shallow introduction to Deep Learning
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Presented by: Mingyuan Zhou Duke University, ECE June 17, 2011
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Neural Net Language Models
Introduction to Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Manifold Learning JAMES MCQUEEN – UW DEPARTMENT OF STATISTICS.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Deep Learning Primer Swadhin Pradhan Reading Group Presentation 03/30/2016, UT Austin.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Big data classification using neural network
CS 9633 Machine Learning Support Vector Machines
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Learning Amin Sobhani.
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Deep Learning.
Multimodal Learning with Deep Boltzmann Machines
Deep Learning Yoshua Bengio, U. Montreal
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
Deep learning and applications to Natural language processing
Deep Belief Networks Psychology 209 February 22, 2013.
Structure learning with deep autoencoders
Dipartimento di Ingegneria «Enzo Ferrari»
Department of Electrical and Computer Engineering
Learning with information of features
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Deep learning Introduction Classes of Deep Learning Networks
Deep Architectures for Artificial Intelligence
Goodfellow: Chapter 14 Autoencoders
Learning Deep Architectures
Representation Learning with Deep Auto-Encoder
Neural networks (3) Regularization Autoencoder
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Machine Learning Support Vector Machine Supervised Learning
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Modeling IDS using hybrid intelligent systems
CSC 578 Neural Networks and Deep Learning
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA

How to Beat the Curse of Many Factors of Variation? Compositionality: exponential gain in representational power Distributed representations Deep architecture

Distributed Representations  Many neurons active simultaneously  Input represented by the activation of a set of features that are not mutually exclusive  Can be exponentially more efficient than local representations

Local vs Distributed

RBM Hidden Units Carve Input Space h1h1 h2h2 h3h3 x1x1 x2x2

Unsupervised Deep Feature Learning  Classical: pre-process data with PCA = leading factors  New: learning multiple levels of features/factors, often over-complete  Greedy layer-wise strategy: raw input x unsupervised raw input x P(y|x) Supervised fine-tuning (Hinton et al 2006, Bengio et al 2007, Ranzato et al 2007) raw input x

Why Deep Learning?  Hypothesis 1: need deep hierarchy of features to efficiently represent and learn complex abstractions needed for AI and mammal intelligence.  Computational & statististical efficiency  Hypothesis 2: unsupervised learning of representations is a crucial component of the solution.  Optimization & regularization.  Theoretical and ML-experimental support for both.  Cortex: deep architecture, the same learning rule everywhere

Deep Motivations  Brains have a deep architecture  Humans’ ideas & artifacts composed from simpler ones  Unsufficient depth can be exponentially inefficient  Distributed (possibly sparse) representations necessary for non- local generalization, exponentially more efficient than 1-of-N enumeration of latent variable values  Multiple levels of latent variables allow combinatorial sharing of statistical strength raw input x task 1task 3task 2 shared intermediate representations

Deep Architectures are More Expressive Theoretical arguments: … 123 2n2n 123 … n = universal approximator2 layers of Logic gates Formal neurons RBF units Theorems for all 3: (Hastad et al 86 & 91, Bengio et al 2007) Functions compactly represented with k layers may require exponential size with k-1 layers

Sharing Components in a Deep Architecture Polynomial expressed with shared components: advantage of depth may grow exponentially (Bengio & Delalleau, Learning Workshop 2011) Sum-product network

Parts Are Composed to Form Objects Layer 1: edges Layer 2: parts Lee et al. ICML’2009 Layer 3: objects

Deep Architectures and Sharing Statistical Strength, Multi-Task Learning  Generalizing better to new tasks is crucial to approach AI  Deep architectures learn good intermediate representations that can be shared across tasks  Good representations make sense for many tasks raw input x task 1 output y 1 task 3 output y 3 task 2 output y 2 shared intermediate representation h

Feature and Sub-Feature Sharing  Different tasks can share the same high- level features  Different high-level features can be built from the same set of lower-level features  More levels = up to exponential gain in representational efficiency … … … … … task 1 output y 1 task N output y N High-level features Low-level features … … … … … task 1 output y 1 task N output y N High-level features Low-level features Sharing intermediate features Not sharing intermediate features

Representations as Coordinate Systems  PCA: removing low-variance directions  easy but what if signal has low variance? We would like to disentangle factors of variation, keeping them all.  Overcomplete representations: richer, even if underlying distribution concentrates near low-dim manifold.  Sparse/saturated features: allows for variable-dim manifolds. Different few sensitive features at x = local chart coordinate system.

Effect of Unsupervised Pre-training AISTATS’2009+JMLR 2010, with Erhan, Courville, Manzagol, Vincent, S. Bengio

Effect of Depth w/o pre-training with pre-training

Unsupervised Feature Learning: a Regularizer to Find Better Local Minima of Generalization Error  Unsupervised pre- training acts like a regularizer  Helps to initialize in basin of attraction of local minima with better generalization error

Non-Linear Unsupervised Feature Extraction Algorithms  CD for RBMs  SML (PCD) for RBMs  Sampling beyond Gibbs (e.g. tempered MCMC)  Mean-field + SML for DBMs  Sparse auto-encoders  Sparse Predictive Decomposition  Denoising Auto-Encoders  Score Matching / Ratio Matching  Noise-Contrastive Estimation  Pseudo-Likelihood  Contractive Auto-Encoders See my book / review paper (F&TML 2009): Learning Deep Architectures for AI

Auto-Encoders  Reconstruction=decoder(encoder(input))  Probable inputs have small reconstruction error  Linear decoder/encoder = PCA up to rotation  Can be stacked to form highly non-linear representations, increasing disentangling (Goodfellow et al, NIPS 2009) … code= latent features … encoder decoder input reconstruction

Restricted Boltzman Machine (RBM)  The most popular building block for deep architectures  Bipartite undirected graphical model  Inference is trivial: P(h|x) & P(x|h) factorize observed hidden

Gibbs Sampling in RBMs P(h|x) and P(x|h) factorize h 1 ~ P(h|x 1 ) x 2 ~ P(x|h 1 ) x 3 ~ P(x|h 2 ) x1 x1 h 2 ~ P(h|x 2 ) h 3 ~ P(h|x 3 )  Easy inference  Convenient Gibbs sampling x  h  x  h…

RBMs are Universal Approximators  Adding one hidden unit (with proper choice of parameters) guarantees increasing likelihood  With enough hidden units, can perfectly model any discrete distribution  RBMs with variable nb of hidden units = non-parametric (Le Roux & Bengio 2008, Neural Comp.)

Denoising Auto-Encoder Learns a vector field towards higher probability regions Minimizes variational lower bound on a generative model Similar to pseudo- likelihood A form of regularized score matching Reconstruction Corrupted input

Stacked Denoising Auto- Encoders No partition function, can measure training criterion Encoder & decoder: any parametrization Performs as well or better than stacking RBMs for usupervised pre-training Infinite MNIST

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, Rifai, Vincent, Muller, Glorot & Bengio, ICML Higher Order Contractive Auto-Encoders, Rifai, Mesnil, Vincent, Muller, Bengio, Dauphin, Glorot, ECML Part of winning toolbox in final phase of the Unsupervised & Transfer Learning Challenge 2011 Contractive Auto-Encoders

Training criterion: wants contraction in all directions cannot afford contraction in manifold directions Few active units represent the active subspace (local chart) Jacobian’s spectrum is peaked = local low- dimensional representation / relevant factors

Unsupervised and Transfer Learning Challenge: 1 st Place in Final Phase Raw data 1 layer 2 layers 4 layers 3 layers

Transductive Representation Learning  Validation set and test sets have different classes than training set, hence very different input distributions  Directions that matter to distinguish them might have small variance under the training set  Solution: perform last level of unsupervised feature learning (PCA) on the validation / test set input data.

Small (4-domain) Amazon benchmark: we beat the state-of-the-art handsomely Domain Adaptation (ICML 2011) Sparse rectifiers SDA finds more features that tend to be useful either for predicting domain or sentiment, not both

Sentiment Analysis: Transfer Learning 25 Amazon.com domains: toys, software, video, books, music, beauty, … Unsupervised pre- training of input space on all domains Supervised SVM on 1 domain, generalize out- of-domain Baseline: bag-of-words + SVM

Sentiment Analysis: Large Scale Out-of-Domain Generalization Relative loss by going from in-domain to out-of-domain testing 340k examples from Amazon, from 56 (tools) to 124k (music)

Deep Sparse Rectifier Neural Networks, Glorot, Bordes & Bengio, AISTATS Sampled Reconstruction for Large-Scale Learning of Embeddings, Dauphin, Glorot & Bengio, ICML Representing Sparse High- Dimensional Stuff … code= latent features … sparse input dense output probabilities cheap expensive

Speedup from Sampled Reconstruction

Deep Self-Taught Learning for Handwritten Character Recognition Y. Bengio & 16 others (IFT6266 class project & AISTATS 2011 paper) discriminate 62 character classes (upper, lower, digits), 800k to 80M examples Deep learners beat state-of-the-art on NIST and reach human-level performance Deep learners benefit more from perturbed (out-of-distribution) data Deep learners benefit more from multi-task setting

35

Improvement due to training on perturbed data Deep Shallow {SDA,MLP}1: trained only on data with distortions: thickness, slant, affine, elastic, pinch {SDA,MLP}2: all perturbations

Improvement due to multi-task setting Deep Shallow Multi-task: Train on all categories, test on target categories (share representations) Not multi-task: Train and test on target categories only (no sharing, separate models)

Comparing against Humans and SOA All 62 character classes Deep learners reach human performance [1] Granger et al., 2007 [2] Pérez-Cortes et al., 2000 (nearest neighbor) [3] Oliveira et al., 2002b (MLP) [4] Milgram et al., 2005 (SVM)

Tips & Tricks Don’t be scared by the many hyper- parameters: use random sampling (not grid search) & clusters / GPUs Learning rate is the most important, along with top-level dimension Make sure selected hyper-param value is not on the border of interval Early stopping Using NLDR for visualization Simulation of Final Evaluation Scenario

Conclusions Deep Learning: powerful arguments & generalization principles Unsupervised Feature Learning is crucial: many new algorithms and applications in recent years DL particularly suited for multi-task learning, transfer learning, domain adaptation, self-taught learning, and semi- supervised learning with few labels

: numpy  GPU

Merci! Questions? LISA ML Lab team: