Multimodal Learning with Deep Boltzmann Machines

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

Deep Learning Bing-Chen Tsai 1/21.

CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.

CS590M 2008 Fall: Paper Presentation

Advanced topics.

Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.

Computer vision: models, learning and inference

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.

Visual Recognition Tutorial

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Distributed Representations of Sentences and Documents

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Deep Boltzman machines Paper by : R. Salakhutdinov, G. Hinton Presenter : Roozbeh Gholizadeh.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

How to do backpropagation in a brain

6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):

Deep Boltzmann Machines

Boltzmann Machines and their Extensions S. M. Ali Eslami Nicolas Heess John Winn March 2013 Heriott-Watt University.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.

Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.

Lecture 2: Statistical learning primer for biologists

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

1 Restricted Boltzmann Machines and Applications Pattern Recognition (IC6304) [Presentation Date: ] [ Ph.D Candidate,

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

Learning Deep Generative Models by Ruslan Salakhutdinov

Convolutional Neural Network

Deep Feedforward Networks

Energy models and Deep Belief Networks

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Restricted Boltzmann Machines for Classification

CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Classification with Perceptrons Reading:

Machine Learning Basics

Deep Learning Qing LU, Siyuan CAO.

Structure learning with deep autoencoders

Bayesian Models in Machine Learning

Learning Markov Networks

Deep Architectures for Artificial Intelligence

Deep Belief Nets and Ising Model-Based Network Construction

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.

Stochastic Optimization Maximization for Latent Variable Models

An introduction to Graphical Models – Michael Jordan

CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.

Multivariate Methods Berlin Chen

CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.

Multivariate Methods Berlin Chen, 2005 References:

Markov Networks.

Autoencoders David Dohan.

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Multimodal Learning with Deep Boltzmann Machines Nitish Srivastava Ruslan Salakhutdinov University of Toronto Nir Halay

Introduction We will start by introducing a basic method called restricted Boltzmann machine (RBM). A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs have found applications in: Dimensionality reduction Classification Feature learning etc… It’s an unsupervised learning algorithm.

Restricted Boltzmann Machine RBM can be described by the following undirected graph: We assume that each observation 𝐱 depend on a latent random vector 𝐡. Their connection is described by the following energy function: 𝐸 𝐱,𝐡 ≜− 𝐡 T 𝐖𝐱− 𝐜 T 𝐱− 𝐛 T 𝐡

RBM – joint distribution The energy function: 𝐸 𝐱,𝐡 ≜− 𝐡 T 𝐖𝐱− 𝐜 T 𝐱− 𝐛 T 𝐡 The joint probability distribution of 𝐱 and 𝐡 is defines as: 𝑝 𝐱,𝐡 ≜ 1 𝑍 exp −𝐸 𝐱,𝐡 where Z is a normalization constant (which is intractable). Given a set of observations of 𝐱, we want to find the distribution parameters 𝐖,𝐜,𝐛 .

RBM- properties We will discuss the case of binary distributed variables, i.e. 𝐱∈ 0,1 𝐾 and 𝐡∈ 0,1 𝐻 . Some basic properties of RBM: Conditional independence: 𝑝 𝐱 𝐡 = 𝑘 𝑝 𝑥 𝑘 𝐡 , 𝑝 𝐡 𝐱 = 𝑗 𝑝 ℎ 𝑗 𝐱 Conditional distributions: 𝑝 𝑥 𝑘 =1 𝐡 =sigm 𝑐 𝑘 + 𝐡 T 𝐖 ⋅𝑘 𝑝 ℎ 𝑗 =1 𝐱 =sigm 𝑏 𝑗 + 𝐖 𝑗⋅ 𝐱 Proofs of these properties are given on Hugo Larochelle lectures: https://youtu.be/p4Vh_zMw-HQ?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH k-th column of 𝐖 j-th row of 𝐖

RBM – marginal probability The marginal probability of 𝐱 is given by: 𝑝 𝐱 = 𝐡 𝑝 𝐱 ,𝐡 = 1 𝑍 𝐡 exp −𝐸 𝐱,𝐡 =… = 1 𝑍 exp 𝐜 𝑇 𝐱+ 𝑗 log 1+ exp 𝑏 𝑗 + 𝐖 𝑗⋅ 𝒙 Define the ‘Free Energy’: 𝐹 𝐱 ≜− 𝐜 𝑇 𝐱− 𝑗 log 1+ exp 𝑏 𝑗 + 𝐖 𝑗⋅ 𝒙 The marginal probability can be written as: 𝑝 𝐱 = 1 𝑍 exp −𝐹 𝐱

RBM - fitting 𝐖 , 𝐜 , 𝐛 = min 𝐖,𝐜,𝐛 − log 𝑝 𝐱 1 ,…, 𝐱 T Now, we need to fit the model, i.e. we want to find the parameters 𝐖,𝐜,𝐛 that best fit the data 𝐱 t 𝑡=1 𝑇 . We will use the ML estimator: 𝐖 , 𝐜 , 𝐛 = min 𝐖,𝐜,𝐛 − log 𝑝 𝐱 1 ,…, 𝐱 T = min 𝐖,𝐜,𝐛 − 1 𝑇 𝑡=1 𝑇 log 𝑝 𝐱 t We would like to use a stochastic gradient descent (SGD): 𝜕 −log 𝑝 𝐱 𝜕𝜃 =− 𝜕 𝜕𝜃 log 𝐡 𝑝 𝐱 ,𝐡 =− 𝜕 𝜕𝜃 log 1 𝑍 𝐡 exp −𝐸 𝐱,𝐡 =− 𝜕 𝜕𝜃 log 𝐡 exp −𝐸 𝐱,𝐡 𝐱′,𝐡 exp −𝐸 𝐱′,𝐡 =− 𝜕 𝜕𝜃 log 𝐡 exp −𝐸 𝐱,𝐡 + 𝜕 𝜕𝜃 log 𝐱′,𝐡 exp −𝐸 𝐱′,𝐡

RBM - fitting Calculate: Similarly: 𝜕 𝜕𝜃 log 𝐡 exp −𝐸 𝐱,𝐡 = = 1 𝐡′ exp −𝐸 𝐱,𝐡′ − 𝐡 exp −𝐸 𝐱,𝐡 𝜕 𝜕𝜃 𝐸 𝐱,𝐡 =− 𝐡 exp −𝐸 𝐱,𝐡 𝐡 ′ exp −𝐸 𝐱, 𝐡 ′ 𝜕 𝜕𝜃 𝐸 𝐱,𝐡 =− 𝐡 1 𝑍 exp −𝐸 𝐱,𝐡 1 𝑍 𝐡 ′ exp −𝐸 𝐱, 𝐡 ′ 𝜕 𝜕𝜃 𝐸 𝐱,𝐡 =− 𝐡 𝑝 𝐱,𝐡 𝑝 𝐱 𝜕 𝜕𝜃 𝐸 𝐱,𝐡 =− 𝐡 𝑝 𝐡|𝐱 𝜕 𝜕𝜃 𝐸 𝐱,𝐡 =−E 𝜕𝐸 𝐱,𝐡 𝜕𝜃 | 𝐱 Similarly: 𝜕 𝜕𝜃 log 𝐱′,𝐡 exp −𝐸 𝐱′,𝐡 =−E 𝜕𝐸 𝐱,𝐡 𝜕𝜃

RBM - fitting Finally: 𝜕 −log 𝑝 𝐱 𝑡 𝜕𝜃 =E 𝜕𝐸 𝐱 𝑡 ,𝐡 𝜕𝜃 | 𝐱 𝑡 −E 𝜕𝐸 𝐱,𝐡 𝜕𝜃 The positive phase can be computed analytically: E 𝜕𝐸 𝐱,𝐡 𝜕𝐖 | 𝐱 =E 𝜕 𝜕𝐖 − 𝐡 T 𝐖𝐱 | 𝐱 =E − 𝐡 T 𝐱| 𝐱 =−𝐡 𝐱 𝐱 T E 𝜕𝐸 𝐱,𝐡 𝜕𝐛 | 𝐱 =E 𝜕 𝜕𝐛 − 𝐛 T 𝐡 | 𝐱 =E −𝐡| 𝐱 =−𝐡 𝐱 E 𝜕𝐸 𝐱,𝐡 𝜕𝐜 | 𝐱 =E 𝜕 𝜕𝐜 − 𝐜 T 𝐱 | 𝐱 =E −𝐱| 𝐱 =−𝐱 where 𝐡 𝐱 ≜p 𝐡=𝟏|𝐱 =sigm 𝐛+𝐖𝐱 . Positive phase Negative phase

RBM - Contrastive Divergence The negative phase is actually intractable. Hinton proposed (in 2002) an approximation of this term: E 𝜕𝐸 𝐱,𝐡 𝜕𝜃 =E E 𝜕𝐸 𝐱,𝐡 𝜕𝜃 | 𝐱 ≈E 𝜕𝐸 𝐱 ,𝐡 𝜕𝜃 | 𝐱 where 𝐱 is some another sample from our model distribution. We can generate some 𝐱 using ‘Gibbs Sampler’ (MCMC method), starting from 𝐱 𝑡 :

RBM - Contrastive Divergence Pseudo code for Contrastive Divergence (CD) algorithm:

RBM - Contrastive Divergence Intuitive meaning of the Contrastive Divergence (CD) algorithm: the CD algorithm is given by: 𝜃 (𝑘+1) = 𝜃 (𝑘) −E 𝜕𝐸 𝐱 𝑡 ,𝐡 𝜕𝜃 | 𝐱 𝑡 +E 𝜕𝐸 𝐱 ,𝐡 𝜕𝜃 | 𝐱 Hence, we go onto a direction that decreases the mean energy at the true example 𝐱 𝑡 but increases the mean energy at the model random example 𝐱 . Low energy ⟺ High probability , High energy ⟺ Low probability When the mean energy is equal at 𝐱 𝑡 and 𝐱 , the CD has been converged.

RBM - Example Dataset- MNIST: Filters:

RBM Extension – GB RBM Gaussian-Bernoulli RBM (Hinton and Salakhutdinov, 2006): The visible layer 𝐯∈ ℝ 𝐷 and the hidden layer 𝐡∈ 0,1 𝐹 . The energy function is defined as: The conditional independence is the same as regular RBM. 𝑣 𝑖 | 𝐡 ~ 𝑁 𝑏 𝑖 + 𝜎 𝑖 2 𝐡 T 𝐖 ⋅𝑖 , 𝜎 𝑖 2 ℎ 𝑗 |𝐯 ~ Bernoulli sigm 𝑎 𝑗 + 𝑖=1 𝐷 𝑊 𝑖,𝑗 𝑣 𝑖 𝜎 𝑖 2 The CD algorithm is very similar to the regular RBM.

RBM Extension – RSM Replicated Softmax Model: Assume some binary matrix that identify words in some document: We want to learn the distribution of this matrix 𝐕. A regular RBM would vectorize this matrix and use the following energy function: 7 6 5 4 3 2 1 Cat Dog Cow

RBM Extension – RSM Replicated Softmax Model: If ℎ 1 ,…, ℎ 𝐹 are topic indicators, the order of the words can be ignored, which means that 𝑊 𝑖𝑗𝑘 = 𝑊 𝑖′𝑗𝑘 and 𝑏 𝑖𝑘 = 𝑏 𝑖 ′ 𝑘 ∀ 𝑖 ′ ,𝑖 This motivates the following energy function: where 𝑣 𝑘 ≜ 𝑖=1 𝑀 𝑣 𝑖𝑘 is the count for the k-th word. There is an additional normalization to the bias term of the hidden layer. The conditional independence is the same as regular RBM. The CD algorithm is very similar to the regular RBM.

Deep Boltzmann Machine DBM is a Boltzmann machine with more than one hidden layer, for example: The energy function of this DBM is defined as: 𝐸 𝐯, 𝐡 1 , 𝐡 2 , 𝐡 3 ≜− 𝐡 1 T 𝐖 1 𝐯− 𝐡 2 T 𝐖 2 𝐡 1 − 𝐡 3 T 𝐖 3 𝐡 2 − 𝐛 1 T 𝐡 1 − 𝐛 2 T 𝐡 2 − 𝐛 3 T 𝐡 3 − 𝐜 T 𝐯 The joint probability distribution of 𝐯, 𝐡 1 , 𝐡 2 and 𝐡 3 is defined as: 𝑝 𝐯, 𝐡 1 , 𝐡 2 , 𝐡 3 ≜ 1 𝑍 exp −𝐸 𝐯, 𝐡 1 , 𝐡 2 , 𝐡 3

Deep Boltzmann Machine The conditional probability distributions are given by: 𝑝 𝑣 𝑘 =1| 𝐡 1 =sigm 𝑐 𝑘 + 𝐡 1 T 𝐖 ⋅𝑘 1 𝑝 ℎ 𝑗 1 =1| 𝐡 2 ,𝐯 =sigm 𝑏 𝑗 1 + 𝐖 𝑗⋅ 1 𝐯+ 𝐡 2 T 𝐖 ⋅𝑗 2 𝑝 ℎ 𝑚 2 =1| 𝐡 3 , 𝐡 1 =sigm 𝑏 𝑚 2 + 𝐖 𝑚⋅ 2 𝐡 1 + 𝐡 3 T 𝐖 ⋅𝑚 3 𝑝 ℎ 𝑙 3 =1| 𝐡 2 =sigm 𝑏 𝑙 2 + 𝐖 𝑙⋅ 3 𝐡 2 Each layer is conditionally independent of all the other layers given its neighbors. For example: 𝑝 ℎ 𝑗 1 =1| 𝐡 3 ,𝐡 2 ,𝐯 =𝑝 ℎ 𝑗 1 =1| 𝐡 2 ,𝐯 The contrastive divergence algorithm is quite expensive in this context. Hence, we may resort to a suboptimal solution.

Multimodal DBM - Motivation Data often consists of multiple diverse modalities. This multimodal DBM can model a joint distribution of two dependent variables of different modalities. For example, the joint distribution between image and text: 𝑝 𝐯 img , 𝐯 txt Once the joint distribution is obtained, we can sample from the conditional distributions 𝑝 𝐯 img | 𝐯 txt and 𝑝 𝐯 txt | 𝐯 img :

Multimodal DBM - Structure A multimodal DBM has the following structure: Each of the modalities has its own DBM, which are connected through one hidden layer. In this example, 𝐯 𝑚 represent image and 𝐯 𝑡 represent text.

Multimodal DBM - Distribution The energy function of each DBM are defined as: Image DBM: 𝐸 𝑚 𝐯 m , 𝐡 1𝑚 ,𝐡 2𝑚 ≜ 𝐯 m ↔ 𝐡 1𝑚 GB RBM + 𝐡 1𝑚 ↔ 𝐡 2𝑚 RBM Text DBM: 𝐸 𝑡 𝐯 m , 𝐡 1𝑡 ,𝐡 2𝑡 ≜ 𝐯 t ↔ 𝐡 1𝑡 RSM RBM + 𝐡 1𝑡 ↔ 𝐡 2𝑡 RBM Connection layer: 𝐸 𝑐 𝐯 m , 𝐡 1𝑡 ,𝐡 2𝑡 ≜ 𝐡 3 ↔ 𝐡 2𝑚 RBM + 𝐡 3 ↔ 𝐡 2𝑡 RBM Define 𝐡≜ 𝐡 1𝑚 ,𝐡 1𝑡 , 𝐡 2𝑚 , 𝐡 2𝑡 , 𝐡 3 We want to find a term for the joint probability of 𝐯 m and 𝐯 t and all other conditional probabilities.

Multimodal DBM - Distribution The joint probability of 𝐯 m and 𝐯 t can be written as: 𝑝 𝐯 m , 𝐯 t = 𝐡 𝑝 𝐯 m , 𝐯 t , 𝐡 1𝑚 ,𝐡 1𝑡 , 𝐡 2𝑚 , 𝐡 2𝑡 , 𝐡 3 = 𝐡 𝑝 𝐡 2𝑚 , 𝐡 2𝑡 , 𝐡 3 𝑝 𝐯 m , 𝐡 1𝑚 | 𝐡 2𝑚 , 𝐡 2𝑡 , 𝐡 3 𝑝 𝐯 t , 𝐡 1𝑡 | 𝐡 2𝑚 , 𝐡 2𝑡 , 𝐡 3 = 𝐡 𝑝 𝐡 2𝑚 , 𝐡 2𝑡 , 𝐡 3 𝑝 𝐯 m , 𝐡 1𝑚 | 𝐡 2𝑚 𝑝 𝐯 t , 𝐡 1𝑡 | 𝐡 2𝑡 =… = an explicit expression for the image−text example in eqn. (7) All of the conditional distribution of each layer over its neighbors are also presented in the paper in eqn. (8).

Multimodal DBM - Fitting As in the regular RBM, the gradient of the log likelihood consists of two elements: Positive phase (or data expectation): In contrary to the case of regular RBM, this term is hard to compute analytically. Hence, an approximate distribution of the hidden layers is used in order to compute this term. Negative phase (or model expectation): This term can be approximated using Gibbs sampler, similarly to the case of regular RBM, with the approximate distribution from the positive phase.

Multimodal DBM - Fitting We assume an approximation of the distribution: 𝑃 𝐡 𝐯 ≈𝑄 𝐡 𝐯 A native approximation will assume conditionally independence: where 𝑞 ℎ 𝑖 𝑙 =1 𝐯 ≜ 𝜇 𝑖 𝑙 and 𝛍≜ 𝛍 1𝑚 , 𝛍 1𝑡 , 𝛍 2𝑚 , 𝛍 2𝑡 , 𝛍 3 The vector 𝛍 is obtained by maximizing the variational bound: Exact fix point iteration for the components of 𝛍 are presented in eqn. (12).

Multimodal DBM - Applications Generating missing modalities: Given an image 𝐯 𝐦 , we can sample from the conditional distribution 𝑝( 𝐯 𝒕 | 𝐯 𝐦 ) using Gibbs sampler. Samples from 𝑝( 𝐯 𝒎 | 𝐯 𝒕 ) :

Multimodal DBM - Applications Other applications: Inferring Joint Representations: We can sample from 𝑝( 𝐡 𝟑 | 𝐯 𝐦 , 𝐯 𝒕 ) to achieve some joint features. Discriminative Tasks: After learning, the Multimodal Deep Boltzmann Machine can be used to initialize a multilayer neural network by partially unrolling the lower layers (Salakhutdinov and Hinton, 2009b).

Multimodal DBM - Experiments Dataset: MIR Flickr (Huiskes and Lew, 2008) 1 million images (from Flickr website) along with their user assigned tags. 25,000 images have been annotated using 24 labels including object categories There are more than 800,000 distinct tags in the data set. each text input was represented using a vocabulary of the 2000 most frequent tags in the 1 million collection.

Multimodal DBM - Experiments Experiment: Representation of multimodal Inputs Logistic regression on the representation at the joint hidden layer:

Multimodal DBM - Experiments Experiment: Retrieval Tasks - Multimodal Queries How many selected items are relevant? How many relevant items are selected?

Multimodal DBM - Experiments Experiment: Retrieval Tasks - Unimodal Queries How many selected items are relevant? How many relevant items are selected?

Multimodal DBM - Experiments Experiment: Video-Audio Data (CUAVE data set – digits)

Questions?