Deep Architectures for Artificial Intelligence
Learning Features: The Past Traditional model of pattern recognition involves fixed kernel machines and hand-crafted features(since late 50’s) The first learning machine was the “Perceptron” Built at Cornell in 1960, the perceptron was a linear classifier on top of a simple feature extractor The vast majority of practical applications of ML today use glorified linear classifiers or glorified template matching.
Learning Features: The Future Modern approaches are based upon trainable features AND trainable classifier. Designing a feature extractor requires considerable efforts by experts
Machine Learning Supervised learning Unsupervised learning The training data consists of input information with their corresponding output information. Unsupervised learning The training data consists of input information without their corresponding output information.
Neural networks Generative model Discriminative model P(x,y1) P(x,y2) Model the distribution of input as well as output ,P(x , y) Discriminative model Model the posterior probabilities ,P(y | x) P(x,y1) P(x,y2) P(y1|x) P(y2|x)
Neural networks Two layer neural networks (Sigmoid neurons) Back-propagation Step1: Randomly initial weight Determine the output vector Step2: Evaluating the gradient of an error function Step3: Adjusting weight, Repeat The step1,2,3 until error enough low
Deep Neural Networks ANN with more than 2 hidden layer are referred as deep Given enough hidden neuron, a single hidden layer is enough to approximate any function to any degree of precision However too many neuron may quickly make the network unfeasible to train Adding layer greatly improve the network learning capacity, thus reducing the number of neuron needed
Deep Learning Deep Learning is about representing high-dimensional data Learning Representations of data means to discover and disentangle the independent explanatory factors that underlie the data distribution. The Manifold Hypothesis: Natural data lives in a low- dimensional (non-linear) manifold because variables in natural data are mutually dependent. Internal intermediate representations can be viewed as latent variables to be inferred, and deep belief networks are a particular type of latent variable models.
Hierarchy of Representations Hierarchy of representations with increasing level of abstraction Each stage is a kind of trainable feature transform Image recognition Image Pixel → edge → texton → motif → part → object Text Character → word → word group → clause → sentence → story Speech Sample → Spectral → Band → Sound → phoneme → word
How to train deep models? Purely Supervised Initialize parameters randomly Train in supervised mode typically with SGD, using backprop to compute gradients Used in most practical systems for speech and image recognition Unsupervised, layerwise + supervised classifier on top Train each layer unsupervised, one after the other Train a supervised classifier on top, keeping the other layers fixed Good when very few labeled samples are available Unsupervised, layerwise + global supervised fine-tuning Add a classifier layer, and retrain the whole thing supervised Good when label set is poor (e.g. pedestrian detection) Unsupervised pre-training often uses regularized auto-encoders
Boltzmann Machine Model one input layer and one hidden layer typically binary states for every unit stochastic (vs. deterministic) recurrent (vs. feed-forward) generative model (vs. discriminative): estimate the distribution of observations(say p(image)), while traditional discriminative networks only estimate the labels(say p(label|image)) defined Energy of the network and Probability of a unit’s state(scalar T is referred to as the “temperature”):
Restricted Boltzmann Machine Model a bipartite graph: no intralayer connections, feed-forward RBM does not have T factor, the rest are the same as BM one important feature of RBM is that the visible units and hidden units are conditionally independent, which will lead to a beautiful result later on:
Restricted Boltzmann Machine Two characters to define a Restricted Boltzmann Machine: states of all the units: obtained through probability distribution. weights of the network: obtained through training(Contrastive Divergence). As mentioned before, the objective of RBM is to estimate the distribution of input data. And this goal is fully determined by the weights, given the input. Energy defined for the RBM:
Restricted Boltzmann Machine Distribution of visible layer of the RBM(Boltzmann Distribution): Z is the partition function defined as the sum of over all possible configurations of {v,h} Probability that unit i is on(binary state 1): is the logistic/sigmoid function
Deep Belief Net Based on RBMs h2 data h1 h3 RBM DBNs based on stacks of RBMs: The top two hidden layers form an undirected associative memory(regarded as a shorthand for infinite stacks) and the remained hidden layers form a directed acyclic graph. The red arrows are NOT part of the generative model. They are just for inference purpose
Training Deep Belief Nets Previous discussion gives an intuition of training stacks of RBMs one layer at a time. This greedy learning algorithm is proved to be efficient in the sense of expected variance by Hinton. First, learn all the weights tied.
Training Deep Belief Nets Then freeze bottom layer and relearn all the other layers.
Training Deep Belief Nets Then freeze bottom two layers and relearn all the other layers.
Training Deep Belief Nets Each time we learn a new layer, the inference at the lower layers will become incorrect, but the variational bound on the log probability of the data improves, proved by Hinton. Since the inference at lower layers becomes incorrect, Hinton uses a fine- tuning procedure to adjust the weights, called wake-sleep algorithm.
Training Deep Belief Nets Wake-sleep algorithm: wake phase: do a down-top pass, sample h using the recognition weight based on input v for each RBM, and then adjust the generative weight by the RBM learning rule. sleep phase: do a top-down pass, start by a random state of h at the top layer and generate v. Then the recognition weights are modified. Analogs for wake-sleep algorithm: wake phase: if the reality is different with the imagination, then modify the generative weights to make what is imagined as close as the reality. sleep phase: if the illusions produced by the concepts learned during wake phase are different with the concepts, then modify the recognition weight to make the illusions as close as the concepts.
Useful Resources Webpages: People: Geoffrey E. Hinton’s readings (with source code available for DBN) http://www.cs.toronto.edu/~hinton/csc2515/deeprefs.html Notes on Deep Belief Networks http://www.quantumg.net/dbns.php MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectures.net/mlss2010au_frean_deepbeliefnets/ Deep Learning Tutorials http://deeplearning.net/tutorial/ Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/ Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf CUHK MMlab project : http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html People: Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton Andrew Ng http://www.cs.stanford.edu/people/ang/index.html Yoshua Bengio www.iro.umontreal.ca/~bengioy Yann LeCun http://yann.lecun.com/ Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php