Advanced topics
Learning feature hierarchies (Deep learning) Outline Self-taught learning Learning feature hierarchies (Deep learning) Scaling up
Self-taught learning
Cars Motorcycles Supervised learning Testing: What is this? Sometimes, most data wins. So, how to get more data? Even with AMT, often slow and expensive. Cars Motorcycles Testing: What is this?
Semi-supervised learning Unlabeled images (all cars/motorcycles) Car Motorcycle Testing: What is this?
Self-taught learning Car Unlabeled images (random internet images) Motorcycle Testing: What is this?
Self-taught learning Sparse coding, LCC, etc. f1, f2, …, fk If have labeled training set is small, can give huge performance boost. Use learned f1, f2, …, fk to represent training/test sets. Car Motorcycle Using f1, f2, …, fk a1, a2, …, ak
Learning feature hierarchies/Deep learning
Why feature hierarchies object models object parts (combination of edges) edges pixels
Deep learning algorithms Stack sparse coding algorithm Deep Belief Network (DBN) (Hinton) Deep sparse autoencoders (Bengio) [Other related work: LeCun, Lee, Yuille, Ng …]
Deep learning with autoencoders Logistic regression Neural network Sparse autoencoder Deep autoencoder
x1 x2 x3 +1 Logistic regression Logistic regression has a learned parameter vector q. On input x, it outputs: where x1 x2 x3 +1 Draw a logistic regression unit as:
String a lot of logistic units together. Example 3 layer network: Neural Network String a lot of logistic units together. Example 3 layer network: x1 a3 a2 a1 x2 x3 Layer 3 +1 +1 Layer 1 Layer 3
Example 4 layer network with 2 output units: Neural Network Example 4 layer network with 2 output units: x1 x2 x3 +1 Layer 4 +1 +1 Layer 3 Layer 1 Layer 2
Neural Network example [Courtesy of Yann LeCun]
Training a neural network Given training set (x1, y1), (x2, y2), (x3, y3 ), …. Adjust parameters q (for every node) to make: (Use gradient descent. “Backpropagation” algorithm. Susceptible to local optima.)
Unsupervised feature learning with a neural network Autoencoder. Network is trained to output the input (learn identify function). Trivial solution unless: Constrain number of units in Layer 2 (learn compressed representation), or Constrain Layer 2 to be sparse. x4 x5 x6 +1 Layer 1 Layer 2 x1 x2 x3 Layer 3 a1 a2 a3
Unsupervised feature learning with a neural network Training a sparse autoencoder. Given unlabeled training set x1, x2, … a1 a2 a3 Reconstruction error term L1 sparsity term
Unsupervised feature learning with a neural network x1 x1 x2 x2 a1 x3 x3 a2 x4 x4 a3 x5 x5 +1 x6 x6 Layer 2 Layer 3 +1 Layer 1
Unsupervised feature learning with a neural network x1 x2 a1 x3 a2 x4 a3 x5 +1 New representation for input. x6 Layer 2 +1 Layer 1
Unsupervised feature learning with a neural network x1 x2 a1 x3 a2 x4 a3 x5 +1 x6 Layer 2 +1 Layer 1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 Train parameters so that , subject to bi’s being sparse. +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 Train parameters so that , subject to bi’s being sparse. +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 Train parameters so that , subject to bi’s being sparse. +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 New representation for input. x6 +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 x3 a2 b2 x4 a3 b3 x5 +1 +1 x6 +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 c1 x3 a2 b2 c2 x4 a3 b3 c3 x5 +1 +1 +1 x6 +1
Unsupervised feature learning with a neural network x1 x2 a1 b1 c1 x3 a2 b2 c2 x4 a3 b3 c3 x5 New representation for input. +1 +1 +1 x6 +1 Use [c1, c3, c3] as representation to feed to learning algorithm.
Deep Belief Net Deep Belief Net (DBN) is another algorithm for learning a feature hierarchy. Building block: 2-layer graphical model (Restricted Boltzmann Machine). Can then learn additional layers one at a time.
Restricted Boltzmann machine (RBM) Layer 2. [a1, a2, a3] (binary-valued) x1 x2 x3 x4 Input [x1, x2, x3, x4] MRF with joint distribution: Use Gibbs sampling for inference. Given observed inputs x, want maximum likelihood estimation:
Restricted Boltzmann machine (RBM) Layer 2. [a1, a2, a3] (binary-valued) x1 x2 x3 x4 Input [x1, x2, x3, x4] Gradient ascent on log P(x) : [xiaj]obs from fixing x to observed value, and sampling a from P(a|x). [xiaj]prior from running Gibbs sampling to convergence. Adding sparsity constraint on ai’s usually improves results.
Deep Belief Network Similar to a sparse autoencoder in many ways. Stack RBMs on top of each other to get DBN. Layer 3. [b1, b2, b3] Layer 2. [a1, a2, a3] Input [x1, x2, x3, x4] Train with approximate maximum likelihood (often with sparsity constraint on ai’s):
Deep Belief Network Layer 4. [c1, c2, c3] Layer 3. [b1, b2, b3] Layer 2. [a1, a2, a3] End: One of challenges is scaling up. Most people: 14x14 up to 32x32. Input [x1, x2, x3, x4]
Deep learning examples
Convolutional DBN for audio Max pooling unit Detection units Spectrogram
Convolutional DBN for audio Time-invariant features Spectrogram
Probabilistic max pooling Convolutional DBN: Convolutional Neural net: X3 X1 X2 X4 max {x1, x2, x3, x4} max {x1, x2, x3, x4} Where xi are {0,1}, and mutually exclusive. Thus, 5 possible cases: 1 1 1 1 1 1 X1 X2 X3 X4 1 1 Where xi are real numbers. Collapse 2n configurations into n+1 configurations. Permits bottom up and top down inference.
Convolutional DBN for audio Spectrogram
Convolutional DBN for audio Max pooling Second CDBN layer Detection units Max pooling One CDBN layer Detection units
Learned first-layer bases CDBNs for speech Visual bases: Look at them and see if make sense/correspond to Gabors. Try to perform similar analysis on audio bases. Learned first-layer bases
Convolutional DBN for Images ‘’max-pooling’’ node (binary) Wk Detection layer H Max-pooling layer P Hidden nodes (binary) “Filter” weights (shared) At most one hidden nodes are active. Input data V Visible nodes (binary or real)
Convolutional DBN on face images object models object parts (combination of edges) edges Note: Sparsity important for these results. pixels
Learning of object parts Examples of learned object parts from object categories Faces Cars Elephants Chairs
Training on multiple objects Trained on 4 classes (cars, faces, motorbikes, airplanes). Second layer: Shared-features and object-specific features. Third layer: More specific features. Third layer bases learned from 4 object categories. Plot of H(class|neuron active) Second layer bases learned from 4 object categories.
Hierarchical probabilistic inference Generating posterior samples from faces by “filling in” experiments (cf. Lee and Mumford, 2003). Combine bottom-up and top-down inference. Input images Samples from feedforward Inference (control) Aglioti et al., 1994; Halligan et al., 1993; Weinstein, 1969; Ramachandran, 1998; Halligan et al., 1993; Sadato et al., 1996; Halligan et al., 1999 Samples from Full posterior inference
Key issue in feature learning: Scaling up
Scaling up with graphics processors US$ 250 NVIDIA GPU Peak GFlops http://www.cbsnews.com/stories/2000/06/29/tech/main210684.shtml: 12.3 Tflops, $110 million, used to simulate nuclear weapon testing. Like 13 graphics cards costing $250 each. 40 people with US$250 graphics card #18 on top supercomputers list 2 years back. http://www.top500.org/list/2006/11/100 Intel CPU 2003 2004 2005 2006 2007 2008 (Source: NVIDIA CUDA Programming Guide)
Approx. number of parameters (millions): Scaling up with GPUs Approx. number of parameters (millions): Using GPU (Raina et al., 2009)
Unsupervised feature learning: Does it work?
State-of-the-art task performance Audio State-of-the-art task performance TIMIT Phone classification Accuracy Prior art (Clarkson et al.,1999) 79.6% Stanford Feature learning 80.3% TIMIT Speaker identification Accuracy Prior art (Reynolds, 1995) 99.7% Stanford Feature learning 100.0% Images CIFAR Object classification Accuracy Prior art (Yu and Zhang, 2010) 74.5% Stanford Feature learning 75.5% NORB Object classification Accuracy Prior art (Ranzato et al., 2009) 94.4% Stanford Feature learning 96.2% Video UCF activity classification Accuracy Prior art (Kalser et al., 2008) 86% Stanford Feature learning 87% Hollywood2 classification Accuracy Prior art (Laptev, 2004) 47% Stanford Feature learning 50% Multimodal (audio/video) AVLetters Lip reading Accuracy Prior art (Zhao et al., 2009) 58.9% Stanford Feature learning 63.1%
Instead of hand-tuning features, use unsupervised feature learning! Summary Instead of hand-tuning features, use unsupervised feature learning! Sparse coding, LCC. Advanced topics: Self-taught learning Deep learning Scaling up
Workshop page: http://ufldl.stanford.edu/eccv10-tutorial/ Other resources Workshop page: http://ufldl.stanford.edu/eccv10-tutorial/ Code for Sparse coding, LCC. References. Full online tutorial.