Semi-Supervised Learning in Gigantic Image Collections

Slides:



Advertisements
Similar presentations
Semi-Supervised Learning in Gigantic Image Collections
Advertisements

Bayesian Belief Propagation
January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
Linear Regression.
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
CMPUT 466/551 Principal Source: CMU
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
The loss function, the normal equation,
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Logistic Regression Principal Component Analysis Sampling TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A A A.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Spectral Methods for Dimensionality
Semi-Supervised Clustering
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Linli Xu Martha White Dale Schuurmans University of Alberta
Data Mining, Neural Network and Genetic Programming
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
CSE 4705 Artificial Intelligence
Classification of unlabeled data:
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Machine learning, pattern recognition and statistical data modelling
Multimodal Learning with Deep Boltzmann Machines
ECE 6504 Deep Learning for Perception
Bias and Variance of the Estimator
Data Mining Practical Machine Learning Tools and Techniques
INF 5860 Machine learning for image classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Overfitting and Underfitting
Support Vector Machines
The loss function, the normal equation,
Parametric Methods Berlin Chen, 2005 References:
Neural networks (1) Traditional multi-layer perceptrons
Machine learning overview
Neural networks (3) Regularization Autoencoder
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Semi-Supervised Learning
Introduction to Neural Networks
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Presented by Gunnar Atli Sigurdsson TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Spectrum of Label Information Human annotations Noisy labels Unlabeled One property of images on the internet is that they have a wide range of label information. A tiny fraction have been labeled by humans so have reliable labels, but a much larger fraction have a some kind of noisy label. In other words, there is some kind of text associated with the image, from the image name or surrounding HTML text that gives some cue as to what’s in the image but is not that accurate. And of course, we also have a large amount of data with no labels at all. So we’d like a framework that can make use of all these types of labels. 2

Semi-Supervised Learning Classification function should be smooth with respect to data density Data Supervised Semi-Supervised And one such technique is semi-supervised learning. Consider a toy dataset with just two labels 3

Semi-Supervised Learning using Graph Laplacian W is n x n affinity matrix (n = # of points) We consider approaches of this type that are based on the Graph Laplacian. Here each image is a vertex in a graph and the weight of the edge between the vertices is given by an affinity defined as follows. So for n points we have an n by n affinity matrix W, from which we can compute the normalized graph Laplacian L, like so. Graph Laplacian: 4

SSL using Graph Laplacian Want to find label function f that minimizes: y = labels If labeled, , otherwise Agreement with labels Smoothness In SSL, we solve for a label function f over the data points. The graph Laplacian measures the smoothness of the label function f while the second term constrains f to agree with the labels, using a weighting lambda according the reliability of the label. We can rewrite these lambda in matrix form like so and then the optimal f is given by the solution to a this n by n linear system. Solution: n x n system (n = # points) 5

Eigenvectors of Laplacian Approximate system Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues: Instead of directly solving the n by n linear system, we can instead model f as a linear combination of the smallest few eigenvectors of the Laplacian. So U are the eigenvectors and \alpha are the coefficients. These smallest eigenvectors are smooth with respect to the data density. The smallest is just a DC term, but the 2nd smallest splits the data horizontally and the 3rd splits it vertically. 6

Rewrite System Let f = U α U = smallest k eigenvectors of L α = coeffs. k is user parameter (typically ~100) y = labels Optimal α is now solution to k x k system: Problem: How to get the eigenvectors? So if we use the k smallest eigenvectors as a basis, k being a value we select, typically being 100 or so, then we can rewrite the linear system we had previously as follows. This now means that instead of solving an n by n system, we just need to solve a k by k system for the coefficients alpha, from which we can compute the label function f. 7

Approach So our approach takes a different route. 8

Overview of Approach Compute approximate eigenvectors Ours Nystrom Density Data Landmarks Ours Nystrom In Nystrom we reduce the number of data points down to set a set of landmarks. By contrast in our approach we consider the limit as the number of points goes to infinity and we have a continuous density. A key points is that our approach is linear in the # of examples. Limit as n → ∞ Reduce n Polynomial in number of landmarks Linear in number of data-points 9

Consider Limit as n → ∞ Consider x to be drawn from toy 2D distribution p(x) Let Lp(F) be a smoothness operator on p(x), for a function F(x) Smoothness operator penalizes functions that vary in areas of high density Analyze eigenfunctions of Lp(F) Let’s consider a toy 2D dataset. As n goes to infinity, then we have 2D density. We define an operator that measures smoothness of a continuous label function capital F. Notice that is a continuous analogue of the graph Laplacian. Nearby locations x1 and x2 which have high affinity will similar values in F, if F is smooth. We now analyze the eigenfunctions of this operator. Lp of F. 10

Eigenvectors & Eigenfunctions To get an intuition as to what these eigenfunction looks like we show them in the bottom row. Notice that they are continuous functions that capture the same structure as the discrete eigenvectors. 11

Key Assumption: Separability of Input data p(x1) Claim: If p is separable, then: Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue A key assumption we make in our method is that the input distribution is separable. So for our toy 2D data, we assume that the joint density is modeled as a product of the two marginal distributions px1 and px2. And you can show that eigenfunctions of these marginals are eigenfunctions of the joint density. p(x2) p(x1,x2) 12

Preprocessing Need to make data separable Rotate using PCA PCA One important pre-processing operation is that we rotate the data to make it more separable, as assumed by our algorithm. Currently we do this we PCA, although other options are possible. Not separable Separable 13

Numerical Approximations to Eigenfunctions in 1D 300,000 points drawn from distribution p(x) Consider p(x1) p(x1) So let’s look at how we compute the eigenfunctions for one of these marginal distributions. Given a large set of observed data, which we assume is drawn from the density, we can form a histogram hx1 which will be an approximation to the true marginal. Histogram h(x1) p(x) Data 14

Numerical Approximations to Eigenfunctions in 1D Solve for values of eigenfunction at set of discrete locations (histogram bin centers) and associated eigenvalues B x B system (B = # histogram bins, e.g. 50) Then we can solve for the eigenfunctions of the 1D distribution using the histogram. We solve for the values of the eigenfunction g and their associated eigenvalues at the locations of the bin centers using the following equation. The size of this system is given by the # of histogram bins which is small, e.g. 50 or so. P is approximated density at selected points W is affinity between those points D is degree of points as before 15

From Eigenfunctions to Approximate Eigenvectors Take each data point Do 1-D interpolation in each eigenfunction Very fast operation Histogram bin Having obtained the eigenfunctions, how do we compute the approximate eigenvectors? For each data point we just do a 1D interpolation in the eigenfunctions to find the eigenvector value. This is a very quick operation. Eigenfunction value 1 50 16

Overall Algorithm Rotate data to maximize separability (PCA) For each of the d input dimensions: Construct 1D histogram (allows to calc density) Solve numerically for eigenfunctions/values Order eigenfunctions from all dimensions by increasing eigenvalue & take first k Interpolate data into k eigenfunctions Yields approximate eigenvectors of Laplacian Solve k x k least squares system to give label function

Experiments on Toy Data

Nystrom Comparison With Nystrom, too few landmark points result in highly unstable eigenvectors Here we compare Nystrom to our eigenfunction approach. When only a few landmarks are used, Nystrom fails to capture the structure of the data. But our eigenfunction approach uses all the datapoints and we get the correct solution. 19

Nystrom Comparison Eigenfunctions fail when data has significant dependencies between dimensions But if the input data has significant dependencies between dimensions, like in this example, then our eigenfunction approach fails. Here Nystrom, despite the small set of landmarks, gets the correct solution. 20

Experiments on Real Data

Experiments Images from 126 classes downloaded from Internet search engines, total 63,000 images Dump truck Emu In our first set of experiments we use 63,000 images from a set of 126 classes downloaded from image search engines. Human labels for these images have been gathered by Alex Krizhevsky and Geoff Hinton. 22

Real 1-D Eigenfunctions of PCA’d Gist descriptors Input Dimension

Protocol Task is to re-rank images of each class (class/non-class) Use eigenfunctions computed on all 63,000 images Vary number of labeled examples Measure precision @ 15% recall

Total number of images 4800 5000 8000 6000

80 Million Images

Running on 80 million images PCA to 32 dims, k=48 eigenfunctions For each class, labels propagating through 80 million images Precompute approximate eigenvectors (~20Gb) Label propagation is fast <0.1secs/keyword

Japanese Spaniel 3 positive 3 negative Labels from CIFAR set

Summary Semi-supervised scheme that can scale to really large problems – linear in # points Rather than sub-sampling the data, take the limit of infinite unlabeled data Assumes input data distribution is separable Can propagate labels in graph with 80 million nodes in fractions of second

Training Deep Networks on noisy labels with bootstrapping Scott E. Reid & Honglak Lee Presented by Gunnar Atli Sigurdsson

Perceptual Consistency Want classifier to predict same label for similar percepts Similar to “smoothness” objective Gives the classifier a reason to reject implausible labels, and fill in missing

Method

Consistency via reconstruction Model noisy label as t = W q Explicitly models the noise distribution

Model-free consistency via bootstrapping Softmax regression with minimum entropy regularization Normal, or with hard-max prediction

Summary Weakly-supervised deep learning Simple add-on approaches improve performance Handles noisy, incomplete, and subjective labels

Learning from Noisy Labels with Deep Neural Networks Sainbayar Sukhabaatar & Rob Fergus http://latex.codecogs.com/png.download?%5Csum_%7Bi%3D1%7D%5En%2 0%5Csqrt%7Bw_ix_i%7D

Noisy labels True label is a latent variable in reality Examples Google search, User Tags, Keyword, etc Approaches Overfitting avoidance, Preprocessing, Priors Marginalize out the latent true label (surrogate cost function) Authors refer to logistic regression paper and follow this approach. Marginalizing out means that the final cost function is a linear combination of cost functions

Motivation In standard CNN Assumptions: 50K correct + 40k incorrect = 10K correct (labels) Assumptions: Label noise is random conditioned on true class Each class only mislabeled with small set of others “Bottom-up” (Model predicts noisy labels instead of cleaning them before showing to model) Top-down also considered, but proved to learn a degenerate solution and give garbage results

Approach We want to learn the noise distribution Correctly labeled samples Noisy labeled samples We augment a CNN with linear layer weights represent probabilities If Q fixed, gradients pass through explain parameters show this is marginalizing

Estimating noise dist. with clean data Too many parameters for cross-validation If clean data available Train model M Use confusion matrices on clean and noisy to estimate noise distribution: or Use bayes rule to get Q

Estimating noise dist. with noisy data Noise distribution unknown Use back-prop. to learn Q, but project onto subspace Trace norm regularizer on Q Forces network to use Q Prove that only global minimum for estimate is the ideal In practice use a schedule for when to start updating Q, and when to use weight decay

Experiments, synthetic SVHN dataset. Randomly change labels. Google street-view house number dataset (SVHN)

Experiments, synthetic Mix together clear and noisy datasets. CIFAR is a 60K small images 32x32, of 10 object categories (Alex Krizhevsky)

Experiments, synthetic 20K clean images, 30K noisy images Test on 30K noisy images Baseline on 10K clean: 30% test error

Experiments, 50K CIFAR-10 In practice: downweight noisy data in loss Many outside images, noisy model violated 150k random with uniform label improves baseline

Experiments, inherent label noise 1.3M clean images from ImageNet2012 1.4M noisy images from Google Same performance as AlexNet with 15M clean.