Semi-Supervised Learning in Gigantic Image Collections

Slides:



Advertisements
Similar presentations
Face Recognition Sumitha Balasuriya.
Advertisements

Partitional Algorithms to Detect Complex Clusters
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
3D Geometry for Computer Graphics
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
1School of CS&Eng The Hebrew University
AGE ESTIMATION: A CLASSIFICATION PROBLEM HANDE ALEMDAR, BERNA ALTINEL, NEŞE ALYÜZ, SERHAN DANİŞ.
Computer Vision – Image Representation (Histograms)
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
© 2003 by Davi GeigerComputer Vision September 2003 L1.1 Face Recognition Recognized Person Face Recognition.
Iterative closest point algorithms
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (NYU) Yair Weiss (Hebrew U.) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
Understanding and evaluating blind deconvolution algorithms
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Efficient Image Search and Retrieval using Compact Binary Codes
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Linear Algebra and Image Processing
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Alignment Introduction Notes courtesy of Funk et al., SIGGRAPH 2004.
Internet-scale Imagery for Graphics and Vision James Hays cs195g Computational Photography Brown University, Spring 2010.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Segmentation using eigenvectors Papers: “Normalized Cuts and Image Segmentation”. Jianbo Shi and Jitendra Malik, IEEE, 2000 “Segmentation using eigenvectors:
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A A A A A AAA Fitting a Graph to Vector Data Samuel I. Daitch (Yale)
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Object Detection with Discriminatively Trained Part Based Models
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Linear Models for Classification
Efficient Image Search and Retrieval using Compact Binary Codes Rob Fergus (NYU) Jon Barron (NYU/UC Berkeley) Antonio Torralba (MIT) Yair Weiss (Hebrew.
Dimensionality reduction
1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.
Mesh Segmentation via Spectral Embedding and Contour Analysis Speaker: Min Meng
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CSE 554 Lecture 8: Alignment
Spectral Methods for Dimensionality
Semi-Supervised Clustering
SIFT Scale-Invariant Feature Transform David Lowe
Intrinsic Data Geometry from a Training Set
Linli Xu Martha White Dale Schuurmans University of Alberta
Dimensionality Reduction
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Unsupervised Riemannian Clustering of Probability Density Functions
Dimensionality reduction
Semi-Supervised Learning in Gigantic Image Collections
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Singular Value Decomposition
Efficient Image Classification on Vertically Decomposed Data
Grouping.
Introduction PCA (Principal Component Analysis) Characteristics:
Dimension reduction : PCA and Clustering
LECTURE 09: DISCRIMINANT ANALYSIS
Machine learning overview
Nonlinear Dimension Reduction:
Using Manifold Structure for Partially Labeled Classification
“Traditional” image segmentation
Presentation transcript:

Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

Gigantic Image Collections What does the world look like? Gigantic Image Collections Object Recognition for large-scale image search High level image statistics Our goal is to develop techniques for image search that can be applied to the billions of images on the Internet We’re interested in developing object recognition techniques that can scale to the billions of images on the Internet.

Spectrum of Label Information Human annotations Noisy labels Unlabeled One property of images on the internet is that they have a wide range of label information. A tiny fraction have been labeled by humans so have reliable labels, but a much larger fraction have a some kind of noisy label. In other words, there is some kind of text associated with the image, from the image name or surrounding HTML text that gives some cue as to what’s in the image but is not that accurate. And of course, we also have a large amount of data with no labels at all. So we’d like a framework that can make use of all these types of labels.

Semi-Supervised Learning Classification function should be smooth with respect to data density Data Supervised Semi-Supervised And one such technique is semi-supervised learning. Consider a toy dataset with just two labels

Semi-Supervised Learning using Graph Laplacian [Zhu03,Zhou04] is n x n affinity matrix (n = # of points) We consider approaches of this type that are based on the Graph Laplacian. Here each image is a vertex in a graph and the weight of the edge between the vertices is given by an affinity defined as follows. So for n points we have an n by n affinity matrix W, from which we can compute the normalized graph Laplacian L, like so. Graph Laplacian:

SSL using Graph Laplacian Want to find label function f that minimizes: y = labels If labeled, , otherwise Smoothness Agreement with labels In SSL, we solve for a label function f over the data points. The graph Laplacian measures the smoothness of the label function f while the second term constrains f to agree with the labels, using a weighting lambda according the reliability of the label. We can rewrite these lambda in matrix form like so and then the optimal f is given by the solution to a this n by n linear system. Solution: n x n system (n = # points)

Eigenvectors of Laplacian Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues: [Belkin & Niyogi 06, Schoelkopf & Smola 02, Zhu et al 03, 08] Instead of directly solving the n by n linear system, we can instead model f as a linear combination of the smallest few eigenvectors of the Laplacian. So U are the eigenvectors and \alpha are the coefficients. These smallest eigenvectors are smooth with respect to the data density. The smallest is just a DC term, but the 2nd smallest splits the data horizontally and the 3rd splits it vertically.

Rewrite System Let U = smallest k eigenvectors of L = coeffs. k is user parameter (typically ~100) Optimal is now solution to k x k system: So if we use the k smallest eigenvectors as a basis, k being a value we select, typically being 100 or so, then we can rewrite the linear system we had previously as follows. This now means that instead of solving an n by n system, we just need to solve a k by k system for the coefficients alpha, from which we can compute the label function f.

Computational Bottleneck Consider a dataset of 80 million images Inverting L Inverting 80 million x 80 million matrix Finding eigenvectors of L Diagonalizing 80 million x 80 million matrix But if we want to scale to really large datasets, then we have a problem. Consider a dataset of 80 million images. If we directly solve the original linear system, then we need to invert an 80 million by 80 million matrix. If we use the eigenvector basis, then finding the eigenvectors themselves is going to require diagonalizing this 80 million by 80 million matrix.

Large Scale SSL - Related work Nystrom method: pick small set of landmark points Compute exact eigenvectors on these Interpolate solution to rest Other approaches include: [see Zhu ‘08 survey] Data Landmarks A number of techniques exist for doing large scale semi-supervised learning. Many of these are similar to the Nystrom method which computes the exact eigenvectors on a small set of landmarks and then interpolates the remaining points to give an approximate solution. A variety of other approaches exist that adaptively group the datapoints and compute eigenvectors on these groupings. Mixture models (Zhu and Lafferty ‘05), Sparse Grids (Garcke and Griebel ‘05), Sparse Graphs (Tsang and Kwok ‘06)

Our Approach So our approach takes a different route.

Overview of Our Approach Compute approximate eigenvectors Density Data Landmarks Ours Nystrom In Nystrom we reduce the number of data points down to set a set of landmarks. By contrast in our approach we consider the limit as the number of points goes to infinity and we have a continuous density. A key points is that our approach is linear in the # of examples. Limit as n  ∞ Reduce n Linear in number of data-points Polynomial in number of landmarks

Consider Limit as n  ∞ Consider x to be drawn from 2D distribution p(x) Let Lp(F) be a smoothness operator on p(x), for a function F(x) Smoothness operator penalizes functions that vary in areas of high density Analyze eigenfunctions of Lp(F) 2 where Let’s consider a toy 2D dataset. As n goes to infinity, then we have 2D density. We define an operator that measures smoothness of a continuous label function capital F. Notice that is a continuous analogue of the graph Laplacian. Nearby locations x1 and x2 which have high affinity will similar values in F, if F is smooth. We now analyze the eigenfunctions of this operator. Lp of F.

Eigenvectors & Eigenfunctions To get an intuition as to what these eigenfunction looks like we show them in the bottom row. Notice that they are continuous functions that capture the same structure as the discrete eigenvectors.

Key Assumption: Separability of Input data p(x1) Claim: If p is separable, then: Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue A key assumption we make in our method is that the input distribution is separable. So for our toy 2D data, we assume that the joint density is modeled as a product of the two marginal distributions px1 and px2. And you can show that eigenfunctions of these marginals are eigenfunctions of the joint density. p(x2) [Nadler et al. 06,Weiss et al. 08] p(x1,x2)

Numerical Approximations to Eigenfunctions in 1D 300,000 points drawn from distribution p(x) Consider p(x1) p(x1) So let’s look at how we compute the eigenfunctions for one of these marginal distributions. Given a large set of observed data, which we assume is drawn from the density, we can form a histogram hx1 which will be an approximation to the true marginal. Histogram h(x1) p(x) Data

Numerical Approximations to Eigenfunctions in 1D Solve for values of eigenfunction at set of discrete locations (histogram bin centers) and associated eigenvalues B x B system (B = # histogram bins, e.g. 50) Then we can solve for the eigenfunctions of the 1D distribution using the histogram. We solve for the values of the eigenfunction g and their associated eigenvalues at the locations of the bin centers using the following equation. The size of this system is given by the # of histogram bins which is small, e.g. 50 or so.

1D Approximate Eigenfunctions So if we solve this system, then what do the eigenfunctions look like? Well, here are the smallest 3 eigenfunctions of the marginal of the 1st dimension. You can see that the first one is fairly constant in regions of high density and then changes rapidly in the middle. The second has an extra kink and the 3rd has two kinks. 1st Eigenfunction of h(x1) 2nd Eigenfunction of h(x1) 3rd Eigenfunction of h(x1)

Separability over Dimension Build histogram over dimension 2: h(x2) Now solve for eigenfunctions of h(x2) And we do the same thing for the 2nd dimension. Here the marginal is a similar to a gaussian and these are the eigenfunctions. 1st Eigenfunction of h(x2) 2nd Eigenfunction of h(x2) 3rd Eigenfunction of h(x2)

From Eigenfunctions to Approximate Eigenvectors Take each data point Do 1-D interpolation in each eigenfunction Very fast operation Eigenfunction value Having obtained the eigenfunctions, how do we compute the approximate eigenvectors? For each data point we just do a 1D interpolation in the eigenfunctions to find the eigenvector value. This is a very quick operation. 1 50 Histogram bin

Preprocessing Need to make data separable Rotate using PCA PCA One important pre-processing operation is that we rotate the data to make it more separable, as assumed by our algorithm. Currently we do this we PCA, although other options are possible. Not separable Separable

Overall Algorithm Rotate data to maximize separability (currently use PCA) For each of the d input dimensions: Construct 1D histogram Solve numerically for eigenfunctions/values Order eigenfunctions from all dimensions by increasing eigenvalue & take first k Interpolate data into k eigenfunctions Yields approximate eigenvectors of Laplacian Solve k x k least squares system to give label function

Experiments on Toy Data

Nystrom Comparison With Nystrom, too few landmark points result in highly unstable eigenvectors Here we compare Nystrom to our eigenfunction approach. When only a few landmarks are used, Nystrom fails to capture the structure of the data. But our eigenfunction approach uses all the datapoints and we get the correct solution.

Nystrom Comparison Eigenfunctions fail when data has significant dependencies between dimensions But if the input data has significant dependencies between dimensions, like in this example, then our eigenfunction approach fails. Here Nystrom, despite the small set of landmarks, gets the correct solution.

Experiments on Real Data

Experiments Images from 126 classes downloaded from Internet search engines, total 63,000 images Dump truck Emu Labels (correct/incorrect) provided by Alex Krizhevsky, Vinod Nair & Geoff Hinton, (CIFAR & U. Toronto) In our first set of experiments we use 63,000 images from a set of 126 classes downloaded from image search engines. Human labels for these images have been gathered by Alex Krizhevsky and Geoff Hinton.

Input Image Representation Pixels not a convenient representation Use Gist descriptor (Oliva & Torralba, 2001) L2 distance btw. Gist vectors rough substitute for human perceptual distance We represent each image with a single global descriptor. We use the Gist descriptor of Oliva & Torralba which captures the energy of gabor filters at different scales & orientations over the image. We use PCA to project it down into 64 d. Apply oriented Gabor filters over different scales Average filter energy in each bin

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after PCA to 64 dimensions PCA MI is mutual information score. 0 = Independent

Real 1-D Eigenfunctions of PCA’d Gist descriptors Input Dimension

Protocol Task is to re-rank images of each class (class/non-class) Use eigenfunctions computed on all 63,000 images Vary number of labeled examples Measure precision @ 15% recall

Total number of images 4800 5000 8000 6000

Total number of images 4800 5000 8000 6000

Total number of images 4800 5000 8000 6000

Total number of images 4800 5000 8000 6000

80 Million Images

Running on 80 million images PCA to 32 dims, k=48 eigenfunctions For each class, labels propagating through 80 million images Precompute approximate eigenvectors (~20Gb) Label propagation is fast <0.1secs/keyword

Japanese Spaniel 3 positive 3 negative Labels from CIFAR set

Airbus, Ostrich, Auto

Summary Semi-supervised scheme that can scale to really large problems – linear in # points Rather than sub-sampling the data, we take the limit of infinite unlabeled data Assumes input data distribution is separable Can propagate labels in graph with 80 million nodes in fractions of second Related paper in this NIPS by Nadler, Srebro & Zhou See spotlights on Wednesday

Future Work Can potentially use 2D or 3D histograms instead of 1D Requires more data Consider diagonal eigenfunctions Sharing of labels between classes

Comparison of Approaches

Exact Eigenvectors 0.0531 : 0.0535 Exact -- Approximate Eigenvalues Approximate 0.1920 : 0.1928 0.2049 : 0.2068 0.2480 : 0.5512 0.3580 : 0.7979 Data So for this toy dataset on the left, we can compute the exact eigenvectors of the graph Laplacian shown in the center, and also the approximate eigenvectors using our eigenfunction approach. Note the similarity of the eigenvalues of the first 3 eigenvectors, but then the exact ones start mixing across dimension

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after PCA PCA MI is mutual information score. 0 = Independent

Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after ICA ICA MI is mutual information score. 0 = Independent

Varying # Eigenfunctions

Leveraging Noisy Labels Images in dataset have noisy labels Keyword used in from Internet search engine Can easily be incorporated into SSL scheme Give weight 1/10th of hand-labeled example

Leveraging Noisy Labels

Effect of Noisy Labels

Complexity Comparison Key: n = # data points (big, >106) l = # labeled points (small, <100) m = # landmark points d = # input dims (~100) k = # eigenvectors (~100) b = # histogram bins (~50) Nystrom Select m landmark points Get smallest k eigenvectors of m x m system Interpolate n points into k eigenvectors Solve k x k linear system Eigenfunction Rotate n points Form d 1-D histograms Solve d linear systems, each b x b k 1-D interpolations of n points Solve k x k linear system Polynomial in # landmarks Linear in # data points

Semi-Supervised Learning using Graph Laplacian [Zhu03,Zhou04] V = data points (n in total) E = n x n affinity matrix W We consider approaches of this type that are based on the Graph Laplacian. Here each image is a vertex in a graph and the weight of the edge between the vertices is given by an affinity defined as follows. So for n points we have an n by n affinity matrix W, from which we can compute the normalized graph Laplacian L, like so. Graph Laplacian:

Rewrite System Let U = smallest k eigenvectors of L = coeffs. k is user parameter (typically ~100) Optimal is now solution to k x k system: So if we use the k smallest eigenvectors as a basis, k being a value we select, typically being 100 or so, then we can rewrite the linear system we had previously as follows. This now means that instead of solving an n by n system, we just need to solve a k by k system for the coefficients alpha, from which we can compute the label function f.

Consider Limit as n  ∞ Consider x to be drawn from 2D distribution p(x) Let Lp(F) be a smoothness operator on p(x), for a function F(x): Analyze eigenfunctions of Lp(F) 2 where Let’s consider a toy 2D dataset. As n goes to infinity, then we have 2D density. We define an operator that measures smoothness of a continuous label function capital F. Notice that is a continuous analogue of the graph Laplacian. Nearby locations x1 and x2 which have high affinity will similar values in F, if F is smooth. We now analyze the eigenfunctions of this operator. Lp of F.

Numerical Approximations to Eigenfunctions in 1D Solve for values g of eigenfunction at set of discrete locations (histogram bin centers) and associated eigenvalues B x B system (# histogram bins = 50) P is diag(h(x1)) Then we can solve for the eigenfunctions of the 1D distribution using the histogram. We solve for the values of the eigenfunction g and their associated eigenvalues at the locations of the bin centers using the following equation. The size of this system is given by the # of histogram bins which is small, e.g. 50 or so. Affinity between discrete locations

Real 1-D Eigenfunctions of PCA’d Gist descriptors Eigenfunction value Color = Input dimension xmin xmax Histogram bin 1 50 Input Dimension Eigenfunction 256