Jakob Verbeek December 11, 2009

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Applications of one-class classification
Weakly supervised learning of MRF models for image region labeling Jakob Verbeek LEAR team, INRIA Rhône-Alpes.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Pattern Recognition and Machine Learning
Data Mining Classification: Alternative Techniques
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Clustering with k-means and mixture of Gaussian densities Jakob Verbeek December 3, 2010 Course website:
Supervised Learning Recap
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Dimensional reduction, PCA
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Crash Course on Machine Learning
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Exercise Session 10 – Image Categorization
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
The EM algorithm, and Fisher vector image representation
Classification 2: discriminative models
Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Classification 1: generative and non-parameteric methods Jakob Verbeek January 7, 2011 Course website:
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
KNN & Naïve Bayes Hongning Wang
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch8: Nonparametric Methods
Outline Parameter estimation – continued Non-parametric methods.
Lecture 26: Faces and probabilities
Hidden Markov Models Part 2: Algorithms
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
EM Algorithm and its Applications
CAMCOS Report Day December 9th, 2015 San Jose State University
Presentation transcript:

Jakob Verbeek December 11, 2009 Fisher kernels for image representation & generative classification models Jakob Verbeek December 11, 2009

Plan for this course Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels Classification techniques 2 Discriminative methods, kernels Decomposition of images Topic models, …

Classification Training data consists of “inputs”, denoted x, and corresponding output “class labels”, denoted as y. Goal is to correctly predict for a test data input the corresponding class label. Learn a “classifier” f(x) from the input data that outputs the class label or a probability over the class labels. Example: Input: image Output: category label, eg “cat” vs. “no cat” Classification can be binary (two classes), or over a larger number of classes (multi-class). In binary classification we often refer to one class as “positive”, and the other as “negative” Binary classifier creates a boundaries in the input space between areas assigned to each class

Example of classification Given: training images and their categories What are the categories of these test images?

Discriminative vs generative methods Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods Directly estimate class probability given input: p(y|x) Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0

Generative classification methods Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Modeling class-conditional densities over the inputs x Selection of model class: Parametric models: such as Gaussian (for continuous), Bernoulli (for binary), … Semi-parametric models: mixtures of Gaussian, Bernoulli, … Non-parametric models: Histograms over one-dimensional, or multi-dimensional data, nearest-neighbor method, kernel density estimator Given class conditional model, classification is trivial: just apply Bayes’ rule Adding new classes can be done by adding a new class conditional model Existing class conditional models stay as they are

Histogram methods Suppose we have N data points use a histogram with C cells How to set the density level in each cell ? Maximum (log)-likelihood estimator. Proportional to nr of points n in cell Inversely proportional to volume V of cell Problems with histogram method: # cells scales exponentially with the dimension of the data Discontinuous density estimate How to choose cell size?

The ‘curse of dimensionality’ Number of bins increases exponentially with the dimensionality of the data. Fine division of each dimension: many empty bins Rough division of each dimension: poor density model Probability distribution of D discrete variables takes at least 2D values At least 2 values for each variable The number of cells may be reduced assuming independency between the components of x: the naïve Bayes model Model is “naïve” since it assumes that all variables are independent… Unrealistic for high dimensional data, where variables tend to be dependent Poor density estimator Classification performance can still be good using derived p(y|x)

Example of generative classification Hand-written digit classification Input: binary 28x28 scanned digit images, collect in 784 long vector Desired output: class label of image Generative model Independent Bernoulli model for each class Probability per pixel per class Maximum likelihood estimator is average value per pixel per class Classify using Bayes’ rule:

k-nearest-neighbor estimation method Idea: fix number of samples in the cell, find the right cell size. Probability to find a point in a sphere A centered on x with volume v is Smooth density approximately constant in small region, and thus Alternatively: estimate P from the fraction of training data in a sphere on x Combine the above to obtain estimate

k-nearest-neighbor estimation method Method in practice: Choose k For given x, compute the volume v which contain k samples. Estimate density with Volume of a sphere with radius r in d dimensions is What effect does k have? Data sampled from mixture of Gaussians plotted in green Larger k, larger region, smoother estimate Selection of k Leave-one-out cross validation Select k that maximizes data log-likelihood

k-nearest-neighbor classification rule Use k-nearest neighbor density estimation to find p(x|category) Apply Bayes rule for classification: k-nearest neighbor classification Find sphere volume v to capture k data points for estimate Use the same sphere for each class for estimates Estimate global class priors Calculate class posterior distribution

k-nearest-neighbor classification rule Effect of k on classification boundary Larger number of neighbors Larger regions Smoother class boundaries

Kernel density estimation methods Consider a simple estimator of the cumulative distribution function: Derivative gives an estimator of the density function, but this is just a set of delta peaks. Derivative is defined as Consider a non-limiting value of h: Each data point adds 1/(2hN) in region of size h around it, sum of “blocks” gives estimate

Kernel density estimation methods Can use other than “block” function to obtain smooth estimator. Widely used kernel function is the (multivariate) Gaussian Contribution decreases smoothly as a function of the distance to data point. Choice of smoothing parameter Larger size of “kernel” function gives smoother desnity estimator Use the average distance between samples. Use cross-validation. Method can be used for multivariate data Or in naïve bayes model

Summary generative classification methods (Semi-) Parametric models (eg p(data |category) = gaussian or mixture) No need to store data, but possibly too strong assumptions on data density Can lead to poor fit on data, and poor classification result Non-parametric models Histograms: Only practical in low dimensional space (<5 or so) High dimensional space will lead to many cells, many of which will be empty Naïve Bayes modeling in higher dimensional cases K-nearest neighbor & kernel density estimation: Need to store all training data Need to find nearest neighbors or points with non-zero kernel evaluation (costly) histogram k-nn k.d.e.

Discriminative vs generative methods Generative probabilistic methods Model the density of inputs x from each class p(x|y) Estimate class prior probability p(y) Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods (next week) Directly estimate class probability given input: p(y|x) Some methods do not have probabilistic interpretation, eg. they fit a function f(x), and assign to class 1 if f(x)>0, and to class 2 if f(x)<0 Hybrid generative-discriminative models Fit density model to data Use properties of this model as input for classifier Example: Fisher-vectors for image respresentation

Clustering for visual vocabulary construction Clustering of local image descriptors using k-means or mixture of Gaussians Recap of the image representation pipe-line Extract image regions at various locations and scales Compute descriptor for each region (eg SIFT) (Soft) assignment each descriptors to clusters Make histogram for complete image Summing of vector representations of each descriptor Input to image classification method Cluster indexes .5 .9 .1 … 1.2 20.0 0.2 0.6 10.0 2.5 1.0 1 … 20 10 3 2 Image regions

Fisher Vector motivation Feature vector quantization is computationally expensive in practice Run-time linear in N: nr. of feature vectors ~ 10^3 per image D: nr. of dimensions ~ 10^2 (SIFT) K: nr. of clusters ~ 10^3 for recognition So in total in the order of 10^8 multiplications per image to assign SIFT descriptors to visual words We use histogram of visual word counts Can we do this more efficiently ?! Reading material: “Fisher Kernels on Visual Vocabularies for Image Categorization” F. Perronnin and C. Dance, in CVPR'07 Xerox Research Centre Europe, Meylan

Fisher vector image representation MoG / k-means stores nr of points per cell Need many clusters to represent distribution of descriptors in image But increases computational cost Fischer vector adds 1st & 2nd order moments More precise description regions assigned to cluster Fewer clusters needed for same accuracy Representation (2D+1) times larger, at same computational cost Terms already calculated when computing soft-assignment 1 2 4 3 5 8 1 2 4 3 8 5 qnk: soft-assignment of image region to cluster (Gaussian mixture component)

Image representation using Fisher kernels General idea of Fischer vector representation Fit probabilistic model to data Use derivative of data log-likelihood as data representation, eg.for classification [Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.] Here, we use Mixture of Gaussians to cluster the region descriptors Concatenate derivatives to obtain data representation

Image representation using Fisher kernels Extended representation of image descriptors using MoG Displacement of descriptor from center Squares of displacement from center From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension) Simplified version obtained when Using this representation for a linear classifier Diagonal covariance matrices, variance in dimensions given by vector vk For a single image region descriptor Summed over all descriptors this gives us 1: Soft count of regions assigned to cluster D: Weighted average of assigned descriptors D: Weighted variance of descriptors in all dimensions

Fisher vector image representation MoG / k-means stores nr of points per cell Need many clusters to represent distribution of descriptors in image Fischer vector adds 1st & 2nd order moments More precise description regions assigned to cluster Fewer clusters needed for same accuracy Representation (2D+1) times larger, at same computational cost Terms already calculated when computing soft-assignment Comp. cost is O(NKD), need difference between all clusters and data 1 2 4 3 5 8 1 2 4 3 8 5

Images from categorization task PASCAL VOC Yearly “competition” for image classification (also object localization, segmentation, and body-part localization)

Fisher Vector: results BOV-supervised learns separate mixture model for each image class, makes that some of the visual words are class-specific MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors Other results: based on linear classifier of the image descriptions Similar performance, using 16x fewer Gaussians Unsupervised/universal representation good

Plan for this course Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM Classification techniques 1 Introduction, generative methods, semi-supervised Reading for next week: Previous papers (!), nothing new Available on course website http://lear.inrialpes.fr/~verbeek/teaching Classification techniques 2 Discriminative methods, kernels Decomposition of images Topic models, …