Discriminative classification methods, kernels, and topic models Jakob Verbeek January 8, 2010
Plan for this course Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels 5)Classification techniques 2 Discriminative methods, kernels 6)Decomposition of images Topic models
Generative vs Discriminative Classification Training data consists of “inputs”, denoted x, and corresponding output “class labels”, denoted as y. Goal is to predict class label y for a given test data x. Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Estimate class prior probability p(y) –Use Bayes’ rule to infer distribution over class given input Discriminative (probabilistic) methods –Directly estimate class probability given input: p(y|x) –Some methods do not have probabilistic interpretation, but fit a function f(x), and assign to classes based on sign(f(x))
Discriminative vs generative methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) –Parametric models Assume specific form of p(x|y), eg Gaussian, Bernoulli, … –Semi-parametric models Combine models of simple class, eg in mixtures of Gaussians –Non-parametric models Do not make any assumptions on the form of p(x|y), eg. K-nearest-neighbour, histograms, kernel density estimator Discriminative (probabilistic) methods –Directly estimate class probability given input: p(y|x) –Logistic discriminant –Support vector machines
1.Choose class of decision functions in feature space. 2.Estimate the function parameters from the training set. 3.Classify a new pattern on the basis of this decision rule. Discriminant function kNN example from last week Needs to store all data Separation using smooth curve Only need to store curve parameters
Linear functions in feature space Decision function is linear in the features: Classification based on the sign of f(x) Slope is determined by w Offset from origin is determined by w 0 Decision surface is (d-1) dimensional hyper-plane orthogonal to w, given by w
Dealing with more than two classes First idea: construct classifier for all classes from several 2-class classifiers –Eg. Seprate one class vs all others, or separate all pairs of classes vs each other –Learn the 2-class “base” classifiers independently Not clear what should be done in some regions –1 vs All: Region claimed by several classes –Pair-wise: no agreement
Dealing with more than two classes Alternative: use a linear function for each class assign sample to the class of the function with maximum value Decision regions are convex in this case –If two points fall in the region, then also all points on connecting line
Logistic discrimination We go back to the concept of x as a stochastic vector and assume: This is valid for a variety of parametric models, examples to come… In this case we can show that where Thus we see that the class boundary is obtained by setting linear function in exponent to zero
Logisitic discrimination Example Suppose each p(x|y) is Gaussian with the same covariance matrix. In this case we have Thus probability of class given data is a logistic discriminant model Points with equal probability are in hyper-plane orthogonal to w. P(data|class) P(class|data)
Multi-class logistic discriminant Assume that ratio of data likelihood is linear for all class pairs is linear in x In that case we find that This is known as the “soft-max” over the linear functions corresponding to the classes For any given pair of classes we find that they are equally likely on a hyperplane in the feature space
Parameter estimation for logistic discriminant For 2-class or multi-class model, we set parameters using maximum likelihood –Maximize the (log) likelihood of predicting the correct class label for training data –Thus we sum over all training data the log-probablity of the correct label Derivative of log-likelihood as intuitive interpretation No closed-form solution, use gradient descend method to find optimal parameters –Note 1: log-likelihood is concave in w, no local optima –Note 2: w is linear combination of data points Expected number of points from each class should equal the actual number. Expected value of each feature, weighting points by p(y|x), should equal empirical expectation. Indicator function 1 if true, else 0
Motivation for support vector machines Suppose classes are linearly separable –Continuum of classifiers with zero error on the training data Generalization is not good in this case: But better if a margin is introduced: Using statistical learning theory, larger margins can be formally shown to yield better generalization performance.
Support vector machines Classification method for 2-class problems, let class labels Suppose that the classes are linearly separable We seek a linear separating classification function, ie. for training data To obtain a margin we require Since this requirement is invariant to scaling we set b=1 –Otherwise a large b could be compensated for by multiplying w and w 0 with large number We get two hyperplanes The size of the margin is Points on the margin are called support vectors (color) 2 x )(roundness 1 x (color) 2 x )(roundness 1 x margin
Support vector machines: objective Our objective is to find the classification function with maximum margin Maximizing the margin means minimizing But subject to the constraints that the function is separating the training data Using the technique of Lagrange multipliers for constrained optimization, we can solve this problem by minimizing the following w.r.t. w and w 0, and maximizing w.r.t. the alpha parameters This is a quadratic function in w and w 0, an the optimal values are found by setting the derivatives to zero.
Support vector machines: optimization Setting the derivatives to zero we find and –Note: w is a linear function of the training data points If we substitute this into the Lagrangian function, we get the dual form which has to be maximized w.r.t. the alpha variables, subject to constraints –Note: the training data appear only in the form of inner products here The dual function is quadratic in the alpha’s subject to linear constraints, and can be solved using standard quadratic solvers Only the data points precisely on the margin get a non-zero alpha
SVMs: non separable classes Before we assumed that the classes we separable –and found the w that separated the classes with maximum margin What if classes are non-separable? Require that points are traversing the margin by a mimimal amount – the n are non-negative and known as “slack variables” Replace infeasible class-separation constraint with penalty for large slacks New objective function: Support vectors are either on the margin, or on the wrong side of the margin.
Summary Linear discriminant analysis Two most widely used linear classifiers in practice: –Logistic discriminant (supports more than 2 classes directly) –Support vector machines (multi-class extensions recently developed) In both cases the weight vector w is a linear combination of the data points This means that we only need the inner-products between data points to calculate the linear functions –The function k that computes the inner products is called a kernel function
Plan for this course Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels 5)Classification techniques 2 Discriminative methods, kernel functions 6)Decomposition of images Topic models
Motivation of generalized linear functions Linear classifiers are attractive because of their simplicity –Computationally efficient –SVM and logistic discriminant, do not suffer from an objective function with multiple local optima. The fact that only linear functions can be implemented limits their use to discriminating (approximately) linearly separable classes. Two possible extensions 1)Use a richer class of functions, not only linear ones –In this case optimization is often much more difficult 2)Extend or replace the original set of features with non-linear functions of the original features w
Generalized linear functions Data vector x replaced by vector of m fixed functions of x Can turn problem into a linearly separable one
Example of alternative feature space Suppose that one class `encapsulates’ the other class: A linear classifier does not work very well... Lets map our features to a new space spanned by: A circle around the origin with radius r in the original space is now described by a plane in the new space: We can use classification function with and
The kernel function of our example Let’s compute the inner-product in the new feature space we defined: Thus, we simply square the standard inner product, and do not need to explicitly map the points to the new feature space ! This becomes useful if the new feature space has many features. The kernel function is a shortcut to compute inner products in feature space, without explicitly mapping the data to that space.
Example non-linear support vector machine Example where classes are separable, but not linearly. Gaussian kernel used This corresponds to using an infinite number of non-linear features Decision surface
Nonlinear support vector machines We saw that to find the parameters of an SVM we need the data only in the form of inner products We can replace these directly by evaluations of the kernel function For the parameters we thus find We compute the inner product between a sample to classify as
Nonlinear logistic discriminant A similar procedure can be taken also for logistic discriminant Again we express w directly as a linear combination of the transformed training data We then optimize the log-likelihood of correct classification of the training data samples as a function of the alpha variables For the multi-class version we obtain a linear function per class and class membership probability estimates given by the softmax function
The kernel function Starting with the alternative representation in some feature space we can find the corresponding kernel function as the `program’: 1.Map x and y to the new feature space. 2.Compute the inner product between the mapped points. A kernel is positive definite if NxN matrix K containing all pairwise evaluations on N arbitrary points, is positive definite, ie for non-zero w: If a kernel is computing inner products then it is positive definite Mercer’s Theorem: if k is positive definite, then there exists a feature space in which k computes the inner-products.
The “kernel trick” has many applications We used kernels to compute inner-products to find non-linear SVMs, and logistic discriminant models The main advantage of the “kernel trick” is that we can use feature spaces with a vast number of “derived” features without being confronted with the computational burden of explicitly mapping the data points to this space. The same trick can also be used for many other inner-product based methods for classification, regression, clustering, and dimension reduction. –K-means –Mixture of Gaussians –Principal component analysis
Image categorization with Bags-of-Words A range of kernel functions are used in categorization tasks Popular examples include –Chi-square RBF kernel over bag-of-word histograms –Pyramid-match kernel –...See slides Cordelia Schmid lecture 2… Any interesting image similarity measure that can be show to yield a positive definite kernel, can be plugged into SVMs or logistic discriminant Kernels can be combined by –Summation –Products –Exponentiation –Many other ways
Summary classification methods Generative probabilistic methods –Model the density of inputs x from each class p(x|y) Parametric, semi-parametric, and non-parametric models –Classification using Bayes’ rule Discriminative (probabilistic) methods –Directly estimate decision function f(x) for classification Logistic discriminant Support vector machines –Linear methods easily extended by using kernel functions Hybrid methods –Model training input data with probabilistic model –Use probabilistic model the generate features for discriminative model –Example: Fisher kernels combined with linear classifier
Plan for this course Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels Classification techniques 2 Discriminative methods, kernels 5)Decomposition of images Topic models
Topic models for image decomposition So far we mainly looked at clustering and classifying images What about localizing categories in an image, ie decomposing a scene into the different elements that constitute it How can we decompose this beach scene? Sky Water Sand Vegetation
Topic models for image decomposition How can we decompose this beach scene? Need to model the appearance of difference categories Difficult to use shape models –Some categories do not have a particular shape: water, sky, grass, etc. –Other categories occlude each other: building occluded by cars, trees, people Typically a relatively small number of categories appear per image –Several tens, not all thousands of categories we might want to model Sky Water Sand Vegetation
Topic Models for text documents Model each text as a mixture of different topics –A fixed set of topics is used (categories in images) –The mix of topics is specific to each text (mix of categories) Words are modelled as –Selecting a topic given the document-specific mix, ie using p( topic | document ) –Selecting a word from the chosen topic, ie using p( word | topic ) Words no longer independent –Seeing some words, you can guess topics to predict other words
Topic Models for text documents Words are modelled as –Selecting a topic given the document-specific mix using p( topic | document ) –Selecting a word from the chosen topic using p( word | topic ) Given these distributions we can assign words to topics p(topic|word,document)
Topic Models for text documents Words no longer independent –Seeing some words, you can guess topics to predict other words Polysemous words can be disambiguated using the context –“Bank” can refer to financial institution or side of a river –Seeing many financial terms in the text, we infer that the “finance” topic is strongly represented in the text, and that “Bank” refers to the first meaning.
Topic Models for images Each image has its own specific mix of visual topics / categories All documents mix the same set of topics Image regions extracted over dense grid and multiple scales –Each image region assigned to a visual word eg using k-means clustering of region descriptors such as SIFT Possibly also use color information, and position of region in the image (top, bottom, …) Example of image regions assigned to different topics –Clouds, Trees, Sky, Mountain
Probabilistic Latent Semantic Analysis (PLSA) Some notation: –w: (visual) words –t: topics –d: documents / images p(t|d) : document specific topic mix p(w|t) : topic specific distribution over words Distribution over words for a document, p(w|d), is a mixture of the topic distributions To sample a word: –Sample topic from p(t|d) –Sample word from topic using p(w|t) We can now infer to which topic a word belongs using Bayes’ rule
Interpreting a new image with PLSA Suppose that we have learned the topic distributions p(word|topic) We want to assign image regions to topics –The topic mix p(topic|document) for the new image is an unknown parameter Find the topic mix p(t|d) using maximum likelihood with EM algorithm –Associate topic variable t n to each visual word w n Remember the EM algorithm –In the E-step we compute posterior distribution over hidden variables, distribution on topic for each visual word in this case –In the M-step we maximize the expected joint log-likelihood on observed and hidden variables given current posteriors, likelihood of topic and word variables in this case Probability of topic t given document Needs to be estimated for new doc. Probability of word w n given topic t Supposed to be know to us
Interpreting a new image with PLSA In the E-step we compute posterior distribution over hidden variables, distribution on topic for each visual word in this case The posterior for word 1 and word 2 is the same if they are the same word –We use q wt to denote posterior probability of topic t for word w Probability of topic t given document Needs to be estimated for new doc. Probability of word w given topic t Supposed to be know to us
Interpreting a new image with PLSA In the M-step we maximize the expected joint log-likelihood on observed and hidden variables given current posteriors, likelihood of topic and word variables in this case Use n wd to denote how often word w appears in the document, then we rewrite as Maximizing this with respect to the topic mixing parameters we obtain Probability of topic t given document Needs to be estimated for new doc. Probability of word w given topic t Supposed to be know to us Expected nr of words assigned to topic divided by total number of words in doc
Interpreting a new images with PLSA Repeat EM-algorithm until convergence of p(t|d) After convergence, read-off the word to topic assignments from p(t|w,d) Example images decomposed over 22 topics [Verbeek & Triggs, CVPR’07]
Learning the topics from labeled images As training data we have images labeled with the categories they contain Not known which image regions belongs to each category Learn PLSA model, forcing each image to mix only the labeled topics –Example image labeled as {building, grass, sky, tree} –Assignment of image regions to topics during learning EM algorithm –E-step: Compute p(t|w,d) as before, but set posterior to zero for topics not appearing in labeling –M-step: As before for topic mix p(t|d) in each image, topics not in labeling get weight zero Estimation of distribution over words for each topic as Expected nr of times w assigned to topic t divided by total number words assigned to t
Interpreting a new images with PLSA More example images taken from [Verbeek & Triggs, CVPR’07]
Plan for this course Introduction to machine learning Clustering techniques k-means, Gaussian mixture density Gaussian mixture density continued Parameter estimation with EM Classification techniques 1 Introduction, generative methods, semi-supervised Classification techniques 2 Discriminative methods, kernels Decomposition of images Topic models, … Next week: last course by Cordelia Schmid Exam: 9h-12h, Friday January 29, 2010, room D207 (ensimag) Not allowed to bring any material Prepare from slides + papers All material available on