Advanced Topics in Computer and Human Vision

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Lecture 9 Support Vector Machines

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.

An Introduction of Support Vector Machine

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines and Kernel Methods

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Principal Component Analysis

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Support Vector Machine (SVM) Classification

Support Vector Machines and Kernel Methods

Support Vector Machines

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

SVM Support Vectors Machines

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

An Introduction to Support Vector Machines Martin Law.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Support Vector Machine & Image Classification Applications

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

An Introduction to Support Vector Machine (SVM)

Linear Models for Classification

SVM – Support Vector Machines Presented By: Bella Specktor.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Support Vector Machines

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines Tao Department of computer science University of Illinois.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support vector machines

CS 9633 Machine Learning Support Vector Machines

Omer Boehm A tutorial about SVM Omer Boehm

Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Support vector machines

Support Vector Machines 2

Presentation transcript:

Advanced Topics in Computer and Human Vision Kernel – Based Methods Presented by Jason Friedman Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2003

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Agenda… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Agenda… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)

Structural Risk Minimization (SRM) Definition: Training set with l observations: Each observation consists of a pair: 16x16=256

Structural Risk Minimization (SRM) The task: “Generalization” - find a mapping Assumption: Training and test data drawn from the same probability distribution, i.e. (x,y) is “similar” to (x1,y1), …, (xl,yl) given a previously unseen x find a suitable y, i.e.

Structural Risk Minimization (SRM) – Learning Machine Definition: Learning machine is a family of functions {f()},  is a set of parameters. For a task of learning two classes f(x,) 2 {-1,1} 8 x, Class of oriented lines in R2: sign(1x1 + 2x2 + 3) 1. Example for a class of functions: sin(alpha*x)

Structural Risk Minimization (SRM) – Capacity vs. Generalization Definition: Capacity of a learning machine measures the ability to learn any training set without error. Too much Capacity Too little Capacity underfitting overfitting ? Is the color green? Does it have the same # of leaves?

Structural Risk Minimization (SRM) – Capacity vs. Generalization For small sample sizes overfitting or underfitting might occur Best generalization = right balance between accuracy and capacity The best generalization performance will be achieved if the right balance is struck between the accuracy on that particular training set and the capacity of the machine

Structural Risk Minimization (SRM) – Capacity vs. Generalization Solution: Restrict the complexity (capacity) of the function class. Intuition: “Simple” function that explains most of the data is preferable to a “complex” one.

Structural Risk Minimization (SRM) -VC dimension What is a “simple”/”complex” function? Definition: Given l points (can be labeled in 2l ways) The set of points is shattered by the function class {f()} if for each labeling there is a function which correctly assigns those labels.

Structural Risk Minimization (SRM) -VC dimension Three points shattered by oriented lines in R2 Explain that exists a set of 3 points that can be shattered and there doesn’t exist a set of 4 points that can be shattered Definition VC dimension of {f()} is the maximum number of points that can be shattered by {f()} and is a measure of capacity.

Structural Risk Minimization (SRM) -VC dimension Theorem: The VC dimension of the set of oriented hyperplanes in Rn is n+1. Low # of parameters ) low VC dimension Example: theta(sin(alpha*x)) theta(y) = 1 for y>0 Theta(y)=-1 otherwise

Structural Risk Minimization (SRM) -Bounds Definition: Actual risk Minimize R() But, we can’t measure actual risk, since we don’t know p(x,y)

Structural Risk Minimization (SRM) -Bounds Definition: Empirical risk Remp() ! R(), l ! 1 But for small training set deviations might occur

Structural Risk Minimization (SRM) -Bounds Not valid for infinite VC dimension Risk bound: Confidence term with probability (1-) h is VC dimension of the function class Note: R() is independent of p(x,y)

Structural Risk Minimization (SRM) -Bounds

Structural Risk Minimization (SRM) -Bounds

Structural Risk Minimization (SRM) -Principal Method Principle method for choosing a learning machine for a given task:

SRM Divide the class of functions into nested subsets Either calculate h for each subset, or get a bound on it Train each subset to achieve minimal empirical error Choose the subset with the minimal risk bound Complexity Risk Bound Notice that some learning machines perform very well even though their risk bound is trivial

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Agenda… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)

Support Vector Machines (SVM) Currently the “en vogue” approach to classification Successful applications in bioinformatics, text, handwriting recognition, image processing Introduced by Bosner, Gayon and Vapnik, 1992 SVM are a particular instance of Kernel Machines Examples of forces???????

Linear SVM – Separable case Two given classes are linearly separable

Linear SVM - definitions Separating hyperplane H: w is normal to H |b|/||w|| is the perpendicular distance from H to the origin d+ (d-) is the shortest distance from H to the closest positive (negative) point.

Linear SVM - definitions

Linear SVM - definitions If H is a separating hyperplane, then No training points fall between H1 and H2

Linear SVM - definitions By scaling w and b, we can require that Or more simply: The training points for which the equality holds (i.e. they lie on H1 or H2) are called “support vectors” and their removal would change the solution. Equality holds  xi lies on H1 or H2

Linear SVM - definitions Note: w is no longer a unit vector Margin is now 2 / ||w|| Find hyperplane with the largest margin.

Linear SVM – maximizing margin Maximizing the margin , minimizing ||w||2 ) more room for unseen points to fall ) restrict the capacity R is the radius of the smallest ball around data subject to the above constraints

Linear SVM – Constrained Optimization Introduce Lagrange multipliers “Primal” formulation: Minimize LP with respect to w and b Require

Linear SVM – Constrained Optimization Objective function is quadratic Linear constraint defines a convex set Intersection of convex sets is a convex set ) can formulate “Wolfe Dual” problem Easier to solve dual problem

Linear SVM – Constrained Optimization The Solution Maximize LP with respect to i Require Substitute into LP to give: Maximize with respect to i

Linear SVM – Constrained Optimization Karush Kuhn Tucker conditions: Karush Kuhn Tucker conditions are sufficient and necessary conditions for i, w and b to be a solution

Linear SVM – Constrained Optimization Using Karush Kuhn Tucker conditions: If i > 0 then lies either on H1 or H2 ) The solution is sparse in i Those training points are called “support vectors”. Their removal would change the solution Removing non support vector points will not change the solution

SVM – Test Phase Given the unseen sample x we take the class of x to be

Linear SVM – Non-separable case Separable case corresponds to empirical risk of zero. For noisy data this might not be the minimum in the actual risk. (overfitting ) No feasible solution for non-separable case

Linear SVM – Non-separable case Relax the constraints by introducing positive slack variables i is an upper bound on the number of errors

Linear SVM – Non-separable case Assign extra cost to errors Minimize where C is a penalty parameter chosen by the user For k=1,2 it’s still a quadratic function

Linear SVM – Non-separable case Lagrange formulation again: “Wolfe Dual” problem - maximize: subject to: The solution: Lagrange multiplier

Linear SVM – Non-separable case Karush Kuhn Tucker conditions:

Linear SVM – Non-separable case Using Karush Kuhn Tucker conditions: The solution is sparse in i

Nonlinear SVM Non linear decision function might be needed

Nonlinear SVM- Feature Space Map the data to a high dimensional (possibly infinite) feature space Solution depends on If there were function k(xi,xj) s.t. ) no need to know  explicitly Consider the same algorithm in the input space For infinite F dot product is “difficult”

Nonlinear SVM – Toy example Input Space Feature Space

Nonlinear SVM – Avoid the Curse Curse of dimensionality: The difficulty of estimating a problem increases drastically with the dimension But! Learning in F may be simpler if one uses low complexity function class (hyperplanes)

Nonlinear SVM- Kernel Functions Kernel functions exist! effectively compute dot products in feature space Can use it without knowing  and F Given a kernel,  and F are not unique F with smallest dim is called minimal embedding space For some mappings and some feature spaces we can effectively compute dot products in feature space using Kernel functions Having kernel we can use it without explicitly calculating or even knowing what  is and what the dimension of F is.

Nonlinear SVM- Kernel Functions Mercer’s condition: There exists a pair {,F} such that iff for any g(x) s.t. is finite then Given a function how do we know whether it’s a kernel or not? Mercer is satisfied for polynomial kernels K(x_i,x_j) = (x_i * x_j)^p

Nonlinear SVM- Kernel Functions Formulation of algorithm in terms of kernels

Nonlinear SVM- Kernel Functions Kernels frequently used:

Nonlinear SVM- Feature Space d=256, p=4 ) dim(F)= 183,181,376 Hyperplane {w,b} requires dim(F) + 1 parameters Solving SVM means adjusting l+1 parameters F is the smallest embedding feature space This is the main gain Usually: high dim feature space ) bad generalization performance.

SVM - Solution LD is convex ) the solution is global Two type of non-uniqueness: {w,b} is not unique {w,b} is unique, but the set {i} is not Prefer the set with less support vectors (sparse) OPTIMAL !!!!!!

Nonlinear SVM-Toy Example

Nonlinear SVM-Toy Example

Nonlinear SVM-Toy Example

Methods of solution Quadratic programming packages Chunking Decomposition methods Sequential minimal optimization

Applications - SVM Extracting Support Data for a Given Task Bernhard Schölkopf Chris Burges Vladimir Vapnik Proceedings, First International Conference on Knowledge Discovery & Data Mining. 1995

Applications - SVM Input (USPS handwritten digits) Constructed: Training set: 7300 Testing set: 2000 Constructed: 10 class/non-class SVM classifiers Take the class with maximal output 16x16

Applications - SVM Three different types of classifiers for each digit:

Applications - SVM Similar performance the 3 types of kernels means that it’s more important to control the capacity rather than type of decision function

Applications - SVM Predicting the Optimal Decision Functions - SRM IF the amount of data is very limited, would prefer to predict without validation

Applications - SVM

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Agenda… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)

Feature Space vs. Input Space Suppose the solution is a linear combination in feature space Cannot generally say that each point has a preimage What does this mean in input space?

Feature Space vs. Input Space If there exists z such that and k is an invertible function fk of (x¢ y) then can compute z as where {e1,…,eN} is an orthonormal basis of the input space.

Feature Space vs. Input Space Polynomial kernel is invertible when Then the preimage of w is given by

Feature Space vs. Input space In general, cannot find the preimage Look for an approximation such that is small

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Agenda… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)

PCA Regular PCA: Find the direction u s.t. projecting n points in d dimensions onto u gives the largest variance. u is the eigenvector of covariance matrix Cu=u.

Kernel PCA Extension to feature space: compute covariance matrix solve eigenvalue problem CV=V )

Kernel PCA Define in terms of dot products: Then the problem becomes: where l x l rather than d x d Multiply both sides with (xk) from the left so we will be able to define it in terms of dot products: Need to solve a linear Eigenvalue problem, but now n x n rather than d x d.

Kernel PCA – Extracting Features To extract features of a new pattern x with kernel PCA, project the pattern on the first n eigenvectors

Kernel PCA Kernel PCA is used for: De-noising Compression Interpretation (Visualization) Extract features for classifiers

Kernel PCA - Toy example Regular PCA:

Kernel PCA - Toy example

Applications – Kernel PCA Kernel PCA Pattern Reconstruction via Approximate Pre-Images B. Schölkopf, S. Mika, A. Smola, G. Rätsch, and K.-R. Müller. In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of the 8th International Conference on Artificial Neural Networks, Perspectives in Neural Computing, pages 147-152, Berlin, 1998. Springer Verlag.

Applications – Kernel PCA Recall (x) can be reconstructed from its principal components Projection operator:

Applications – Kernel PCA Denoising When , ) can’t guarantee existence of pre-image First n directions ! main structure The remaining directions ! noise

Applications – Kernel PCA Find z that minimizes Use gradient descent starting with x Ro might be non-zero

Applications – Kernel PCA Input toy data: 3 point sources (100 points each) with Gaussian noise =0.1 Using RBF

Applications – Kernel PCA Lines of constant feature value for the first 8 nonlinear principal components extracted with k(x,y)=exp(-|| x-y||^2/0.1)

Applications – Kernel PCA Kernel PCA denoising by reconstructing from prejections onto the eigenvecors 20 new points for each gaussian, represented in feature space by projecting on first pca

Applications – Kernel PCA How the original points move in denoising

Applications – Kernel PCA Compare to the linear PCA

Applications – Kernel PCA Real world data 256 dimensional handwritten digits Training: 300 Testing: 50 Used RBF Gaussian noise sigma=0.5 Speckle noise p=0.4

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Agenda… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)

Fisher Linear Discriminant Finds a direction w, projected on which the classes are “best” separated

Fisher Linear Discriminant For “best” separation - maximize where is the projected mean of class i, and is the std.

Fisher Linear Discriminant Equivalent to finding w which maximizes: where

Fisher Linear Discriminant The solution is given by:

Kernel Fisher Discriminant Kernel formulation: where

Kernel Fisher Discriminant From the theory of reproducing kernels: Substituting it into the J(w) reduces the problem to maximizing:

Kernel Fisher Discriminant where 1_(l_j) is a matrix with all entries 1/l_j

Kernel Fisher Discriminant Solution is to solve the generalized eigenproblem: Projection of a new pattern is then: Find a suitable threshold l x l rather than d x d Suitable threshold – e.g. mean of average projections or something to minimize the risk

Kernel Fisher Discriminant – Constraint Optimization Can formulate problem as constraint optimization w

Kernel Fisher Discriminant – Toy Example Comparison of feature found by KFD (left) and those found by kernel PCA(1- middle and 2- right) Used polynomial kernel Two noisy parabolic shapes mirrored at x and y KFDA KPCA – 1st eigenvector KPCA – 2nd eigenvector

Applications – Fisher Discriminant Analysis Fisher Discriminant Analysis with Kernels S.Mika et. al. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41-48. IEEE, 1999.

Applications – Fisher Discriminant Analysis Input (USPS handwritten digits): Training set: 3000 Constructed: 10 class/non-class KFD classifiers Take the class with maximal output Comparison of feature found by KFD (left) and those found by kernel PCA(1- middle and 2- right) Used polynomial kernel Two noisy parabolic shapes mirrored at x and y

Applications – Fisher Discriminant Analysis Results: 3.7% error on a ten-class classifier Using RBF with  = 0.3*256 Compare to 4.2% using SVM KFDA vs. SVM

Structural Risk Minimization (SRM) Support Vector Machines (SVM) Summary… Structural Risk Minimization (SRM) Support Vector Machines (SVM) Feature Space vs. Input Space Kernel PCA Kernel Fisher Discriminate Analysis (KFDA)