Linear Learning Machines and SVM The Perceptron Algorithm revisited

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
SVM Support Vectors Machines
Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear Models for Classification
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines
Support Vector Machines Tao Department of computer science University of Illinois.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
Support Vector Machines 2
Presentation transcript:

Linear Learning Machines and SVM The Perceptron Algorithm revisited Content Linear Learning Machines and SVM The Perceptron Algorithm revisited Functional and Geometric Margin Novikoff theorem Dual Representation Learning in the Feature Space Kernel-Induced Feature Space Making Kernels The Generalization Problem Probably Approximately Correct Learning Structural Risk Minimization 4/24/2017 236875 Visual Recognition

Linear Learning Machines and SVM Basic Notation Input space Output space for classification for regression Hypothesis Training Set Test error also R(a) Dot product 4/24/2017 236875 Visual Recognition

The Perceptron Algorithm revisited Linear separation of the input space The algorithm requires that the input patterns are linearly separable, which means that there exist linear discriminant function which has zero training error. We assume that this is the case. 4/24/2017 236875 Visual Recognition

The Perceptron Algorithm (primal form) initialize repeat error false for i=1..l if then error true end if end for until (error==false) return k,(wk,bk) where k is the number of mistakes 4/24/2017 236875 Visual Recognition

The Perceptron Algorithm Comments The perceptron works by adding misclassified positive or subtracting misclassified negative examples to an arbitrary weight vector, which (without loss of generality) we assumed to be the zero vector. So the final weight vector is a linear combination of training points where, since the sign of the coefficient of is given by label yi, the are positive values, proportional to the number of times, misclassification of has caused the weight to be updated. It is called the embedding strength of the pattern . 4/24/2017 236875 Visual Recognition

Functional and Geometric Margin The notion of margin of a data point w.r.t. a linear discriminant will turn out to be an important concept. The functional margin of a linear discriminant (w,b) w.r.t. a labeled pattern is defined as If the functional margin is negative, then the pattern is incorrectly classified, if it is positive then the classifier predicts the correct label. The larger the further away xi is from the discriminant. This is made more precise in the notion of the geometric margin 4/24/2017 236875 Visual Recognition

Functional and Geometric Margin cont. The geometric margin of The margin of a training set two points 4/24/2017 236875 Visual Recognition

Functional and Geometric Margin cont. which measures the Euclidean distance of a point from the decision boundary. Finally, is called the (functional) margin of (w,b) w.r.t. the data set S={(xi,yi)}. The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximum is a maximal margin hyperplane. Maximal Margin Hyperplane 4/24/2017 236875 Visual Recognition

Novikoff theorem Theorem: Suppose that there exists a vector and a bias term such that the margin on a (non-trivial) data set S is at least , i.e. then the number of update steps in the perceptron algorithm is at most where 4/24/2017 236875 Visual Recognition

The bound is invariant under rescaling of the patterns. Novikoff theorem cont. Comments: Novikoff theorem says that no matter how small the margin, if a data set is linearly separable, then the perceptron will find a solution that separates the two classes in a finite number of steps. More precisely, the number of update steps (and the runtime) will depend on the margin and is inverse proportional to the squared margin. The bound is invariant under rescaling of the patterns. The learning rate does not matter. 4/24/2017 236875 Visual Recognition

Dual Representation The decision function can be rewritten as follows: And also the update rule can be rewritten as follows: The learning rate only influences the overall scaling of the hyperplanes, it does no affect an algorithm with zero starting vector, so we can put 4/24/2017 236875 Visual Recognition

Duality: First Property of SVMs DUALITY is the first feature of Support Vector Machines SVM are Linear Learning Machines represented in a dual fashion Data appear only inside dot products (in the decision function and in the training algorithm) The matrix is called Gram matrix 4/24/2017 236875 Visual Recognition

Limitations of Linear Classifiers Linear Learning Machines (LLM) cannot deal with Non-linearly separable data Noisy data This formulation only deals with vectorial data 4/24/2017 236875 Visual Recognition

Limitations of Linear Classifiers Neural networks solution: multiple layers of thresholded linear functions – multi-layer neural networks. Learning algorithms – back-propagation. SVM solution: kernel representation. Approximation-theoretic issues are independent of the learning-theoretic ones. Learning algorithms are decoupled from the specifics of the application area, which is encoded into design of kernel. 4/24/2017 236875 Visual Recognition

Learning in the Feature Space Map data into a feature space where they are linearly separable (i.e. attributes features) 4/24/2017 236875 Visual Recognition

Learning in the Feature Space cont. Example Consider the target function giving gravitational force between two bodies. Observable quantities are masses m1, m2 and distance r. A linear machine could not represent it, but a change of coordinates gives the representation 4/24/2017 236875 Visual Recognition

Learning in the Feature Space cont. The task of choosing the most suitable representation is known as feature selection. The space X is referred to as the input space, while is called the feature space. Frequently one seeks to find smallest possible set of features that still conveys essential information (dimensionality reduction 4/24/2017 236875 Visual Recognition

Problems with Feature Space Working in high dimensional feature spaces solves the problem of expressing complex functions BUT: There is a computational problem (working with very large vectors) And a generalization theory problem (curse of dimensionality) 4/24/2017 236875 Visual Recognition

Implicit Mapping to Feature Space We will introduce Kernels: Solve the computational problem of working with many dimensions Can make it possible to use infinite dimensions Efficiently in time/space Other advantages, both practical and conceptual 4/24/2017 236875 Visual Recognition

Kernel-Induced Feature Space In order to learn non-linear relations we select non-linear features. Hence, the set of hypotheses we consider will be functions of type where is a non-linear map from input space to feature space In the dual representation, the data points only appear inside dot products 4/24/2017 236875 Visual Recognition

Given a function K, it is possible to verify that it is a kernel Kernels Kernel is a function that returns the value of the dot product between the images of the two arguments When using kernels, the dimensionality of space F not necessarily important. We may not even know the map Given a function K, it is possible to verify that it is a kernel 4/24/2017 236875 Visual Recognition

Kernels cont. One can use LLMs in a feature space by simply rewriting it in dual representation and replacing dot products with kernels: Stopped here 4/24/2017 236875 Visual Recognition

The Kernel Matrix (Gram Matrix) … K(1,m) K(2,1) K(2,2) K(2,3) K(2,m) K(m,1) K(m,2) K(m,3) K(m,m) 4/24/2017 236875 Visual Recognition

The central structure in kernel machines The Kernel Matrix The central structure in kernel machines Information ‘bottleneck’: contains all necessary information for the learning algorithm Fuses information about the data AND the kernel Many interesting properties: 4/24/2017 236875 Visual Recognition

The kernel matrix is Symmetric Positive Definite Mercer’s Theorem The kernel matrix is Symmetric Positive Definite Any symmetric positive definite matrix can be regarded as a kernel matrix, that is as an inner product matrix in some space More formally, Mercer’s Theorem: Every (semi) positive definite, symmetric function is a kernel: i.e. there exists a mapping such that it is possible to write: Definition of Positive Definiteness: 4/24/2017 236875 Visual Recognition

Eigenvalues expansion of Mercer’s Kernels: Mercer’s Theorem cont. Eigenvalues expansion of Mercer’s Kernels: That is: the eigenvalues act as features! 4/24/2017 236875 Visual Recognition

Simple examples of kernels are: which is a polynomial of degree d which is Gaussian RBF two-layer sigmoidal neural network 4/24/2017 236875 Visual Recognition

Example: Polynomial Kernels 4/24/2017 236875 Visual Recognition

Example 4/24/2017 236875 Visual Recognition

The set of kernels is closed under some operations. If Making Kernels The set of kernels is closed under some operations. If K,K’ are kernels, then: K+K’ is a kernel cK is a kernel, if c>0 aK+bK’ is a kernel, for a,b>0 Etc… One can make complex kernels from simple ones: modularity! 4/24/2017 236875 Visual Recognition

Second Property of SVMs: SVMs are Linear Learning Machines, that : Use a dual representation Operate in a kernel induced feature space (that is: is a linear function in the feature space implicitly defined by K) 4/24/2017 236875 Visual Recognition

A bad kernel .. Would be a kernel whose kernel matrix is mostly diagonal: all points orthogonal to each other, no clusters, no structure…. 1 … 4/24/2017 236875 Visual Recognition

Need some prior knowledge of target so choose a good kernel No Free Kernel In mapping in a space with too many irrelevant features, kernel matrix becomes diagonal Need some prior knowledge of target so choose a good kernel 4/24/2017 236875 Visual Recognition

The Generalization Problem The curse of dimensionality: easy to overfit in high dimensional spaces (=regularities could be found in the training set that are accidental, that is that would not be found again in a test set) The SVM problem is ill posed (finding one hyperplane that separates the data: many such hyperplanes exist) Need principled way to choose the best possible hyperplane 4/24/2017 236875 Visual Recognition

The Generalization Problem cont. “Capacity” of the machine – ability to learn any training set without error. “A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree” C. Burges 4/24/2017 236875 Visual Recognition

Probably Approximately Correct Learning Assumptions and Definitions Suppose: We are given l observations Train and test points drawn randomly (i.i.d) from some unknown probability distribution D(x,y) The machine learns the mapping and outputs a hypothesis . A particular choice of generates “trained machine”. The expectation of the test error or “expected risk” is 4/24/2017 236875 Visual Recognition

A Bound on the Generalization Performance The “empirical risk” is: Choose some such that . With probability the following bound holds (Vapnik,1995): where is called VC dimension is a measure of “capacity” of machine. R.h.s. of (3) is called the “risk bound” of h(x,a) in distribution D. 4/24/2017 236875 Visual Recognition

A Bound on the Generalization Performance The second term in the right-hand side is called VC confidence. Three key points about the actual risk bound: It is independent of D(x,y) It is usually not possible to compute the left hand side. If we know d, we can compute the right hand side. This gives a possibility to compare learning machines! 4/24/2017 236875 Visual Recognition

Definition: the VC dimension of a set of functions is d if and only if there exists a set of points such that these points can be labeled in all 2d possible configurations, and for each labeling, a member of set H can be found which correctly assigns those labels, but that no set exists where q>d satisfying this property. 4/24/2017 236875 Visual Recognition

If for any number N, it is possible to find N points The VC Dimension Saying another way:VC dimension is size of largest subset of X shattered by H (every dichotomy implemented). VC dimension measures the capacity of a set H of functions. If for any number N, it is possible to find N points that can be separated in all the 2N possible ways, we will say that the VC-dimension of the set is infinite 4/24/2017 236875 Visual Recognition

The VC Dimension Example Suppose that the data live in space, and the set consists of oriented straight lines, (linear discriminants). While it is possible to find three points that can be shattered by this set of functions, it is not possible to find four. Thus the VC dimension of the set of linear discriminants in is three. 4/24/2017 236875 Visual Recognition

The VC Dimension cont. Theorem 1 Consider some set of m points in . Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent. Corollary: The VC dimension of the set of oriented hyperplanes in is n+1, since we can always choose n+1 points, and then choose one of the points as origin, such that the position vectors of the remaining points are linearly independent, but can never choose n+2 points 4/24/2017 236875 Visual Recognition

The VC Dimension cont. VC dimension can be infinite even when the number of parameters of the set of hypothesis functions is low. Example: For any integer l with any labels we can find l points and parameter a such that those points can be shattered by Those points are: and parameter a is: 4/24/2017 236875 Visual Recognition

Minimizing the Bound by Minimizing d 4/24/2017 236875 Visual Recognition

Minimizing the Bound by Minimizing d VC confidence (second term in (3)) dependence on d/l given 95% confidence level ( ) and assuming training sample of size 10000. One should choose that learning machine whose set of functions has minimal d For d/l>0.37 (for and l=10000) VC confidence >1. Thus for higher d/l the bound is not tight. 4/24/2017 236875 Visual Recognition

Example Question. What is VC dimension and empirical risk of the nearest neighbor classifier? Any number of points, labeled arbitrarily, will be successfully learned, thus and empirical risk =0 . So the bound provide no information in this example. 4/24/2017 236875 Visual Recognition

Structural Risk Minimization Finding a learning machine with the minimum upper bound on the actual risk leads us to a method of choosing an optimal machine for a given task. This is the essential idea of the structural risk minimization (SRM). Let be a sequence of nested subsets of hypotheses whose VC dimensions satisfy d1 < d2 < d3 < … SRM then consists of finding that subset of functions which minimizes the upper bound on the actual risk. This can be done by training a series of machines, one for each subset, where for a given subset the goal of training is to minimize the empirical risk. One then takes that trained machine in the series whose sum of empirical risk and VC confidence is minimal. 4/24/2017 236875 Visual Recognition