Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU: http://www.cs.cmu.edu/~awm/tutorials.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Pattern Recognition and Machine Learning
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Support Vector Machines
Methods of Pattern Recognition chapter 5 of: Statistical learning methods by Vapnik Zahra Zojaji.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Learning From Data Chichang Jou Tamkang University.
Support Vector Machines Kernel Machines
Binary Classification Problem Linearly Separable Case
Sparse Kernels Methods Steve Gunn.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
SVM Support Vectors Machines
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
An Introduction to Support Vector Machines Martin Law.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Support Vector Machines Project מגישים : גיל טל ואורן אגם מנחה : מיקי אלעד נובמבר 1999 הטכניון מכון טכנולוגי לישראל הפקולטה להנדסת חשמל המעבדה לעיבוד וניתוח.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
1 Support Vector Machine (SVM)  MUMT611  Beinan Li  Music McGill 
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
Support Vector Machines Tao Department of computer science University of Illinois.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.
Support Vector Machines
CSSE463: Image Recognition Day 14
CS 9633 Machine Learning Support Vector Machines
LECTURE 16: SUPPORT VECTOR MACHINES
Support Vector Machines
The Nature of Statistical Learning Theory by V. Vapnik
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CSSE463: Image Recognition Day 14
LECTURE 17: SUPPORT VECTOR MACHINES
Support Vector Machines and Kernels
Support Vector Machines 2
Presentation transcript:

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU: http://www.cs.cmu.edu/~awm/tutorials

(Rough) Outline Empirical Modeling Risk Minimization Theory Empirical Risk Minimization Structural Risk Minimization Optimal Separating Hyperplanes Support Vector Machines Example Questions

Empirical Data Modeling Observations of a system are collected Based on these observations a process of induction is used to build up a model of the system This model is used to deduce responses of the system not yet observed Observations could be a wide variety of things (medical records, ecological data, consumer trends) depending on what you are trying to model The goal is to use what is already known to develop a generalized model of the of a system

Empirical Data Modeling Data obtained through observation is finite and sampled by nature Typically this sampling is non-uniform Due to the high dimensional nature of some problems the data will form only a sparse distribution in the input space Creating a model from this type of data is an ill posed problem Observational data is messy! -Sparsely distributed -incomplete How can we use this incomplete data to better understand the underlying rules of a system

Empirical Data Modeling Globally Optimal Model Best Reachable Model Selected Model The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.

Error in Modeling Approximation Error is a consequence of the hypothesis space not exactly fitting target space, The underlying function may lie outside the hypothesis space A poor choice of the model space will result in a large approximation error (model mismatch) Estimation Error is the error due to the learning procedure converging to a non-optimal model in the hypothesis space Together these form the Generalization Error Approximation Error: based on the way we define the problem and represent the data we may not be able to model the system exactly (in our medical example this could translate to not representing all relevant attributes of a persons medical condition) Estimation Error: is a result of the inductive process we used selecting a suboptimal model in hypothesis space

Error in Modeling Globally Optimal Model Best Reachable Model Selected Model The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.

Pattern Recognition Given a system , where: Develop a model , that best predicts the behavior of the system for all possible items . The item we want to classify and The true classification of that item Given a set of data that fall into two classifications We want to develop a model that will classify these elements with the least amount of error

Supervised Learning A generator (G) of a set of vectors , observed independently from the system with a fixed, unknown probability distribution A supervisor (S) who returns an output value to every input vector , according to the systems conditional probability function (also unknown) A learning machine (LM) capable of implementing a set of functions , where is a set of parameters Generator: creates possible data items x Supervisor: correctly classifies all x to y=1 or y=-1 Learning machine: capable of implementing all functions f(x,a) in the hypothesis space

Supervised Learning Training: the generator creates a set and the supervisor provides correct classification to form the training set The learning machine develops an estimation function using the training data. Use the estimation function to classify new unseen data

Risk Minimization In order to choose the best estimation function we must have a measure of discrepancy between a true classification of and an estimated classification For pattern recognition we use: This is called the loss function. For pattern recognition we say that if the estimated classification is correct there is no loss, otherwise there is total loss

Risk Minimization The expected value of loss with regards to some estimation function : where

Risk Minimization The expected value of loss with regards to some estimation function : where Goal: Find the function that minimizes the risk (over all functions ) We want to find the estimation function that has the lowest risk out of all the estimation functions in the hypothesis space

Risk Minimization The expected value of loss with regards to some estimation function : where Goal: Find the function that minimizes the risk (over all functions ) Problem: By definition we don’t know How can we measure the accuracy of a model with out this knowledge

To make things clearer… For the coming discussion we will shorten notation in the following ways The training set will be referred to as The loss function will be

Empirical Risk Minimization (ERM) Instead of measuring risk over the set of all just measure it over just the training set The empirical risk must converge uniformly to the actual risk over the set of loss functions in both directions: Idea: measure risk of an estimation function over the data for which P(x) and P(y|x) is known the training data! In order for this to be an accurate heuristic it must converge in the same way actual risk does This is the approach taken by classical neural networks such as those using back propagation method

VC Dimension (Vapnik–Chervonenkis Dimension) The VC dimension is a scalar value that measures the capacity of a set of functions. The VC dimension of a set of functions is if and only if there exists a set of points such that these points can be separated in all possible configurations, and that no set exists where satisfying this property. VC dimension describes the ability of a set of functions to separate data in a space

VC Dimension (Vapnik–Chervonenkis Dimension) Three points in the plane can be shattered by the set of linear indicator functions whereas four points cannot The set of linear indicator functions in n dimensional space has a VC dimension equal to n + 1

Upper Bound for Risk It can be shown that , where is the confidence interval and is the VC dimension ERM only minimizes , and , the confidence interval, is fixed based on the VC dimension of the set of functions determined a priori When implementing ERM one must tune the confidence interval based on the problem to avoid underfitting/overfitting the data The accrual risk of a function is bounded by the empirical risk of that function plus the confidence interval of the training data and VC dimension ERM is inaccurate when the confidence interval is large. The structure of the hypothesis space determines the confidence interval and therefore when implementing ERM one needs a priori knowledge to build a learning machine with a small confidence interval

Structural Risk Minimization (SRM) SRM attempts to minimize the right hand side of the inequality over both terms simultaneously The first term is dependent upon a specific function’s error and the second depends on the VC dimension of the space that function is in Therefore VC dimension must be a controlling variable Structural risk minimization attempts to minimize the entire right hand side of the inequality VC dimension is a controlling factor in this because both the confidence interval and the function set are dependant on it

Structural Risk Minimization (SRM) We define our hypothesis space to be the set of functions We say that is the hypothesis space of VC dimension such that: For a set of observations SRM chooses the function minimizing the empirical risk in subset for which the guaranteed risk is minimal SRM breaks up the hypothesis space into subsets based on vc dimension SRM effectively chooses the model with the lowest empirical risk in the vc dimension that produces the smallest guaranteed risk

Structural Risk Minimization (SRM) SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases the minima of the empirical risks decrease but the confidence interval increases SRM is more general than ERM because it uses the subset for which minimizing yields the best bound on There is a trade off between minimizing empirical risk and choosing a vc dimension that provides a small confidence interval SRM is more general that ERM because the vc dimension of a machine implementing ERM must be chosen a priori

Support Vector Classification Uses the SRM principal to separate two classes by a linear indicator function which is induced from available examples The goal is to produce a classifier that will work well on unseen examples, i.e. it generalizes well

Linear Classifiers denotes +1 denotes -1 Imagine a training set such as this. What is the best way to separate this data?

Linear Classifiers denotes +1 denotes -1 Imagine a training set such as this. What is the best way to separate this data? All of these are correct but which is the best?

Linear Classifiers denotes +1 denotes -1 Imagine a training set such as this. What is the best way to separate this data? All of these are correct but which is the best? The maximum margin classifier maximizes the distance from the hyperplane to the nearest data points (support vectors) Support vectors

Defining the Optimal Hyperplane The optimal hyperplane separates the training set with the largest margin Margin (M)

Defining the Optimal Hyperplane The optimal hyperplane separates the training set with the largest margin The margin is defined the distance from any point on the minus plane to the closest point on the plus plane Margin (M)

Defining the Optimal Hyperplane The optimal hyperplane separates the training set with the largest margin The margin is defined the distance from any point on the minus plane to the closest point on the plus plane We need to find M in terms of w Margin (M)

Defining the Optimal Hyperplane Because w is perpendicular to the hyperplane

Defining the Optimal Hyperplane

Defining the Optimal Hyperplane …

Defining the Optimal Hyperplane … …

Defining the Optimal Hyperplane … …

Defining the Optimal Hyperplane … …

Defining the Optimal Hyperplane … …

Defining the Optimal Hyperplane … So we want to maximize … …or minimize

Quadratic Programming Minimizing is equivalent to maximizing the equation in the non negative quadrant under the constraint This is derived using the Lagrange functional

Extensions Possible to extend to non-separable training sets by adding a error parameter and minimizing: Data can be split into more than two classifications by using successive runs on the resulting classes

Support Vector (SV) Machines Maps the input vectors into a high-dimensional feature space using a kernel function In this feature space the optimal separating hyperplane is constructed Optimal Hyperplane in Feature Space Feature Space Input Space

Support Vector (SV) Machines 1-Dimensional Example

Support Vector (SV) Machines 1-Dimensional Example Easy!

Support Vector (SV) Machines 1-Dimensional Example Easy! Harder (impossible)

Support Vector (SV) Machines 1-Dimensional Example Easy! Harder (impossible) Project into a higher dimension

Support Vector (SV) Machines 1-Dimensional Example Easy! Harder (impossible) Project into a higher dimension Magic…

Support Vector (SV) Machines Some possible ways to implement SV machines: Polynomial Learning Machines Radial Basis Function Machines Multi-Layer Neural Networks These methods all implement different kernel functions

Two-Layer Neural Network Approach Kernel is a sigmoid function: Implements the rules: Using this technique the following are found automatically: Architecture of the two layer machine, determining the number N of units in the first layer (the number of support vectors) The vectors of the weights in the first layer The vector of weights for the second layer (values of ) This is a powerful implementation because it automatically determines: -the number of support vectors (number of neurons in the first layer) -the nonlinear transformation into the feature space (weights on the first layer) -and the values of alpha (the weights for the second layer)

Two-Layer Neural Network Approach This is a powerful implementation because it automatically determines: -the number of support vectors (number of neurons in the first layer) -the nonlinear transformation into the feature space (weights on the first layer) -and the values of alpha (the weights for the second layer)

Handwritten Digit Recognition Used U.S. Postal Service database 7,300 training patterns 2,000 test patterns Resolution of the database was 16 x 16 pixels yielding a 256 dimensional input space

Handwritten Digit Recognition Classifier Raw error% Human performance 2.5 Decision tree, C4.5 16.2 Polynomial SVM 4.0 RBF SVM 4.1 Neural SVM 4.2

Exam Question 1 What are the two components of Generalization Error?

Exam Question 1 What are the two components of Generalization Error? Approximation Error and Estimation Error

Exam Question 2 What is the main difference between Empirical Risk Minimization and Structural Risk Minimization?

Exam Question 2 What is the main difference between Empirical Risk Minimization and Structural Risk Minimization? ERM: Keep the confidence interval fixed (chosen a priori) while minimizing empirical risk SRM: Minimize both the confidence interval and the empirical risk simultaneously

Exam Question 3 What differs between SVM implementations (polynomial, radial, NN, etc.)?

Exam Question 3 What differs between SVM implementations (polynomial, radial, NN, etc.)? The Kernel function.

References Vapnik: The Nature of Statistical Learning Theory Gunn: Support Vector Machines for Classification and Regression (http://www.dec.usc.es/persoal/cernadas/tc03/mc/SVM.pdf) Andrew Moore’s SVM Tutorial: http://www.cs.cmu.edu/~awm/tutorials

Any Questions?