Download presentation
1
Statistical Learning Theory: Classification Using Support Vector Machines
John DiMona Some slides based on Prof Andrew Moore at CMU:
2
(Rough) Outline Empirical Modeling Risk Minimization
Theory Empirical Risk Minimization Structural Risk Minimization Optimal Separating Hyperplanes Support Vector Machines Example Questions
3
Empirical Data Modeling
Observations of a system are collected Based on these observations a process of induction is used to build up a model of the system This model is used to deduce responses of the system not yet observed Observations could be a wide variety of things (medical records, ecological data, consumer trends) depending on what you are trying to model The goal is to use what is already known to develop a generalized model of the of a system
4
Empirical Data Modeling
Data obtained through observation is finite and sampled by nature Typically this sampling is non-uniform Due to the high dimensional nature of some problems the data will form only a sparse distribution in the input space Creating a model from this type of data is an ill posed problem Observational data is messy! -Sparsely distributed -incomplete How can we use this incomplete data to better understand the underlying rules of a system
5
Empirical Data Modeling
Globally Optimal Model Best Reachable Model Selected Model The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.
6
Error in Modeling Approximation Error is a consequence of the hypothesis space not exactly fitting target space, The underlying function may lie outside the hypothesis space A poor choice of the model space will result in a large approximation error (model mismatch) Estimation Error is the error due to the learning procedure converging to a non-optimal model in the hypothesis space Together these form the Generalization Error Approximation Error: based on the way we define the problem and represent the data we may not be able to model the system exactly (in our medical example this could translate to not representing all relevant attributes of a persons medical condition) Estimation Error: is a result of the inductive process we used selecting a suboptimal model in hypothesis space
7
Error in Modeling Globally Optimal Model Best Reachable Model Selected Model The goal in modeling is to choose a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.
8
Pattern Recognition Given a system , where:
Develop a model , that best predicts the behavior of the system for all possible items The item we want to classify and The true classification of that item Given a set of data that fall into two classifications We want to develop a model that will classify these elements with the least amount of error
9
Supervised Learning A generator (G) of a set of vectors , observed independently from the system with a fixed, unknown probability distribution A supervisor (S) who returns an output value to every input vector , according to the systems conditional probability function (also unknown) A learning machine (LM) capable of implementing a set of functions , where is a set of parameters Generator: creates possible data items x Supervisor: correctly classifies all x to y=1 or y=-1 Learning machine: capable of implementing all functions f(x,a) in the hypothesis space
10
Supervised Learning Training: the generator creates a set and the supervisor provides correct classification to form the training set The learning machine develops an estimation function using the training data. Use the estimation function to classify new unseen data
11
Risk Minimization In order to choose the best estimation function we must have a measure of discrepancy between a true classification of and an estimated classification For pattern recognition we use: This is called the loss function. For pattern recognition we say that if the estimated classification is correct there is no loss, otherwise there is total loss
12
Risk Minimization The expected value of loss with regards to some estimation function : where
13
Risk Minimization The expected value of loss with regards to some estimation function : where Goal: Find the function that minimizes the risk (over all functions ) We want to find the estimation function that has the lowest risk out of all the estimation functions in the hypothesis space
14
Risk Minimization The expected value of loss with regards to some estimation function : where Goal: Find the function that minimizes the risk (over all functions ) Problem: By definition we don’t know How can we measure the accuracy of a model with out this knowledge
15
To make things clearer…
For the coming discussion we will shorten notation in the following ways The training set will be referred to as The loss function will be
16
Empirical Risk Minimization (ERM)
Instead of measuring risk over the set of all just measure it over just the training set The empirical risk must converge uniformly to the actual risk over the set of loss functions in both directions: Idea: measure risk of an estimation function over the data for which P(x) and P(y|x) is known the training data! In order for this to be an accurate heuristic it must converge in the same way actual risk does This is the approach taken by classical neural networks such as those using back propagation method
17
VC Dimension (Vapnik–Chervonenkis Dimension)
The VC dimension is a scalar value that measures the capacity of a set of functions. The VC dimension of a set of functions is if and only if there exists a set of points such that these points can be separated in all possible configurations, and that no set exists where satisfying this property. VC dimension describes the ability of a set of functions to separate data in a space
18
VC Dimension (Vapnik–Chervonenkis Dimension)
Three points in the plane can be shattered by the set of linear indicator functions whereas four points cannot The set of linear indicator functions in n dimensional space has a VC dimension equal to n + 1
19
Upper Bound for Risk It can be shown that ,
where is the confidence interval and is the VC dimension ERM only minimizes , and , the confidence interval, is fixed based on the VC dimension of the set of functions determined a priori When implementing ERM one must tune the confidence interval based on the problem to avoid underfitting/overfitting the data The accrual risk of a function is bounded by the empirical risk of that function plus the confidence interval of the training data and VC dimension ERM is inaccurate when the confidence interval is large. The structure of the hypothesis space determines the confidence interval and therefore when implementing ERM one needs a priori knowledge to build a learning machine with a small confidence interval
20
Structural Risk Minimization (SRM)
SRM attempts to minimize the right hand side of the inequality over both terms simultaneously The first term is dependent upon a specific function’s error and the second depends on the VC dimension of the space that function is in Therefore VC dimension must be a controlling variable Structural risk minimization attempts to minimize the entire right hand side of the inequality VC dimension is a controlling factor in this because both the confidence interval and the function set are dependant on it
21
Structural Risk Minimization (SRM)
We define our hypothesis space to be the set of functions We say that is the hypothesis space of VC dimension such that: For a set of observations SRM chooses the function minimizing the empirical risk in subset for which the guaranteed risk is minimal SRM breaks up the hypothesis space into subsets based on vc dimension SRM effectively chooses the model with the lowest empirical risk in the vc dimension that produces the smallest guaranteed risk
22
Structural Risk Minimization (SRM)
SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases the minima of the empirical risks decrease but the confidence interval increases SRM is more general than ERM because it uses the subset for which minimizing yields the best bound on There is a trade off between minimizing empirical risk and choosing a vc dimension that provides a small confidence interval SRM is more general that ERM because the vc dimension of a machine implementing ERM must be chosen a priori
23
Support Vector Classification
Uses the SRM principal to separate two classes by a linear indicator function which is induced from available examples The goal is to produce a classifier that will work well on unseen examples, i.e. it generalizes well
24
Linear Classifiers denotes +1 denotes -1
Imagine a training set such as this. What is the best way to separate this data?
25
Linear Classifiers denotes +1 denotes -1
Imagine a training set such as this. What is the best way to separate this data? All of these are correct but which is the best?
26
Linear Classifiers denotes +1 denotes -1
Imagine a training set such as this. What is the best way to separate this data? All of these are correct but which is the best? The maximum margin classifier maximizes the distance from the hyperplane to the nearest data points (support vectors) Support vectors
27
Defining the Optimal Hyperplane
The optimal hyperplane separates the training set with the largest margin Margin (M)
28
Defining the Optimal Hyperplane
The optimal hyperplane separates the training set with the largest margin The margin is defined the distance from any point on the minus plane to the closest point on the plus plane Margin (M)
29
Defining the Optimal Hyperplane
The optimal hyperplane separates the training set with the largest margin The margin is defined the distance from any point on the minus plane to the closest point on the plus plane We need to find M in terms of w Margin (M)
30
Defining the Optimal Hyperplane
Because w is perpendicular to the hyperplane
31
Defining the Optimal Hyperplane
32
Defining the Optimal Hyperplane
…
33
Defining the Optimal Hyperplane
… …
34
Defining the Optimal Hyperplane
… …
35
Defining the Optimal Hyperplane
… …
36
Defining the Optimal Hyperplane
… …
37
Defining the Optimal Hyperplane
… So we want to maximize … …or minimize
38
Quadratic Programming
Minimizing is equivalent to maximizing the equation in the non negative quadrant under the constraint This is derived using the Lagrange functional
39
Extensions Possible to extend to non-separable training sets by adding a error parameter and minimizing: Data can be split into more than two classifications by using successive runs on the resulting classes
40
Support Vector (SV) Machines
Maps the input vectors into a high-dimensional feature space using a kernel function In this feature space the optimal separating hyperplane is constructed Optimal Hyperplane in Feature Space Feature Space Input Space
41
Support Vector (SV) Machines
1-Dimensional Example
42
Support Vector (SV) Machines
1-Dimensional Example Easy!
43
Support Vector (SV) Machines
1-Dimensional Example Easy! Harder (impossible)
44
Support Vector (SV) Machines
1-Dimensional Example Easy! Harder (impossible) Project into a higher dimension
45
Support Vector (SV) Machines
1-Dimensional Example Easy! Harder (impossible) Project into a higher dimension Magic…
46
Support Vector (SV) Machines
Some possible ways to implement SV machines: Polynomial Learning Machines Radial Basis Function Machines Multi-Layer Neural Networks These methods all implement different kernel functions
47
Two-Layer Neural Network Approach
Kernel is a sigmoid function: Implements the rules: Using this technique the following are found automatically: Architecture of the two layer machine, determining the number N of units in the first layer (the number of support vectors) The vectors of the weights in the first layer The vector of weights for the second layer (values of ) This is a powerful implementation because it automatically determines: -the number of support vectors (number of neurons in the first layer) -the nonlinear transformation into the feature space (weights on the first layer) -and the values of alpha (the weights for the second layer)
48
Two-Layer Neural Network Approach
This is a powerful implementation because it automatically determines: -the number of support vectors (number of neurons in the first layer) -the nonlinear transformation into the feature space (weights on the first layer) -and the values of alpha (the weights for the second layer)
49
Handwritten Digit Recognition
Used U.S. Postal Service database 7,300 training patterns 2,000 test patterns Resolution of the database was 16 x 16 pixels yielding a 256 dimensional input space
50
Handwritten Digit Recognition
Classifier Raw error% Human performance 2.5 Decision tree, C4.5 16.2 Polynomial SVM 4.0 RBF SVM 4.1 Neural SVM 4.2
51
Exam Question 1 What are the two components of Generalization Error?
52
Exam Question 1 What are the two components of Generalization Error?
Approximation Error and Estimation Error
53
Exam Question 2 What is the main difference between Empirical Risk Minimization and Structural Risk Minimization?
54
Exam Question 2 What is the main difference between Empirical Risk Minimization and Structural Risk Minimization? ERM: Keep the confidence interval fixed (chosen a priori) while minimizing empirical risk SRM: Minimize both the confidence interval and the empirical risk simultaneously
55
Exam Question 3 What differs between SVM implementations (polynomial, radial, NN, etc.)?
56
Exam Question 3 What differs between SVM implementations (polynomial, radial, NN, etc.)? The Kernel function.
57
References Vapnik: The Nature of Statistical Learning Theory
Gunn: Support Vector Machines for Classification and Regression ( Andrew Moore’s SVM Tutorial:
58
Any Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.