The Nature of Statistical Learning Theory by V. Vapnik

The Nature of Statistical Learning Theory by V. Vapnik
Statistical Learning Theory & Classifications Based on Support Vector Machines The Nature of Statistical Learning Theory by V. Vapnik Turn off all the lights Introduce myself and major Introduce the topic Anders Melen

Table of Contents Empirical Data Modeling
What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Classification Optimal Separating Hyperplane & Quadratic Programming Support Vector Machines (SVM) Exam Questions Q & A Session We are going to start with Empirical data modeling which we’ve talked about many times in this course and you should feel very familiar with. If you’re unsure you should pick up the idea pretty quickly in a few minutes here. Quick simplified explanation of what statistical learning theory is and what its all about. Example has what supervised learning is Risk Minimization etc! There is a lot to talk about here Everytime I reach a new section I’m going to jump back to the table of contents so you know we’re moving on to a new topic I generally talk really fast during presentations so if I do, please feel free to yell at me so you can keep up I also tried to strike a good balance between covering a balance of high level concepts and some important low level details 1

What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Classification Optimal Separating Hyperplane & Quadratic Programming Support Vector Machines (SVM) Exam Questions Q & A Session Lets start off with something we should all be familiar with. Empirical Data Modeling 2

Empirical Data Modeling
Observations of a system are collected Induction on observations is used to build up a model of the system. Model is then used to deduce responses of an unobserved system. Sampling is typically non-uniform High dimensional problems will form a sparse distribution in the input space Basically refers to any kind computer modeling that uses empirical observations rather than mathematical relationships Empirical data being data that has been gathered by observation. 3

Modeling Error Approximation error is the consequence of the hypothesis space not fitting the target space Globally Optimal Model Best Reachable Model Selected Model The underlying function may lie outside the hypothesis space A poor choice of the model space will result in a large approximation error (model mismatch) 4

Modeling Error Estimation Error is the error due to the learning procedure converging to a non-optimal model in the hypothesis space Approximation Error Globally Optimal Model Generalization Error Best Reachable Model Estimation Error Selected Model Globally optimal model & selected model → generalization error Together these form the Generalization Error 5

Modeling Error Approximation error is the consequence of the hypothesis space not fitting the target space Globally Optimal Model Best Reachable Model Selected Model Goal Choose a model from the hypothesis space which is closest (w/ respect to some error measure) to the function target space 6

What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Classification Optimal Separating Hyperplane & Quadratic Programming Support Vector Machines (SVM) Exam Questions Q & A Session 7

Statistical Learning Theory
Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnik 17) Learning itself falls into several categories most importantly the first two Unsupervised learning supervised learning online learning reinforcement learning 8

Model of Supervised Learning
Training The supervisor takes each generated x value and returns and output value y. Each (x,y) pair is part of the training set: F(x,y) = F(x) F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl) Many of the algorithms we already discussed in this course are based on supervised learning theory. The second example here shows the expanded form of the conditional probability function that most of us are all familiar with. It reads “the probability of y given x” What this diagram here is essentially saying is that we pass in a training set to both S and LM where each training row has a solution Y. Thus we supervise the building of a model that can accurately predict future values outside of our training set. 10

Risk Minimization To find the best function, we need to measure loss
L is the discrepancy function which is based on the y’s generated by the supervision and the ŷ’s generated by the estimate functions F is a predictor such that expected loss is minimized L(y, F(x,𝛂)) This is a very common setup in machine learning. We essentially want to be able to look at our training data drawn from an unknown distribution From there we want to determine a predictor f() such that the expected loss is minimized thus giving us better predictive accuracy. 12

Risk Minimization Pattern Recognition
With pattern recognition, the supervisor’s output y can only take on 2 values, y = {0,1} and the loss takes the following values. So the risk function determines the probability of different answers being given by the supervisor and the estimation function. The main difference between a pattern recognition approach to risk minimization is that the output y will be a boolean value. 13

Some Simplifications From Here On
Training Set {(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl} Loss Function L(y, F(x,𝛂)) → Q(z,𝛂) From here on out I want to declare a few simplifications namely the training set and the loss function Keep these in mind as they are used throughout the rest of the presentation 14

Empirical Risk Minimization (ERM)
We want to measure the risk over the training set rather than the set of all 15

The empirical risk must converge to the actual risk over the set of loss functions One really important fact is that the empirical risk must converge to the actually risk over the set of loss functions as denoted by this limit here. 16

In both directions! This indeed has to be the case in both directions. Again this minimum limit is denoting the required convergence criteria 17

What do we need to address here?
What are the necessary and sufficient conditions for consistency of a learning process based on ERM principles? At what rate does the learning process converge? How can we control the rate of the convergence of learning? 18

Vapnik-Chervonenkis Dimensions
Lets just call them VC Dimensions Developed by Alexey Jakovlevich Chervonenkis & Vladimir Vapnik The VC dimension is scalar value that measures the capacity of a set of functions 20

Vapnik-Chervonenkis Dimensions
The VC dimension of a set of functions is responsible for the generalization ability of learning machines The VC dimension of a set of indicator functions Q(z,𝛂)𝛂 ∈ 𝞚 is the maximum number h of vectors z1, …, zh that can be separated into two classes in all 2h possible ways using functions of the set. 21

Upper Bound For Risk It can be shown that
where is the confidence interval and h is the VC dimension 22

Upper Bound For Risk ERM only minimizes and ,
the confidence interval, is fixed based on the VC dimension of the set of functions determined by apriori ERM must tune the confidence interval based on the problem to avoid overfitting and underfitting 23

Structural Risk Management (SRM)
SRM attempts to minimize the right hand size of the inequality over both terms simultaneously 25

The term is dependent on a specific function’s error while the term depends on the dimension of the space that the functions lives in. The VC dimension is the controlling variable 26

We define the hypothesis space S to be the set of functions: Q(z,𝛂)𝛂 ∈ 𝞚 We say that Sk= {Q(z,𝛂)},𝛂 ∈ 𝞚k is the hypothesis space of a VC dimension, k, such that: For a set of observations {z1, …, zn}, SRM will choose the loss function Q(z,𝛂) minimizing the empirical risk in subset Sk for which the guaranteed risk is minimal. There is a few take away messages for SRM as follows SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases the minima of the empirical risks decrease but the confidence interval increase SRM is more general than ERM alone because it uses the subset Sk for which minimizing Remp(𝛂) **empirical risk of alpha** yields the best bound on R(𝛂) **risk of alpha** 27

Support Vector Classification
Uses SRM principle to separate two classes by a linear indicator function which is induced from examples in the training set The goal is to produce a classifier that will work well on unseen test examples. We want a classifier with the maximum generalizing capacity. This essentially means the classifier with the lowest risk How would you classify this data into two distinct classes? 29

Well, all of these lines work as linear classifiers. However, which one is the best choice? So its really easy for us to just look at this and come up with a lot of linear classifiers that accurately classify the data. But, which one is really the best choice? What makes it the best choice? 30

The margin of a linear classifier is defined as the width the boundary can be increased by before hitting a datapoint Here the yellow denotes the largest boundary width AKA the margin of a randomly chosen linear classifier. Can we do better? Of Course! 31

How about a better vector Linear SVM’s are the simplest SVM’s Here you can clearly see we were able to double the margin by choosing a different linear classifier. This is the simplest SVM called a linear SVM 32

Defines planes by x, y- intercepts, b, w and a vector perpendicular to the lines they lie on so that the dot product gives us the perpendicular planes - Plane + Plane 33

Optimal Separating Hyperplane
Margin = |x- - x+| (w * x) + b >= 1 (w * x) + b <= 1 ≥≤≥≤ + Plane The margin is defined as the distance from any point on the minus plane to the closest point on the positive plane First is the positive plane Second is the negative plane How do we find the margin in terms of b and w? 35

M = |x- - x+| (w * x+) + b = +1 (w * x-) + b = -1 x+ = x- + 𝛌w ≥≤≥≤ + Plane We define these two functions given the values of our two planes We introduce the scalar lambda on w, which we want to solve for. 36

M = |x- - x+| (w * x+) + b = +1 (w * x-) + b = -1 x+ = x- + 𝛌w ≥≤≥≤ = w * (x- + 𝛌w) + b = 1 = w * x- + b + 𝛌w * w We can do some quick math to combine these two functions 37

(w * x+) + b = +1 (w * x-) + b = -1 x+ = x- + 𝛌w ≥≤≥≤ Optimal Separating Hyperplane = w * (x- + 𝛌w) + b = 1 = w * x- + b + 𝛌w * w = -1 + 𝛌w * w = 1 so, And thus we have solved the combined equations for lambda 38

M = |x- - x+| (w * x+) + b = +1 (w * x-) + b = -1 M = |x- - x+| = |𝛌w| = 𝛌|w| = 𝛌 sqrt(w*w) Now we can plug what we’ve solved for lambda back into our equation for finding the distance between our two planes AKA the margin. 39

General Optimal Hyperplane Extend to non-separable training sets by adding an error parameter and minimizing: Given that we can come up with a generalized optimal hyperplane The data can be split into more than two classifications by using successive runs on the resulting classes 40

Quadratic Programming
In linear world we would we’d want to minimize: Now we want to maximize: We want to maximize in the non-negative quadrant 𝛂i >= 0 and i goes from 1 → l Under the constraint that the sum of 𝛂iyi will equal 0 41

Support Vector Machines (SVM)
Map input vectors x into a high-dimensional feature space using a kernel function: (zi, z) = K(x, xi) In this feature space the optimal separating hyperplane is constructed 43

Feature space… Optimal hyperplane… What are you talking about... Feature space… Optimal hyperplane… what the heck am I talking about? I really can’t stand this diagram so I found a little animation that better illustrates the concept of a SVM 44

As you watch notice there exists no linear solution to this classification problem so what we we going to do? We are going to use a polynomial kernel so that we can find a plane that correctly classifies the data points. To simplify this, we are projecting the plane into a hyperplane such that we can find a plane that accurately classifies the data 45

Lets try a basic one dimensional example! Lets start off with a really basic one dimensional example! Can we find a plane that accurately classifies the data points? 46

Aw snap, that was easy! Well that was really easy! 47

Ok, what about a harder one dimensional example? Alright, lets try a little bit harder one dimensional example Can anyone figure out what we need to do? 48

Project the lower dimensional data into a higher dimensional space just like in the animation! Just like in the animation all we need to do is use a little calculus to project our lower dimensional data onto a higher dimensional space 49

There is several ways to implement a SVM Polynomial Learning Machine (Like the animation) Radial Basis Function Machines Two-Layer Neural Networks There is several different implementations of SVM’s that all use different kernel functions Polynomial Learning Machines Radial Basis Function Machines Two Layer Neural Networks We already saw a quick polynomial learning example in the animation Lets take a quick look at Two layer neural network implementation 50

Simple Neural Network Neural Networks are computer science models inspired by nature! The brain is a massive natural neural network consisting of neurons and synapses Neural networks can be modeled using a graphical model If anyone has taken Josh Bongards Evolutionary Robotics course you should be well versed in neural networks. For those of you who haven’t I’ll try and explain what they are quickly so you’ll have some context for the following section 51

Simple Neural Network Neurons → Nodes Synapses → Edges
As a computer scientist you should be familiar with graphs as a data structure To model a natural neural network like a brain we need to define a few things. Neurons can be thought of as nodes Synapses can be thought of as edges Synapses have weight associated with them. When values are passed into an input node they propagate through the neural network being affected by the synapses until they reach the output nodes 52 Molecular Form Neural Network Model

Two-Layer Neural Network
Kernel is a sigmoid function Implementing the rules 53

Two-Layer Neural Network
Using this technique the following are found automatically: Architecture of a two-layer machine Determining N number of units in first layer (# of support vectors) The vectors of the weights wi = xi in the first layer The vector of weights for the second layer (values of 𝛂) 54

Optical Character Recognition (OCR)
Data from the U.S. Postal Service Database (1990) 7,300 training patterns 2,000 test patterns collected from real-life zip code Now I’m going to discuss a real world study done by the United States Postal Service back in 1990 where they compared c4.5 and a couple different approaches to support vector machines. The purpose of the study was to perform optical character recognition on handwritten zip codes using different classification implementations 55

I found this image in the previous presentation and other than the fact that its a giant table of handwritten examples I’m not really sure what its purpose is? I imagine it had been part of the training data itself with a solutions matrix somewhere else. 56

Here is the results of their study with the human performance as a control 57

Conclusion The quality of a learning machine is characterized by three main components How rich and universal is the set of functions that the LM can approximate? How well can the machine generalize? How fast does the learning process for this machine converge 58

Exam Question #1 What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machine The kernel function The difference is the kernel function 60

Exam Question #2 What is empirical data modeling? Give a summary of the main concept and its components Empirical data modeling is the induction of observations to build up a model. Then the model is used to deduce responses of an unobserved system. 61

Exam Question #3 What must the Remp(𝛂) do over the set of loss functions? It must converge to the R(𝛂) What must the empirical risk of alpha do over the set of loss functions? it must converge to the general risk function over the set of loss function 62

End Any questions?

The Nature of Statistical Learning Theory by V. Vapnik

Similar presentations

Presentation on theme: "The Nature of Statistical Learning Theory by V. Vapnik"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Nature of Statistical Learning Theory by V. Vapnik

Similar presentations

Presentation on theme: "The Nature of Statistical Learning Theory by V. Vapnik"— Presentation transcript:

Similar presentations

About project

Feedback