Presentation is loading. Please wait.

Presentation is loading. Please wait.

STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.

Similar presentations


Presentation on theme: "STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical."— Presentation transcript:

1 STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES 2016-05-02 presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical learning theory. Springer Science & Business Media, 1995. 1

2 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 2

3 Intro: Book and Jargons 3 Set-up Practice Theory Summary Chap1: what’s learning problem Chap2: philosophy of learning theory Chap3, 4: optimization of general abilities Chap5 : neural network, SVM Chap6: some highlights

4 4 Supervised regression, classification Unsupervised clustering, PCA Intro: Book and Jargons About statistical learning (machine learning): Parametric regressions Non-parametric kernel density, clustering two-steps, model based a functional form for observations no explicit functional model each observation has both predictor and response measurement (x,y) observation only has predictors (x)

5 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 5

6 Empirical Data Modeling Observations of a system are collected Induction on observations are used to build up a model of the system Model is then used to deduce responses of an unobserved system –Sampling is typically non-uniform –High dimensional problems will form a sparse distribution in the input space 6 Process:

7 Approximation error is the consequence of the hypothesis space not fitting the target space 7 Globally Optimal Model Best Reachable Model Selected Model Modeling Error

8 8 Globally Optimal Model Best Reachable Model Selected Model Goal ○Choose a model from the hypothesis space which is closest (w/ respect to some error measure)to the function target space Approximation error is the consequence of the hypothesis space not fitting the target space Modeling Error

9 Estimation Error is the error between the best model in our hypothesis space and the model within our hypothesis space that we selected. 9 This forms the Generalization Error Globally Optimal Model Best Reachable Model Selected Model Approximation Error Generalization Error Estimation Error Modeling Error

10 The Globally Optimal Model & the Selected Model form the generalization error which measures how well our data model adapts to new and unobserved data 10 Modeling Error

11 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 11

12 Statistical Learning Theory: Consider the learning problem as a problem of finding a desired dependence using a limited number of observations. The general model of learning from examples: 12 Model of Supervised Learning i. A generator (G) of random vectors, drawn independently from a fixed but unknown probability distribution function. ii.A supervisor (S) who returns an output value y to ever y input vector x, according to a conditional distribution function, also fixed but unknown. iii.A learning machine (LM) capable of implementing a set of functions, where A is a set of parameters.

13 Training: - (S) takes each generated x and returns an output value y - Then each pair of (x, y) goes to (LM) and return some - The selection of the desired function is based on a training set of l independent and identically distributed (i.i.d.) observations 13 Model of Supervised Learning

14 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 14

15 To find the best approximation, we measure loss –L : the loss or discrepancy function y’s provided by supervisor ' s provided by learning machines Risk function –F (x,y): the only available information is in training set, otherwise it’s unknown 15 Risk Minimization

16 Pattern Recognition the supervisor’s output y can only take on 2 values, y = {0,1} and the loss takes the following values So the risk function determines the probability of different answers being given by the supervisor and the estimation function 16 Risk Minimization

17 The expected value of loss with regards to some estimation function where 17 Risk Minimization

18 Training Set Loss function Therefore 18...Next

19 Measure the risk just over the training set Q(z, a 0 ) minimize R by Q(z, a l ) minimize R emp 19 Empirical Risk Minimization (ERM)

20 The empirical risk must converge to the actual risk over the set of loss functions in both directions 20 Empirical Risk Minimization (ERM)

21 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 21

22 Define: VC(H), of hypothesis space H defined over instance space X, is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) ≡ ∞. 22 VC dimension (Vapnik–Chervonenkis)

23 The VC dimension is a scalar value that measures the capacity of a set of functions. The VC dimension of a set of functions is responsible for the generalization ability of learning machines. 23 VC dimension (Vapnik–Chervonenkis) The VC dimension of a set of indicator functions, is the maximum number h of vectors ( z 1, …z h ) that can be separated into two classes in all 2 h possible ways using functions of the set. α Λα Λ

24 Upper Bound For Risk It can be shown that where is the confidence interval and h is the VC dimension 24

25 Upper Bound For Risk ERM only minimizesand (the confidence interval) is fixed based on the VC dimension of the set of functions determined a priori When implementing ERM, one must tune the confidence interval based on the problem to avoid underfitting/overfitting the data 25

26 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 26

27 Structural Risk Management (SRM) SRM attempts to minimize the right hand size of the inequality over both terms simultaneously 27

28 Structural Risk Management (SRM) is dependent on a specific function’s error depends on the dimension of the space that the functions lives in The VC dimension is the controlling variable 28

29 Structural Risk Management (SRM) We define the hypothesis space S to be the set of functions: We say that is the hypothesis space of a VC dimension, k, such that: 29

30 SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases, the minimal of the empirical risks decrease but the confidence interval increases SRM is more general than ERM because it uses the subset for which minimizing yields the best bound on 30 Structural Risk Management (SRM)

31 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 31

32 Support Vector Machines (SVM) Feature space… Optimal hyperplane... 32

33 Support vector classification Uses the SRM principal to separate two classes by a linear indicator function which is induced from available examples in the training set. The goal is to produce a classifier that will work well on unseen test examples. We want to the classifier with the maximum generalizing capacity i.e. the lowest risk. 33 Get start with...

34 34 How would you classify this data? Simplest case: Linear classifiers

35 35 Which one is the best? All of these lines work as linear classifiers Simplest case: Linear classifiers

36 36 Define the margin of a linear classifier as the width the boundary can be increased by before hitting a datapoint. Simplest case: Linear classifiers

37 37 Support vectors are the datapoints the margin pushes up against We want the maximum margin linear classifier. This is the simplest SVM called a linear SVM Simplest case: Linear classifiers

38 38 -1 zone +1 zone Plus Plane Minus Plane We can define these two planes by x, the y-intercept, b, and w, a vector perpendicular to the lines they lie on so that the dot product gives the perpendicular planes Simplest case: Linear classifiers

39 39 The optimal separating hyperplane But how can we find M in terms of w and b when the planes are defined as: Positive plane = (w * x) + b = 1 Negative plane = (w * x) +b = -1 Note: Linear classifier plane: (w * x) + b = 0 Then...

40 40 The optimal separating hyperplane The margin is defined the distance from any point on the minus plane to the closest point on the plus plane (w * x) + b ≥ 1 (w * x) + b ≤ 1

41 41 The optimal separating hyperplane

42 42

43 43 =

44 44 = = Therefore

45 45 The optimal separating hyperplane

46 46

47 47

48 48 Thus, we want to maximize Or minimize

49 49 The optimal separating hyperplane Possible to extend to non-separable training sets by adding a error parameter and minimizing: Data can be split into more than two classifications by using successive runs on the resulting classes

50 50 The optimal separating hyperplane Possible to extend to non-separable training sets by adding a error parameter and minimizing: Data can be split into more than two classifications by using successive runs on the resulting classes

51 Support Vector Machines (SVM) Map input vectors x into a high-dimensional feature space using a kernel function: (z i, z) = K(x, x i ) In this feature space the optimal separating hyperplane is constructed 51 Getting there...

52 Support Vector Machines (SVM) 52 ●Lets try a basic one dimensional example!

53 Support Vector Machines (SVM) 53 ●It is easy!

54 Support Vector Machines (SVM) 54 ●Ok, a harder one dimensional example?

55 Support Vector Machines (SVM) 55 Project the lower dimensional data into a higher dimensional space like it shown:

56 Support Vector Machines (SVM) 56 How are SV Machines implemented? - Polynomial Learning Machines - Radial Basis Functions Machines - Two Layer Neural Networks Each of these methods and all SV Machine implementation techniques use a different kernel function.

57 Simple Neural Network 57 ● Neural Networks are computer science models inspired by nature ● The brain is a massive natural neural network consisting of neurons and synapses ● Neural networks can be modeled using a graphical model A little more...

58 Simple Neural Network 58 ●Neurons → Nodes ●Synapses → Edges Molecular Form Neural Network Model

59 Two-Layer Neural Network 59 Kernel is a sigmoid function Implementing the rules

60 Two-Layer Neural Network 60 Using this technique the following are found automatically: i. Architecture of a two-layer machine ii. Determining N number of units in first layer (# of support vectors) iii. The vectors of the weights w i = x i in the first layer iv. The vector of weights for the second layer (values of )

61 Conclusion 61 ●The quality of a learning machine is characterized by three main components a. How rich and universal is the set of functions that the LM can approximate? b. How well can the machine generalize? c. How fast does the learning process for this machine converge

62 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 62

63 Exam Question #1 What is the main difference between Empirical Risk Minimization and Structural Risk Minimization? 63

64 ERM: Keep the confidence interval fixed (chosen a priori) while minimizing empirical risk SRM: Minimize both the confidence interval and the empirical risk simultaneously 64 Exam Question #1

65 What differes between SVM implementations? – i.e. Polynomial, radial basis learning machines, neural network LM’s? 65 Exam Question #2

66 What differes between SVM implementations? – i.e. Polynomial, radial basis learning machines, neural network LM’s? 66 Exam Question #2 o The kernel function

67 What must the R emp () do over the set of loss functions? 67 Exam Question #3

68 What must the R emp () do over the set of loss functions? It must converge to the R() 68 Exam Question #3

69 Table of Contents  Intro: Book and Jargons  Empirical Data Modeling  Statistical Learning Theory& Supervised Learning Model  Risk Minimization  Vapnik-Chervonenkis (VC) Dimensions  Structural Risk Management (SRM)  Support Vector Machines (SVM)  Exam Questions  Q & A 69

70 Book Recommends  James Gareth, Trevor Hastie et al. An introduction to statistical learning.  Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie. The Elements of Statistical Learning.  Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. [http://www.deeplearningbook.org/] 70


Download ppt "STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical."

Similar presentations


Ads by Google