Presentation is loading. Please wait.

Presentation is loading. Please wait.

Binary Classification Problem Linearly Separable Case

Similar presentations


Presentation on theme: "Binary Classification Problem Linearly Separable Case"— Presentation transcript:

1 Binary Classification Problem Linearly Separable Case
Which one is the best? A+ A- Solvent Bankrupt

2 Support Vector Machines Maximizing the Margin between Bounding Planes

3 Algebra of the Classification Problem Linearly Separable Case (Exists )
Given l points in the n dimensional real space Represented by an matrix or Membership of each point in the classes is specified by an diagonal matrix D : if and Separate and by two bounding planes such that: Predict the membership of a new data point

4 Summary the Notations Let be a training
dataset and represented by matrices equivalent to , where

5 The Mathematical Model for SVMs:
Linear Program and Quadratic Program An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem formulation is in this category If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem Standard SVM formulation is in this category

6 Optimization Problem Formulation
Problem setting: Given functions and , defined on a domain subject to where is called the objective function and are called constraints.

7 The Most Important Concept in Optimization (minimization)
A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction. This implies the optimality conditions: A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction. This implies the KKT optimality conditions. There might exist decent direction but move along this direction will leave out the feasible region

8 Minimum Principle Example: be a convex and differentiable function Let
be the feasible region. Example:

9

10 Kuhn-Tucker Stationary-point Problem
Minimization Problem vs. Kuhn-Tucker Stationary-point Problem such that MP: KTSP: Find such that

11 Support Vector Classification
(Linearly Separable Case, Primal) The hyperplane that solves the minimization problem: realizes the maximal margin hyperplane with geometric margin

12 Support Vector Classification
(Linearly Separable Case, Dual Form) The dual problem of previous MP: subject to Applying the KKT optimality conditions, we have . But where is Don’t forget

13 Dual Representation of SVM
(Key of Kernel Methods: ) The hypothesis is determined by

14 Soft Margin SVM (Nonseparable Case) If data are not linearly separable
Primal problem is infeasible Dual problem is unbounded above Introduce the slack variable for each training point The inequality system is always feasible e.g.

15

16 Two Different Measures of
Training Error 2-Norm Soft Margin: 1-Norm Soft Margin:

17 Why We Maximize the Margin? (Based on Statistical Learning Theory)
The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound

18 Goal of Learning Algorithms
The early learning algorithms were designed to find such an accurate fit to the data. A classifier is said to be consistent if it performed the correct classification of the training data The ability of a classifier to correctly classify data not in the training set is known as its generalization Bible code? 1994 Taipei Mayor election? Predict the real future NOT fitting the data in your hand or predict the desired results

19 Probably Approximately Correct Learning
pac Model Key assumption: Training and testing data are generated i.i.d. fixed but unknown according to an distribution When we evaluate the “quality” of a hypothesis (classification function) we should take the ( i.e. “average unknown distribution into account error” or “expected error” made by the ) We call such measure risk functional and denote it as

20 Generalization Error of pac Model
Let be a set of training examples chosen i.i.d. according to Treat the generalization error as a r.v. depending on the random selection of Find a bound of the trail of the distribution of in the form r.v. is a function of and ,where is the confidence level of the error bound which is given by learner

21 Probably Approximately Correct
We assert: or The error made by the hypothesis then the error bound will be less that is not depend on the unknown distribution

22 PAC vs. 民意調查 成功樣本為1265個,以單純隨機抽樣方式(SRS)估計抽樣誤差,在95%的信心水準下,其最大誤差應不超過±2.76%。

23 Find the Hypothesis with Minimum
Expected Risk? Let the training examples chosen i.i.d. according to with the probability density be The expected misclassification error made by is The ideal hypothesis should has the smallest expected risk Unrealistic !!!

24 Empirical Risk Minimization (ERM)
and are not needed) Replace the expected risk over by an average over the training example The empirical risk: Find the hypothesis with the smallest empirical risk Only focusing on empirical risk will cause overfitting

25 VC Confidence (The Bound between )
The following inequality will be held with probability C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998), p

26 Why We Maximize the Margin? (Based on Statistical Learning Theory)
The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound

27 Capacity (Complexity) of Hypothesis
Space :VC-dimension A given training set is shattered by if for every labeling of with this labeling if and only consistent Three (linear independent) points shattered by a hyperplanes in

28 Shattering Points with Hyperplanes
Can you always shatter three points with a line in ? Theorem: Consider some set of m points in . Choose a point as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the rest points are linearly independent.

29 Definition of VC-dimension
(A Capacity Measure of Hypothesis Space ) The Vapnik-Chervonenkis dimension, , of hypothesis space defined over the input space is the size of the (existent) largest finite subset of shattered by If arbitrary large finite set of can be shattered by , then Let then

30 Let then Lemma: Two sets of points may be separated by a Hyperplane iff the intersection of their convex hulls is empty Theorem: Given points in and choose any one of the points as origin. Then these points can be shattered by oriented hyperplanes if and only if the position vectors of the Remaining points are linearly independent


Download ppt "Binary Classification Problem Linearly Separable Case"

Similar presentations


Ads by Google