Download presentation
Presentation is loading. Please wait.
Published byOswin Stephens Modified over 9 years ago
1
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心
2
课程基本信息 主讲教师:陈昱 chen_yu@pku.edu.cn Tel : 82529680 助教:程再兴, Tel : 62763742 wataloo@hotmail.com 课程网页: http://www.icst.pku.edu.cn/course/jiqixuexi/jqxx201 1.mht 2
3
Ch9 Maximum Margin Classifier Support vector machine Overlapping class distribution Multiclass SVM SVM for regression Computational learning theory 3
4
Preview Recall what we have Learned in Ch8: Linear parametric models (e.g. that for regression) can be recast into an equivalent ‘dual representation’ for which the whole model is dependent on kernel functions evaluated at training input variables. Some benefits of using kernel functions Can represent infinite dimension feature space, such as Gaussian kernel Can handle symbolic objects Provide a way of extending many well-known algorithms via kernel tricks 4
5
Preview (2) A major drawback of kernel method: During both training and prediction we have to evaluate kernel function at every training sample, which is quite computationally expensive. It is desirable that we only need to evaluate kernel function at a subset of training data → sparse kernel One kind of sparse kernel machine starts from the idea of maximum margin → support vector and support vector machine (SVM) SVM is a decision machine rather than a probabilistic distribution 5
6
Preview (3) Another kind of sparse kernel based on Bayesian viewpoint, such as relevance vector machine, typically sparser than SVM. 6
7
Ch9 Maximum Margin Classifier Support vector machine Overlapping class distribution Multiclass SVM SVM for regression Computational learning theory 7
8
Introduction: Margin and Support Vector Consider a two-class classification problem using linear model of the form: Note: In the above formula we make explicit the bias parameter. Assume that training set consisting of N samples {, … }, where t(x n ) ∈ {1,-1}, and a new instance x will be classified according to sign of t(x). Furthermore, we assume that the training set is linear separable, i.e. there exists a function of the form (7.1) s.t. y(x n )>0 for all x n satisfying t(x n )=1, and y(x n )<0 otherwise. 8
9
Introduction (2) There may exist many such linear functions, if it does exist, we will try to find one that minimize certain error function, and the SVM approaches this problem through the concept of the margin, which is defined as the distance from the training set to the hyperplane defined by y(x)=0. In SVM the decision boundary is chosen to be the one for which the margin is maximized One motivation for the idea of maximizing the margin comes from computational learning theory. 9
10
An Illustration Left figure: The margin is defined as the perpendicular distance between the decision boundary and the closest of the data points Right figure: Maximizing the margin leads to a particular choice of a decision boundary; it is determined by a subset of data points known as support vector (indicated by circle in the picture). 10
11
Finding Maximum Margin Solution Recall that the distance from a point (x,y) to a hyperplane induced by y(x)=0, where y(x) takes the form of (7.1), is |y(x)|/norm(w). Since we are only interested in solutions for which all training points are correctly classified, i.e. y(x n )t n >0, for n=1, …N, thus we can rewrite the distance as It follows that the maximum margin solution is found by solving 11
12
Finding Maximum Margin Solution (2) Observe that if we rescale both w & b by k, the distance is unchanged, therefore we can rescale w & b when necessary s.t. for points closest to the hyperplane. For such w & b it follows that all training examples satisfy the constraint t n y(x n ) ≥ 1. Such representation of hyperplane is called the canonical representation of the decision plane. Notice that for a finite training set, at least one point makes the equality true in the constraint. 12
13
Equivalent Quadratic Programming Problem Notice that maximizing 1/norm(w) is equivalent to minimizing (norm(w)) 2. It follows that the optimization problem determined by (7.3) is equivalent to the following quadratic programming problem: The constrained quadratic programming problem is equivalent to minimizing the following Lagrangian function w.r.t. w and b, and maximizing the same function w.r.t. a: 13
14
Dual Representation Setting the partial derivative of L w.r.t. components of w, and b to zero, respectively, we obtain Plug (7.8) into L and utilize (7.9), we can eliminate w and b, and rewrite L in terms of a (therefore maximizing L) as 14
15
Dual Representation (2) (7.10) together with constraints (7.11) & (7.12) are called dual representation of the original maximum margin problem (7.6) subject to constraint (7.5) The solution to a quadratic programming problem in M variables in general has computational complexity of O(M 3 ). When kernel function k is positive definite, function (in term of a) is bounded below, giving rise to a well-defined optimization problem. 15
16
Prediction To classify a new input variable x using training data, apply the following equation: 16
17
Support Vectors It can be shown (Appendix E in the book) a constrained optimization to the maximum margin problem satisfies the Karush-Kuhn-Tucker (KKT) conditions: It follows that for every data point, either a n =0 or y(x n )=1. Any point s.t. a n =0 plays no role in prediction of a new instance, and the remaining points are called support vectors, and they lie on the maximum margin hyperplanes in the feature space (Fig 7.1) 17
18
Determine the Parameter b Notice that any support vector x satisfies t(x)y(x)=1. Let S denote the set of indices of support vectors, for any n ∈ S, plugging (7.13) into (7.16), we obtain For the sake of numerical stability, we solve for b by multiplying both sides of (7.17) by t n, making use of t n 2 =1, then averaging the equations over all support vectors, and finally solve for b: 18
19
Error Function We can represent maximum margin classifier in terms of minimizing an error function containing a quadratic regularizer: where function E ∞ (z)=0 if z≥0, and ∞ otherwise. 19
20
An Illustrative Example Example of synthetic data from two classes in 2-dim input space, showing contours of constant y(X) obtained from SVM with a Gaussian kernel. Also shown are the decision boundary, margin boundary, and support vectors. 20
21
Summary of this Section Consider a two-class classification problem and assume that training points are linearly separable in the feature space Maximize margin → minimize length of parameter vector with constraint → dual representation (containing kernel functions) → the optimal solutions satisfy KKT conditions → kernel functions evaluated at sparse points (sparse Gram matrix) and support vectors 21
22
Ch9 Maximum Margin Classifier Support vector machine Overlapping class distribution Multiclass SVM SVM for regression Computational learning theory 22
23
Overlapping Class Distribution In previous section, we have assumed that training data points are linearly separable in feature space If in reality the above assumption doesn’t hold, exact separation could lead to poor generalization Recall that the maximum margin classifier is equivalent to minimizing the following error function: We might lesson the penalty imposed by E ∞ on misclassification so that data points are allowed to be on wrong side of the margin boundary, but with a penalty that increases with the distance from the boundary. 23
24
Slack Variable For each training data point introduce a slack variable ξ n defined as follows: ξ n =0 if is on or inside the correct margin boundary, and |t n -y(x n )| otherwise. The following figure illustrates the change of values of ξ n in terms of : 24
25
Error Function The error function corresponding to relaxation of the hard constraint is where parameter C is a positive constant controlling the trade-off between the slack variable penalty and the margin, i.e. classification error and model complexity. Furthermore, the constraint (7.5): t n y(x n )≥1 becomes t n y(x n )≥1-ξ n (7.20) and ξ n ≥0, for n=1,…N. 25
26
Lagrangian of the Optimization The corresponding Lagrangian of minimizing (7.21) together with constraints (7.20) and ξ n ≥0, is given by where both {μ n } & {ξ n } are Lagrange multipliers. The corresponding set of KKT conditions are given by 26
27
Solving the Lagrangian Take partial derivatives of L w.r.t. w i, b, and ξ n, and set them equal to 0, we obtain: Plug (7.29) into (7.22) and utilize (7.30) & (7.31) to simplify the resulting equation, we can eliminate not only w, but also b and ξ n from L, and obtain 27
28
Solving the Lagrangian (2) In summary, we want to maximize (7.32) w.r.t. a, subject to the following constraints: for n=1,…N. (7.33) are known as box constraints. Notice that (7.32) is the same as (7.10), however, (7.33) is “more demanding than” (7.11) However, for prediction, (7.13) still holds, since we still utilize (7.29) to derive it. 28
29
Interpreting the Solution Similarly, data points with a n =0 don’t contribute to the prediction, and the remaining ones consist of support vectors; Furthermore, support vectors satisfy We now utilize the additional constraints induced by relaxing the hard margin constraints: If a n 0, and consequently ξ n =0, t n y(x n )=1, i.e. the corresponding lies on margin If a n =C, then t n y(x n )≤1, the corresponding can lie anywhere, and can be labeled correctly or incorrectly, depending on sign of t n y(x n ), i.e. whether ξ n >1 or not. 29
30
Determine the Parameter b To compute parameter b, we only consider those support vectors s.t. a n, t n y(x n )=1, and we can apply the same trick as we did in previous section to computer b (both cases have the same prediction formula). Therefore we obtain the following formula for b: 30
31
v-SVM It involves minimizing the following Lagrangian: subject to the following constraints: Here the parameter v (replacing C) can be interpreted as both an upper bound on the fraction of margin errors and a lower bound for on the fraction of support vectors 31
32
An Illustration of v-SVM The following figure illustrates an example of applying v-SVM to a synthetic data. The v-SVM takes Gaussian kernel of the form exp(-γ||x-x’|| 2 ), where γ=0.45, and in the figure support vectors are indicated by circles. 32
33
Algorithms for the Quadratic Programming Notice that the Lagrangian function L is a quadratic function in terms of a i, and the constraints define a convex region, therefore any local optimum will also be a global optimum. Practical approaches to the constrained quadratic programming problems: Chunking: Since the value of L is unchanged if we remove rows and columns from the kernel matrix corresponding to zero-Lagrange multipliers, the original problem can be broken into a series of small ones. Such idea can be implemented using protected conjugated gradient. 33
34
Algorithms (2) Decomposition methods It still break the original problem into a series of small ones, however, all of these quadratic programming problems are of fixed size, so that the method can be applied to any size training data set. Sequential minimal optimization (SMO) (popular method) It takes the idea of chunking to the extreme and consider just two Lagrange multipliers at a time, therefore the sub-problems can be solved analytically. At each step the choice of a pair of Lagrange multipliers is given by heuristics (The original heuristics is based on KKT conditions) See http://kernel-machines.org for a collection of software dealing with SVM and Gaussian processes.http://kernel-machines.org 34
35
Some Remarks to SVM Dimensionality of feature space in kernel methods Consider a kernel function on 2-dim space given by k(X,Z)=(X T Z) 2 In the example original 2-dim space can be regarded as a 2-dim manifold embedded in a 3-dim feature space. SVM can be modified so that it provides a probabilistic output for the prediction of a new instance, e.g. 35
36
Ch9 Maximum Margin Classifier Support vector machine Overlapping class distribution Multiclass SVM SVM for regression Computational learning theory 36
37
Multiple 2-class SVM for Multiclass SVM Consider K-class classification One-versus-the-rest: The k-th model y k (x) is trained using data from k-th class as positive examples, and the rest data as negative examples Some challenges An input might be assigned to multiple labels The resulting training set are imbalanced. One solution is to modify the target values We might also define a single objective function for training all K SVMs simultaneously, based on maximizing the margin from each to the remaining classes. One-versus-one: train on all possible pairs of classes 37
38
Multiple 2-class SVM for Multiclass SVM Consider K-class classification One-versus-the-rest: The k-th model y k (x) is trained using data from k-th class as positive examples, and the rest data as negative examples Some challenges An input might be assigned to multiple labels The resulting training set are imbalanced. One solution is to modify the target values We might also define a single objective function for training all K SVMs simultaneously, based on maximizing the margin from each to the remaining classes. One-versus-one: train on all possible pairs of classes 38
39
Single-Class SVM Consider an unsupervised learning problem related to probability density estimation. Instead of modeling the density distribution, in certain situations it is to find a smooth boundary of high density, i.e. for a data point drawn according to the density distribution, it will have a (predetermined) probability (desirably closed to 1) of inside the boundary. Sample application scenario: abnormality detection where training data consist of mostly normal samples plus a few abnormal ones. Approach 1: find a hyperplane that separates all but a fixed fraction of the training data from the original while at the same time maximizing the distance of the hyperplane to the original (Scholkopf et al. 2001) Approach 2: find the smallest sphere in feature space that contains all but a fraction of the data points (Tax & Duin, 1998) 39
40
Ch9 Maximum Margin Classifier Support vector machine Overlapping class distribution Multiclass SVM SVM for regression Computational learning theory 40
41
Recall: Regularized Least Squares Consider the error function: With the sum-of-squares error function and a quadratic regularizer, we get data term + regularizer
42
Error Function for Support vector Regression To obtain sparse solutions, the quadratic error function in previous page is replaced by an ε- insensitive function defined as follows: 42 Red: ε-insensitive error function Green: quadratic one
43
Error Function (2) Therefore the error function for support vector regression is given by Furthermore, we can re-express the optimization problem by introducing slack variables. Since the ε-insensitive error function has two branches, we need to introduce two slack variables for each x n : η n ≥0, where η n >0 corresponds to the case that t n >y(x n )+ε ῆ n ≥0, where ῆ n >0 corresponds to the case that t n <y(x n )-ε 43
44
Error Function (3) Such introducing of Slack variables allow points to lie outside the tube: 44
45
Error Function (4) Therefore error function can be written as which must be minimized, subject to the constraints η n ≥0 and ῆ n ≥0, as well as (7.53) & (7.54), for n=1,…N. 45
46
Lagrangian of the Optimization The corresponding Lagrangian of the constrained minimization is given by 46
47
Solving the Lagrangian Take partial derivatives of L w.r.t. w i, b, η n and ῆ n, and set them equal to 0, we obtain: 47
48
Solving the Lagrangian (2) Plug (7.57) into (7.56) and utilize (7.58)--(7.60) to simplify the resulting equation, we obtain again with the following box constraints: 48
49
Prediction Substituting (7.57) into (7.1), the prediction for a new input variable x using training data, is given by following equation: 49
50
KTT Conditions Similarly, the corresponding KKT conditions are given by To save the space, here we omit inequalities from the set of KKT conditions. Notice that Similarly we also can estimate b utilizing above KKT conditions … 50
51
Interpreting the Solution Similarly, zero-a n ’s don’t contribute to the prediction, and the remaining ’s s.t. a n ≥0 consist of support vectors, furthermore, they satisfy We now utilize the additional constraints induced by relaxing the hard margin constraints: If a n 0, and consequently ξ n =0, t n y(x n )=1, i.e. the corresponding lies on margin If a n =C, then t n y(x n )≤1, the corresponding can lie anywhere, and can be labeled correctly or incorrectly, depending on sign of t n y(x n ). 51
52
Ch9 Maximum Margin Classifier Support vector machine Overlapping class distribution Multiclass SVM SVM for regression Computational learning theory 52
53
Recall: Shattering a Set of Instances Def. A dichotomy of a set S is a partition of S into two disjoint subsets Def. A set of instances S is shattered by hypo space H iff for every dichotomy of S, there exists some hypo in H consistent with this dichotomy. 53 3 instances shattered
54
Recall: VC Dimension Motivation: What if H can’t shatter X? Try finite subsets of X. Def. VC dimension of hypo space H defined over instance space X is the size of largest finite subset of X shattered by H. If any arbitrarily large finite subsets of X can be shattered by H, then VC(H) ≡ ∞ Roughly speaking, VC dimension measures how many (training) points can be separated for all possible labeling using functions of a given class. 54
55
Sample Complexity from VC Dimension How many randomly drawn examples suffice to ε-exhaust VS H,D with probability at least 1-δ? (Blumer et al. 1989) Remark: The bound derived above often reflects the worst case, since it is valid under very general constraints, whereas in real-world applications we often deal with distributions that have significant regularity. One attempt to tight the bound is the PAC-Bayesian network, which considers a distribution over the hypotheses space H. 55
56
Margin and VC Dimension Consider the function class of hyperplane of the form f(x)=w T x+b, defined over an M-dim space X. Let D, a subset of X, be a training data set, and h denotes the VC dimension of such function class over D. For a dichotomy of D, assume such dichotomy is separable by a hyperplane, then we can find one such hyperplane with maximum margin, and denote its weight vector by w m (under the canonical representation for w m ). Let R be the radius of the smallest ball around the training data set, then the following inequalities hold: Recall that 1/delta is a lower bound of the margin 56
57
Summary Points of Ch9 Main concepts: margin, support vector, KKT condition To construct SVM and its variants including SVM for overlapping class distribution and for regression from training data set, we follow the following procedure: minimizing error function with constraints → introducing Lagrange multipliers → dual representation (containing kernel functions) → the optimal solutions satisfying KKT conditions → kernel functions evaluated at sparse points (sparse kernel matrix) and support vectors Other useful knowledge: multiple 2-class SVMs for multi-class classification, one-class SVM, v-SVM. 57
58
HW 7.3, 7.4, 7.5 (10pt each, due on Nov 28) 7.1 (bonus problem, 10pt, due on Nov 28) 58
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.