Download presentation
Presentation is loading. Please wait.
1
Support Vector Machines H. Clara Pong Julie Horrocks 1, Marianne Van den Heuvel 2,Francis Tekpetey 3, B. Anne Croy 4. 1 Mathematics & Statistics, University of Guelph, 2 Biomedical Sciences, University of Guelph, 3 Obstetrics and Gynecology, University of Western Ontario, 4 Anatomy & Cell Biology, Queen’s University
2
Outline Background Separating Hyper-plane & Basis Expansion Support Vector Machines Simulations Remarks
3
Background Motivation The IVF (In-Vitro Fertilization) project 18 infertile women each undergoing the IVF treatment Outcome (Outputs, Y’s) : Binary (pregnancy) Predictor (Inputs, X’s): Longitudinal data (adhesion) CD56 bright cells
4
Background methods Classification methods Relatively new method: Support Vector Machines - - V. Vapnik: first proposed in 1979 - input space into a high dimensional feature space - Maps input space into a high dimensional feature space - feature space - Constructs a linear classifier in the new feature space Traditional method: Discriminant Analysis - R.A. Fisher: - R.A. Fisher: 1936 - - Classify according to the values from the discriminant functions - - Assumption: the predictors X in a given class has a Multi- Normal distribution.
5
Separating Hyper-plane Suppose there are 2 classes (A, B) y = 1 for group A, y = -1 for group B. Let a hyper-plane be defined as f(X) = β 0 +β T X = 0 then f(X) is the decision boundary that separates the two groups. f(X) = β 0 +β T X > 0 for X Є A f(X) = β 0 +β T X < 0 for X Є B Given X 0 Є A, misclassified when f(X 0 ) < 0. Given X 0 Є B, misclassified when f(X 0 ) > 0. f(X) = β 0 +β T X = 0 A: f(X)>0 B: f(X)<0
6
f(X) = β 0 +β T X = 0 Separating Hyper-plane The perceptron learning algorithm search for a hyper-plane that minimizes the distance of misclassified points to the decision boundary. However this does not provide a unique solution.
7
Optimal Separating Hyper-plane Let C be the distance of the closest point from the two groups to the hyper-plane. The Optimal Separating hyper-plane is the unique separating hyper-plane f(X) = β 0 * +β *T X = 0, where (β * 0,β *T ) maximizes C. f(X) = β 0 * +β* T X = 0 C C
8
Optimal Separating Hyper-plane Maximization Problem: C f(X) = β 0 * +β* T X = 0 C (the support vectors) Dual LaGrange problem: Subjects to 1. α i [y i (x i Tβ+ β 0 ) -1] = 0 2. α i ≥ 0 all i=1…N 3. β = Σ i=1..N α i y i x i 4. Σ i=1..N α i y i = 0 5. The Kuhn Tucker Conditions f(X) only depends on the x i ’s where α i ≠ 0
9
Optimal Separating Hyper-plane C f(X) = β 0 * +β* T X = 0 C (the support vectors)
10
Basis Expansion Suppose there are p inputs X=(x … x p ) Suppose there are p inputs X=(x 1 … x p ) Let h k (X) be a transformation that maps X from R R. Let h k (X) be a transformation that maps X from R p R. h k (X) is called the basis function. H = {h 1 (X), …,h m (X)} is the basis of a new feature space (dim=m) X=(x)H = {h(X), h(X),h(X)} Example: X=(x 1,x 2 )H = {h 1 (X), h 2 (X),h 3 (X)} h(X) = h(x) = x h 1 (X) = h 1 (x 1,x 2 ) = x 1, h(X) = h(x) =x h 2 (X) = h 2 (x 1,x 2 ) = x 2, h(X) = h(x) =x 1 x 2 h 3 (X) = h 3 (x 1,x 2 ) = x 1 x 2 X_new = H(X)= (x 1, x 2, x 1 x 2 ) x1x1 x2x2 x1x2x1x2 x 1 + x 2 +
11
Support Vector Machines The optimal hyper-plane {X| The optimal hyper-plane {X| f(X) = β 0 * +β* T X=0 }. is called the Support Vector Classifier. f(X) = β 0 * +β* T X is called the Support Vector Classifier. Separable Case: Separable Case: all points are outside of the margins The classification rule is the sign of the decision function. f(X) = β 0 * +β* T X = 0 C C
12
Support Vector Machines Hyper-plane: {X| Hyper-plane: {X| f(X) = β 0 +β T X = 0 } Non-separable Case: Non-separable Case: training data is non-separable. f(X) = β 0 +β T X = 0 S i = C – yi f(X i ) when X i crosses the margin and it’s zero when Xi outside. Xi crosses the margin of its group when C – y i f(X i ) > 0. y i f(X i ) SiSi C Let ξ i C =S i, ξi is the proportional of C that the prediction has crossed the margin. Misclassification occurs when S i > C (ξ i > 1).
13
Support Vector Machines The overall misclassification is Σξ i, The overall misclassification is Σξ i, and is bounded by δ. Maximization Problem: Dual LaGrange problem: (non-separable case) s.t.. 0≤ α i ≤ ζ, Σ α i y i = 0 Subjects to 1. α i [y i (x i Tβ+ β0) –(1-ξ i )] = 0 2. v i ≥ 0 all i=1…N 3. β = Σ α i y i x i 4. The Kuhn Tucker Conditions
14
Support Vector Machines SVM search for an optimal hyper-plane in a new feature space where the data are more separate. The linear classifier becomes Dual LaGrange problem: Suppose H = {h1(X), …,hm(X)} is the basis for the new feature space F. All elements in the new feature space is a linear basis expansion of X.
15
Support Vector Machines For example: Kernel: This implies The kernel and the basis transformation define one another.
16
Support Vector Machines Dual LaGrange function: This shows the basis transformation in SVM does not need to be define explicitly. The most common kernels: 1. d th Degree Polynomial: 2. Radial Basis: 3. Neural Network:
17
Simulations 3 cases 100 simulations per case Each simulation consists of 200 points 100 points from each group Input space: 2 dimensional Output: 0 or 1 (2 groups) Half of the points are randomly selected as the training set. X=(x 1,x 2 ),Y є {0,1}
18
Simulations Case 1 (Normal with same covariance matrix) Black ~ group 0 Red ~ group 1
19
Simulations Case 1 Misclassifications (in 100 simulations) TrainingTesting MeanSdMeanSd LDA7.85%2.658.07%2.51 SVM6.98%2.338.48%2.81
20
Simulations Case 2 (Normal with unequal covariance matrixes) Black ~ group 0 Red ~ group 1
21
Simulations Case 2 Misclassifications (in 100 simulations) TrainingTesting MeanSdMeanSd QDA15.5%3.7516.84%3.48 SVM13.6%4.0318.8%4.01
22
Simulations Case 3 (Non-normal) Black ~ group 0 Red ~ group 1
23
Simulations Case 3 Misclassifications (in 100 simulations) TrainingTesting MeanSdMeanSd QDA14%3.7916.8%3.63 SVM9.34%3.4614.8%3.21
24
Simulations Paired t-test for differences in misclassifications Ho: mean different = 0; Ha: mean different ≠ 0 Case 1 mean different (LDA - SVM) = - 0.41, se = 0.3877 mean different (LDA - SVM) = - 0.41, se = 0.3877 t = -1.057, p-value = 0.29 (insignificant) t = -1.057, p-value = 0.29 (insignificant) Case 2 mean different (QDA - SVM) = -1.96, se = 0.4170 mean different (QDA - SVM) = -1.96, se = 0.4170 t = -4.70, p-value = 8.42e-06 (significant) Case 3 mean different (QDA - SVM) = 2, sd= 0.4218 mean different (QDA - SVM) = 2, sd= 0.4218 t = 4.74, p-value = 7.13e-06 (significant)
25
Remarks Support Vector Machines Maps the original input space onto a feature space of higher dimension No assumption on the distributions of X’s Performance The performances of Discriminant Analysis and SVM are similar (when (X|Y) has a Normal distribution and share the same Σ) Discriminant Analysis has a better performance (when the covariance matrices for the two groups are different) SVM has a better performance (when the input (X) violated the distribution assumption) (when the input (X) violated the distribution assumption)
26
Reference 1. 1. N. Cristianini, and J. Shawe-Taylor An introduction to Support Vector Machines and other kernel-based learning methods. New York: Cambridge University Press, 2000. 2. 2. J. Friedman, T. Hastie, and R. Tibshirani The Elements of Statistical Learning. NewYork: Springer, 2001. 3. 3. D. Meyer, C. Chang, and C. Lin. R Documentation: Support Vector Machines. http://www.maths.lth.se/help/R/.R/library/e1071/html/svm.htmlhttp://www.maths.lth.se/help/R/.R/library/e1071/html/svm.html Last updated: March 2006 4. 4. H. Planatscher and J. Dietzsch. SVM-Tutorial using R (e1071-package) http://www.potschi.de/svmtut/svmtut.htm http://www.potschi.de/svmtut/svmtut.htm 5. 5. M. Van Den Heuvel, J. Horrocks, S. Bashar, S. Taylor, S. Burke, K. Hatta, E. Lewis, and A. Croy. Menstrual Cycle Hormones Induce Changes in Functional Interac-tions Between Lymphocytes and Endothelial Cells. Journal of Clinical Endocrinology and Metabolism, 2005.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.