Classification: Support Vector Machine 10/10/07
What hyperplane (line) can separate the two classes of data?
But there are many other choices! Which one is the best?
What hyperplane (line) can separate the two classes of data? But there are many other choices! Which one is the best? M: margin
M Optimal separating hyperplane The best hyperplane is the one that maximizes the margin, M. M
A hyperplane is Computing the margin width x T + 0 = 1 x T + 0 = 0 x T + 0 = -1 x+x+ x-x- Find x + and x - on the “plus” and “minus” plane, so that x + - x - is perpendicular to . Then M = | x + - x - |
Find x + and x - on the “plus” and “minus” plane, so that x + - x - is perpendicular to . Then M = | x + - x - | Since x + T + 0 = 1 x - T + 0 = -1 (x + - x - ) T = 2 A hyperplane is Computing the margin width x T + 0 = 1 x T + 0 = 0 x T + 0 = -1 x+x+ x-x- M = | x + - x - | = 2/| |
The hyperplane is separating if The maximizing problem is subject to Computing the marginal width M support vector
Optimal separating hyperplane Rewrite the problem as subject to Lagrange function To minimize, set partial derivatives to be 0 Can be solved by quadratic programming.
What is the best hyperplane? When the two classes are non- separable Idea: allow some points to lie on the wrong side, but not by much.
Support vector machine When the two classes are not separable, the problem is slightly modified: Find subject to Can be solved using quadratic programming.
Convert a nonseparable to separable case by nonlinear transformation non-separable in 1D
Convert a nonseparable to separable case by nonlinear transformation separable in 1D
Introduce nonlinear kernel functions h(x), and work on the transformed functions. Then the separating function is In fact, all you need is the kernel function: Common kernels: Kernel function
Applications
Prediction of central nervous systems embryonic tumor outcome 42 patient samples 5 cancer types Array contains 6817 genes Question: are different tumors types distinguishable from gene expression pattern? (Pomeroy et al. 2002)
Gene expressions within a cancer type cluster together (Pomeroy et al. 2002)
PCA based on all genes (Pomeroy et al. 2002)
PCA based on a subset of informational genes (Pomeroy et al. 2002)
(Khan et al. 2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks Four different cancer types. 88 samples 6567 genes Goal: to predict cancer types from gene expression data
(Khan et al. 2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Procedures Filter out genes that have low expression values (retain 2308 genes) Dimension reduction by using PCA --- select top 10 principle components 3 fold cross-validation: (Khan et al. 2001)
Artificial Neural Network
(Khan et al. 2001)
Procedures Filter out genes that have low expression values (retain 2308 genes) Dimension reduction by using PCA --- select top 10 principle components 3 fold cross-validation: Repeat 1250 times. (Khan et al. 2001)
Acknowledgement Sources of slides: –Cheng Li – 01/kdd2001-tutorial-final.pdfhttp:// 01/kdd2001-tutorial-final.pdf – pt
Aggregating predictors Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets. Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.
AdaBoost Step 1: Initialization the observation weights Step 2: For m = 1 to M, –Fit a classifier G m (X) to the training data using weight w i –Compute –Set Step 3: Output misclassified obs are given more weights
Boosting
Substituting, we get the Lagrange (Wolf) dual function subject to To complete the steps, see Burges et al. If then These x i’ s are called the support vectors. is only determined by the support vectors Optimal separating hyperplane
The Lagrange function is Setting the partial derivatives to be 0. Substituting, we get Subject to Support vector machine