Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?

Similar presentations


Presentation on theme: "Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?"— Presentation transcript:

1 Classification: Support Vector Machine 10/10/07

2 What hyperplane (line) can separate the two classes of data?

3 But there are many other choices! Which one is the best?

4 What hyperplane (line) can separate the two classes of data? But there are many other choices! Which one is the best? M: margin

5 M Optimal separating hyperplane The best hyperplane is the one that maximizes the margin, M. M

6 A hyperplane is Computing the margin width  x T  +  0 = 1 x T  +  0 = 0 x T  +  0 = -1 x+x+ x-x- Find x + and x - on the “plus” and “minus” plane, so that x + - x - is perpendicular to . Then M = | x + - x - |

7 Find x + and x - on the “plus” and “minus” plane, so that x + - x - is perpendicular to . Then M = | x + - x - | Since x + T  +  0 = 1 x - T  +  0 = -1 (x + - x - ) T  = 2 A hyperplane is Computing the margin width  x T  +  0 = 1 x T  +  0 = 0 x T  +  0 = -1 x+x+ x-x- M = | x + - x - | = 2/|  |

8 The hyperplane is separating if The maximizing problem is subject to Computing the marginal width M support vector

9 Optimal separating hyperplane Rewrite the problem as subject to Lagrange function To minimize, set partial derivatives to be 0 Can be solved by quadratic programming.

10 What is the best hyperplane? When the two classes are non- separable Idea: allow some points to lie on the wrong side, but not by much.

11 Support vector machine When the two classes are not separable, the problem is slightly modified: Find subject to Can be solved using quadratic programming.

12 Convert a nonseparable to separable case by nonlinear transformation non-separable in 1D

13 Convert a nonseparable to separable case by nonlinear transformation separable in 1D

14 Introduce nonlinear kernel functions h(x), and work on the transformed functions. Then the separating function is In fact, all you need is the kernel function: Common kernels: Kernel function

15 Applications

16 Prediction of central nervous systems embryonic tumor outcome 42 patient samples 5 cancer types Array contains 6817 genes Question: are different tumors types distinguishable from gene expression pattern? (Pomeroy et al. 2002)

17

18 Gene expressions within a cancer type cluster together (Pomeroy et al. 2002)

19 PCA based on all genes (Pomeroy et al. 2002)

20 PCA based on a subset of informational genes (Pomeroy et al. 2002)

21

22

23 (Khan et al. 2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks Four different cancer types. 88 samples 6567 genes Goal: to predict cancer types from gene expression data

24 (Khan et al. 2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

25 Procedures Filter out genes that have low expression values (retain 2308 genes) Dimension reduction by using PCA --- select top 10 principle components 3 fold cross-validation: (Khan et al. 2001)

26 Artificial Neural Network

27

28 (Khan et al. 2001)

29 Procedures Filter out genes that have low expression values (retain 2308 genes) Dimension reduction by using PCA --- select top 10 principle components 3 fold cross-validation: Repeat 1250 times. (Khan et al. 2001)

30

31

32 Acknowledgement Sources of slides: –Cheng Li –http://www.cs.cornell.edu/johannes/papers/20 01/kdd2001-tutorial-final.pdfhttp://www.cs.cornell.edu/johannes/papers/20 01/kdd2001-tutorial-final.pdf –www.cse.msu.edu/~lawhiu/intro_SVM_new.p pt

33 Aggregating predictors Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets. Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.

34 AdaBoost Step 1: Initialization the observation weights Step 2: For m = 1 to M, –Fit a classifier G m (X) to the training data using weight w i –Compute –Set Step 3: Output misclassified obs are given more weights

35 Boosting

36 Substituting, we get the Lagrange (Wolf) dual function subject to To complete the steps, see Burges et al. If then These x i’ s are called the support vectors. is only determined by the support vectors Optimal separating hyperplane

37 The Lagrange function is Setting the partial derivatives to be 0. Substituting, we get Subject to Support vector machine


Download ppt "Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?"

Similar presentations


Ads by Google