The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

ECG Signal processing (2)
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
CMPUT 466/551 Principal Source: CMU
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Kernel Technique Based on Mercer’s Condition (1909)
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines CSE 573 Autumn 2005 Henry Kautz based on slides stolen from Pierre Dönnes’ web site.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #50 Binary.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Regression Prof. Bennett Math Model of Learning and Discovery 1/28/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
SVMs in a Nutshell.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof.
Copyright 2005 by David Helmbold1 Support Vector Machines (SVMs) References: Cristianini & Shawe-Taylor book; Vapnik’s book; and “A Tutorial on Support.
Support vector machines
PREDICT 422: Practical Machine Learning
Recap Finds the boundary with “maximum margin”
Geometrical intuition behind the dual problem
Support Vector Machines
Recap Finds the boundary with “maximum margin”
Recitation 6: Kernel SVM
Support vector machines
Usman Roshan CS 675 Machine Learning
Support vector machines
Support vector machines
Support Vector Machines 2
Introduction to Machine Learning
Presentation transcript:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 18: Support Vector Machines (SVMs)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2 Unit 18: Support Vector Machines (SVMs) (Summary) Simplest case of SVM is linear separation by a fat plane –a quadratic programming problem in the space of features –its dual problem is in a space with dimension the number of data points –both problems readily solvable by standard algorithms Next allow “soft margin” approximate linear separation –gives an almost identical quadratic programming problem –there is now a free parameter (smoothing) –it is not the ROC parameter, though it does affect TPR vs. FPR The Kernel Trick –embedding in a high dimensional space makes linear separation much more powerful –this is easily accomplished in the dual space –in fact, you can “guess a kernel” and jump to effectively infinite dimensional space –you have to guess or search for the optimal parameters of your kernel Good SVM implementations (e.g., SVMlight) are freely available –we demonstrate by classifying yeast genes as mitochondrial or not

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 3 Support Vector Machine (SVM) is an algorithm for supervised binary classification First, consider the case of data that is perfectly separable by a “fat plane”. This is called “Maximum Margin SVM”. So we have a quadratic programming problem: notice that this problem “lives” in a space with dimension the same as x (the feature space) i.e., project x to a single coordinate along w

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4 It turns out that every “primal” problem in quadratic programming (or more generally convex programming) has an equivalent “dual” problem [strong duality and Kuhn-Tucker theorems] The dual problem for perfectly separable SVM is: define: Notice that this dual problem lives in a space of dimension “number of data points”, and that, except for calculating G ij, the dimension of the feature space is irrelevant. This is going to let us move into infinite dimensional spaces! Once you have found the minimizing  ’s, you get the actual answer by And the classifier is the sign of vector of all ones “Gram matrix”

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5 We state, but gloss over the details: To go from “maximum margin” to “soft margin” (allowing imperfect separations), the only equation that changes is where  is a new “softness parameter” (inverse penalty for unseparated points). See NR3. The “1-Norm Soft-Margin SVM” and Its Dual You get to choose. It is a smoothing parameter: ¸ ! 0 Smooth classifier, really fat fat-plane, less accurate on training data, but possibly more robust on new data ¸ ! 1 Sharp classifier, thinner fat-plane, more accurate on training data, but possibly less robust on new data TPR and FPR will vary with, but it is not the ROC parameter, which would vary b. You should view each choice of as being a different classifier, with its own ROC curve. There is not always one that dominates.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6 “The Kernel Trick” This is the real power of SVM. Otherwise it would just be linear fiddling around. Imagine an embedding function into a much higher dimensional space: The point is that very complicated separations in n-space can be represented as linear separations in N-space. A simple example is: which would allow all quadratic separations.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7 In the embedding space, the (dual) problem to be solved is with Now the “kernel trick” is: instead of guessing an embedding , just guess a kernel K directly! This is breathtakingly audacious, but it often works! Properties of kernels that could have come from an embedding, even if you don’t know what the embedding is:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8 In practice, you rarely make up your own kernels, but rather use one or another standard, tried-and-true forms: The kernel trick, by the way, is applicable to various other classification algorithms, including PCA. The field is called “kernel-based learning algorithms”.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9 Example: Learn a 2-d classification on data with no linear approximation polynomial d=8 Gaussian radial basis function When you pick a kernel, you have to play around to find good parameters, and also for choice of, the softness parameter. SVM is fundamentally heuristic. Always divide your data into disjoint training and testing sets for this, or else you will get bitten by “overtraining”.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 10 Although NR3 has a “toy” SVM code, this is a case where you should use someone else’s well-optimized code. For SVM, a good (free) starting point is mytrain.txt: +1 1:0.4 2:-1.5 3:-0.1 4:3.2 … -1 1:-1.6 2:-0.5 3:-0.2 4:-1.2 … +1 1:-0.9 3:-0.7 4:1.1 … … > svm_learn –t 1 –d 2 mytrain.txt mymodel > svm_classify mytest.txt mymodel myanswer e.g., power kernel with d=2

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 11 Let’s see if we can learn to predict yeast mitochondrial genes from gene expression profiles. known to be expressed in mitochondria known not to be expressed in mitochondria svm_mito.txt on course web site svm_nonmito.txt

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 12 svm_learn -t 0 svm_train.txt svm_model Scanning examples...done Reading examples into memory OK. (3034 examples read) Setting default regularization parameter C= Optimizing done. (2091 iterations) svm_classify svm_test.txt svm_model svm_output Reading model...OK. (789 support vectors read) Classifying test examples done Runtime (without IO) in cpu-seconds: 0.00 Accuracy on test set: 89.92% (910 correct, 102 incorrect, 1012 total) Precision/recall on test set: 79.31%/33.82% linear kernel actual prediction TP FP FN TN actual prediction SVMlight parameterizes by ACC, precision, recall, and N! (Why?) svm_train.txt and svm_test.txt on course web site

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 13 Derive yet another confusion matrix transformation, now from (N, accuracy, precision, recall) to (TP,FN,FP,TN): actual prediction TP FP FN TN

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 14 actual prediction svm_learn -t 2 -g.001 svm_train.txt svm_model Scanning examples...done Reading examples into memory OK. (3034 examples read) Setting default regularization parameter C= Optimizing Checking optimality of inactive variables...done. svm_classify svm_test.txt svm_model svm_output Reading model...OK. (901 support vectors read) Classifying test examples done Runtime (without IO) in cpu-seconds: 0.56 Accuracy on test set: 91.50% (926 correct, 86 incorrect, 1012 total) Precision/recall on test set: 91.67%/40.44% Gaussian radial basis function actual prediction TP FP FN TN

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 15 SVM is not a panacea for all binary classification problems –actually, I had to play around to find a convincing example –mitochondrial genes worked, but nuclear genes didn’t! It works best when you have about equal numbers of + and – in the training set –but there are weighting schemes to try to correct for this –mitochondrial genes worked well, even at 1:10 in training set Although you have to guess parameters, you can usually try powers of 10 values to bracket them –can automate a systematic search if you need to Every once in a while, SVM really works well on a problem! Also, contains many key ideas of convex programming –also used in interior-point methods for linear programming There are lots of generalizations of SVM