Support Vector Machines

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
An Introduction of Support Vector Machine
SVM—Support Vector Machines
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 CSC 463 Fall 2010 Dr. Adam P. Anthony Class #27.
Support Vector Machine
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines
Support Vector Machine & Image Classification Applications
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
Data Mining Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machine PNU Artificial Intelligence Lab. Kim, Minho.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
1 Support Vector Machines Chapter Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School.
1 Support Vector Machines. Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late.
1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)
Machine Learning Lecture 7: SVM Moshe Koppel Slides adapted from Andrew Moore Copyright © 2001, 2003, Andrew W. Moore.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
Machine Learning in Practice Lecture 19 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
Support Vector Machines Louis Oliphant Cs540 section 2.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines
Support Vector Machines
CS 4/527: Artificial Intelligence
Support Vector Machines
Machine Learning Week 2.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Large Scale Support Vector Machines
COSC 4335: Other Classification Techniques
Introduction to Support Vector Machines
CS 485: Special Topics in Data Mining Jinze Liu
Class #212 – Thursday, November 12
Support Vector Machines
Support Vector Machines
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Introduction to Machine Learning
Presentation transcript:

Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Support Vector Machines Very popular ML technique Became popular in the late 90s (Vapnik 1995; 1998) Invented in the late 70s (Vapnik, 1979) Controls complexity and overfitting, so works well on a wide range of practical problems Can handle high dimensional vector spaces, which makes feature selection less critical Very fast and memory efficient implementations, e.g., svm_light Not always the best solution, especially for problems with small vector spaces

Linear Classifiers f  yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f  yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f  yest x f(x,w,b) = sign(w. x - b) denotes +1 q How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f  yest x f(x,w,b) = sign(w. x - b) denotes +1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers f  yest x Any of these would be fine.. f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore

Classifier Margin f  yest x f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin f  yest x f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin f  yest x f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM Copyright © 2001, 2003, Andrew W. Moore

Why Maximum Margin? Intuitively this feels safest If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification LOOCV is easy since the model is immune to removal of any non-support- vector datapoints There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing Empirically it works very very well f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against LOOCV = leave one out cross validation Copyright © 2001, 2003, Andrew W. Moore

Specifying a line and margin Plus-Plane “Predict Class = +1” zone Classifier Boundary Minus-Plane “Predict Class = -1” zone How do we represent this mathematically? …in m input dimensions? Copyright © 2001, 2003, Andrew W. Moore

Specifying a line and margin Plus-Plane “Predict Class = +1” zone Classifier Boundary Minus-Plane “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Plus-plane = { x : w . x + b = +1 } Minus-plane = { x : w . x + b = -1 } Classify as.. +1 if w . x + b >= 1 -1 w . x + b <= -1 Universe explodes -1 < w . x + b < 1 Copyright © 2001, 2003, Andrew W. Moore

Learning the Maximum Margin Classifier M = Margin Width = “Predict Class = +1” zone x- “Predict Class = -1” zone wx+b=1 wx+b=0 wx+b=-1 Given a guess of w and b we can Compute whether all data points in the correct half-planes Compute the width of the margin Write a program to search the space of ws and bs to find widest margin matching all the datapoints. How? -- Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? Copyright © 2001, 2003, Andrew W. Moore

Learning SVMs Trick #1: Just find the points that would be closest to the optimal separating plane (“support vectors”) and work directly from those instances Trick #2: Represent as a quadratic optimization problem, and use quadratic programming techniques Trick #3 (“kernel trick”): Instead of using the raw features, represent data in a high-dimensional feature space constructed from a set of basis functions (e.g., polynomial and Gaussian combinations of the base features) Find separating plane / SVM in that high-dimensional space Voila: A nonlinear classifier!

SVM Performance Can handle very large features spaces (e.g., 100K features) Relatively fast Anecdotally they work very very well indeed Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark

Binary vs. multi classification SVMs can only do binary classification Two approaches to multi classification: One-vs-all: can turn an n-way classification into n binary classification tasks E.g., for zoo problem, do mammal vs. not-mammal, fish vs. not-fish, … Pick the one that results in the highest score N*(N-1)/2 One-vs-one classifiers that vote on results Mammal vs. fish, mammal vs. reptile, etc…

Feature Engineering Finding features for data instances that make machine learning algorithms more effective Usually a combination of domain knowledge and experimentation For example, consider the problem of classifying an email message as spam or not What are good features?

Spam Features Finding features for data instances that make machine learning algorithms more effective Usually a combination of domain knowledge and experimentation For example, consider the problem of classifying an email message as spam or not What are good features?

Example of a Spam Message Received: from nm37-vm8.bullet.mail.gq1.yahoo.com (nm37-vm8.bullet.mail.gq1.yahoo.com [98.136.217.44]) Date: Mon, 2 May 2016 15:26:36 +0000 (UTC) From: "Mrs.Sabeen Gharam" <SabeenGharam1@outlook.com> Reply-To: "Mrs.Sabeen Gharam" <aramsabeen1@gmail.com> Subject: CAN YOU HANDLE INVESTMENT? Â CAN YOU HANDLE INVESTMENT? My name is Mrs. Sabeen Gharam Aram widow of late Mohamed Assad Aram from idlib in syria arab republic, am in an urgent need of an individual or organization abroad that is willing to help to secure, move and invest my late husband's fund which he has offshore; the investment will depend on the agreement with me. Mrs.Sabeen Gharam Aram

Possible Features Each of the words in the message Is there a TO: header? Does FROM: match REPLY-TO:? Is FROM: address a free email service? Is the subject in all caps? Length of message in log(#bytes) Are their attachments? Extension of any attached file TLD of FROM: address (e.g., .com, .edu, .org, …) ... dozens more ...

Evaluating Features Choose some features, generate training data, train system, evaluation using 10-fold cross validation Some systems can show you the features that contribute the most to classification If not, perform ablation experiments by dropping a feature and re-training And try adding a new feature and training You can also automate this to try adding/removing various subsets of features Rely on your own knowledge and intuition also

Feature Engineering for Text Classification Typical features: words and/or phrases along with term frequency or (better) TF-IDF scores ΔTFIDF amplifies the training set signals by using the ratio of the IDF for the negative and positive collections Results in a significant boost in accuracy Text: The quick brown fox jumped over the lazy white dog. Features: the 2, quick 1, brown 1, fox 1, jumped 1, over 1, lazy 1, white 1, dog 1, the quick 1, quick brown 1, brown fox 1, fox jumped 1, jumped over 1, over the 1, lazy white 1, white dog 1

ΔTFIDF BoW Feature Set Value of feature t in document d is Where Ct,d = count of term t in document d Nt = number of negative labeled training docs with term t Pt = number of positive labeled training docs with term t Normalize to avoid bias towards longer documents Gives greater weight to rare (significant) words Downplays very common words Similar to Unigram + Bigram BoW in other aspects

Example: ΔTFIDF vs TFIDF vs TF Δtfidf tfidf tf , city angels , cage is angels is the mediocrity , city . criticized of angels to exhilarating maggie , of well worth city of a out well maggie and should know angel who is really enjoyed movie goers that maggie , cage is it it's nice seth , who is beautifully goers in wonderfully angels , more of angels us with you Underneath the city but 15 features with highest values for a review of City of Angels

Improvement over TFIDF (Uni- + Bi-grams) Movie Reviews: 88.1% Accuracy vs. 84.65% at 95% Confidence Interval Subjectivity Detection (Opinionated or not): 91.26% vs. 89.4% at 99.9% Confidence Interval Congressional Support for Bill (Voted for/ Against): 72.47% vs. 66.84% at 99.9% Confidence Interval Enron Email Spam Detection: (Spam or not): 98.917% vs. 96.6168 at 99.995% Confidence Interval All tests used 10 fold cross validation At least as good as mincuts + subjectivity detectors on movie reviews (87.2%)