Nonparametric Methods: Support Vector Machines

Slides:



Advertisements
Similar presentations
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Linear Classifiers (perceptrons)
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Support vector machine
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
SVMs Reprised. Administrivia I’m out of town Mar 1-3 May have guest lecturer May cancel class Will let you know more when I do...
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs in a Nutshell.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Copyright 2005 by David Helmbold1 Support Vector Machines (SVMs) References: Cristianini & Shawe-Taylor book; Vapnik’s book; and “A Tutorial on Support.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Classification: Linear Models
Support Vector Machines
Support Vector Machines (SVM)
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines Most of the slides were taken from:
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Machine Learning in Practice Lecture 26
COSC 4335: Other Classification Techniques
CSSE463: Image Recognition Day 14
Support Vector Machines
CSSE463: Image Recognition Day 14
Support Vector Machines and Kernels
CSSE463: Image Recognition Day 14
COSC 4368 Machine Learning Organization
The Bias-Variance Trade-Off
CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines
Support Vector Machines 2
Introduction to Machine Learning
Presentation transcript:

Nonparametric Methods: Support Vector Machines Oliver Schulte CMPT 726 Machine Learning If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.

The Support Vector Machine Classification Formula

Weighted Nearest Neighbour? 5-nearest neighbour classifies green point as blue But the red triangles are much closer Could consider a weighted average vote where closer data points count for more

Issues for Weighted Vote Support Vector Machine provide a form of weighted average vote To understand the motivation, let’s consider the challenges in developing the idea Measuring closeness Encoding class labels Restricting the vote to important data points

Measuring Closeness h(x) = [Σj (x  xj) yj ] - b How to quantify how close a data point is to the query point? Can use a kernel to converts distances to similarity/closeness More on that later For now we use the dot product (x  xj) Measures the cosine of the angle between two vectors For centered vectors, measures their covariance Possible classification formula (not yet final) h(x) = [Σj (x  xj) yj ] - b collinear = angle = 0 = cosine = 1 N x covariance = dot product for centered vectors bias term is subtracted for technical reasons, can be positive or negative consistent with text bias term

Encoding Class Labels Consider the average vote Σj (x  xj) yj If negative class labels are encoded as y = 0, they just disappear in the sum. Solution: encode the negative class as y = -1. (we’re just pretending these are real numbers anyway) positive neighbours vote for the positive class negative neighbours vote for the negative class

Global Importance Big Idea! So does h(x) = [Σj (x  xj) yj ] – b work? Not really, because it sums over all instances Computational problem: say I have 10K instances (e.g. movies). Need to compute dot product for all 10 K Slow predictions Statistical problem: the many distant instances dominate the few close ones get similar predictions for every data point SVM solution: add a weight αj for each data point enforce sparsity so that many data points get 0 weights tell kinect story Big Idea!

SVM Classification Formula Compute weighted thresholded vote = [Σj αj (x  xj) yj ] – b If vote > 0, label x as positive If vote < 0, label x as negative In symbols h(x) = sign( [Σj αj (x  xj) yj ] – b) How can we learn the weights αj? Notice that the weights are like parameters, but more data points  more weights

Learning Importance Weights

SVMs Are Linear Classifiers Assuming dot product, that is. Dot product is linear in its arguments Σj αj yj (xj  x) = Σj (αj yj xj  x) = (Σj αj yj xj)  x Defining w := (Σj αj yj xj), the SVM discriminant function is w x-b – linear classifier! linear combination of data points At least k points in sphere. Legend: Green circle = test case. Solid circle: k = 3 Dashed circle: k = 5.

Important Points for Linear Classifiers Assume linear separability What points matter for determining the line? The borderline points! These are called support vectors Can also show in 1 D + + + -----

Line Drawing The second big idea of SVMs Where should we try the line between the classes? Right in the middle!

The Maximum Margin Classifier Distance from positive/negative class to decision boundary = distance of closest positive/negative class to boundary Margin = minimum distance of either class to boundary Line in the middle = both classes at same distance = max margin Maximum Margin Smaller Margin circles are support vectors -1

Objective Function Quadratic problem: convex, tractable How to find the weights? After a lot of math, maximizing the following function does it argmax{αj: Σjαj - 1/2Σj,k (αjαk yjyk( xj  x) } where αj≥0 and Σjαjyj=0 Quadratic problem: convex, tractable αj>0 support vector. For most j, we have αj=0. Can replace ( xj  x) by other similarity metrics

The Kernel Trick

Linear Non-Separability What if the data are not linearly separable? recall: Can transform the data with basis functions so that they are linearly separable

Linear Classification With Gaussian Basis Function φ2: measures closeness to red centre Legend: left: datapoints in 2D. Red and blue are class labels (ignore). – 2 Gaussian basis functions, we see the centers shown as crosses and the countours shown by green circles. Right: Each point is now mapped to another pair of numbers phi 1 = (1-distance to blue centre)/2. phi 2: (1-distance to red centre)/2 On right, classes are linearly separable, see black line. The black line on right corresponds to the black circle on left. Intuitive example: think of Gaussian centers as indicating parts of a picture, or parts of a body. svm video at https://www.youtube.com/watch?v=3liCbRZPrZA 3D version φ1: measures closeness to blue centre Figure Bishop 4.12

Transformation (x1)2+(x2)2=0 is boundary Linearly Non-separable Linearly Separable

The Kernel Trick Naive Option compute basis functions compute dot product for basis functions Kernel Trick Compute dot product for basis functions directly using similarity metric on original feature vectors

Example f1 f2 f3 (x1)2 (x2)2 √2x1x2 g1 g2 g3 (z1)2 (z2)2 √2z1z2 Exercise: show that (f1,f2,f3) (g1, g2,g3) = [(x1,x2)  (z1,z2)]2

Kernels A kernel converts distance to similarity E.g. K(d) = max{0,1-(2|x|/10)2 Can also define kernels directly as similarity metrics: K(x,z) Many kernels exist, e.g. for vectors, matrices, strings, graphs, images....

Mercer’s Theorem (1909) For any reasonable kernel K, there is a set of basis functions F such that K(x,z) = F(x) F(z). In words, the kernel computes the dot product in basis function space without actually computing the basis functions! For SMVS, this means that we can find the support vectors for non-separable problems without having to compute basis functions. Set of basis functions can be very large, even infinite.

Example 1 SVM with Gaussian kernel K(x,z) = exp{- dist(x,z)/2σ2} Support vectors circled Linear decision boundary in basis function space  non-linear in original feature space

Example 2 SVM trained using cubic polynomial kernel K(x,z) = (x  z +1)3 linearly separable not linearly separable