COMP24111: Machine Learning and Optimisation

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
CS 4700: Foundations of Artificial Intelligence
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Efficient Model Selection for Support Vector Machines
Support Vector Machine & Image Classification Applications
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Biointelligence Laboratory, Seoul National University
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Validation methods.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
Support vector machines
CSSE463: Image Recognition Day 14
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Geometrical intuition behind the dual problem
Support Vector Machines
Support Vector Machines
Support Vector Machines (SVM)
An Introduction to Support Vector Machines
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
Statistical Learning Dong Liu Dept. EEIS, USTC.
CSSE463: Image Recognition Day 14
CSSE463: Image Recognition Day 14
Introduction to Support Vector Machines
CSSE463: Image Recognition Day 14
Support Vector Machines
CSSE463: Image Recognition Day 14
Support Vector Machine _ 2 (SVM)
Support vector machines
CSSE463: Image Recognition Day 14
COSC 4368 Machine Learning Organization
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk

Outline Understand concepts such as hyperplane, distance. Understand the basic idea of support vector machine (SVM). Understand difference between hard-margin and soft-margin SVM. Understand the core idea of kernel trick. Understand different data split schemes used in machine learning experiments. Know different classification performance measures.

History and Information Vapnik and Lerner (1963) introduced the generalised portrait algorithm. The algorithm implemented by SVMs is a nonlinear generalisation of the generalised portrait algorithm. Support vector machine was first introduced in 1992: Boser et al. A training algorithm for optimal margin classifiers. Proceedings of the 5- th Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992. More on SVM history: http://www.svms.org/history.html Centralised website: http://www.kernel-machines.org Popular textbook: N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, 2000. http://www.support-vector.net Popular library: LIBSVM, MATLAB SVM.

Hyperplane and Distance 3D space The above is called a hyperplane. In 2D space (w1x1+w2x2+b=0), it is a straight line. In 3D space (w1x1+w2x2+w3x3+b=0), it is a plane. x2 x r: Distance from an arbitrary point x to the plane. Whether r is positive or negative depends on which side of the hyperplane x lies. + - Hyperplane direction w b / ||w||2 Distance from the origin to the plane. (0,0) x1

Parallel Hyperplanes We focus on two parallel hyperplanes: Geometrically, distance between these two planes is 2 / ||w||2 w wTx+b=0 wTx+b=-1 wTx+b=1 ρ x1 x2 z: wTz+b=1 r ρ=1 / ||w||2

We start from an ideal classification case! Linearly separable case! x1 x2

equivalent expression Separation Margin Given two parallel hyperplanes below, we separate two classes of data by preventing samples from falling into the margin: equivalent expression x1 x2 2 / ||w||2 w wTx+b=0 wTx+b=-1 wTx+b=1 ρ The region bounded by these two hyperplanes is called the separation “margin”, given by

Support Vector Machine (SVM) The aim of SVM is simply to find an optimal hyperplane to separate two-classes of data points with the widest margin. This results in the following constrained optimisation: Which is better? Margin maximisation x2 x1 Stopping training samples from falling into the margin.

Support Vectors Support vectors: data points that satisfy These points are the most difficult to classify and are very important for the location of the optimal hyperplane: Support vectors x1 x2 2 / ||w||2 Optimal hyperplane wTx+b= 0 Upper hyperplane wTx+b= +1 Lower hyperplane wTx+b= -1 w Support vectors

SVM Training SVM training: the process of solving the following constrained optimisation problem: Following calculus and optimisation theory, we use Lagrange multipliers λi≥0 and formulate the Lagrangean function: By differentiating L() and setting the gradient to zero, we can express the Lagrangean function using only the multipliers (called a dual problem).

The SVM we have learned so far is called SVM Training We only need to optimise the dual problem with respect to the multipliers: This is called a quardratic programing (QP) problem in optimisation. There are many QP solvers available: https://en.wikipedia.org/wiki/Quadratic_programming The SVM we have learned so far is called hard-margin SVM.

non-separable data patterns So far, we work on simple cases like this: x1 x2 What if the data points look like this? x1 x2 separable data patterns In practice, no datasets are ideally linearly separable. This means that some data points are bound to be misclassified with a linear hyperplane. non-separable data patterns

Non-separable Patterns We use slack variables ξi≥0, which measure deviation of each i-th point from the ideal situation, to relax the previous constraints as: We don’t push all the points to stay outside the margin any more. equivalent expression x1 x2 Point within region of separation, but still in the right side: 0≤ξi<1 Point in the wrong side of the decision boundary: ξi>1

Modified Optimisation In addition to maximising the margin as before, we need to keep all slacks ξi as small as possible to minimise the classification errors. The modified SVM optimisation problem now is: The above constrained optimisation problem can be converted to a QP problem. C≥0 is a user defined parameter, which controls the regularisation. This is the trade-off between complexity and nonseparable patterns. Soft-margin SVM

Alternative Formulation An alternative way to formulate the soft-marge SVM: Hinge loss: max(0, 1 - true_label x model output), true-label: -1 or 1 Slack variable ξi corresponds to the hinge loss error of the i-th data point. Hinge loss function: max(0,1-a) Small error, if a is less than 1 but still positive (not too bad…) Zero error, if a is greater than 1 (very happy! ) Large error, if a is negative (very unhappy! )

non-linear data patterns So far, we can handle linear cases like this: x1 x2 What if the data points look like this? x1 x2 linear data patterns non-linear data patterns

Kernel Trick A new method to handle nonlinear data patterns. Similar to linear basis function model: we will project each data point to a new feature space in order to make the patterns linearly separable in that space. Instead of directly deriving the mapping function , we derive the inner product kernel function: Original input space New feature space nonlinearly separable after projection linearly separable Inner product between two points x and y is xTy.

Kernel Trick Examples of kernel functions: Kernel SVM: The use of kernel trick changes the formulation of the dual problem and the decision function. Kernel Name Expression K(x,y) Comments Linear No parameter Polynomial p is a user defined parameter Gaussian, also called radial basis function (RBF) σ is the user defined width parameter Hyperbolic tangent α & β are user defined parameters

Bias parameter b can be estimated from support vectors. SVM Decision Function After solving the QP problem for SVM, we compute the optimal values for the multipliers Support vectors are the training samples associated with non-zero multipliers. They are either misclassified or within the margin. Linear SVM decision function: Nonlinear SVM decision function: Bias parameter b can be estimated from support vectors. Kernel SVM demos:https://www.youtube.com/watch?v=3liCbRZPrZA https://www.youtube.com/watch?v=ndNE8he7Nnk Example:

Iris Classification Example: Soft-margin SVM with Gaussian kernel (σ=1, C=1) Soft-margin SVM with Gaussian kernel (σ=0.1, C=1)

Iris Classification Example: Soft-margin SVM with polynomial kernel (p=1, C=1) Soft-margin SVM with Gaussian kernel (p=1, C=0.01) More support vectors.

Machine Learning Experiments We are given a set of data samples (examples). How can we use the data properly to train the model parameters, estimate the model performance, and select a best model among different options (e.g., the best hyperparameter setting). Can we train the model with set A, select hyperparameter with set A, and assess the model again with set A? No, you cannot! What can we do? Random split the data!

Holdout Method One single split of the whole dataset into two groups. Training set: used to train the classifier. Test set: used to estimate the error rate of the trained classifier. Drawback: If dataset is small, we may not be able to set aside a portion of the dataset for testing. The holdout estimate of error rate can be misleading if we happen to get an “unfortunate” split. Better methods? Random subsampling K-fold cross validation Leave-one-out cross validation Bootstrap Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

Random Subsampling Perform K data splits of the entire dataset. Each split randomly selects a fixed number of samples for testing. For each data split, the classifier is trained from scratch using the training samples and its error rate is estimated with the testing examples (denoted by Ei for the i-th split). The final error estimate is computed by Here is an example of randomly selecting 8 samples for testing in each split. Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

k-fold Cross Validation Divide the entire dataset into K partitions. For each of the K experiments, use (K-1) folds for training and a different fold for estimating the error rate Ei. The error estimate is computed by Advantage: all the examples in the dataset are eventually used for both training and testing. An example of 4-fold cross validation Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

Leave-one-out Cross Validation Leave-one-out (LOO) cross validation is the degenerate case for k-fold cross validation. We have a total of N samples. LOO is an N-fold cross validation. Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

Bootstrap Bootstrap is based on sampling technique with replacement. Repeat the following process K times: Randomly select (with replacement) M examples and use these for training. The remaining examples that were not selected are for testing. The number of testing samples can change over repeats. The final error estimate is computed by Sampling with Replacement: Choose an example from the given set, put that example back to the set, and then choose another example. Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

Hyperparameter Selection Different hyperparameter settings correspond to different model options. Selection of hyperparameter is also known as model selection. Model 1 Model 2 Model 3 Model T . Error Estimate Use one of the following data split schemes to train a candidate model and estimate its error: Holdout RS K-fold CV LOO Boostrap Training Data Min Final Model

Example: Handwriting digit recognition- choosing λ for regularised least squares model. A good choice here. We have 200 training samples. 31 possible setting for λ are checked (These correspond to 31 model options). Random subsampling is used to estimate performance of each model. 100 out of 200 samples are selected to calculate error rate in each experiment. Run 50 experiments in total. Mean and standard deviation of 50 classification error rates are plotted as error bars for each λ option. Relevant MATLAB command: mean(), std(), errorbar()

Classification Performance Measures Classification accuracy can be unreliable when assessing unbalanced data. For example, if there are 95 samples from class A and only 5 from class B in the data set, a particular classifier might classify all the observations as class A. Confusion matrix is a table with two rows and two columns that reports the number of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN). In multi-class classification, a confusion matrix can be computed for each class. Sensitivity (recall): Specificity: Precision: F1-score: confusion matrix for “cat” class

Summary In this lecture, we have learned Support vector machines for classification. Different data partition methods in machine learning experiments. Classification performance measures. In next lecture, we will talk about deep learning models!