Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP24111: Machine Learning and Optimisation

Similar presentations


Presentation on theme: "COMP24111: Machine Learning and Optimisation"— Presentation transcript:

1 COMP24111: Machine Learning and Optimisation
Chapter 4: Support Vector Machines Dr. Tingting Mu

2 Outline Understand concepts such as hyperplane, distance.
Understand the basic idea of support vector machine (SVM). Understand difference between hard-margin and soft-margin SVM. Understand the core idea of kernel trick. Understand different data split schemes used in machine learning experiments. Know different classification performance measures.

3 History and Information
Vapnik and Lerner (1963) introduced the generalised portrait algorithm. The algorithm implemented by SVMs is a nonlinear generalisation of the generalised portrait algorithm. Support vector machine was first introduced in 1992: Boser et al. A training algorithm for optimal margin classifiers. Proceedings of the 5- th Annual Workshop on Computational Learning Theory , Pittsburgh, More on SVM history: Centralised website: Popular textbook: N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Popular library: LIBSVM, MATLAB SVM.

4 Hyperplane and Distance
3D space The above is called a hyperplane. In 2D space (w1x1+w2x2+b=0), it is a straight line. In 3D space (w1x1+w2x2+w3x3+b=0), it is a plane. x2 x r: Distance from an arbitrary point x to the plane. Whether r is positive or negative depends on which side of the hyperplane x lies. + - Hyperplane direction w b / ||w||2 Distance from the origin to the plane. (0,0) x1

5 Parallel Hyperplanes We focus on two parallel hyperplanes:
Geometrically, distance between these two planes is 2 / ||w||2 w wTx+b=0 wTx+b=-1 wTx+b=1 ρ x1 x2 z: wTz+b=1 r ρ=1 / ||w||2

6 We start from an ideal classification case!
Linearly separable case! x1 x2

7 equivalent expression
Separation Margin Given two parallel hyperplanes below, we separate two classes of data by preventing samples from falling into the margin: equivalent expression x1 x2 2 / ||w||2 w wTx+b=0 wTx+b=-1 wTx+b=1 ρ The region bounded by these two hyperplanes is called the separation “margin”, given by

8 Support Vector Machine (SVM)
The aim of SVM is simply to find an optimal hyperplane to separate two-classes of data points with the widest margin. This results in the following constrained optimisation: Which is better? Margin maximisation x2 x1 Stopping training samples from falling into the margin.

9 Support Vectors Support vectors: data points that satisfy
These points are the most difficult to classify and are very important for the location of the optimal hyperplane: Support vectors x1 x2 2 / ||w||2 Optimal hyperplane wTx+b= 0 Upper hyperplane wTx+b= +1 Lower hyperplane wTx+b= -1 w Support vectors

10 SVM Training SVM training: the process of solving the following constrained optimisation problem: Following calculus and optimisation theory, we use Lagrange multipliers λi≥0 and formulate the Lagrangean function: By differentiating L() and setting the gradient to zero, we can express the Lagrangean function using only the multipliers (called a dual problem).

11 The SVM we have learned so far is called
SVM Training We only need to optimise the dual problem with respect to the multipliers: This is called a quardratic programing (QP) problem in optimisation. There are many QP solvers available: The SVM we have learned so far is called hard-margin SVM.

12 non-separable data patterns
So far, we work on simple cases like this: x1 x2 What if the data points look like this? x1 x2 separable data patterns In practice, no datasets are ideally linearly separable. This means that some data points are bound to be misclassified with a linear hyperplane. non-separable data patterns

13 Non-separable Patterns
We use slack variables ξi≥0, which measure deviation of each i-th point from the ideal situation, to relax the previous constraints as: We don’t push all the points to stay outside the margin any more. equivalent expression x1 x2 Point within region of separation, but still in the right side: 0≤ξi<1 Point in the wrong side of the decision boundary: ξi>1

14 Modified Optimisation
In addition to maximising the margin as before, we need to keep all slacks ξi as small as possible to minimise the classification errors. The modified SVM optimisation problem now is: The above constrained optimisation problem can be converted to a QP problem. C≥0 is a user defined parameter, which controls the regularisation. This is the trade-off between complexity and nonseparable patterns. Soft-margin SVM

15 Alternative Formulation
An alternative way to formulate the soft-marge SVM: Hinge loss: max(0, 1 - true_label x model output), true-label: -1 or 1 Slack variable ξi corresponds to the hinge loss error of the i-th data point. Hinge loss function: max(0,1-a) Small error, if a is less than 1 but still positive (not too bad…) Zero error, if a is greater than 1 (very happy! ) Large error, if a is negative (very unhappy! )

16 non-linear data patterns
So far, we can handle linear cases like this: x1 x2 What if the data points look like this? x1 x2 linear data patterns non-linear data patterns

17 Kernel Trick A new method to handle nonlinear data patterns.
Similar to linear basis function model: we will project each data point to a new feature space in order to make the patterns linearly separable in that space. Instead of directly deriving the mapping function , we derive the inner product kernel function: Original input space New feature space nonlinearly separable after projection linearly separable Inner product between two points x and y is xTy.

18 Kernel Trick Examples of kernel functions:
Kernel SVM: The use of kernel trick changes the formulation of the dual problem and the decision function. Kernel Name Expression K(x,y) Comments Linear No parameter Polynomial p is a user defined parameter Gaussian, also called radial basis function (RBF) σ is the user defined width parameter Hyperbolic tangent α & β are user defined parameters

19 Bias parameter b can be estimated from support vectors.
SVM Decision Function After solving the QP problem for SVM, we compute the optimal values for the multipliers Support vectors are the training samples associated with non-zero multipliers. They are either misclassified or within the margin. Linear SVM decision function: Nonlinear SVM decision function: Bias parameter b can be estimated from support vectors. Kernel SVM demos: Example:

20 Iris Classification Example:
Soft-margin SVM with Gaussian kernel (σ=1, C=1) Soft-margin SVM with Gaussian kernel (σ=0.1, C=1)

21 Iris Classification Example:
Soft-margin SVM with polynomial kernel (p=1, C=1) Soft-margin SVM with Gaussian kernel (p=1, C=0.01) More support vectors.

22 Machine Learning Experiments
We are given a set of data samples (examples). How can we use the data properly to train the model parameters, estimate the model performance, and select a best model among different options (e.g., the best hyperparameter setting). Can we train the model with set A, select hyperparameter with set A, and assess the model again with set A? No, you cannot! What can we do? Random split the data!

23 Holdout Method One single split of the whole dataset into two groups.
Training set: used to train the classifier. Test set: used to estimate the error rate of the trained classifier. Drawback: If dataset is small, we may not be able to set aside a portion of the dataset for testing. The holdout estimate of error rate can be misleading if we happen to get an “unfortunate” split. Better methods? Random subsampling K-fold cross validation Leave-one-out cross validation Bootstrap Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

24 Random Subsampling Perform K data splits of the entire dataset.
Each split randomly selects a fixed number of samples for testing. For each data split, the classifier is trained from scratch using the training samples and its error rate is estimated with the testing examples (denoted by Ei for the i-th split). The final error estimate is computed by Here is an example of randomly selecting 8 samples for testing in each split. Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

25 k-fold Cross Validation
Divide the entire dataset into K partitions. For each of the K experiments, use (K-1) folds for training and a different fold for estimating the error rate Ei. The error estimate is computed by Advantage: all the examples in the dataset are eventually used for both training and testing. An example of 4-fold cross validation Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

26 Leave-one-out Cross Validation
Leave-one-out (LOO) cross validation is the degenerate case for k-fold cross validation. We have a total of N samples. LOO is an N-fold cross validation. Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

27 Bootstrap Bootstrap is based on sampling technique with replacement.
Repeat the following process K times: Randomly select (with replacement) M examples and use these for training. The remaining examples that were not selected are for testing. The number of testing samples can change over repeats. The final error estimate is computed by Sampling with Replacement: Choose an example from the given set, put that example back to the set, and then choose another example. Slides prepared based on Lecture 13, Introduction to Pattern Analysis, R. Gutierrez-Osuna.

28 Hyperparameter Selection
Different hyperparameter settings correspond to different model options. Selection of hyperparameter is also known as model selection. Model 1 Model 2 Model 3 Model T . Error Estimate Use one of the following data split schemes to train a candidate model and estimate its error: Holdout RS K-fold CV LOO Boostrap Training Data Min Final Model

29 Example: Handwriting digit recognition- choosing λ for regularised least squares model.
A good choice here. We have 200 training samples. 31 possible setting for λ are checked (These correspond to 31 model options). Random subsampling is used to estimate performance of each model. 100 out of 200 samples are selected to calculate error rate in each experiment. Run 50 experiments in total. Mean and standard deviation of 50 classification error rates are plotted as error bars for each λ option. Relevant MATLAB command: mean(), std(), errorbar()

30 Classification Performance Measures
Classification accuracy can be unreliable when assessing unbalanced data. For example, if there are 95 samples from class A and only 5 from class B in the data set, a particular classifier might classify all the observations as class A. Confusion matrix is a table with two rows and two columns that reports the number of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN). In multi-class classification, a confusion matrix can be computed for each class. Sensitivity (recall): Specificity: Precision: F1-score: confusion matrix for “cat” class

31 Summary In this lecture, we have learned
Support vector machines for classification. Different data partition methods in machine learning experiments. Classification performance measures. In next lecture, we will talk about deep learning models!


Download ppt "COMP24111: Machine Learning and Optimisation"

Similar presentations


Ads by Google