Linear machines márc. 9.. 1 Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Classification and Decision Boundaries
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Lecture 10: Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
This week: overview on pattern recognition (related to machine learning)
Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Non-Bayes classifiers. Linear discriminants, neural networks.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Lecture 4 Linear machine
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002.
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machine
Neural networks and support vector machines
Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Linear machines 28/02/2017.
Neuro-Computing Lecture 4 Radial Basis Function Network
Support Vector Machines Most of the slides were taken from:
COSC 4335: Other Classification Techniques
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Support Vector Machines
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
COSC 4368 Machine Learning Organization
Machine Learning Support Vector Machine Supervised Learning
Linear Discrimination
Introduction to Machine Learning
Presentation transcript:

Linear machines márc. 9.

1 Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but tractable model

Decision surface for Bayes classifier with Normal densites (  i =  esete)

Decision tree and decision regions

4 Linear discriminant function two category classifier: choose  1 if g(x) > 0 else choose  2 if g(x) < 0 If g(x) = 0 the decision is undefined. g(x)=0 defines the decision surface Linear machine = linear discriminant function: g(x) = w t x + w 0 w weight vector w 0 constant bias

5

6 c linear discriminant function:  i is predicted if g i (x) > g j (x)  j  i; i.e. pairwise decision surfaces defines the decision regions More than 2 categories

7

8 It is proved that linear machines can only define convex regions, i.e. concave regions cannot be learnt. Moreover the decision boundaries can be higher order surfaces (like elliptoids)… Expression power of linear machines

Homogen coordinates

10 Training linear machines

11 Lineáris gépek tanulása Searching for the values of w which separates classes Usually a goodness function is utilised as objective function, e.g.

12 Two categories - normalisation (normalised version) if y i belongs to ω 2 replace y i by -y i then search for a which a t y i >0 There isn’t any unique solution.

13 Iterative optimalisation The solution minimalises J(a ) Iterative improvement of J(a) a(k) a(k+1) Step direction Learning rate

14 Gradient descent Learning rate is a function of k, i.e. it describes a cooling strategy

15 Gradient descent

16 Learning rate?

17 Perceptron rule

18 Perceptron szabály Y(a): the set of training samples misclassified by a If Y(a) is empty J p (a)=0; else J p (a)>0

19 Perceptron rule –Using J p (a) in the gradient descent:

20 Misclassified training samples by a(k) Perceptron convergence theorem: If the training dataset is linearly separable the batch perceptron algorithm finds a solution in finete steps.

21 η(k)=1 online learning Stochastic gradient desent: Estimate the gradient based on a few trainging examples

Online learning algorithms: The modell is updated by each training instance (or by a small batch) Offline learning algorithms: The training dataset is processed as a whole Advantages of online learning: -Update is straightforward -The training dataset can be streamed -Implicit adaptation Disadvantages of online learning: - Its accuracy migth be lower Online vs offline learning

23 Not linearly separable case –Change the loss function, it should count each training example e.g. the directed distance from the decision surface

SVM

25 Which one to prefer?

26 Margin: the gap around the decision surface. It is defined by the training instances closest to the decision survey (support vectors)

27

28 Support Vector Machine (SVM) SVM is a linear machine where the objective function incorporates the maximalisation of the margin! This provides generalisation ability

SVM Linearly separable case

30 Linear SVM: linearly separable case Training database: Searching for w s.t. or

31 Note the size of the margin by ρ Linearly separable: We prefer a unique solution: argmax ρ = argmin Linear SVM: linearly separable case

32 Linear SVM: linearly separable case Convex quadratic optimisation problem…

33 The form of the solution: bármely t-ből x t támasztóvektor iff only support vectors count only support vectors count Weighted avearge of training instances Linear SVM: linearly separable case

SVM not linearly separable case

36 Linear SVM: not linearly separable case ξ slack variable enables incorrect classifications („soft margin”) : ξ t =0 if the classification is correct, else it is the distance from the margin C is a metaparameter for the trade-off between the margin size and incorrect classifications

SVM non-linear case

Generalised linear discriminant functions E.g. quadratic decision surface: Generalised linear discriminant functions: y i : R d → R arbitrary functions g(x) is not linear in x, but is is linear in y i (it is a hyperplane in the y-space)

Example

43 Non-linear SVM

Φ is a mapping into a higher dimensional (k) space: There exists a mapping into a higher dimensional space for any dataset where the dataset will be linearly separable in the new space. 44 Non-linear SVM

45 The kernel trick g(x)= The calculation of mappings into high dimensional space can be omited if the kernel of to x can be computed

46 Example: polinomial kernel K(x,y)=(x y) p d=256 (original dimensions) p=4 h= (high dimensional space) on the other hand K(x,y) is known and feasible to calculate while the inner product in high dimensions is not

47 Kernels in practice No rule of thumbs for selecting the appropiate kernel

48 The XOR example

49 The XOR example

50 The XOR example

51 The XOR example

52 Notes on SVM Training is a global optimalisation problem (exact optimalisation). The performance of SVM is highly dependent on the choice of the kernel and its parameters Finding the appropriate kernel for a particular task is „magic”

53 Notes on SVM Complexity depends on the number of support vectors but not on the dimensionality of the feature space In practice, it gaines good enogh generalisation ability even with a small training database

Summary Linear machines Gradient descent Perceptron SVM –Linearly separable case –Not separable case –Non-linear SVM