Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Lecture 9 Support Vector Machines

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

An Introduction of Support Vector Machine

Support Vector Machines

SVM—Support Vector Machines

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.

Support Vector Machines

Support Vector Machine

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support Vector Machines MEDINFO 2004, T02: Machine Learning Methods for Decision Support and Discovery Constantin F. Aliferis & Ioannis Tsamardinos Discovery.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

Support Vector Machines Kernel Machines

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Support Vector Machines and Kernel Methods

Support Vector Machines

CS 4700: Foundations of Artificial Intelligence

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

An Introduction to Support Vector Machines Martin Law.

Support Vector Machine & Image Classification Applications

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.

An Introduction to Support Vector Machine (SVM)

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CpSc 810: Machine Learning Support Vector Machine.

SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

An Introduction of Support Vector Machine In part from of Jinwei Gu.

Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.

A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.

Kernels Slides from Andrew Moore and Mingyue Tan.

An Introduction of Support Vector Machine Courtesy of Jinwei Gu.

Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

CS 9633 Machine Learning Support Vector Machines

PREDICT 422: Practical Machine Learning

Support Vector Machine

Support Vector Machines

An Introduction to Support Vector Machines

An Introduction to Support Vector Machines

Support Vector Machines Introduction to Data Mining, 2nd Edition by

Support Vector Machines

SVMs for Document Ranking

Presentation transcript:

Linear Classifiers / SVM Soongsil University Dept. of Industrial and Information Systems Engineering Intelligence Systems Lab. 1

sample 2 Linear Classifiers

Feature 3 Linear Classifiers

Feature 4 Linear Classifiers

Training Set 5 Linear Classifiers

How to Classify Them Using Computer? Linear Classifiers 6

How to Classify Them Using Computer? Linear Classifiers 7 혹은

Linear Classification Linear Classifiers 8

9 Optimal Hyperplane SVMs(Support Vector Machines) 9 Misclassified

10 Which Separating Hyperplane to Use? Var 1 Var 2

11 denotes +1 denotes -1 Any of these would be fine....but which is best? Optimal Hyperplane SVMs(Support Vector Machines)

12 Support Vector Machines Three main ideas:  Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin  Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications  Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

Optimal Hyperplane SVMs(Support Vector Machines) 13 Define the margin of a linear classifier as the width that the boundary could be i ncreased by before hitting a datapoint.

Canonical Hyperplane SVMs(Support Vector Machines) 14 The maximum margin linear classifier is the linear classifier with the maxi mum margin. This is the simplest kind of SVM (Called an LSVM)

Normal Vector SVMs(Support Vector Machines) 15

16 Maximizing the Margin Var 1 Var 2 Margin Width IDEA 1: Select the separating hyperplane that maximizes the margin!

17 Support Vectors Var 1 Var 2 Margin Width Support Vectors

18 Margin width B 1 b 11 b 12 d 벡터 W 와 의 내적은 다음의 기하학적 의미 를 갖는다. d

19 Setting Up the Optimization Problem Var 1 Var 2 There is a scale and unit for data so that k=1. Then problem becomes:

20 Setting Up the Optimization Problem SVMs(Support Vector Machines) 20 Var 1 Var 2 The width of the margin is: So, the problem is:

21 Setting Up the Optimization Problem If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrite as So the problem becomes:  SVMs(Support Vector Machines) ||w|| 2 =w T w is minimized

22 Linear, Hard-Margin SVM Formulation Find w,b that solves Problem is convex so, there is a unique global minimum value (when feasible) There is also a unique minimizer, i.e. weight and b value that provides the minimum Non-solvable if the data is not linearly separable Quadratic Programming Very efficient computationally with modern constraint optimization engines (handles thousands of constraints and training instances).

23 Finding the Decision Boundary Let {x 1,..., x n } be our data set and let y i  {1,-1} be the cl ass label of x i The decision boundary should classify all points correctly  The decision boundary can be found by solving the follo wing constrained optimization problem The Lagrangian of this optimization problem is

Lagrangian of SVM optimization problem

25 Lagrangian of SVM optimization problem

26 Lagrangian of SVM optimization problem

27 Lagrangian of SVM optimization problem

28 Lagrangian of SVM optimization problem

29 Lagrangian of SVM optimization problem

30 와 다음 식에 대입하여 정리하면 = Lagrangian of SVM optimization problem 편미분 하여 얻은 다음 결과를

31 Remember The Dual Problem !! Two functions based on the Lagrangian function Min L(x, λ ) 을 위한 x 값, 의 최대값에 해당하는 λ 을 구하는 문제 L(x, λ )

32 The Dual Problem By setting the derivative of the Lagrangian to be zero, th e optimization problem can be written in terms of  i (the dual problem) This is a quadratic programming (QP) problem –A global maximum of  i can always be found w can be recovered by

33 The Dual Problem By setting the derivative of the Lagrangian to be zero, th e optimization problem can be written in terms of  i (the dual problem) This is a quadratic programming (QP) problem –A global maximum of  i can always be found w can be recovered by 만약 학습 데이터 수가 아주 많을때 는 SVM 학습 속도가 매우 느려질 수 있다. dual 문제에서는 parameters α 의 수가 매우 많아 질 수 있기 때문이다.

34  6 =1.4 A Geometrical Interpretation Class 1 Class 2  1 =0.8  2 =0  3 =0  4 =0  5 =0  7 =0  8 =0.6  9 =0  10 =0

35

36 Characteristics of the Solution KKT condition indicates many of the  i are zero –w is a linear combination of a small number of data points x i with non-zero  i are called support vectors (SV) –The decision boundary is determined only by the SV –Let t j (j=1,..., s) be the indices of the s support vectors. We can write For testing with a new data z –Compute and classify z as: class 1 if the sum is positive, class 2 otherwise.

37 Support Vector Machines Three main ideas:  Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin  Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications  Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

38 Non-Linearly Separable Data Var 1 Var 2 Allow some instances to fall within the margin, but penalize them Introduce slack variables What if the training set is not linearly separable?

39 Formulating the Optimization Problem Var 1 Var 2 Constraint becomes: Objective function penalizes for misclassified instances and those within the margin C trades-off margin width and misclassifications; chosen by the user; large C  a higher penalty to errors

40 Soft Margin Hyperplane By minimizing  i  i,  i can be obtained by  i are “ slack variables ” in optimization;  i =0 if there is no error for x i, and  i is an upper bound of the number of errors The optimization problem becomes

41 Soft Margin Classification Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. Need to minimize: Subject to: ξiξi ξiξi SVMs(Support Vector Machines)

42 Linear, Soft-Margin SVMs Algorithm tries to maintain  i to zero while maximizing margin Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes Other formulations use  i 2 instead As C , we get closer to the hard-margin solution

43 Robustness of Soft vs Hard Margin SVMs Var 1 Var 2 ii Var 1 Var 2 Soft Margin SVM Hard Margin SVM As C  As C  0

44 Soft vs Hard Margin SVM Soft-Margin always have a solution Soft-Margin is more robust to outliers Smoother surfaces (in the non-linear case) Hard-Margin does not require to guess the cost parameter (requires no parameters at all)

45

46

47

48

Linear SVMs: Overview The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i. Both in the dual formulation of the problem and in the solution training points appear only inside inner products: Find α 1 …α N such that f(x) = Σ α i y i x i T x + b SVMs(Support Vector Machines)

50 Support Vector Machines Three main ideas:  Define what an optimal hyperplane is (in way that can be identified in a computationally efficient way): maximize margin  Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications  Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space

51

52 Extension to Non-linear Decision Boundary So far, we only consider large-margin classifier with a linear decision boundary, how to generalize it to become nonlinear? Key idea: transform x i to a higher dimensional space to “ make life easier ” –Input space: the space the point x i are located –Feature space: the space of  (x i ) after transformation Why transform? –Linear operation in the feature space is equivalent to non-linear operation in input space –Classification can become easier with a proper transformation. In the XOR problem, for example, adding a new feature of x 1 x 2 make the problem linearly separable

53 Non-linear SVMs SVMs(Support Vector Machines) 53 Datasets that are linearly separable with some noise wor k out great: But what are we going to do if the dataset is just too hard? How about … mapping data to a higher-dimensional spac e: 0 x 0 x 0 x x2x2

54 Disadvantages of Linear Decision Surfaces Var 1 Var 2 SVMs(Support Vector Machines)

55 Advantages of Non-Linear Surfaces Var 1 Var 2 SVMs(Support Vector Machines)

56 Linear Classifiers in High-Dimensional Spaces Var 1 Var 2 Constructed Feature 1 Find function  (x) to map to a different space Constructed Feature 2 SVMs(Support Vector Machines)

57 Transforming the Data Computation in the feature space can be costly because it is high dimensional –The feature space is typically infinite-dimensional! The kernel trick comes to rescue  ( )  (.)  ( ) Feature space Input space

58

59 Mapping Data to a High-Dimensional Space Find function  (x) to map to a different space, then SVM formulation becomes: Data appear as  (x), weights w are now weights in the new space Explicit mapping expensive if  (x) is very high dimensional Solving the problem without explicitly mapping the data is desirable !! SVMs(Support Vector Machines)

60 The Kernel Trick Recall the SVM optimization problem The data points only appear as inner product As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly Many common geometric operations (angles, distances) can be expressed by inner products Define the kernel function K by

61 An Example for  (.) and K(.,.) Suppose  (.) is given as follows An inner product in the feature space is So, if we define the kernel function as follows, there is no need to carry out  (.) explicitly This use of kernel function to avoid carrying out  (.) explicitly is known as the kernel trick

62 The linear classifier relies on dot product between vectors K(x i,x j )=x i T x j If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x i T x j ) 2, Need to show that K(x i,x j )= φ(x i ) T φ(x j ): K(x i,x j )=(1 + x i T x j ) 2, = (1+[x i1 x i2 ] T [x j1 x j2 ]) 2 = 1+ x i1 2 x j x i1 x j1 x i2 x j2 + x i2 2 x j x i1 x j1 + 2x i2 x j2 = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] Kernel Example

63 Kernel Example Let Where: …(we can do XOR!)

64 What Functions are Kernels? SVMs(Support Vector Machines) 64 For some functions K(x i,x j ) checking that K(x i,x j )= φ(x i ) T φ(x j ) can be cumbersome. Mercer’s theorem: Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix: K(x1,x1)K(x1,x1)K(x1,x2)K(x1,x2)K(x1,x3)K(x1,x3) … K(x1,xN)K(x1,xN) K(x2,x1)K(x2,x1)K(x2,x2)K(x2,x2)K(x2,x3)K(x2,x3)K(x2,xN)K(x2,xN) …………… K(xN,x1)K(xN,x1)K(xN,x2)K(xN,x2)K(xN,x3)K(xN,x3) … K(xN,xN)K(xN,xN) K=

65 Kernel Functions In practical use of SVM, only the kernel function is specified (and not  (.)) Kernel function can be thought of as a similarity measure between the input objects Not all similarity measure can be used as kernel function, however –Mercer's condition states that any positive semi-definite kernel K(x, y), i.e. can be expressed as a dot product in a high dimensional space.

66 Examples of Kernel Functions SVMs(Support Vector Machines) 66 Linear: K(x i,x j )= x i T x j Polynomial of power p : K(x i,x j )= (1+ x i T x j ) p Gaussian (radial-basis function network): Closely related to radial basis function neural networks Sigmoid: K(x i,x j )= tanh(β 0 x i T x j + β 1 ) It does not satisfy the Mercer condition on all  and 

67 Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function

68 Modification Due to Kernel Function For testing, the new data z is classified as: –class 1 if f  0, –class 2 if f <0 Original With kernel function

69

70

Suppose we have 5 1D data points –x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with 1, 2, 6 as class 1 and 4, 5 as class 2  y 1 =1, y 2 =1, y 3 =-1, y 4 =-1, y 5 =1 We use the polynomial kernel of degree 2 –K(x,y) = (xy+1) 2 –C is set to 100 We first find  i (i=1, …, 5) by Example

72  By using a QP solver, we get  1 =0,  2 =2.5,  3 =0,  4 =7.333,  5 =4.833 –Verify (at home) that the constraints are indeed satisfied –The support vectors are {x 2 =2, x 4 =5, x 5 =6}  The discriminant function is b is recovered by solving f(x 2 =2)=1 or by f(x 4 =5)=-1 or by f(x 5 =6})=1, as x 2, x 5 lie on and x 4, lies on  all give b=9 Example

73

74 Solve the classical XOR problem, i.e find non-linear discriminant function !! –Dataset Class 1: 1=(−1,−1), 4=(+1,+1) Class 2: 2=(−1,+1), 3=(+1,−1) –Kernel function Polynomial of order 2: (,′)=(′+1) 2 –To achieve linear separability, we will use =∞ Homework

75

76 Robustness of Soft vs Hard Margin SVMs Var 1 Var 2 ii Var 1 Var 2 Soft Margin SVM Hard Margin SVM

77

78 Choosing the Kernel Function  Probably the most tricky part of using SVM.  The kernel function is important because it creates the kernel matrix, which summarize all the data  Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …)  In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications.  SVM with RBF kernel is closely related to RBF neural networks, with the centers of the radial basis functions automatically chosen for SVM

79 Strengths of SVM Strengths –Training is relatively easy No local optimal, unlike in neural networks –It scales relatively well to high dimensional data –Tradeoff between classifier complexity and error can be controlled explicitly –Non-traditional data like strings and trees can be used as input to SVM, instead of feature vectors –By performing logistic regression (Sigmoid) on the SVM output of a set of data can map SVM output to probabilities.

80 Need to choose a “good” kernel function. It is sensitive to noise - A relatively small number of mislabeled examples can dramatically d ecrease the performance It only considers two classes - how to do multi-class classification with SVM ? - Answer: 1) with output arity m, learn m SVM’s – SVM 1 learns “Output==1” vs “Output != 1” – SVM 2 learns “Output==2” vs “Output != 2” – : – SVM m learns “Output==m” vs “Output != m” 2)To predict the output for a new input, just predict with each SVM a nd find out which one puts the prediction the furthest into the positiv e region. Weaknesses of SVM

81 Summary: Steps for Classification 1.Prepare the pattern matrix {(x i,y i )} 2. Select a Kernel function 3. Select the error parameter C can use the values suggested by the SVM software, or can set apart a validation set to determine the values of the parameter 4. Execute the training algorithm (to find all α i ) 5. New data can be classified using α i and Support Vectors

82 The Dual of the SVM Formulation Original SVM formulation n inequality constraints n positivity constraints n number of  variables The (Wolfe) dual of this problem one equality constraint n positivity constraints n number of  variables (Lagrange multipliers) Objective function more complicated NOTICE: Data only appear as  (x i )   (x j )

Nonlinear SVM - Overview SVMs(Support Vector Machines) 83 SVM locates a separating hyperplane in the feature space and classify points in that space It does not need to represent the space explicitly, simply by defining a kernel function The kernel function plays the role of the dot product in the feature space.

SVM Applications 84 SVM has been used successfully in many real-world problems - text (and hypertext) categorization - image classification - Ranking (e.g., Google searches) - bioinformatics (Protein classification, Cancer classificat ion) - hand-written character recognition

85 Handwritten digit recognition

86 Comparison with Neural Networks Neural Networks Hidden Layers map to lower dimensional spaces Search space has multiple local minima Training is expensive Classification extremely efficient Requires number of hidden units and layers Very good accuracy in typical domains SVMs Kernel maps to a very-high dimensional space Search space has a unique minimum Training is extremely efficient Classification extremely efficient Kernel and cost the two parameters to select Very good accuracy in typical domains Extremely robust

87 Conclusions SVMs express learning as a mathematical program taking advantage of the rich theory in optimization SVM uses the kernel trick to map indirectly to extremely high dimensional spaces SVMs extremely successful, robust, efficient, and versatile while there are good theoretical indications as to why they generalize well

88 Suggested Further Reading C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), P.H. Chen, C.-J. Lin, and B. Schölkopf. A tutorial on nu -support vector machines N. Cristianini. ICML'01 tutorial, K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Neural Networks, 12(2): , May (PDF)PDF B. Schölkopf. SVM and kernel methods, Tutorial given at the NIPS Conference. Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, Springel 2001

89 References Burges, C. “A Tutorial on Support Vector Machines for Pattern Recognition.” Bell Labs Law, Martin. “A Simple Introduction to Support Vector Machines.” Michigan State University Prabhakar, K. “ An Introduction to Support Vector Machines”

90

91

92