Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for.

Similar presentations


Presentation on theme: "INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for."— Presentation transcript:

1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Lecture Slides for

2 CHAPTER 10: Linear Discrimination

3 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Likelihood- vs. Discriminant-based Classification Likelihood-based: Assume a model for p(x|C i ), use Bayes’ rule to calculate P(C i |x) Choose C i if g i (x) = log P(C i |x ) is maximum Discriminant-based: Assume a model for the discriminant g i (x| Φ i ); no density estimation  Estimating the boundaries is enough; no need to accurately estimate the densities inside the boundaries

4 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Linear Discriminant Linear discriminant: Advantages:  Simple: O(d) space/computation  Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)  Optimal when p(x|C i ) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable

5 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Generalized Linear Model Quadratic discriminant: Instead of higher complexity, we can still use a linear classifier if we use higher-order (product) terms. Map from x to z using nonlinear basis functions and use a linear discriminant in z-space The linear function defined in the z space corresponds to a non-linear function in the x space.

6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 Two Classes choose C 1 if g 1 (x) > g 2 (x) C 2 if g 2 (x) > g 1 (x) Define:

7 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 Learning the Discriminants As we have seen before, when p (x | C i ) ~ N ( μ i, ∑ ), the optimal discriminant is a linear one: So, estimate  i and  from data, and plug into the gi’s to find the linear discriminant functions. Of course any way of learning can be used (e.g. perceptron, gradient descent, logistic regression...).

8 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8 When K > 2  Combine K two-class problems, each one separating one class from all other classes

9 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9 Multiple Classes Why? Any problem? Convex decision regions based on g i s (indicated with blue) dist is |g i (x)|/||wi|| Assumes that classes are linearly separable: reject may be used How to train? How to decide on a test?

10 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Pairwise Separation If the classes are not linearly separable: Uses k(k-1)/2 linear discriminants

11 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11

12 A Bit of Geometry

13 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Dot Product and Projection =w T p = ||w||||p||Cos  proj. of p onto w = ||p||Cos  = w T.p/||w|| p w  p

14 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 Geometry The points x on the separating hyperplane have g(x) = w T x+w 0 =0. Hence for the points on the boundary w T x=- w 0. Thus, these points also have the same projection onto the weight vector w, namely w T x/||w|| (by definition of projection and dot product). But this is equal to –w 0 /||w||. Hence... The perpendicular distance of the boundary to the origin is |w 0 |/||w||. The distance of any point x to the decision boundary is |g(x)|/||w||.

15 Support Vector Machines

16 Vapnik and Chervonenkis – 1963 Boser, Guyon and Vapnik – 1992 (kernel trick) Cortes and Vapnik – 1995 (soft margin) The SVM is a machine learning algorithm which  solves classification problems  uses a flexible representation of the class boundaries  implements automatic complexity control to reduce overfitting  has a single global minimum which can be found in polynomial time It is popular because it can be easy to use it often has good generalization performance the same algorithm solves a variety of problems with little tuning Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16

17 SVM Concepts Convex programming and duality Using maximum margin to control complexity Representing non-linear boundaries with feature expansion The kernel trick for efficient optimization Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17

18 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Linear Separators Which of the linear separators is optimal?

19 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Classification Margin Distance from example x i to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support vectors from two classes. r ρ

20 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20 Maximum Margin Classification Maximizing the margin is good according to intuition. Implies that only support vectors matter; other training examples are ignorable.

21 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 SVM as 2-class Linear Classifier Optimal separating hyperplane: Separating hyperplane maximizing the margin (Cortes and Vapnik, 1995; Vapnik, 1995) Note the condition >= 1 (not just 0). We can always do this if the classes are linearly separable by rescaling w and w0, without affecting the separating hyperplane:

22 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 Optimal Separating Hyperplane (Cortes and Vapnik, 1995; Vapnik, 1995)

23 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23 Maximizing the Margin To maximize margin, minimize the Euclidian norm of the weight vector w Distance from the discriminant to the closest instances on either side is called the margin In general this relationship holds (geometry): So, for the support vectors, we have:

24 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24 Maximizing the Margin-Alternate explanation Distance from the discriminant to the closest instances on either side is called the margin Distance of x to the hyperplane is We require that this distance is at least some value  We would like to maximize , but we can do so in infinitely many ways by scaling w. For a unique sol’n, we fix ρ ||w||=1 and minimize ||w||.

25 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25 Unconstrained problem using Lagrange multipliers (+ numbers) The solution, if it exists, is always at a saddle point of the Lagrangian L p should be minimized w.r.t w and maximized w.r.t  t s

26 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26

27 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27

28 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28

29 The figure shows this curve together with several level curves of f. These have the equations f(x, y) = c, where c = 7, 8, 9, 10, 11 LAGRANGE MULTIPLIERS—TWO VARIABLES

30 To maximize f(x, y) subject to g(x, y) = k is to find: The largest value of c such that the level curve f(x, y) = c intersects g(x, y) = k. LAGRANGE MULTIPLIERS—TWO VARIABLES

31 It appears that this happens when these curves just touch each other—that is, when they have a common tangent line. Otherwise, the value of c could be increased further. LAGRANGE MULTIPLIERS—TWO VARIABLES

32 This means that the normal lines at the point (x 0, y 0 ) where they touch are identical.  So the gradient vectors are parallel.  That is, for some scalar λ. LAGRANGE MULTIPLIERS—TWO VARIABLES

33 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33

34 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34 Convex quadratic optimization problem can be solved using the dual form where we use these local minima constraints and maximize w.r.t  t s

35 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35 from: http://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/lagrang/lagrang.html

36 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36 Maximize L d with respect to  t only Quadratic programming problem Thanks to the convexity of the problem, optimal value of L p = L d

37 To every convex program corresponds a dual Solving original (primal) is equivalent to solving dual Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37

38 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 38 Support vectors for which

39 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39 Size of the dual depends on N and not on d Maximize L d with respect to  t only Quadratic programming problem Thanks to the convexity of the problem, optimal value of L p = L d

40 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40 Calculating the parameters w and w 0 Note that:  either the constraint is exactly satisfied (=1) (and α t can be non-zero)  or the constraint is clearly satisfied (> 1) (then α t must be zero) Once we solve for α t, we see that most of them are 0 and only a small number have α t >0  the corresponding x t s are called the support vectors

41 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 41 Finding the weights

42 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42 Calculating the parameters w and w 0 Once we have the Lagrange multipliers, we can compute w and w 0: where SV is the set of the Support Vectors.

43 We make decisions by comparing each query x with only the support vectors Choose class C1 if +, C2 if negative Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43

44 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 44

45 Not-Linearly Separable Case

46 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46 The non-separable case cannot find a feasible solution using the previous approach  The objective function (L D ) grows arbitrarily large. Relax the constraints, but only when necessary  Introduce a further cost for this

47 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 47

48 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 48 Soft Margin Hyperplane Not linearly separable Three cases (shown in fig): Case 1: Case 2: Case 3:

49 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 49 Soft Margin Hyperplane Define Soft error New primal is Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. Upper bound on the number of training errors Lagrange multipliers to enforce positivity of 

50 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 50 Soft Margin Hyperplane New dual is the same as the old one subject to As in the separable case, instances that are not support vectors vanish with their  t =0 and the remaining define the boundary.

51 Kernel Functions in SVM

52 We can handle the overfitting problem: even if we have lots of parameters, large margins make simple classifiers “All” that is left is efficiency Solution: kernel trick Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 52

53 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 53 Kernel Functions Instead of trying to fit a non-linear model, we can  map the problem to a new space through a non-linear transformation and  use a linear model in the new space Say we have the new space calculated by the basis functions z = φ(x) where z j =  j (x), j=1,...,k d-dimensional x spacek-dimensional z space φ (x) = [ φ 1 (x) φ 2 (x)... φ k (x)]

54 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 54

55 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 55 Kernel Functions

56 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 56 Kernel Machines Preprocess input x by basis functions z = φ (x)g(z)=w T z g(x)=w T φ (x) SVM solution: Find Kernel functions K(x,y) such that the inner product of basis functions are replaced by a Kernel function in the original input space

57 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 57 Kernel Functions Consider polynomials of degree q: (Cherkassky and Mulier, 1998)

58 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 58

59 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 59

60 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 60 Examples of Kernel Functions Linear: K(x i,x j )= x i T x j  Mapping Φ : x → φ (x), where φ (x) is x itself Polynomial of power p: K(x i,x j )= (1+ x i T x j ) p  Mapping Φ : x → φ (x), where φ (x) has dimensions Gaussian (radial-basis function): K(x i,x j ) =  Mapping Φ : x → φ (x), where φ (x) is infinite-dimensional: every point is mapped to a function (a Gaussian) Higher-dimensional space still has intrinsic dimensionality d, but linear separators in it correspond to non-linear separators in original space.

61 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 61 Typically k is much larger than d, and possibly larger than N  Using the dual where the complexity depends on N rather than k is advantageous We use the soft margin hyperplane  If C is too large, too high a penalty for non-separable points (too many support vectors)  If C is too small, we may have underfitting Decide by cross-validation

62 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 62 Other Kernel Functions Polynomials of degree q: Radial-basis functions: Sigmoidal functions such as:

63 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 63 What Functions are Kernels? Advanced For some functions K(x i,x j ) checking that K(x i,x j )= φ (x i ) T φ (x j ) can be cumbersome. Any function that satisfies some constraints called the Mercer conditions can be a Kernel function - (Cherkassky and Mulier, 1998) Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi- positive definite symmetric Gram matrix: K(x1,x1)K(x1,x1)K(x1,x2)K(x1,x2)K(x1,x3)K(x1,x3)…K(x1,xn)K(x1,xn) K(x2,x1)K(x2,x1)K(x2,x2)K(x2,x2)K(x2,x3)K(x2,x3)K(x2,xn)K(x2,xn) …………… K(xn,x1)K(xn,x1)K(xn,x2)K(xn,x2)K(xn,x3)K(xn,x3)…K(xn,xn)K(xn,xn) K=

64 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 64 Informally, kernel methods implicitly define the class of possible patterns by introducing a notion of similarity between data  Choice of similarity -> Choice of relevant features More formally, kernel methods exploit information about the inner products between data items  Many standard algorithms can be rewritten so that they only require inner products between data (inputs)  Kernel functions = inner products in some feature space (potentially very complex)  If kernel given, no need to specify what features of the data are being used  Kernel functions make it possible to use infinite dimensions efficiently in time / space

65 String kernels For example, given two documents, D 1 and D 2, the number of words appearing in both may form a kernel. Define  (D 1 ) as the M-dimensional binary vector where dimension i is 1 if word w i appears in D 1 ; 0 otherwise. Then  (D 1 ) T  (D 2 ) indicates the number of shared words. If we define K(D 1,D 2 ) as the number of shared words;  no need to preselect the M words  no need to create the bag-of-words model explicitly  M can be as large as we want Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 65

66 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 66 Projecting into Higher Dimensions

67 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 67 SVM Applications Cortes and Vapnik 1995:  Handwritten digit classification  16x16 bitmaps -> 256 dimensions  Polynomial kernel where q=3 -> feature space with 10 6 dimensions  No overfitting on a training set of 7300 instances  Average of 148 support vectors over different training sets Expected test error rate: Exp N [P(error)] = Exp N [#support vectors] / N (= 0.02 for the above example)

68 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 68 Choice of the optimal hyperplane as the decision surface is not only intuitive, but also follow the structural risk minimization principle  for good generalization, choose the model that minimizes the training error and VC dimension

69 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 69 Structural Risk Minimization Vapnik 1995, 1998: The set of optimal hyperplanes described by the equation w o T x+b 0 =0 has a VC dimension h such that h <= min(  D 2 /  2 ,d)+1 where D is the diameter of the unit ball containing all the input points,  is the margin of separation and d is the original input dimension.

70 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 70 SVM history and applications SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs represent a general methodology for many PR problems: classification,regression, feature extraction, clustering, novelty detection, etc. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of α i ’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]

71 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 71 Advantages of SVMs  There are no problems with local minima, because the solution is a Qaudratic Programming problem with a global minimum.  The optimal solution can be found in polynomial time  There are few model parameters to select: the penalty term C, the kernel function and parameters (e.g., spread σ in the case of RBF kernels)  The final results are stable and repeatable (e.g., no random initial weights)  The SVM solution is sparse; it only involves the support vectors  SVMs rely on elegant and principled learning methods  SVMs provide a method to control complexity independently of dimensionality  SVMs have been shown (theoretically and empirically) to have excellent generalization capabilities

72 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 72 Challenges Can the kernel functions be selected in a principled manner? SVMs still require selection of a few parameters, typically through cross-validation How does one incorporate domain knowledge?  Currently this is performed through the selection of the kernel and the introduction of “artificial” examples How interpretable are the results provided by an SVM? What is the optimal data representation for SVM? What is the effect of feature weighting? How does an SVM handle categorical or missing features? Do SVMs always perform best? Can they beat a hand-crafted solution for a particular problem? Do SVMs eliminate the model selection problem?

73 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 73 More explanations or demonstrations can be found at:  http://www.support-vector-machines.org/index.html  Haykin Chp. 6 pp. 318-339  Burges tutorial (under/reading/) Burges, CJC "A Tutorial on Support Vector Machines for Pattern Recognition" Data Mining and Knowledge Discovery, Vol 2 No 2, 1998.A Tutorial on Support Vector Machines for Pattern Recognition  http://www.dtreg.com/svm.htm http://www.dtreg.com/svm.htm Software  SVMlight, by Joachims, is one of the most widely used SVM classification and regression package. Distributed as C++ source and binaries for Linux, Windows, Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).  LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for Support Vector Machines), is developed by Chang and Lin; also widely used. Developed in C++ and Java, it supports also multi-class classification, weighted SVM for unbalanced data, cross-validation and automatic model selection. It has interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels: linear, polynomial, radial basis function, and neural (tanh). http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Applet to play with:  http://lcn.epfl.ch/tutorial/english/svm/html/index.html http://lcn.epfl.ch/tutorial/english/svm/html/index.html  http://cs.stanford.edu/people/karpathy/svmjs/demo/

74 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 74 Maybe the best /most clear applet (next few slides):  http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html Play with the applet and try to get and understanding for  Different kernels  The effect of C  Resulting boundaries  Note: In each iteration, the applet adds more support vectors while trying to increase the margin (rho)

75 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 75

76 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 76

77 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 77


Download ppt "INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for."

Similar presentations


Ads by Google