Machine Learning CS 165B Spring 2012 1. Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.

Machine Learning CS 165B Spring 2012 1

Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks (Ch. 4) Linear classifiers Support Vector Machines Bayesian Learning (Ch. 6) Instance-based Learning Clustering Computational learning theory (?) 2

Linear classifiers f x denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) How would you classify this data? 3

Linear classifiers f x denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) Any of these would be fine....but which is best? 7

Classifier margin f x denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 8

Maximum margin f x denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called LSVM) Linear SVM 9

Maximum margin f x denotes +1 denotes -1 f(x,w,b) = sign(w. x + b) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM 10

Why maximum margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1.Intuitively this feels safest. 2.If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3.The model is immune to removal of any non-support-vector datapoints. 4.There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5.Empirically it works very very well. 11

Specifying the margin Plus-plane = { x : w. x + b = +1 } Minus-plane = { x : w. x + b = -1 } Plus-Plane Minus-Plane Classifier Boundary “Predict Class = +1” zone “Predict Class = -1” zone Classify as..+1ifw. x + b >= 1 ifw. x + b <= -1 Universe explodes if-1 < w. x + b < 1 w.x+b=1 w.x+b=0 w.x+b=-1 12 Slide 53 of “linear classifiers”

Learning the maximum margin classifier Given a guess of w and b we can Compute whether all data points lie in the correct half-planes Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? –Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? “Predict Class = +1” zone “Predict Class = -1” zone w.x+b=1 w.x+b=0 w.x+b=-1 M = Margin Width = x-x- x+x+ 13

Classification rule Solution: α is 0 for all data points except the support vectors. y i is +1/-1 and defines the classification The final classification rule is quite simple: All the cleverness goes into selecting the support vectors that maximize the margin and computing the weight to use on each support vector. The set of support vectors 14

Why SVM? Very popular machine learning technique –Became popular in the late 90s (Vapnik 1995; 1998) –Invented in the late 70s (Vapnik, 1979) Controls complexity and overfitting issues, so it works well on a wide range of practical problems Because of this, it can handle high dimensional vector spaces, which makes feature selection less critical Very fast and memory efficient implementations, e..g. svm_light It’s not always the best solution, especially for problems with small vector spaces 15

Kernel Trick Instead of just using the features, represent the data using a high-dimensional feature space constructed from a set of basis functions (e.g., polynomial and Gaussian combinations of the base features) Then find a separating plane / SVM in that high- dimensional space Voila: A nonlinear classifier! 16

Binary vs. Multi-class classification SVMs can only do binary classification One-vs-all: can turn an n-way classification into n binary classification tasks –E.g., for the zoo problem, do mammal vs not-mammal, fish vs. not-fish, … –Pick the one that results in the highest score N*(N-1)/2 One-vs-one classifiers that vote on results –Mammal vs. fish, mammal vs. reptile, etc… 17

Suppose we’re in 1-dimension What would SVMs do with this data? x=0 From LSVM to general SVM 18

Not a big surprise Positive “plane” Negative “plane” x=0 19

Harder 1-dimensional dataset Points are not linearly separable. What can we do now? x=0 20

Harder 1-dimensional dataset Transform the data points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions) x=0 21

Harder 1-dimensional dataset x=0 These points are linearly separable now! 22

Non-linear SVMs General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) 23

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space: 24

Nonlinear SVMs: The Kernel Trick 2-dimensional vector x=[x 1 x 2 ] let K(x i,x j )=(1 + x i T x j ) 2 Need to show that K(x i,x j ) = φ(x i ) T φ(x j ) K(x i,x j ) = (1 + x i T x j ) 2 = 1+ x i1 2 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i2 2 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] An example: 25

Nonlinear SVMs: The Kernel Trick  Linear kernel: Examples of commonly-used kernel functions:  Polynomial kernel:  Gaussian (Radial-Basis Function (RBF) ) kernel:  Sigmoid: In general, functions that satisfy Mercer ’ s condition can be kernel functions. 26

Mercer’s condition Necessary and sufficient condition for a valid kernel: –Consider the Gram matrix K over any set of S points whose elements are given by K(xi,xj); xi, xj belonging to S –K should be positive semi-definite:  z T Kz ≥ 0 for all real z. 27

Making new kernels from the old New kernels can be made from existing kernels by operations such as addition, multiplication and rescaling: 28

What the kernel trick achieves All of the computations that we need to do to find the maximum- margin separator can be expressed in terms of scalar products between pairs of datapoints (in the high-dimensional feature space). These scalar products are the only part of the computation that depends on the dimensionality of the high-dimensional space. –So if we had a fast way to do the scalar products we would not have to pay a price for solving the learning problem in the high-D space. The kernel trick is just a magic way of doing scalar products a whole lot faster than is usually possible. –It relies on choosing a way of mapping to the high-dimensional feature space that allows fast scalar products. 29

The kernel trick For many mappings from a low-D space to a high-D space, there is a simple operation on two vectors in the low-D space that can be used to compute the scalar product of their two images in the high-D space. Low-D High-D doing the scalar product in the obvious way Letting the kernel do the work 30

Mathematical details (Linear SVM) Let {x 1,..., x n } be our data set and let y i  {1, –1} be the class label of x i The decision boundary should classify all points correctly  y i g(x i ) > 0 For any given w, b, the distance of x i from the decision surface is given by y i g(x i ) / ||w|| = y i (w T x i + b) / ||w|| For any given w, b, the minimum distance over all x i from the decision surface is given by min { y i (w T x i + b) } / ||w|| The best set of parameters are given by arg max {min { y i (w T x i + b)} / ||w||} i i w,b 31

Stretching w, b Setting w ← κw and b ← κb leaves the distance from the decision surface unchanged but can change the value of y i (w T x i + b) Example –w = (1,1), ||w|| = √2, b = –1, x i = (0.8,0.4) – y i (w T x i + b) = 0.2 –distance from decision surface = 0.2 / √2 –choose κ = 1/0.2  y i (w T x i + b) = 1  distance from decision surface = 0.2 / √2 This can be used to normalize any (w, b) such that the lower bound in the inequality constraint = 1 (next slide) 32

Reformulated optimization problem For all i, y i (w T x i + b) ≥ 1, and for some i, y i (w T x i + b) = 1. Therefore, min {y i (w T x i + b)} = 1 Solve arg max 1 / ||w|| subject to y i (w T x i + b) ≥ 1 Or, solve arg min (1/2) ||w|| 2 subject to y i (w T x i + b) ≥ 1 Quadratic Programming (special case of Convex optimization) –No local minima –Geometric view –Well established tools for solving this optimization problem (e.g. cplex) However, the story continues! Will reformulate the optimization i w,b 33

Detour: Constrained optimization minimize f(x 1, x 2 ) = x 1 2 + x 2 2 – 1 subject to h(x 1, x 2 ) = x 1 + x 2 – 1 = 0 Solve the usual way by rewriting x 2 and differentiating In general, consider the constraint surface h(x) = 0 – is normal to this surface Since we are minimizing f, is also normal to the surface at the optimal point. Therefore, and are parallel or anti-parallel at the optima. Therefore, there exists a parameter λ ≠ 0 such that 34

Linear Programming Find Subject to Very very useful algorithm 1300+ papers 100+ books 10+ courses 100s of companies Main methods Simplex method Interior point method 35

36 Lagrange multipliers The method of Lagrange multipliers gives a set of necessary conditions to identify optimal points of constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. The classical problem formulation minimize f(x 1, x 2,..., x n ) subject to h(x 1, x 2,..., x n ) = 0 can be converted to minimize L(x, ) = f(x) + h(x) where L(x, ) is the Lagrangian function is an unspecified constant called the Lagrangian Multiplier

37 Finding optimum using Lagrange Multipliers New problem: minimize L(x, ) = f(x) + h(x) Take the partial derivative wrt x and λ and set them to 0. minimize f(x 1, x 2 ) = x 1 2 + x 2 2 – 1 subject to h(x 1, x 2 ) = x 1 + x 2 – 1 = 0 can be converted to minimize L(x, ) = f(x) + h(x) = x 1 2 + x 2 2 – 1 + (x 1 + x 2 – 1) Solution..

Another example Introduce a Lagrange multiplier for constraint Construct the Lagrangian Stationary points So, 38

How about inequality constraints? Search for SVM maximum margin involves inequality constraints minimize f(x) subject to h(x) ≤ 0 Two possibilities –h(x) < 0 ; constraint is inactive  solution provided by = 0;  λ = 0 in the Lagrangian equation –h(x) = 0; constraint is active  solution as in the equality constraint case  λ > 0  and are anti-parallel λh(x) = 0 39

Karush-Kuhn-Tucker constraints minimize f(x) subject to h(x) ≤ 0 L(x, ) = f(x) + h(x) At the optimal solution, h(x) = 0 Can be extended to multiple constraints Valid for convex objective function and convex constraints Same as min (max L(x, )) st  –Consider the function max L(x, ) st  for a fixed x. –If h(x) > 0 then the function = ∞ –If h(x) ≤ 0 then  and the function = f(x) – Therefore, the minimum over all x is the same as minimizing f(x). 40 x

Primal & Dual functions Primal: Dual: Weak duality: Strong duality: 41

Modified example Introduce a Lagrange multiplier for the inequality constraint Construct the Lagrangian Stationary points Case λ = 0: x=y =0, −xy= 0 Case λ ≠ 0: x=y= λ=3, −xy= − 9 Primal: Dual: 42

Returning to SVM optimization s.t. Quadratic programming with linear constraints Lagrangian Function 43 α: Lagrangian multipliers (KKT condition) w,b α≥0

Solving the Dual optimization problem 44 Plug these and reduce

Reduction L(w,b,α) =½ ∑ ∑ α i α j y i y j x i T x j – ∑ ∑ α i α j y i y j x i T x j – b∑α i y i + ∑α i = ∑α i – ½ ∑ ∑ α i α j y i y j x i T x j 45 ijijii i ij At the optimal solution to the Dual

Solving the optimization problem Lagrangian Dual Problem 46 Which is easier to solve? #dims = dims(w) #dims = dims(α) min max w,bα≥0

Karush-Kuhn-Tucker Constraint α i (y i (w T x i + b) – 1) = 0 For every data point, either –α i = 0 (not a support vector), or –y i (w T x i + b) =1 (support vector)  lies on the maximum margin hyperplane 47

 6 =1.4 A geometrical interpretation Class 1 Class 2  1 =0.8  2 =0  3 =0  4 =0  5 =0  7 =0  8 =0.6  9 =0  10 =0 Support vectors   ’s with values different from zero (they hold up the separating plane)! 48

Solution steps Solve for α Plug in and solve for w Solve for b: –For any support vector, y i (w T x i + b) = 1  y i (w T x i + b) = 1  y i (∑α j y j x j T x i + b) = 1  b = y i – ∑α j y j x j T x i –Averaging over all support vectors,  b = (1/s) ∑ [y i – ∑α j y j x j T x i ]  s is the number of support vectors 49 j εSV i εSV

Nonlinear SVM: Optimization Use K(x i,x j ) instead of x i T x j The optimization technique is the same. 50

Non-linearly separable problems We allow “error”  i in classification; it is based on the output of the discriminant function w T x+b  i approximates the number of misclassified samples Class 1 Class 2 New objective function: C : tradeoff between error and margin; chosen by the user; large C means higher penalty to errors 51

Non-linearly separable case Formulation: such that Parameter C can be viewed as a way to control over-fitting. High C means fit data Low C means maximize margin 52 Which constraint on ξ is active?

Lagrangian formulation 53 such that (overlapping classes) (Lagrangian multipliers) (KKT condition) min max w,b,ξα,μα,μ

Solving the Dual optimization problem 54 Plug these and reduce

The Dual Lagrangian is identical 55 α≥0 55

Support Vectors α i (y i (w T x i + b) – 1 + ξ i ) = 0 For every data point, either –α i = 0 (not a support vector), or –y i (w T x i + b) = 1 – ξ i (support vector)  If α i 0 and ξ i = 0 –the point lies on the margin  If α i = C then μ i = 0 –the point can be either correctly classified (ξ i ≤ 1) or misclassified (ξ i > 1). Solution for b as in the separable case. 56 0 ≤ α i ≤ C

Support Vector Machine: Algorithm 1. Choose a kernel function 2. Choose a value for C 3. Solve the quadratic programming problem (many software packages available) to find the support vectors and α 4. Construct the discriminant function 57

Example Suppose we have 5 1D data points –x 1 =1, x 2 =2, x 3 =4, x 4 =5, x 5 =6, with x 1, x 2, x 5 as class 1 and x 3, x 4 as class 2  y 1 =1, y 2 =1, y 3 =−1, y 4 =−1, y 5 =1 We use a polynomial kernel of degree 2 –K(x,y) = (xy+1) 2 –C is set to 100 We first find  i (i=1, …, 5) by 58

Example By using a QP solver, we get  1 =0,  2 =2.5,  3 =0,  4 =7.333,  5 =4.833 –Verify that the constraints are indeed satisfied –The support vectors are {x 2 =2, x 4 =5, x 5 =6} The discriminant function is b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x 2, x 4, x 5 lie on and all give b=9 59 Do the support vectors lie on the margin?

Example Value of discriminant function x1x1 x2x2 x3x3 x4x4 x5x5 class 2 class 1 60 Check out the support vectors

Performance Support Vector Machines work very well in practice. –The user must choose the kernel function and its parameters, but the rest is automatic. –The test performance is very good. They can be expensive in time and space for big datasets –The computation of the maximum-margin hyper-plane depends on the cube of the number of training cases. –We need to store all the support vectors. SVM’s are very good if you have no idea about what structure to impose on the task. 61

Summary Strengths –Training is relatively easy  We don’t have to deal with local minimum like in ANN  SVM solution is always global and unique. –Less prone to overfitting –Simple, easy to understand geometric interpretation.  No large networks to mess around with. 62

Summary Weaknesses –Training (and Testing) is quite slow compared to ANN  Because of Quadratic Programming –Essentially a binary classifier  However, there are some tricks to evade this. –Very sensitive to noise  A few off data points can completely throw off the algorithm –Biggest Drawback: The choice of Kernel function.  There is no “set-in-stone” theory for choosing a kernel function for any given problem (still in research...)  Once a kernel function is chosen, there is only ONE modifiable parameter, the error penalty C. 63

Choosing the Kernel Function Probably the most tricky part of using SVM. The kernel function is important because it creates the kernel matrix, which summarizes all the data Many principles have been proposed (diffusion kernel, Fisher kernel, string kernel, …) There is even research to estimate the kernel matrix from available information In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try 64

Main points Large Margin Classifier –Better generalization ability & less over-fitting The Kernel Trick –Map data points to higher dimensional space in order to make them linearly separable. –Since only dot product is used, we do not need to represent the mapping explicitly. 65

Additional Resources http://www.kernel-machines.org/ 66 Demo of LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Machine Learning CS 165B Spring 2012 1. Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.

Similar presentations

Presentation on theme: "Machine Learning CS 165B Spring 2012 1. Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning CS 165B Spring 2012 1. Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.

Similar presentations

Presentation on theme: "Machine Learning CS 165B Spring 2012 1. Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks."— Presentation transcript:

Similar presentations

About project

Feedback