Download presentation
Presentation is loading. Please wait.
1
Support Vector Machines
Joseph Gonzalez TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA
2
From a linear classifier to ...
*One of the most famous slides you will see, ever!
3
*One of the most famous slides you will see, ever!
Maximum margin Maximum possible separation between positive and negative training examples *One of the most famous slides you will see, ever!
4
The Big Idea X X X X X X O O X O O O O O O O
- how many people are scared of svms? maybe no one... anyway... it’s a really really simple classifier... - set of points, what do we want? - how about “as separating as possible”, “maximum margin” - note how if we do this, only the “boundary” points determine the decision boundary. those will be called support vectors! X X O O X O O O O O O O
5
Geometric Intuition SUPPORT VECTORS X X X O O X O O
6
Geometric Intuition SUPPORT VECTORS X X X O O X X O O
7
Primal Version min ||w||2 +C ∑ξ s.t. (w.x + b)y ≥ 1-ξ ξ ≥ 0
look at that tiny vector w.x is greater than 0
8
max ∑α -1/2 ∑αiαjyiyjxixj
DUAL Version max ∑α -1/2 ∑αiαjyiyjxixj s.t. ∑αiyi = 0 C ≥ αi ≥ 0 Where did this come from? Remember Lagrange Multipliers Let us “incorporate” constraints into objective Then solve the problem in the “dual” space of lagrange multipliers - Lagrange: useful tool -- so here’s a summary - Uh, but WHY did we go to the dual form??
9
max ∑α -1/2 ∑αiαjyiyjxixj
Primal vs Dual min ||w||2 +C ∑ξ s.t. (w.x + b)y ≥ 1-ξ ξ ≥ 0 max ∑α -1/2 ∑αiαjyiyjxixj s.t. ∑αiyi = 0 C ≥ αi ≥ 0 Number of parameters? large # features? large # examples? for large # features, DUAL preferred many αi can go to zero! - kernel trick... yeah... - some solvers are optimized for this (also said Carlos)
10
DUAL: the “Support vector” version
max ∑α - 1/2 ∑αiαjyiyjxixj s.t. ∑αiyi = 0 C ≥ αi ≥ 0 Wait... how do we predict y for a new point x?? How do we find w? How do we find b? How do we find α? Quadratic programming How do we find C? Cross-validation! w: we know that lagrangian derivative is 0 if at a minimum. Comes from setting the derivative of lagrangian w/r.t. w to zero y = sign(w.x+b) w = Σi αi yi xi y = sign(Σi αi yi xi xj + b)
11
max ∑α - 1/2 ∑αiαjyiyjxixj
“Support Vector”s? max ∑α - 1/2 ∑αiαjyiyjxixj s.t. ∑αiyi = 0 C ≥ αi ≥ 0 y=w.x+b b = y-w.x x1: b = 1- .4 [-2 -1][0 1] = 1+.4 =1.4 b α2 max ∑α - α1α2(-1)(0+2) - 1/2 α12(1)(0+1) - 1/2 α22(1)(4+4) X well... it’s the ratio between the alphas that matter, right... Did everyone get what “support vectors” are? Tell me! . decision boundary? . support vectors? (2,2) α1 O max α1 + α2 + 2α1α2 - α12/2 - 4α22 s.t. α1-α2 = 0 C ≥ αi ≥ 0 (0,1) 4/5 α1=α2=α max 2α -5/2α2 max 5/2α(4/5-α) 2/5 α1=α2=2/5 w = Σi αi yi xi w = .4([0 1]-[2 2]) =.4[-2 -1 ]
12
max ∑α - 1/2 ∑αiαjyiyjxixj
“Support Vector”s? max ∑α - 1/2 ∑αiαjyiyjxixj s.t. ∑αiyi = 0 C ≥ αi ≥ 0 α2 X “power” (alpha) for positive is spread now (2,2) What is α3? Try this at home α1 O (0,1) O α3
13
Playing With SVMS
14
More on Kernels Kernels represent inner products
K(a,b) = a.b K(a,b) = φ(a) . φ(b) Kernel trick is allows extremely complex φ( ) while keeping K(a,b) simple Goal: Avoid having to directly construct φ( ) at any point in the algorithm
15
Kernels Complexity of the optimization problem remains only dependent on the dimensionality of the input space and not of the feature space!
16
Can we used Kernels to Measure Distances?
Can we measure distance between φ(a) and φ(b) using K(a,b)?
17
Continued:
18
Popular Kernel Methods
Gaussian Processes Kernel Regression (Smoothing) Nadarayan-Watson Kernel Regression
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.