Download presentation
Presentation is loading. Please wait.
1
Introduction to Machine Learning 236756
Tutorial 6: The SVM Optimization Problem Kernels
2
Which Is Better? ๐ณ= โ 2
3
Margin The margin of a linear separator is defined as the distance of the closest instance point to the linear hyperplane Large margins are intuitively more stable: If noise is added to data, then it is more likely to still be separated
4
The Margin ๐ฅ: ๐ค,๐(๐ฅ) =0 ๐(๐ฅ 0 ) ๐๐๐๐๐๐ ๐ค = ๐ค, ๐(๐ฅ 0 ) | (๐๐ ๐ ๐ข๐๐ ๐ค =1)
๐๐๐๐๐๐ ๐ค = ๐ค, ๐(๐ฅ 0 ) | (๐๐ ๐ ๐ข๐๐ ๐ค =1) ๐(๐ฅ 0 ) ๐ค
5
Hard-SVM Equivalent (realizable case)
Solve: ๐ค 0 = ๐๐๐๐๐๐ ๐ค ๐ค s.t.: โ๐โ ๐ , ๐ฆ ๐ ๐ค, ๐(๐ฅ ๐ ) โฅ1 Output: ๐ค = ๐ค 0 ๐ค 0
6
Soft-SVM Data separability may be a strong requirement (especially with marginโฆ) Solve: ๐๐๐๐๐๐ ๐ค ๐ค s.t.: โ๐โ ๐ , ๐ฆ ๐ ๐ค, ๐(๐ฅ ๐ ) โฅ1 + ๐ ๐ ๐ ๐ ๐ โ ๐ ๐ โ๐ ๐ ๐ โฅ๐ ๐ ๐ At optimum, must have โ๐ : 1 ๐ฆ ๐ โฉ๐ค, ๐(๐ฅ ๐ )โช
7
Soft-SVM: Equivalent Definition
Solve: ๐๐๐๐๐๐ ๐ค ๐ค ๐ฟ ๐ โ๐๐๐๐ (๐ค) Reminder: ๐ฟ ๐ โ๐๐๐๐ ๐ค = 1 ๐ maxโก{0,1โ ๐ฆ ๐ ๐ค, ๐(๐ฅ ๐ ) } ๐
8
The Representer Theorem
๐ค โ ๐๐๐๐๐๐ ๐ค ๐ ๐ค ๐ maxโก{0,1โ ๐ฆ ๐ ๐ค,๐( ๐ฅ ๐ ) } Theorem: ๐ค โ๐=๐ ๐๐๐ ๐(๐ฅ 1 ),โฆ, ๐(๐ฅ ๐ ) In other words, โ ๐ผ 1 ,โฆ, ๐ผ ๐ s.t. ๐ค = ๐ผ ๐ ๐(๐ฅ ๐ ) Moreover can set, for some optimal ๐ : ๐ผ ๐ =0 for all ๐ s.t. ๐ฆ ๐ ๐ค ,๐ ๐ฅ ๐ >1 Support Vectors: ๐ ๐ฅ ๐ : ๐ผ ๐ โ 0 โฆ hence โSVMโ ๐ฟ ๐ โ๐๐๐๐
9
The Representer Theorem
๐ค โ ๐๐๐๐๐๐ ๐ค ๐ ๐ค ๐ maxโก{0,1โ ๐ฆ ๐ ๐ค,๐( ๐ฅ ๐ ) } Since the solution is of the following form Bringing it back to the function we will get
10
The Representer Theorem
From previous slide we can conclude that the optimization problem depends only on values of โฉ ๐ฅ ๐ , ๐ฅ ๐ โช for ๐,๐โ[๐]. Function ๐พ ๐ฅ, ๐ฅ โฒ =โฉ๐ ๐ฅ ,๐ ๐ฅ โฒ โช called Kernel
11
Can any Function be a Kernel?
No! Must be an inner product of some feature space Technical condition: for any finite set of points, Gram matrix (matrix of kernel evaluations) must be positive semi definite: K is a valid kernel if and only if โ ๐ โ ๐ฅ 1 , ๐ฅ 2 ,โฆ, ๐ฅ ๐ : ๐บ= ๐พ( ๐ฅ 1 , ๐ฅ 1 ) โฏ ๐พ( ๐ฅ 1 , ๐ฅ ๐ ) โฎ โฑ โฎ ๐พ( ๐ฅ ๐ , ๐ฅ 1 ) โฏ ๐พ ๐ฅ ๐ , ๐ฅ ๐ โฝ0 In particular K must be symmetric, but this is not enough
12
Kernels As Prior Knowledge
If we think that positive examples can (almost) be separated by some ellipse: then we should use polynomials of degree 2 A Kernel encodes a measure of similarity between objects. Must be a valid inner product function.
13
Polynomial Kernels Higher degrees: ๐พ ๐ฅ, ๐ฅ โฒ = ๐ฅ, ๐ฅ โฒ +1 ๐ corresponds to feature space with all degree-๐ monomials ๐ท=๐( ๐ ๐ ) features, still ๐(๐) to calculate Predictors: all degree-๐ poynomials Variant: ๐พ ๐ฅ, ๐ฅ โฒ = ๐ฅ, ๐ฅ โฒ +๐ ๐ Larger ๐ ๏จ higher weight to lower order features ๐ specifies prior bias / knowledge Add โaโ in r=2 example. We get [a^2,a\sqrt(2)x[1],โฆ,\sqrt(2)x[1]x[2],x[1]^2] If a is very large, weight vector using 1st order can be small, but to use higher order need larger w๏ higher norm
14
Kernels Kernel: ๐พ ๐ฅ,๐ฅโฒ =โฉ๐ ๐ฅ ,๐ ๐ฅโฒ โช
Sometimes calculating (and even specifying) inner product is easier then writing out and evaluating ๐ โ
E.g., ๐ฅโ โ ๐ , ๐ ๐ฅ โ โ ๐ท : ๐ ๐ฅ =[1, 2 ๐ฅ 1 , 2 ๐ฅ 2 ,โฆ 2 ๐ฅ ๐ , 2 ๐ฅ 1 ๐ฅ 2 , 2 ๐ฅ 1 ๐ฅ 3 ,โฆ, 2 ๐ฅ ๐โ1 ๐ฅ ๐ , ๐ฅ ,๐ฅ ,โฆ,๐ฅ ๐ 2 ] โLinear separatorsโ = all quadratics Calculating explicitly: ๐ ๐ท =๐( ๐ 2 ) Claim: ๐ ๐ฅ ,๐ ๐ฅโฒ = ๐ฅ,๐ฅโฒ +1 2 Time to calculate: ๐(๐) Write down polynomial expansion: K(x,xโ)=(\sum_t x[t]xโ[t]+1)^r=\sum_t (x[t]xโ[t])^2+1^2+2\sum_s,t x[s]xโ[s]x[t]xโ[t]+2\sum_s x[s]xโ[s]1
15
Gaussian Kernels (RBF: Radial Basis Functions)
Assume ๐ณโโ ๐:๐ณโฆ โ โ , ๐ ๐ฅ ๐ = 1 ๐! ๐ โ ๐ฅ 2 ๐ฅ ๐ ๐ ๐ฅ ,๐ ๐ฅ โฒ = ๐ โ ๐ฅโ ๐ฅ โฒ (Check this!) More generally, assuming ๐ณโ โ ๐ can also define kernel with ๐ ๐ฅ ,๐ ๐ฅ โฒ = ๐ โ๐พ ๐ฅโ ๐ฅ โฒ 2
16
http://openclassroom. stanford. edu/MainFolder/DocumentPage. php
ng&doc=exercises/ex8/ex8.html
20
Solving SVMโs Efficiently
The objective is convex inโฆ. ๐ผ 1 ,โฆ, ๐ผ ๐ Gradient Descent! โฆwith a randomization trick Instead of going in the direction of the gradient, we go in direction of a random vector ๐ผ ๐๐๐๐๐๐ ๐ฃ๐๐๐ก๐๐ =๐๐๐๐๐๐๐๐ก
21
SGD for SVM Parameter: ๐ (number of steps) Initialize: ๐ฝ 1 =0โ โ ๐
Let ๐ผ ๐ก = 1 ๐๐ก ๐ฝ ๐ก Choose ๐ randomly from [๐] //random example For all ๐โ ๐ set ๐ฝ ๐ ๐ก+1 = ๐ฝ ๐ ๐ก If ๐ฆ ๐ ๐ ๐ผ ๐ ๐ก ๐พ( ๐ฅ ๐ , ๐ฅ ๐ ) <1// ๐ฅ ๐ โs prediction incorrect/low margin? Set ๐ฝ ๐ ๐ก+1 = ๐ฝ ๐ ๐ก + ๐ฆ ๐ Else Set ๐ฝ ๐ ๐ก+1 = ๐ฝ ๐ ๐ก Output ๐ค = ๐ ๐ผ ๐ ๐ ๐ฅ ๐ where ๐ผ = 1 ๐ ๐ก ๐ผ ๐ก
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.