Introduction to Machine Learning

Introduction to Machine Learning 236756
Tutorial 6: The SVM Optimization Problem Kernels

Which Is Better? 𝒳= ℝ 2

Margin The margin of a linear separator is defined as the distance of the closest instance point to the linear hyperplane Large margins are intuitively more stable: If noise is added to data, then it is more likely to still be separated

The Margin 𝑥: 𝑤,𝜙(𝑥) =0 𝜙(𝑥 0 ) 𝑚𝑎𝑟𝑔𝑖𝑛 𝑤 = 𝑤, 𝜙(𝑥 0 ) | (𝑎𝑠𝑠𝑢𝑚𝑒 𝑤 =1)
𝑚𝑎𝑟𝑔𝑖𝑛 𝑤 = 𝑤, 𝜙(𝑥 0 ) | (𝑎𝑠𝑠𝑢𝑚𝑒 𝑤 =1) 𝜙(𝑥 0 ) 𝑤

Hard-SVM Equivalent (realizable case)
Solve: 𝑤 0 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 𝑤 s.t.: ∀𝑖∈ 𝑚 , 𝑦 𝑖 𝑤, 𝜙(𝑥 𝑖 ) ≥1 Output: 𝑤 = 𝑤 0 𝑤 0

Soft-SVM Data separability may be a strong requirement (especially with margin…) Solve: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 𝑤 s.t.: ∀𝑖∈ 𝑚 , 𝑦 𝑖 𝑤, 𝜙(𝑥 𝑖 ) ≥1 + 𝟏 𝒎 𝝃 𝒊 𝝀 − 𝝃 𝒊 ∀𝒊 𝝃 𝒊 ≥𝟎 𝜉 𝑖 At optimum, must have ∀𝑖 : 1 𝑦 𝑖 〈𝑤, 𝜙(𝑥 𝑖 )〉

Soft-SVM: Equivalent Definition
Solve: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 𝑤 𝐿 𝑆 ℎ𝑖𝑛𝑔𝑒 (𝑤) Reminder: 𝐿 𝑠 ℎ𝑖𝑛𝑔𝑒 𝑤 = 1 𝑚 max⁡{0,1− 𝑦 𝑖 𝑤, 𝜙(𝑥 𝑖 ) } 𝝀

The Representer Theorem
𝑤 ∈ 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 𝜆 𝑤 𝑚 max⁡{0,1− 𝑦 𝑖 𝑤,𝜙( 𝑥 𝑖 ) } Theorem: 𝑤 ∈𝑉=𝑠𝑝𝑎𝑛 𝜙(𝑥 1 ),…, 𝜙(𝑥 𝑚 ) In other words, ∃ 𝛼 1 ,…, 𝛼 𝑚 s.t. 𝑤 = 𝛼 𝑖 𝜙(𝑥 𝑖 ) Moreover can set, for some optimal 𝒘 : 𝛼 𝑖 =0 for all 𝑖 s.t. 𝑦 𝑖 𝑤 ,𝜙 𝑥 𝑖 >1 Support Vectors: 𝜙 𝑥 𝑖 : 𝛼 𝑖 ≠0 … hence “SVM” 𝐿 𝑆 ℎ𝑖𝑛𝑔𝑒

𝑤 ∈ 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 𝜆 𝑤 𝑚 max⁡{0,1− 𝑦 𝑖 𝑤,𝜙( 𝑥 𝑖 ) } Since the solution is of the following form Bringing it back to the function we will get

From previous slide we can conclude that the optimization problem depends only on values of 〈 𝑥 𝑖 , 𝑥 𝑗 〉 for 𝑖,𝑗∈[𝑚]. Function 𝐾 𝑥, 𝑥 ′ =〈𝜙 𝑥 ,𝜙 𝑥 ′ 〉 called Kernel

Can any Function be a Kernel?
No! Must be an inner product of some feature space Technical condition: for any finite set of points, Gram matrix (matrix of kernel evaluations) must be positive semi definite: K is a valid kernel if and only if ∀ 𝑚 ∀ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑚 : 𝐺= 𝐾( 𝑥 1 , 𝑥 1 ) ⋯ 𝐾( 𝑥 1 , 𝑥 𝑚 ) ⋮ ⋱ ⋮ 𝐾( 𝑥 𝑚 , 𝑥 1 ) ⋯ 𝐾 𝑥 𝑚 , 𝑥 𝑚 ≽0 In particular K must be symmetric, but this is not enough

Kernels As Prior Knowledge
If we think that positive examples can (almost) be separated by some ellipse: then we should use polynomials of degree 2 A Kernel encodes a measure of similarity between objects. Must be a valid inner product function.

Polynomial Kernels Higher degrees: 𝐾 𝑥, 𝑥 ′ = 𝑥, 𝑥 ′ +1 𝑟 corresponds to feature space with all degree-𝑟 monomials 𝐷=𝑂( 𝑑 𝑟 ) features, still 𝑂(𝑑) to calculate Predictors: all degree-𝑟 poynomials Variant: 𝐾 𝑥, 𝑥 ′ = 𝑥, 𝑥 ′ +𝑎 𝑟 Larger 𝑎  higher weight to lower order features 𝑎 specifies prior bias / knowledge Add ‘a’ in r=2 example. We get [a^2,a\sqrt(2)x[1],…,\sqrt(2)x[1]x[2],x[1]^2] If a is very large, weight vector using 1st order can be small, but to use higher order need larger whigher norm

Kernels Kernel: 𝐾 𝑥,𝑥′ =〈𝜙 𝑥 ,𝜙 𝑥′ 〉
Sometimes calculating (and even specifying) inner product is easier then writing out and evaluating 𝜙 ⋅ E.g., 𝑥∈ ℝ 𝑑 , 𝜙 𝑥 ∈ ℝ 𝐷 : 𝜙 𝑥 =[1, 2 𝑥 1 , 2 𝑥 2 ,… 2 𝑥 𝑑 , 2 𝑥 1 𝑥 2 , 2 𝑥 1 𝑥 3 ,…, 2 𝑥 𝑑−1 𝑥 𝑑 , 𝑥 ,𝑥 ,…,𝑥 𝑑 2 ] “Linear separators” = all quadratics Calculating explicitly: 𝑂 𝐷 =𝑂( 𝑑 2 ) Claim: 𝜙 𝑥 ,𝜙 𝑥′ = 𝑥,𝑥′ +1 2 Time to calculate: 𝑂(𝑑) Write down polynomial expansion: K(x,x’)=(\sum_t x[t]x’[t]+1)^r=\sum_t (x[t]x’[t])^2+1^2+2\sum_s,t x[s]x’[s]x[t]x’[t]+2\sum_s x[s]x’[s]1

Gaussian Kernels (RBF: Radial Basis Functions)
Assume 𝒳⊆ℝ 𝜙:𝒳↦ ℝ ∞ , 𝜙 𝑥 𝑖 = 1 𝑖! 𝑒 − 𝑥 2 𝑥 𝑖 𝜙 𝑥 ,𝜙 𝑥 ′ = 𝑒 − 𝑥− 𝑥 ′ (Check this!) More generally, assuming 𝒳⊆ ℝ 𝑛 can also define kernel with 𝜙 𝑥 ,𝜙 𝑥 ′ = 𝑒 −𝛾 𝑥− 𝑥 ′ 2

http://openclassroom. stanford. edu/MainFolder/DocumentPage. php
ng&doc=exercises/ex8/ex8.html

Solving SVM’s Efficiently
The objective is convex in…. 𝛼 1 ,…, 𝛼 𝑚 Gradient Descent! …with a randomization trick Instead of going in the direction of the gradient, we go in direction of a random vector 𝔼 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑒𝑐𝑡𝑜𝑟 =𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡

SGD for SVM Parameter: 𝑇 (number of steps) Initialize: 𝛽 1 =0∈ ℝ 𝑚
Let 𝛼 𝑡 = 1 𝜆𝑡 𝛽 𝑡 Choose 𝑖 randomly from [𝑚] //random example For all 𝑗≠𝑖 set 𝛽 𝑗 𝑡+1 = 𝛽 𝑗 𝑡 If 𝑦 𝑗 𝑗 𝛼 𝑗 𝑡 𝐾( 𝑥 𝑗 , 𝑥 𝑖 ) <1// 𝑥 𝑖 ’s prediction incorrect/low margin? Set 𝛽 𝑖 𝑡+1 = 𝛽 𝑖 𝑡 + 𝑦 𝑖 Else Set 𝛽 𝑖 𝑡+1 = 𝛽 𝑖 𝑡 Output 𝑤 = 𝑗 𝛼 𝑗 𝜙 𝑥 𝑗 where 𝛼 = 1 𝑇 𝑡 𝛼 𝑡

Introduction to Machine Learning

Similar presentations

Presentation on theme: "Introduction to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Machine Learning

Similar presentations

Presentation on theme: "Introduction to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback