Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Machine Learning

Similar presentations


Presentation on theme: "Introduction to Machine Learning"โ€” Presentation transcript:

1 Introduction to Machine Learning 236756
Tutorial 6: The SVM Optimization Problem Kernels

2 Which Is Better? ๐’ณ= โ„ 2

3 Margin The margin of a linear separator is defined as the distance of the closest instance point to the linear hyperplane Large margins are intuitively more stable: If noise is added to data, then it is more likely to still be separated

4 The Margin ๐‘ฅ: ๐‘ค,๐œ™(๐‘ฅ) =0 ๐œ™(๐‘ฅ 0 ) ๐‘š๐‘Ž๐‘Ÿ๐‘”๐‘–๐‘› ๐‘ค = ๐‘ค, ๐œ™(๐‘ฅ 0 ) | (๐‘Ž๐‘ ๐‘ ๐‘ข๐‘š๐‘’ ๐‘ค =1)
๐‘š๐‘Ž๐‘Ÿ๐‘”๐‘–๐‘› ๐‘ค = ๐‘ค, ๐œ™(๐‘ฅ 0 ) | (๐‘Ž๐‘ ๐‘ ๐‘ข๐‘š๐‘’ ๐‘ค =1) ๐œ™(๐‘ฅ 0 ) ๐‘ค

5 Hard-SVM Equivalent (realizable case)
Solve: ๐‘ค 0 = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ค ๐‘ค s.t.: โˆ€๐‘–โˆˆ ๐‘š , ๐‘ฆ ๐‘– ๐‘ค, ๐œ™(๐‘ฅ ๐‘– ) โ‰ฅ1 Output: ๐‘ค = ๐‘ค 0 ๐‘ค 0

6 Soft-SVM Data separability may be a strong requirement (especially with marginโ€ฆ) Solve: ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ค ๐‘ค s.t.: โˆ€๐‘–โˆˆ ๐‘š , ๐‘ฆ ๐‘– ๐‘ค, ๐œ™(๐‘ฅ ๐‘– ) โ‰ฅ1 + ๐Ÿ ๐’Ž ๐ƒ ๐’Š ๐€ โˆ’ ๐ƒ ๐’Š โˆ€๐’Š ๐ƒ ๐’Š โ‰ฅ๐ŸŽ ๐œ‰ ๐‘– At optimum, must have โˆ€๐‘– : 1 ๐‘ฆ ๐‘– โŒฉ๐‘ค, ๐œ™(๐‘ฅ ๐‘– )โŒช

7 Soft-SVM: Equivalent Definition
Solve: ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ค ๐‘ค ๐ฟ ๐‘† โ„Ž๐‘–๐‘›๐‘”๐‘’ (๐‘ค) Reminder: ๐ฟ ๐‘  โ„Ž๐‘–๐‘›๐‘”๐‘’ ๐‘ค = 1 ๐‘š maxโก{0,1โˆ’ ๐‘ฆ ๐‘– ๐‘ค, ๐œ™(๐‘ฅ ๐‘– ) } ๐€

8 The Representer Theorem
๐‘ค โˆˆ ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ค ๐œ† ๐‘ค ๐‘š maxโก{0,1โˆ’ ๐‘ฆ ๐‘– ๐‘ค,๐œ™( ๐‘ฅ ๐‘– ) } Theorem: ๐‘ค โˆˆ๐‘‰=๐‘ ๐‘๐‘Ž๐‘› ๐œ™(๐‘ฅ 1 ),โ€ฆ, ๐œ™(๐‘ฅ ๐‘š ) In other words, โˆƒ ๐›ผ 1 ,โ€ฆ, ๐›ผ ๐‘š s.t. ๐‘ค = ๐›ผ ๐‘– ๐œ™(๐‘ฅ ๐‘– ) Moreover can set, for some optimal ๐’˜ : ๐›ผ ๐‘– =0 for all ๐‘– s.t. ๐‘ฆ ๐‘– ๐‘ค ,๐œ™ ๐‘ฅ ๐‘– >1 Support Vectors: ๐œ™ ๐‘ฅ ๐‘– : ๐›ผ ๐‘– โ‰ 0 โ€ฆ hence โ€œSVMโ€ ๐ฟ ๐‘† โ„Ž๐‘–๐‘›๐‘”๐‘’

9 The Representer Theorem
๐‘ค โˆˆ ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ค ๐œ† ๐‘ค ๐‘š maxโก{0,1โˆ’ ๐‘ฆ ๐‘– ๐‘ค,๐œ™( ๐‘ฅ ๐‘– ) } Since the solution is of the following form Bringing it back to the function we will get

10 The Representer Theorem
From previous slide we can conclude that the optimization problem depends only on values of โŒฉ ๐‘ฅ ๐‘– , ๐‘ฅ ๐‘— โŒช for ๐‘–,๐‘—โˆˆ[๐‘š]. Function ๐พ ๐‘ฅ, ๐‘ฅ โ€ฒ =โŒฉ๐œ™ ๐‘ฅ ,๐œ™ ๐‘ฅ โ€ฒ โŒช called Kernel

11 Can any Function be a Kernel?
No! Must be an inner product of some feature space Technical condition: for any finite set of points, Gram matrix (matrix of kernel evaluations) must be positive semi definite: K is a valid kernel if and only if โˆ€ ๐‘š โˆ€ ๐‘ฅ 1 , ๐‘ฅ 2 ,โ€ฆ, ๐‘ฅ ๐‘š : ๐บ= ๐พ( ๐‘ฅ 1 , ๐‘ฅ 1 ) โ‹ฏ ๐พ( ๐‘ฅ 1 , ๐‘ฅ ๐‘š ) โ‹ฎ โ‹ฑ โ‹ฎ ๐พ( ๐‘ฅ ๐‘š , ๐‘ฅ 1 ) โ‹ฏ ๐พ ๐‘ฅ ๐‘š , ๐‘ฅ ๐‘š โ‰ฝ0 In particular K must be symmetric, but this is not enough

12 Kernels As Prior Knowledge
If we think that positive examples can (almost) be separated by some ellipse: then we should use polynomials of degree 2 A Kernel encodes a measure of similarity between objects. Must be a valid inner product function.

13 Polynomial Kernels Higher degrees: ๐พ ๐‘ฅ, ๐‘ฅ โ€ฒ = ๐‘ฅ, ๐‘ฅ โ€ฒ +1 ๐‘Ÿ corresponds to feature space with all degree-๐‘Ÿ monomials ๐ท=๐‘‚( ๐‘‘ ๐‘Ÿ ) features, still ๐‘‚(๐‘‘) to calculate Predictors: all degree-๐‘Ÿ poynomials Variant: ๐พ ๐‘ฅ, ๐‘ฅ โ€ฒ = ๐‘ฅ, ๐‘ฅ โ€ฒ +๐‘Ž ๐‘Ÿ Larger ๐‘Ž ๏ƒจ higher weight to lower order features ๐‘Ž specifies prior bias / knowledge Add โ€˜aโ€™ in r=2 example. We get [a^2,a\sqrt(2)x[1],โ€ฆ,\sqrt(2)x[1]x[2],x[1]^2] If a is very large, weight vector using 1st order can be small, but to use higher order need larger w๏ƒ higher norm

14 Kernels Kernel: ๐พ ๐‘ฅ,๐‘ฅโ€ฒ =โŒฉ๐œ™ ๐‘ฅ ,๐œ™ ๐‘ฅโ€ฒ โŒช
Sometimes calculating (and even specifying) inner product is easier then writing out and evaluating ๐œ™ โ‹… E.g., ๐‘ฅโˆˆ โ„ ๐‘‘ , ๐œ™ ๐‘ฅ โˆˆ โ„ ๐ท : ๐œ™ ๐‘ฅ =[1, 2 ๐‘ฅ 1 , 2 ๐‘ฅ 2 ,โ€ฆ 2 ๐‘ฅ ๐‘‘ , 2 ๐‘ฅ 1 ๐‘ฅ 2 , 2 ๐‘ฅ 1 ๐‘ฅ 3 ,โ€ฆ, 2 ๐‘ฅ ๐‘‘โˆ’1 ๐‘ฅ ๐‘‘ , ๐‘ฅ ,๐‘ฅ ,โ€ฆ,๐‘ฅ ๐‘‘ 2 ] โ€œLinear separatorsโ€ = all quadratics Calculating explicitly: ๐‘‚ ๐ท =๐‘‚( ๐‘‘ 2 ) Claim: ๐œ™ ๐‘ฅ ,๐œ™ ๐‘ฅโ€ฒ = ๐‘ฅ,๐‘ฅโ€ฒ +1 2 Time to calculate: ๐‘‚(๐‘‘) Write down polynomial expansion: K(x,xโ€™)=(\sum_t x[t]xโ€™[t]+1)^r=\sum_t (x[t]xโ€™[t])^2+1^2+2\sum_s,t x[s]xโ€™[s]x[t]xโ€™[t]+2\sum_s x[s]xโ€™[s]1

15 Gaussian Kernels (RBF: Radial Basis Functions)
Assume ๐’ณโŠ†โ„ ๐œ™:๐’ณโ†ฆ โ„ โˆž , ๐œ™ ๐‘ฅ ๐‘– = 1 ๐‘–! ๐‘’ โˆ’ ๐‘ฅ 2 ๐‘ฅ ๐‘– ๐œ™ ๐‘ฅ ,๐œ™ ๐‘ฅ โ€ฒ = ๐‘’ โˆ’ ๐‘ฅโˆ’ ๐‘ฅ โ€ฒ (Check this!) More generally, assuming ๐’ณโŠ† โ„ ๐‘› can also define kernel with ๐œ™ ๐‘ฅ ,๐œ™ ๐‘ฅ โ€ฒ = ๐‘’ โˆ’๐›พ ๐‘ฅโˆ’ ๐‘ฅ โ€ฒ 2

16 http://openclassroom. stanford. edu/MainFolder/DocumentPage. php
ng&doc=exercises/ex8/ex8.html

17

18

19

20 Solving SVMโ€™s Efficiently
The objective is convex inโ€ฆ. ๐›ผ 1 ,โ€ฆ, ๐›ผ ๐‘š Gradient Descent! โ€ฆwith a randomization trick Instead of going in the direction of the gradient, we go in direction of a random vector ๐”ผ ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘’๐‘๐‘ก๐‘œ๐‘Ÿ =๐‘”๐‘Ÿ๐‘Ž๐‘‘๐‘–๐‘’๐‘›๐‘ก

21 SGD for SVM Parameter: ๐‘‡ (number of steps) Initialize: ๐›ฝ 1 =0โˆˆ โ„ ๐‘š
Let ๐›ผ ๐‘ก = 1 ๐œ†๐‘ก ๐›ฝ ๐‘ก Choose ๐‘– randomly from [๐‘š] //random example For all ๐‘—โ‰ ๐‘– set ๐›ฝ ๐‘— ๐‘ก+1 = ๐›ฝ ๐‘— ๐‘ก If ๐‘ฆ ๐‘— ๐‘— ๐›ผ ๐‘— ๐‘ก ๐พ( ๐‘ฅ ๐‘— , ๐‘ฅ ๐‘– ) <1// ๐‘ฅ ๐‘– โ€™s prediction incorrect/low margin? Set ๐›ฝ ๐‘– ๐‘ก+1 = ๐›ฝ ๐‘– ๐‘ก + ๐‘ฆ ๐‘– Else Set ๐›ฝ ๐‘– ๐‘ก+1 = ๐›ฝ ๐‘– ๐‘ก Output ๐‘ค = ๐‘— ๐›ผ ๐‘— ๐œ™ ๐‘ฅ ๐‘— where ๐›ผ = 1 ๐‘‡ ๐‘ก ๐›ผ ๐‘ก


Download ppt "Introduction to Machine Learning"

Similar presentations


Ads by Google