Download presentation
Presentation is loading. Please wait.
Published byShavonne Baldwin Modified over 8 years ago
1
Discriminative Machine Learning Topic 3: SVM Duality Slides available online http://mpawankumar.infohttp://mpawankumar.info M. Pawan Kumar (Based on Prof. A. Zisserman’s course material)
2
Linear Classifier Linear classifier is an appropriate choice Training loss is 0 x1x1 x2x2
3
Linear Classifier Training loss is small Linear classifier is an appropriate choice x1x1 x2x2 x1x1 x2x2
4
Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Linear classifier is not an appropriate choice x1x1 x2x2 x1x1 x2x2
5
Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Feature vector not appropriate x1x1 x2x2 x1x1 x2x2
6
Feature Vector We were using Φ(x) = [x 1 x 2 ] x1x1 x2x2 Instead, let us use Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] x12x12 x22x22 √2x 1 x 2
7
Feature Vector Use a D dimensional feature vector Parameters will also be D dimensional For a large D, data may be linearly separable Large number of parameters to learn Accurate classification Inefficient optimization Can we somehow avoid this?D >> n
8
Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline
9
Optimization – Simple Example min ξ s.t. ξ ≥ 3 2 3 4 ✔
10
Optimization – Simple Example min ξ s.t. ξ ≥ 3 2 3 4 ✔ ξ ≥ 4 ξ ≥ 2 We have to use the maximum lower bound Let us make this a bit more abstract
11
Constrained Optimization - Example min ξ s.t. ξ ≥ w 1 + w 2 w 1 + w 2 2w 1 +3w 2 max{w 1 + w 2, 2w 1 +3w 2 } ✔ ξ ≥ 2w 1 +3w 2 We have to use the maximum lower bound Let us consider the other direction
12
Unconstrained Optimization - Example min f(w 1,w 2 )+ max{w 1 +w 2,2w 1 +3w 2 }
13
Unconstrained Optimization - Example min f(w 1,w 2 )+ ξ s.t. ξ ≥ w 1 + w 2 ξ ≥ 2w 1 +3w 2 Equivalent constrained optimization problem We will call ξ a slack variable Reformulate SVM learning problem
14
Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline
15
SVM Learning Problem max y {w T Ψ(x i,y) +Δ(y i,y)} - w T Ψ(x i,y i )∑i∑i λ||w|| 2 +min
16
SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y) +Δ(y i,y) - w T Ψ(x i,y i ) ≤ ξ i s.t. for all y Slight abuse of notation
17
SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) +Δ(y i,y) ≤ ξ i s.t. for all y Slight abuse of notation Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) Convex Quadratic Program
18
Convex Quadratic Program z T Qz + z T q + Cmin z s.t. z T a i ≤ b i i =1,…,m Q 0 Many efficient solvers But we already know how to optimize Reformulation allows us to write down the dual
19
Reformulation SVM Dual –Example –Generalization Kernels Outline
20
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ (w 1 2 +w 2 2 ) + 0 ≤ (w 1 2 +w 2 2 ) + ξ Lower bound on the objective
21
Example (w 1 2 +w 2 2 ) + 0 Lower bound on the objective min w 0w 1 = 0 Set derivatives with respect to w to 0 w 2 = 0
22
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ w 1 + w 2 + 2/3 ≤ ξ Lower bound on the objective X 1/3 ∑
23
Example w 1 + w 2 + 2/3 Lower bound on the objective (w 1 2 +w 2 2 ) +min w Set derivatives with respect to w to 0 1/6w 1 = -1/2w 2 = -1/2 How can I find the maximum lower bound?
24
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ (α 1 +α 2 +α 3 )ξ
25
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w 2 + 1 ≤ ξ w 1 + 2w 2 + 1 ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ ξ
26
Example (w 1 2 +w 2 2 ) + w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 )
27
Example w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 Maximum lower bound? α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1s.t. -(2α 1 +α 2 ) 2 /4 - (α 1 +2α 2 ) 2 /4 + (α 1 +α 2 )max α
28
Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Weak Duality Value of dual for any feasible α Value of primal for any feasible w ≤
29
Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Strong Duality Value of dual for the optimal α Value of primal for the optimal w =
30
Reformulation SVM Dual –Example –Generalization Kernels Outline
31
ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. for all i, y Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) SVM Learning Problem α i (y) α i (y) ≥ 0 for all i∑ y α i (y) = 1 for all y
32
ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. SVM Learning Problem α i (y)for all y ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) ≤ ξ i for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1
33
∑i∑i SVM Learning Problem λ||w|| 2 + ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ Linear combination of joint feature vector Set derivatives with respect to w to 0 for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1
34
SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ -∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) + ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) max α s.t.
35
SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t.
36
SVM Dual Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. How to deal with high dimensional features?
37
Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline
38
Prediction w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λβ i (y) = α i (y)/2λ We consider M-SVM Binary classification in example sheet
39
Prediction w = -∑ i ∑ y β i (y)Ψ(x i,y i,y)
40
Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) y(w) = Given test input x w T Ψ(x,ŷ)argmax ŷ
41
Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) w T Ψ(x,ŷ)
42
Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) We need to compute dot products of features Let us take a closer look at one product term
43
Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) 0...... 0 Ψ(x,2) = 0 Φ(x)...... 0 … = 0 if y ≠ ŷ
44
Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) 0...... 0 Ψ(x,2) = 0 Φ(x)...... 0 … = Φ(x i ) T Φ(x) if y = ŷ
45
Dot Product Ψ(x i,y) T Ψ(x,ŷ)= Φ(x i ) T Φ(x) if y = ŷ We do not need the feature vector Φ(.) We need a function that computes dot product Kernel Isn’t that as expensive as feature computation? O(D) operation for D-dimensional features
46
Kernel x1x1 x2x2 Φ(x) = [x 1 x 2 ] Corresponding feature? We can use the kernel k(x,x’) = x 1 x’ 1 + x 2 x’ 2
47
Kernel x1x1 x2x2 Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] Corresponding feature? k(x,x’) = x 1 2 (x’ 1 ) 2 + x 2 2 (x’ 2 ) 2 + 2x 1 x’ 1 x 2 x’ 2
48
Kernel x1x1 x2x2 Infinite dimensional Corresponding feature? k(x,x’) = exp(-||x-x’|| 2 /2σ 2 )
49
Prediction - Summary ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) Compute non-zero dot products using kernels argmax y Many dot products are 0 Compute scores for all possible y y(w) = Choose maximum score to make a prediction
50
Kernel Commonly used kernels k(x,x’) = x T x’Linear k(x,x’) = (1+x T x’) d Polynomial Φ(.) has all polynomial terms up to degree d k(x,x’) = exp(-||x-x’|| 2 /2σ 2 ) Gaussian or RBF Φ(.) is infinite dimensional
51
Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline
52
SVM Dual Problem ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Need to compute Q Only requires dot products Kernel trick
53
Computational Efficiency ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Is this a convex quadratic program? Q 0Mercer Kernels
54
Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline
55
Data Not linearly separable in original space Use an RBF kernel
56
Results σ = 1.0λ = 0
57
Results σ = 1.0λ = 0.01 Increase in λ increases margin
58
Results σ = 1.0λ = 0.1
59
Results σ = 1.0λ = 0
60
Results σ = 0.25λ = 0
61
Results σ = 0.1λ = 0How does σ affect prediction?
62
Results σ = 0.1λ = 0Example sheet
63
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.