Discriminative Machine Learning Topic 3: SVM Duality Slides available online M. Pawan Kumar (Based on Prof. A. Zisserman’s course material)
Linear Classifier Linear classifier is an appropriate choice Training loss is 0 x1x1 x2x2
Linear Classifier Training loss is small Linear classifier is an appropriate choice x1x1 x2x2 x1x1 x2x2
Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Linear classifier is not an appropriate choice x1x1 x2x2 x1x1 x2x2
Linear Classifier Training loss is small Linear classifier is an appropriate choice Training loss is large Feature vector not appropriate x1x1 x2x2 x1x1 x2x2
Feature Vector We were using Φ(x) = [x 1 x 2 ] x1x1 x2x2 Instead, let us use Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] x12x12 x22x22 √2x 1 x 2
Feature Vector Use a D dimensional feature vector Parameters will also be D dimensional For a large D, data may be linearly separable Large number of parameters to learn Accurate classification Inefficient optimization Can we somehow avoid this?D >> n
Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline
Optimization – Simple Example min ξ s.t. ξ ≥ ✔
Optimization – Simple Example min ξ s.t. ξ ≥ ✔ ξ ≥ 4 ξ ≥ 2 We have to use the maximum lower bound Let us make this a bit more abstract
Constrained Optimization - Example min ξ s.t. ξ ≥ w 1 + w 2 w 1 + w 2 2w 1 +3w 2 max{w 1 + w 2, 2w 1 +3w 2 } ✔ ξ ≥ 2w 1 +3w 2 We have to use the maximum lower bound Let us consider the other direction
Unconstrained Optimization - Example min f(w 1,w 2 )+ max{w 1 +w 2,2w 1 +3w 2 }
Unconstrained Optimization - Example min f(w 1,w 2 )+ ξ s.t. ξ ≥ w 1 + w 2 ξ ≥ 2w 1 +3w 2 Equivalent constrained optimization problem We will call ξ a slack variable Reformulate SVM learning problem
Reformulation –Examples –SVM Learning Problem SVM Dual Kernels Outline
SVM Learning Problem max y {w T Ψ(x i,y) +Δ(y i,y)} - w T Ψ(x i,y i )∑i∑i λ||w|| 2 +min
SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y) +Δ(y i,y) - w T Ψ(x i,y i ) ≤ ξ i s.t. for all y Slight abuse of notation
SVM Learning Problem ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) +Δ(y i,y) ≤ ξ i s.t. for all y Slight abuse of notation Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) Convex Quadratic Program
Convex Quadratic Program z T Qz + z T q + Cmin z s.t. z T a i ≤ b i i =1,…,m Q 0 Many efficient solvers But we already know how to optimize Reformulation allows us to write down the dual
Reformulation SVM Dual –Example –Generalization Kernels Outline
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ (w 1 2 +w 2 2 ) + 0 ≤ (w 1 2 +w 2 2 ) + ξ Lower bound on the objective
Example (w 1 2 +w 2 2 ) + 0 Lower bound on the objective min w 0w 1 = 0 Set derivatives with respect to w to 0 w 2 = 0
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ w 1 + w 2 + 2/3 ≤ ξ Lower bound on the objective X 1/3 ∑
Example w 1 + w 2 + 2/3 Lower bound on the objective (w 1 2 +w 2 2 ) +min w Set derivatives with respect to w to 0 1/6w 1 = -1/2w 2 = -1/2 How can I find the maximum lower bound?
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ (α 1 +α 2 +α 3 )ξ
Example (w 1 2 +w 2 2 ) + ξmin w,ξ s.t. 2w 1 + w ≤ ξ w 1 + 2w ≤ ξ 0 ≤ ξ X α 1 X α 2 X α 3 ∑ α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 ) ≤ ξ
Example (w 1 2 +w 2 2 ) + w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 (2α 1 +α 2 )w 1 + (α 1 +2α 2 )w 2 + (α 1 +α 2 )
Example w 1 = -(2α 1 +α 2 )/2 Set derivatives with respect to w to 0 w 2 = -(α 1 +2α 2 )/2 Maximum lower bound? α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1s.t. -(2α 1 +α 2 ) 2 /4 - (α 1 +2α 2 ) 2 /4 + (α 1 +α 2 )max α
Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Weak Duality Value of dual for any feasible α Value of primal for any feasible w ≤
Example (2α 1 +α 2 ) 2 /4 + (α 1 +2α 2 ) 2 /4 - (α 1 +α 2 ) α 1 ≥ 0 α 2 ≥ 0α 3 ≥ 0 α 1 +α 2 +α 3 = 1 min α s.t. Convex quadratic programDual problem Strong Duality Value of dual for the optimal α Value of primal for the optimal w =
Reformulation SVM Dual –Example –Generalization Kernels Outline
ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. for all i, y Ψ(x i,y i,y) = Ψ(x i,y) - Ψ(x i,y i ) SVM Learning Problem α i (y) α i (y) ≥ 0 for all i∑ y α i (y) = 1 for all y
ξiξi ∑i∑i λ||w|| 2 +min w T Ψ(x i,y i,y) + Δ(y i,y) ≤ ξ i s.t. SVM Learning Problem α i (y)for all y ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) ≤ ξ i for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1
∑i∑i SVM Learning Problem λ||w|| 2 + ∑ y α i (y) (w T Ψ(x i,y i,y) + Δ(y i,y)) w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ Linear combination of joint feature vector Set derivatives with respect to w to 0 for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1
SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ -∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) + ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) max α s.t.
SVM Learning Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t.
SVM Dual Problem w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λ ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. How to deal with high dimensional features?
Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline
Prediction w = -∑ i ∑ y α i (y)Ψ(x i,y i,y)/2λβ i (y) = α i (y)/2λ We consider M-SVM Binary classification in example sheet
Prediction w = -∑ i ∑ y β i (y)Ψ(x i,y i,y)
Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) y(w) = Given test input x w T Ψ(x,ŷ)argmax ŷ
Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) w T Ψ(x,ŷ)
Prediction w = ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) We need to compute dot products of features Let us take a closer look at one product term
Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) Ψ(x,2) = 0 Φ(x) … = 0 if y ≠ ŷ
Dot Product Ψ(x i,y) T Ψ(x,ŷ) Ψ(x,1) = Φ(x) Ψ(x,2) = 0 Φ(x) … = Φ(x i ) T Φ(x) if y = ŷ
Dot Product Ψ(x i,y) T Ψ(x,ŷ)= Φ(x i ) T Φ(x) if y = ŷ We do not need the feature vector Φ(.) We need a function that computes dot product Kernel Isn’t that as expensive as feature computation? O(D) operation for D-dimensional features
Kernel x1x1 x2x2 Φ(x) = [x 1 x 2 ] Corresponding feature? We can use the kernel k(x,x’) = x 1 x’ 1 + x 2 x’ 2
Kernel x1x1 x2x2 Φ(x) = [x 1 2 x 2 2 √2x 1 x 2 ] Corresponding feature? k(x,x’) = x 1 2 (x’ 1 ) 2 + x 2 2 (x’ 2 ) 2 + 2x 1 x’ 1 x 2 x’ 2
Kernel x1x1 x2x2 Infinite dimensional Corresponding feature? k(x,x’) = exp(-||x-x’|| 2 /2σ 2 )
Prediction - Summary ∑ i ∑ y β i (y)(Ψ(x i,y i ) - Ψ(x i,y)) T Ψ(x,ŷ) Compute non-zero dot products using kernels argmax y Many dot products are 0 Compute scores for all possible y y(w) = Choose maximum score to make a prediction
Kernel Commonly used kernels k(x,x’) = x T x’Linear k(x,x’) = (1+x T x’) d Polynomial Φ(.) has all polynomial terms up to degree d k(x,x’) = exp(-||x-x’|| 2 /2σ 2 ) Gaussian or RBF Φ(.) is infinite dimensional
Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline
SVM Dual Problem ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Need to compute Q Only requires dot products Kernel trick
Computational Efficiency ∑ i,j ∑ y,ŷ α i (y) α i (y) Q(i,y,j,ŷ) - ∑ i ∑ y α i (y)Δ(y i,y) for all i, yα i (y) ≥ 0 for all i∑ y α i (y) = 1 Q(i,y,j,ŷ) = Ψ(x i,y i,y) T Ψ(x j,y j,ŷ) min α s.t. Is this a convex quadratic program? Q 0Mercer Kernels
Reformulation SVM Dual Kernels –Prediction –Learning –Results Outline
Data Not linearly separable in original space Use an RBF kernel
Results σ = 1.0λ = 0
Results σ = 1.0λ = 0.01 Increase in λ increases margin
Results σ = 1.0λ = 0.1
Results σ = 1.0λ = 0
Results σ = 0.25λ = 0
Results σ = 0.1λ = 0How does σ affect prediction?
Results σ = 0.1λ = 0Example sheet
Questions?