Geometrical intuition behind the dual problem Based on: KP Bennett, EJ Bredensteiner, “Duality and Geometry in SVM Classifiers”, Proceedings of the International Conference on Machine Learning, 2000
Geometrical intuition behind the dual problem Convex hulls for two sets of points A and B 𝑥 1 𝑥 2 A B
Geometrical intuition behind the dual problem Convex hulls for two sets of points A and B defined by all possible points PA and PB 𝑥 1 𝑥 2 A PA B PB 𝑖∈𝐴 λ 𝑖 =1 λ 𝑖 ≥0,𝑖∈𝐴 𝑖∈𝐵 λ 𝑖 =1 λ 𝑖 ≥0,𝑖∈𝐵 𝑃 𝐴 = 𝑖∈𝐴 λ 𝑖 𝑥 𝑖 𝑃 𝐵 = 𝑖∈𝐵 λ 𝑖 𝑥 𝑖
Geometrical intuition behind the dual problem For the plane separating A and B we choose the bisecting line between specific PA and PB that have minimal distance 𝑥 1 𝑥 2 ∥ 𝑃 𝐴 − 𝑃 𝐵 ∥ 2 A Here, for PA , a single coeff. is non-zero And for PB , two coeffs. are non-zero B PA PB 𝑖∈𝐴 λ 𝑖 =1 λ 𝑖 ≥0,𝑖∈𝐴 𝑖∈𝐵 λ 𝑖 =1 λ 𝑖 ≥0,𝑖∈𝐵 𝑃 𝐴 = 𝑖∈𝐴 λ 𝑖 𝑥 𝑖 𝑃 𝐵 = 𝑖∈𝐵 λ 𝑖 𝑥 𝑖
Geometrical intuition behind the dual problem 𝑥 1 𝑥 2 𝑚𝑖𝑛 ∥ 𝑃 𝐴 − 𝑃 𝐵 ∥ 2 = 𝑚𝑖𝑛 𝛌 ∥ 𝑖∈𝐴 λ 𝑖 𝐱 𝐢 − 𝑗∈𝐵 λ 𝑗 𝐱 𝐣 ∥ 2 A = 𝑚𝑖𝑛 𝛌 ∥ 𝑖∈𝐴 λ 𝑖 𝑦 𝑖 𝐱 𝐢 + 𝑗∈𝐵 λ 𝑗 𝑦 𝑗 𝐱 𝐣 ∥ 2 +1 -1 B PA = 𝑚𝑖𝑛 𝛌 𝑖,𝑗 λ 𝑖 λ 𝑗 𝑦 𝑖 𝑦 𝑗 𝐱 𝐢 . 𝐱 𝐣 PB 𝑖∈𝐴 λ 𝑖 =1 λ 𝑖 ≥0,𝑖∈𝐴 𝑖∈𝐵 λ 𝑖 =1 λ 𝑖 ≥0,𝑖∈𝐵 𝑃 𝐴 = 𝑖∈𝐴 λ 𝑖 𝑥 𝑖 𝑃 𝐵 = 𝑖∈𝐵 λ 𝑖 𝑥 𝑖
Going from the Primal to the Dual 𝑚𝑖𝑛 𝑤 1 2 ∥𝐰∥ 2 𝑠.𝑡.: 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 ≥1,𝑖=1,…,𝑚 Constrained Optimization Problem label input A convex optimization problem (objective and constraints) Unique solution if datapoints are linearly separable
Going from the Primal to the Dual 𝑚𝑖𝑛 𝑤 1 2 ∥𝐰∥ 2 𝑠.𝑡.: 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 ≥1,𝑖=1,…,𝑚 Constrained Optimization Problem label input 𝐿 𝐰,𝑏,𝛌 = 1 2 ∥𝐰∥ 2 − 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 −1 Lagrange:
Going from the Primal to the Dual 𝑚𝑖𝑛 𝑤 1 2 ∥𝐰∥ 2 𝑠.𝑡.: 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 ≥1,𝑖=1,…,𝑚 Constrained Optimization Problem label input 𝐿 𝐰,𝑏,𝛌 = 1 2 ∥𝐰∥ 2 − 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 −1 Lagrange: KKT conditions: ∇ 𝐰 𝐿 𝐰,𝑏,𝛌 =𝐰− 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐱 𝐢 =0 ∇ 𝑏 𝐿 𝐰,𝑏,𝛌 =− 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 =0
Going from the Primal to the Dual 𝑚𝑖𝑛 𝑤 1 2 ∥𝐰∥ 2 𝑠.𝑡.: 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 ≥1,𝑖=1,…,𝑚 Constrained Optimization Problem label input 𝐿 𝐰,𝑏,𝛌 = 1 2 ∥𝐰∥ 2 − 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 −1 Lagrange: KKT conditions: Plus KKT: ∇ 𝐰 𝐿 𝐰,𝑏,𝛌 =𝐰− 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐱 𝐢 =0 ∇ 𝑏 𝐿 𝐰,𝑏,𝛌 =− 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 =0 𝐰= 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐱 𝐢 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 =0 λ 𝑖 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 −1 =0,𝑖=1,...,𝑚
Going from the Primal to the Dual 𝑚𝑖𝑛 𝑤 1 2 ∥𝐰∥ 2 𝑠.𝑡.: 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 ≥1,𝑖=1,…,𝑚 Constrained Optimization Problem label input 𝐿 𝐰,𝑏,𝛌 = 1 2 ∥𝐰∥ 2 − 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 −1 Lagrange: 1 2 ∥ 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐱 𝐢 ∥ 2 − 𝑖,𝑗=1 𝑚 λ 𝑖 λ 𝑗 𝑦 𝑖 𝑦 𝑗 𝐱 𝐢 . 𝐱 𝐣 − 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝑏+ 𝑖=1 𝑚 λ 𝑖 𝐿 𝐰,𝑏,𝛌 = 𝐰= 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐱 𝐢 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 =0 − 1 2 𝑖,𝑗=1 𝑚 λ 𝑖 λ 𝑗 𝑦 𝑖 𝑦 𝑗 𝐱 𝐢 . 𝐱 𝐣 + 𝑖=1 𝑚 λ 𝑖 𝐿 𝐰,𝑏,𝛌 =
Going from the Primal to the Dual 𝑚𝑖𝑛 𝑤 1 2 ∥𝐰∥ 2 𝑠.𝑡.: 𝑦 𝑖 𝐰. 𝐱 𝐢 +𝑏 ≥1,𝑖=1,…,𝑚 Constrained Optimization Problem label input 𝑚𝑎𝑥 𝛌 − 1 2 𝑖,𝑗=1 𝑚 λ 𝑖 λ 𝑗 𝑦 𝑖 𝑦 𝑗 𝐱 𝐢 . 𝐱 𝐣 + 𝑖=1 𝑚 λ 𝑖 𝑠.𝑡.: λ 𝑖 ≥0,𝑖=1,...,𝑚 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 =0 Equivalent Dual Problem:
Solution to the Dual Problem 𝑠𝑖𝑔𝑛 ℎ 𝐱 =𝑠𝑖𝑔𝑛 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 𝐱 𝐢 .𝐱 +𝑏 𝑏= 𝑦 𝑖 − 𝑗=1 𝑚 λ 𝑗 𝑦 𝑗 𝐱 𝐣 . 𝐱 𝐢 ,𝑖=1,...,𝑚 The Dual Problem below admits the following solution: 𝑚𝑎𝑥 𝛌 − 1 2 𝑖,𝑗=1 𝑚 λ 𝑖 λ 𝑗 𝑦 𝑖 𝑦 𝑗 𝐱 𝐢 . 𝐱 𝐣 + 𝑖=1 𝑚 λ 𝑖 𝑠.𝑡.: λ 𝑖 ≥0,𝑖=1,...,𝑚 𝑖=1 𝑚 λ 𝑖 𝑦 𝑖 =0 Equivalent Dual Problem:
What are Support Vector Machines? Linear classifiers (Mostly) binary classifiers Supervised training Good generalization with explicit bounds
Main Ideas Behind Support Vector Machines Maximal margin Dual space Linear classifiers in high-dimensional space using non-linear mapping Kernel trick
Quadratic Programming
Using the Lagrangian
Dual Space
Dual Form
Non-linear separation of datasets Non-linear separation is impossible in most problems Illustration from Prof. Mohri's lecture notes
Non-separable datasets Solutions: 1) Nonlinear classifiers 2) Increase dimensionality of dataset and add a non-linear mapping Ф
Kernel Trick “similarity measure” between 2 data samples
Kernel Trick Illustrated
Curse of Dimensionality Due to the Non-Linear Mapping
Positive Semi-Definite (P.S.D.) Kernels (Mercer Condition)
Advantages of SVM Work very well... Error bounds easy to obtain: Generalization error small and predictable Fool-proof method: (Mostly) three kernels to choose from: Gaussian Linear and Polynomial Sigmoid Very small number of parameters to optimize
Limitations of SVM Size limitation: Size of kernel matrix is quadratic with the number of training vectors Speed limitations: 1) During training: very large quadratic programming problem solved numerically Solutions: Chunking Sequential Minimal Optimization (SMO) breaks QP problem into many small QP problems solved analytically Hardware implementations 2) During testing: number of support vectors Solution: Online SVM