Support Vector Machine I Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019
Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.
Regularized logistic regression Tumor Size Age Cost function: 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 log ℎ 𝜃 𝑥 𝑖 +(1− 𝑦 𝑖 ) log 1− ℎ 𝜃 𝑥 𝑖 + 𝜆 2 𝑗=1 𝑛 𝜃 𝑗 2 ℎ 𝜃 𝑥 = 𝑔(𝜃 0 + 𝜃 1 𝑥+ 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 2 + 𝜃 4 𝑥 2 2 + 𝜃 5 𝑥 1 𝑥 2 + 𝜃 6 𝑥 1 3 𝑥 2 + 𝜃 7 𝑥 1 𝑥 2 3 +⋯) Slide credit: Andrew Ng
Gradient descent (Regularized) Repeat { 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 +𝜆 𝜃 𝑗 } ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝜕 𝜕 𝜃 𝑗 𝐽(𝜃) Slide credit: Andrew Ng
𝜃 1 : Lasso regularization 𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 +𝜆 𝑗=1 𝑛 | 𝜃 𝑗 | LASSO: Least Absolute Shrinkage and Selection Operator
Single predictor: Soft Thresholding minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜃− 𝑦 𝑖 2 +𝜆 𝜃 1 𝜃 = 1 𝑚 <𝒙, 𝒚> −𝜆 if 1 𝑚 <𝒙, 𝒚> > 𝜆 0 if 1 𝑚 |<𝒙, 𝒚>| ≤ 𝜆 1 𝑚 <𝒙, 𝒚> +𝜆 if 1 𝑚 <𝒙, 𝒚> <−𝜆 𝜃 = 𝑆 𝜆 ( 1 𝑚 <𝒙, 𝒚>) Soft Thresholding operator 𝑆 𝜆 𝑥 =sign 𝑥 𝑥 −𝜆 +
Multiple predictors: : Cyclic Coordinate Descenet minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 + 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘 − 𝑦 𝑖 2 + 𝜆 𝑘≠𝑗 | 𝜃 𝑘 |+ 𝜆 𝜃 𝑗 1 For each 𝑗, update 𝜃 𝑗 with minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 − 𝑟 𝑗 (𝑖) 2 +𝜆 𝜃 𝑗 1 where 𝑟 𝑗 (𝑖) = 𝑦 𝑖 − 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘
L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
Terminology Regularization function Name Solver 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 Tikhonov regularization Ridge regression Close form 𝜃 1 = 𝑗=1 𝑛 |𝜃 𝑗 | LASSO regression Proximal gradient descent, least angle regression 𝛼 𝜃 1 +(1−𝛼) 𝜃 2 2 Elastic net regularization Proximal gradient descent
Things to remember Overfitting Cost function Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Slide credit: Andrew Ng Logistic regression 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 Suppose predict “y = 1” if ℎ 𝜃 𝑥 ≥0.5 𝑧= 𝜃 ⊤ 𝑥≥0 predict “y = 0” if ℎ 𝜃 𝑥 <0.5 𝑧= 𝜃 ⊤ 𝑥<0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng
Slide credit: Andrew Ng Alternative view 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 If “y = 1”, we want ℎ 𝜃 𝑥 ≈1 𝑧= 𝜃 ⊤ 𝑥≫0 If “y = 0”, we want ℎ 𝜃 𝑥 ≈0 𝑧= 𝜃 ⊤ 𝑥≪0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng
Cost function for Logistic Regression Cost( ℎ 𝜃 𝑥 ,𝑦)= −log ℎ 𝜃 𝑥 if 𝑦=1 −log 1−ℎ 𝜃 𝑥 if 𝑦=0 if 𝑦=1 if 𝑦=0 ℎ 𝜃 𝑥 1 ℎ 𝜃 𝑥 1 Slide credit: Andrew Ng
Alternative view of logistic regression Cost ℎ 𝜃 𝑥 ,𝑦 =− 𝑦 log h 𝜃 x −(1−y) log 1−ℎ 𝜃 𝑥 = 𝑦 −log 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 + 1−y −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2
if 𝑦=1 if 𝑦=0 Logistic regression (logistic loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) −log ℎ 𝜃 𝑥 (𝑖) +(1− 𝑦 𝑖 ) −log 1−ℎ 𝜃 𝑥 (𝑖) + 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 Support vector machine (hinge loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 −log 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2
Optimization objective for SVM min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 1) Multiply 1 𝑚 2) Multiply C= 1 𝜆 min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝑗=1 𝑛 𝜃 𝑗 2 Slide credit: Andrew Ng
Slide credit: Andrew Ng Hypothesis of SVM min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝑗=1 𝑛 𝜃 𝑗 2 Hypothesis ℎ 𝜃 𝑥 = 1 if 𝜃 ⊤ 𝑥≥0 0 if 𝜃 ⊤ 𝑥<0 Slide credit: Andrew Ng
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Support vector machine min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝑗=1 𝑛 𝜃 𝑗 2 If “y = 1”, we want 𝜃 ⊤ 𝑥≥ 1 (not just ≥0) If “y = 0”, we want 𝜃 ⊤ 𝑥≤−1 (not just <0) cos t 1 𝜃 ⊤ 𝑥 cos t 0 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2 Slide credit: Andrew Ng
Slide credit: Andrew Ng SVM decision boundary min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 1 2 𝑗=1 𝑛 𝜃 𝑗 2 Let’s say we have a very large 𝐶… Whenever 𝑦 𝑖 =1: 𝜃 ⊤ 𝑥 𝑖 ≥1 Whenever 𝑦 𝑖 =0: 𝜃 ⊤ 𝑥 𝑖 ≤−1 min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝜃 ⊤ 𝑥 𝑖 ≥1 if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Slide credit: Andrew Ng
SVM decision boundary: Linearly separable case 𝑥 2 𝑥 1 Slide credit: Andrew Ng
SVM decision boundary: Linearly separable case 𝑥 2 𝑥 1 margin Slide credit: Andrew Ng
Large margin classifier in the presence of outlier 𝐶 very large 𝑥 2 𝐶 not too large 𝑥 1 Slide credit: Andrew Ng
Slide credit: Andrew Ng Vector inner product 𝑢= 𝑢 1 𝑢 2 𝑣= 𝑣 1 𝑣 2 𝑢 = length of vector 𝑢 = 𝑢 1 2 + 𝑢 2 2 ∈ ℝ 𝑝= length of projection of 𝑣 onto 𝑢 𝑣 𝑣 2 𝑢 𝑢 2 𝑣 1 𝑢 1 𝑢 ⊤ 𝑣=𝑝⋅ 𝑢 = u 1 v 1 + u 2 v 2 Slide credit: Andrew Ng
Slide credit: Andrew Ng SVM decision boundary 1 2 𝑗=1 𝑛 𝜃 𝑗 2 = 1 2 𝜃 1 2 + 𝜃 2 2 = 1 2 𝜃 1 2 + 𝜃 2 2 2 = 1 2 𝜃 2 min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝜃 ⊤ 𝑥 𝑖 ≥1 if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 What’s 𝜃 ⊤ 𝑥 𝑖 ? 𝜃 ⊤ 𝑥 𝑖 = 𝑝 (𝑖) 𝜃 2 𝑥 (𝑖) 𝑥 2 𝑖 𝜃 𝜃 2 𝑝 (𝑖) 𝑥 1 𝑖 𝜃 1 Slide credit: Andrew Ng
Slide credit: Andrew Ng SVM decision boundary min 𝜃 1 2 𝜃 2 s.t. 𝑝 (𝑖) 𝜃 2 ≥1 if y 𝑖 =1 𝑝 (𝑖) 𝜃 2 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 𝑝 1 , 𝑝 (2) small → 𝜃 2 large 𝑝 1 , 𝑝 (2) large → 𝜃 2 can be small 𝜃 𝑥 2 𝑥 2 𝜃 𝑥 1 𝑥 1 Slide credit: Andrew Ng
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Non-linear decision boundary 𝑥 2 Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 𝑥 2 + 𝜃 4 𝑥 1 2 + 𝜃 5 𝑥 2 2 +⋯≥0 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 +⋯ 𝑓 1 = 𝑥 1 , 𝑓 2 = 𝑥 2 , 𝑓 3 = 𝑥 1 𝑥 2 , ⋯ Is there a different/better choice of the features 𝑓 1 , 𝑓 2 , 𝑓 3 , ⋯? 𝑥 1 Slide credit: Andrew Ng
Slide credit: Andrew Ng Kernel 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Give 𝑥, compute new features depending on proximity to landmarks 𝑙 (1) , 𝑙 (2) , 𝑙 (3) 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Gaussian kernel similarity 𝑥, 𝑙 𝑖 =exp(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Slide credit: Andrew Ng
Slide credit: Andrew Ng 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Ex: 𝜃 0 =−0.5, 𝜃 1 =1, 𝜃 2 =1, 𝜃 3 =0 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Slide credit: Andrew Ng
Choosing the landmarks Given 𝑥 𝑓 𝑖 =similarity 𝑥, 𝑙 𝑖 =exp(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Where to get 𝑙 (1) , 𝑙 (2) , 𝑙 3 , ⋯? 𝑙 1 𝑙 2 𝑙 3 ⋮ 𝑙 𝑚 𝑥 1 𝑥 2 Slide credit: Andrew Ng
Slide credit: Andrew Ng SVM with kernels Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 , 𝑦 𝑚 Choose 𝑙 1 =𝑥 1 , 𝑙 2 =𝑥 2 , 𝑙 3 =𝑥 3 , ⋯, 𝑙 𝑚 =𝑥 𝑚 Given example 𝑥: 𝑓 1 =similarity 𝑥, 𝑙 (1) 𝑓 2 =similarity 𝑥, 𝑙 (2) ⋯ For training example 𝑥 𝑖 , 𝑦 𝑖 : 𝑥 𝑖 → 𝑓 (𝑖) 𝑓= 𝑓 0 𝑓 1 𝑓 2 ⋮ 𝑓 𝑚 Slide credit: Andrew Ng
SVM with kernels Hypothesis: Given 𝑥, compute features 𝑓∈ ℝ 𝑚+1 Predict 𝑦=1 if 𝜃 ⊤ 𝑓≥0 Training (original) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 1 2 𝑗=1 𝑛 𝜃 𝑗 2 Training (with kernel) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 + 1 2 𝑗=1 𝑚 𝜃 𝑗 2
Slide credit: Andrew Ng SVM parameters 𝐶 = 1 𝜆 Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance. 𝜎 2 Large 𝜎 2 : features 𝑓 𝑖 vary more smoothly. Higher bias, lower variance Small 𝜎 2 : features 𝑓 𝑖 vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng
SVM song https://www.youtube.com/watch?v=g15bqtyidZs Video source:
SVM Demo https://cs.stanford.edu/people/karpathy/svmjs/demo/
Support Vector Machine Cost function Large margin classification Kernels Using an SVM
Slide credit: Andrew Ng Using SVM SVM software package (e.g., liblinear, libsvm) to solve for 𝜃 Need to specify: Choice of parameter 𝐶. Choice of kernel (similarity function): Linear kernel: Predict 𝑦=1 if 𝜃 ⊤ 𝑥≥0 Gaussian kernel: 𝑓 𝑖 =exp(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ), where 𝑙 (𝑖) = 𝑥 (𝑖) Need to choose 𝜎 2 . Need proper feature scaling Slide credit: Andrew Ng
Kernel (similarity) functions Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng
Multi-class classification 𝑥 2 𝑥 1 Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦=𝑖 from the rest, get 𝜃 1 , 𝜃 (2) , ⋯, 𝜃 (𝐾) Pick class 𝑖 with the largest 𝜃 𝑖 ⊤ 𝑥 Slide credit: Andrew Ng
Logistic regression vs. SVMs 𝑛= number of features (𝑥∈ ℝ 𝑛+1 ), 𝑚= number of training examples If 𝒏 is large (relative to 𝒎): (𝑛=10,000, m=10−1000) → Use logistic regression or SVM without a kernel (“linear kernel”) If 𝒏 is small, 𝒎 is intermediate: (𝑛=1−1000, m=10−10,000) → Use SVM with Gaussian kernel If 𝒏 is small, 𝒎 is large: (𝑛=1−1000, m=50,000+) → Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng
Things to remember 𝑥 2 𝑥 1 margin Cost function min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 + 1 2 𝑗=1 𝑚 𝜃 𝑗 2 Large margin classification Kernels Using an SVM 𝑥 2 𝑥 1 margin