Support Vector Machine I

Support Vector Machine I
Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative Please use piazza. No emails.
HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.

Regularized logistic regression
Tumor Size Age Cost function: 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 log ℎ 𝜃 𝑥 𝑖 +(1− 𝑦 𝑖 ) log 1− ℎ 𝜃 𝑥 𝑖 𝜆 2 𝑗=1 𝑛 𝜃 𝑗 2 ℎ 𝜃 𝑥 = 𝑔(𝜃 0 + 𝜃 1 𝑥+ 𝜃 2 𝑥 2 + 𝜃 3 𝑥 𝜃 4 𝑥 𝜃 5 𝑥 1 𝑥 2 + 𝜃 6 𝑥 1 3 𝑥 2 + 𝜃 7 𝑥 1 𝑥 2 3 +⋯) Slide credit: Andrew Ng

Gradient descent (Regularized)
Repeat { 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 +𝜆 𝜃 𝑗 } ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝜕 𝜕 𝜃 𝑗 𝐽(𝜃) Slide credit: Andrew Ng

𝜃 1 : Lasso regularization
𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 +𝜆 𝑗=1 𝑛 | 𝜃 𝑗 | LASSO: Least Absolute Shrinkage and Selection Operator

Single predictor: Soft Thresholding
minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜃− 𝑦 𝑖 𝜆 𝜃 1 𝜃 = 1 𝑚 <𝒙, 𝒚> −𝜆 if 𝑚 <𝒙, 𝒚> > 𝜆 if 𝑚 |<𝒙, 𝒚>| ≤ 𝜆 1 𝑚 <𝒙, 𝒚> +𝜆 if 𝑚 <𝒙, 𝒚> <−𝜆 𝜃 = 𝑆 𝜆 ( 1 𝑚 <𝒙, 𝒚>) Soft Thresholding operator 𝑆 𝜆 𝑥 =sign 𝑥 𝑥 −𝜆 +

Multiple predictors: : Cyclic Coordinate Descenet
minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 + 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘 − 𝑦 𝑖 𝜆 𝑘≠𝑗 | 𝜃 𝑘 |+ 𝜆 𝜃 𝑗 1 For each 𝑗, update 𝜃 𝑗 with minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 − 𝑟 𝑗 (𝑖) 𝜆 𝜃 𝑗 1 where 𝑟 𝑗 (𝑖) = 𝑦 𝑖 − 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘

L1 and L2 balls Image credit:

Terminology Regularization function Name Solver 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2
𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 Tikhonov regularization Ridge regression Close form 𝜃 1 = 𝑗=1 𝑛 |𝜃 𝑗 | LASSO regression Proximal gradient descent, least angle regression 𝛼 𝜃 1 +(1−𝛼) 𝜃 2 2 Elastic net regularization Proximal gradient descent

Things to remember Overfitting Cost function
Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression

Support Vector Machine
Cost function Large margin classification Kernels Using an SVM

Slide credit: Andrew Ng
Logistic regression 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 Suppose predict “y = 1” if ℎ 𝜃 𝑥 ≥0.5 𝑧= 𝜃 ⊤ 𝑥≥0 predict “y = 0” if ℎ 𝜃 𝑥 <0.5 𝑧= 𝜃 ⊤ 𝑥<0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng

Alternative view 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 If “y = 1”, we want ℎ 𝜃 𝑥 ≈1 𝑧= 𝜃 ⊤ 𝑥≫0 If “y = 0”, we want ℎ 𝜃 𝑥 ≈0 𝑧= 𝜃 ⊤ 𝑥≪0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng

Cost function for Logistic Regression
Cost( ℎ 𝜃 𝑥 ,𝑦)= −log ℎ 𝜃 𝑥 if 𝑦=1 −log 1−ℎ 𝜃 𝑥 if 𝑦=0 if 𝑦=1 if 𝑦=0 ℎ 𝜃 𝑥 1 ℎ 𝜃 𝑥 1 Slide credit: Andrew Ng

Alternative view of logistic regression
Cost ℎ 𝜃 𝑥 ,𝑦 =− 𝑦 log h 𝜃 x −(1−y) log 1−ℎ 𝜃 𝑥 = 𝑦 −log 𝑒 − 𝜃 ⊤ 𝑥 −y −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2

if 𝑦=1 if 𝑦=0 Logistic regression (logistic loss)
min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) −log ℎ 𝜃 𝑥 (𝑖) +(1− 𝑦 𝑖 ) −log 1−ℎ 𝜃 𝑥 (𝑖) 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 Support vector machine (hinge loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 −log 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2

Optimization objective for SVM
min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 1) Multiply 1 𝑚 2) Multiply C= 1 𝜆 min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Slide credit: Andrew Ng

Hypothesis of SVM min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Hypothesis ℎ 𝜃 𝑥 = if 𝜃 ⊤ 𝑥≥ if 𝜃 ⊤ 𝑥<0 Slide credit: Andrew Ng

Support vector machine
min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 If “y = 1”, we want 𝜃 ⊤ 𝑥≥ 1 (not just ≥0) If “y = 0”, we want 𝜃 ⊤ 𝑥≤−1 (not just <0) cos t 1 𝜃 ⊤ 𝑥 cos t 0 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2 Slide credit: Andrew Ng

SVM decision boundary min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Let’s say we have a very large 𝐶… Whenever 𝑦 𝑖 =1: 𝜃 ⊤ 𝑥 𝑖 ≥1 Whenever 𝑦 𝑖 =0: 𝜃 ⊤ 𝑥 𝑖 ≤−1 min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝜃 ⊤ 𝑥 𝑖 ≥ if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case
𝑥 2 𝑥 1 Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case
𝑥 2 𝑥 1 margin Slide credit: Andrew Ng

Large margin classifier in the presence of outlier
𝐶 very large 𝑥 2 𝐶 not too large 𝑥 1 Slide credit: Andrew Ng

Vector inner product 𝑢= 𝑢 1 𝑢 2 𝑣= 𝑣 1 𝑣 2 𝑢 = length of vector 𝑢 = 𝑢 𝑢 2 2 ∈ ℝ 𝑝= length of projection of 𝑣 onto 𝑢 𝑣 𝑣 2 𝑢 𝑢 2 𝑣 1 𝑢 1 𝑢 ⊤ 𝑣=𝑝⋅ 𝑢 = u 1 v 1 + u 2 v 2 Slide credit: Andrew Ng

SVM decision boundary 1 2 𝑗=1 𝑛 𝜃 𝑗 2 = 𝜃 𝜃 2 2 = 𝜃 𝜃 = 𝜃 2 min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝜃 ⊤ 𝑥 𝑖 ≥ if y 𝑖 = 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 What’s 𝜃 ⊤ 𝑥 𝑖 ? 𝜃 ⊤ 𝑥 𝑖 = 𝑝 (𝑖) 𝜃 2 𝑥 (𝑖) 𝑥 2 𝑖 𝜃 𝜃 2 𝑝 (𝑖) 𝑥 1 𝑖 𝜃 1 Slide credit: Andrew Ng

SVM decision boundary min 𝜃 𝜃 2 s.t 𝑝 (𝑖) 𝜃 2 ≥ if y 𝑖 = 𝑝 (𝑖) 𝜃 2 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 𝑝 1 , 𝑝 (2) small → 𝜃 2 large 𝑝 1 , 𝑝 (2) large → 𝜃 2 can be small 𝜃 𝑥 2 𝑥 2 𝜃 𝑥 1 𝑥 1 Slide credit: Andrew Ng

Non-linear decision boundary
𝑥 2 Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 𝑥 2 + 𝜃 4 𝑥 𝜃 5 𝑥 2 2 +⋯≥0 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 +⋯ 𝑓 1 = 𝑥 1 , 𝑓 2 = 𝑥 2 , 𝑓 3 = 𝑥 1 𝑥 2 , ⋯ Is there a different/better choice of the features 𝑓 1 , 𝑓 2 , 𝑓 3 , ⋯? 𝑥 1 Slide credit: Andrew Ng

Kernel 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Give 𝑥, compute new features depending on proximity to landmarks 𝑙 (1) , 𝑙 (2) , 𝑙 (3) 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Gaussian kernel similarity 𝑥, 𝑙 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Slide credit: Andrew Ng

𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Ex: 𝜃 0 =−0.5, 𝜃 1 =1, 𝜃 2 =1, 𝜃 3 =0 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Slide credit: Andrew Ng

Choosing the landmarks
Given 𝑥 𝑓 𝑖 =similarity 𝑥, 𝑙 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 𝜎 2 ) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Where to get 𝑙 (1) , 𝑙 (2) , 𝑙 3 , ⋯? 𝑙 1 𝑙 2 𝑙 3 ⋮ 𝑙 𝑚 𝑥 1 𝑥 2 Slide credit: Andrew Ng

SVM with kernels Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 , 𝑦 𝑚 Choose 𝑙 1 =𝑥 1 , 𝑙 2 =𝑥 2 , 𝑙 3 =𝑥 3 , ⋯, 𝑙 𝑚 =𝑥 𝑚 Given example 𝑥: 𝑓 1 =similarity 𝑥, 𝑙 (1) 𝑓 2 =similarity 𝑥, 𝑙 (2) ⋯ For training example 𝑥 𝑖 , 𝑦 𝑖 : 𝑥 𝑖 → 𝑓 (𝑖) 𝑓= 𝑓 0 𝑓 1 𝑓 2 ⋮ 𝑓 𝑚 Slide credit: Andrew Ng

SVM with kernels Hypothesis: Given 𝑥, compute features 𝑓∈ ℝ 𝑚+1
Predict 𝑦=1 if 𝜃 ⊤ 𝑓≥0 Training (original) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Training (with kernel) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 𝑗=1 𝑚 𝜃 𝑗 2

SVM parameters 𝐶 = 1 𝜆 Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance. 𝜎 2 Large 𝜎 2 : features 𝑓 𝑖 vary more smoothly. Higher bias, lower variance Small 𝜎 2 : features 𝑓 𝑖 vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

SVM song Video source:

SVM Demo

Using SVM SVM software package (e.g., liblinear, libsvm) to solve for 𝜃 Need to specify: Choice of parameter 𝐶. Choice of kernel (similarity function): Linear kernel: Predict 𝑦=1 if 𝜃 ⊤ 𝑥≥0 Gaussian kernel: 𝑓 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 𝜎 2 ), where 𝑙 (𝑖) = 𝑥 (𝑖) Need to choose 𝜎 2 . Need proper feature scaling Slide credit: Andrew Ng

Kernel (similarity) functions
Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng

Multi-class classification
𝑥 2 𝑥 1 Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦=𝑖 from the rest, get 𝜃 1 , 𝜃 (2) , ⋯, 𝜃 (𝐾) Pick class 𝑖 with the largest 𝜃 𝑖 ⊤ 𝑥 Slide credit: Andrew Ng

Logistic regression vs. SVMs
𝑛= number of features (𝑥∈ ℝ 𝑛+1 ), 𝑚= number of training examples If 𝒏 is large (relative to 𝒎): (𝑛=10,000, m=10−1000) → Use logistic regression or SVM without a kernel (“linear kernel”) If 𝒏 is small, 𝒎 is intermediate: (𝑛=1−1000, m=10−10,000) → Use SVM with Gaussian kernel If 𝒏 is small, 𝒎 is large: (𝑛=1−1000, m=50,000+) → Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

Things to remember 𝑥 2 𝑥 1 margin Cost function
min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 𝑗=1 𝑚 𝜃 𝑗 2 Large margin classification Kernels Using an SVM 𝑥 2 𝑥 1 margin

Support Vector Machine I

Similar presentations

Presentation on theme: "Support Vector Machine I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machine I

Similar presentations

Presentation on theme: "Support Vector Machine I"— Presentation transcript:

Similar presentations

About project

Feedback