Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machine I

Similar presentations


Presentation on theme: "Support Vector Machine I"— Presentation transcript:

1 Support Vector Machine I
Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

2 Administrative Please use piazza. No emails.
HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.

3 Regularized logistic regression
Tumor Size Age Cost function: 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 log ℎ 𝜃 𝑥 𝑖 +(1− 𝑦 𝑖 ) log 1− ℎ 𝜃 𝑥 𝑖 𝜆 2 𝑗=1 𝑛 𝜃 𝑗 2 ℎ 𝜃 𝑥 = 𝑔(𝜃 0 + 𝜃 1 𝑥+ 𝜃 2 𝑥 2 + 𝜃 3 𝑥 𝜃 4 𝑥 𝜃 5 𝑥 1 𝑥 2 + 𝜃 6 𝑥 1 3 𝑥 2 + 𝜃 7 𝑥 1 𝑥 2 3 +⋯) Slide credit: Andrew Ng

4 Gradient descent (Regularized)
Repeat { 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 +𝜆 𝜃 𝑗 } ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝜕 𝜕 𝜃 𝑗 𝐽(𝜃) Slide credit: Andrew Ng

5 𝜃 1 : Lasso regularization
𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 +𝜆 𝑗=1 𝑛 | 𝜃 𝑗 | LASSO: Least Absolute Shrinkage and Selection Operator

6 Single predictor: Soft Thresholding
minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜃− 𝑦 𝑖 𝜆 𝜃 1 𝜃 = 1 𝑚 <𝒙, 𝒚> −𝜆 if 𝑚 <𝒙, 𝒚> > 𝜆 if 𝑚 |<𝒙, 𝒚>| ≤ 𝜆 1 𝑚 <𝒙, 𝒚> +𝜆 if 𝑚 <𝒙, 𝒚> <−𝜆 𝜃 = 𝑆 𝜆 ( 1 𝑚 <𝒙, 𝒚>) Soft Thresholding operator 𝑆 𝜆 𝑥 =sign 𝑥 𝑥 −𝜆 +

7 Multiple predictors: : Cyclic Coordinate Descenet
minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 + 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘 − 𝑦 𝑖 𝜆 𝑘≠𝑗 | 𝜃 𝑘 |+ 𝜆 𝜃 𝑗 1 For each 𝑗, update 𝜃 𝑗 with minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 − 𝑟 𝑗 (𝑖) 𝜆 𝜃 𝑗 1 where 𝑟 𝑗 (𝑖) = 𝑦 𝑖 − 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘

8 L1 and L2 balls Image credit:

9 Terminology Regularization function Name Solver 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2
𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 Tikhonov regularization Ridge regression Close form 𝜃 1 = 𝑗=1 𝑛 |𝜃 𝑗 | LASSO regression Proximal gradient descent, least angle regression 𝛼 𝜃 1 +(1−𝛼) 𝜃 2 2 Elastic net regularization Proximal gradient descent

10 Things to remember Overfitting Cost function
Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression

11 Support Vector Machine
Cost function Large margin classification Kernels Using an SVM

12 Support Vector Machine
Cost function Large margin classification Kernels Using an SVM

13 Slide credit: Andrew Ng
Logistic regression 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 Suppose predict “y = 1” if ℎ 𝜃 𝑥 ≥0.5 𝑧= 𝜃 ⊤ 𝑥≥0 predict “y = 0” if ℎ 𝜃 𝑥 <0.5 𝑧= 𝜃 ⊤ 𝑥<0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng

14 Slide credit: Andrew Ng
Alternative view 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 If “y = 1”, we want ℎ 𝜃 𝑥 ≈1 𝑧= 𝜃 ⊤ 𝑥≫0 If “y = 0”, we want ℎ 𝜃 𝑥 ≈0 𝑧= 𝜃 ⊤ 𝑥≪0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng

15 Cost function for Logistic Regression
Cost( ℎ 𝜃 𝑥 ,𝑦)= −log ℎ 𝜃 𝑥 if 𝑦=1 −log 1−ℎ 𝜃 𝑥 if 𝑦=0 if 𝑦=1 if 𝑦=0 ℎ 𝜃 𝑥 1 ℎ 𝜃 𝑥 1 Slide credit: Andrew Ng

16 Alternative view of logistic regression
Cost ℎ 𝜃 𝑥 ,𝑦 =− 𝑦 log h 𝜃 x −(1−y) log 1−ℎ 𝜃 𝑥 = 𝑦 −log 𝑒 − 𝜃 ⊤ 𝑥 −y −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2

17 if 𝑦=1 if 𝑦=0 Logistic regression (logistic loss)
min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) −log ℎ 𝜃 𝑥 (𝑖) +(1− 𝑦 𝑖 ) −log 1−ℎ 𝜃 𝑥 (𝑖) 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 Support vector machine (hinge loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 −log 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2

18 Optimization objective for SVM
min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 1) Multiply 1 𝑚 2) Multiply C= 1 𝜆 min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Slide credit: Andrew Ng

19 Slide credit: Andrew Ng
Hypothesis of SVM min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Hypothesis ℎ 𝜃 𝑥 = if 𝜃 ⊤ 𝑥≥ if 𝜃 ⊤ 𝑥<0 Slide credit: Andrew Ng

20 Support Vector Machine
Cost function Large margin classification Kernels Using an SVM

21 Support vector machine
min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 If “y = 1”, we want 𝜃 ⊤ 𝑥≥ 1 (not just ≥0) If “y = 0”, we want 𝜃 ⊤ 𝑥≤−1 (not just <0) cos t 1 𝜃 ⊤ 𝑥 cos t 0 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2 Slide credit: Andrew Ng

22 Slide credit: Andrew Ng
SVM decision boundary min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Let’s say we have a very large 𝐶… Whenever 𝑦 𝑖 =1: 𝜃 ⊤ 𝑥 𝑖 ≥1 Whenever 𝑦 𝑖 =0: 𝜃 ⊤ 𝑥 𝑖 ≤−1 min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝜃 ⊤ 𝑥 𝑖 ≥ if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Slide credit: Andrew Ng

23 SVM decision boundary: Linearly separable case
𝑥 2 𝑥 1 Slide credit: Andrew Ng

24 SVM decision boundary: Linearly separable case
𝑥 2 𝑥 1 margin Slide credit: Andrew Ng

25 Large margin classifier in the presence of outlier
𝐶 very large 𝑥 2 𝐶 not too large 𝑥 1 Slide credit: Andrew Ng

26 Slide credit: Andrew Ng
Vector inner product 𝑢= 𝑢 1 𝑢 2 𝑣= 𝑣 1 𝑣 2 𝑢 = length of vector 𝑢 = 𝑢 𝑢 2 2 ∈ ℝ 𝑝= length of projection of 𝑣 onto 𝑢 𝑣 𝑣 2 𝑢 𝑢 2 𝑣 1 𝑢 1 𝑢 ⊤ 𝑣=𝑝⋅ 𝑢 = u 1 v 1 + u 2 v 2 Slide credit: Andrew Ng

27 Slide credit: Andrew Ng
SVM decision boundary 1 2 𝑗=1 𝑛 𝜃 𝑗 2 = 𝜃 𝜃 2 2 = 𝜃 𝜃 = 𝜃 2 min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝜃 ⊤ 𝑥 𝑖 ≥ if y 𝑖 = 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 What’s 𝜃 ⊤ 𝑥 𝑖 ? 𝜃 ⊤ 𝑥 𝑖 = 𝑝 (𝑖) 𝜃 2 𝑥 (𝑖) 𝑥 2 𝑖 𝜃 𝜃 2 𝑝 (𝑖) 𝑥 1 𝑖 𝜃 1 Slide credit: Andrew Ng

28 Slide credit: Andrew Ng
SVM decision boundary min 𝜃 𝜃 2 s.t 𝑝 (𝑖) 𝜃 2 ≥ if y 𝑖 = 𝑝 (𝑖) 𝜃 2 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 𝑝 1 , 𝑝 (2) small → 𝜃 2 large 𝑝 1 , 𝑝 (2) large → 𝜃 2 can be small 𝜃 𝑥 2 𝑥 2 𝜃 𝑥 1 𝑥 1 Slide credit: Andrew Ng

29 Support Vector Machine
Cost function Large margin classification Kernels Using an SVM

30 Non-linear decision boundary
𝑥 2 Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 𝑥 2 + 𝜃 4 𝑥 𝜃 5 𝑥 2 2 +⋯≥0 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 +⋯ 𝑓 1 = 𝑥 1 , 𝑓 2 = 𝑥 2 , 𝑓 3 = 𝑥 1 𝑥 2 , ⋯ Is there a different/better choice of the features 𝑓 1 , 𝑓 2 , 𝑓 3 , ⋯? 𝑥 1 Slide credit: Andrew Ng

31 Slide credit: Andrew Ng
Kernel 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Give 𝑥, compute new features depending on proximity to landmarks 𝑙 (1) , 𝑙 (2) , 𝑙 (3) 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Gaussian kernel similarity 𝑥, 𝑙 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Slide credit: Andrew Ng

32 Slide credit: Andrew Ng
𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Ex: 𝜃 0 =−0.5, 𝜃 1 =1, 𝜃 2 =1, 𝜃 3 =0 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Slide credit: Andrew Ng

33 Choosing the landmarks
Given 𝑥 𝑓 𝑖 =similarity 𝑥, 𝑙 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 𝜎 2 ) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Where to get 𝑙 (1) , 𝑙 (2) , 𝑙 3 , ⋯? 𝑙 1 𝑙 2 𝑙 3 𝑙 𝑚 𝑥 1 𝑥 2 Slide credit: Andrew Ng

34 Slide credit: Andrew Ng
SVM with kernels Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 , 𝑦 𝑚 Choose 𝑙 1 =𝑥 1 , 𝑙 2 =𝑥 2 , 𝑙 3 =𝑥 3 , ⋯, 𝑙 𝑚 =𝑥 𝑚 Given example 𝑥: 𝑓 1 =similarity 𝑥, 𝑙 (1) 𝑓 2 =similarity 𝑥, 𝑙 (2) For training example 𝑥 𝑖 , 𝑦 𝑖 : 𝑥 𝑖 → 𝑓 (𝑖) 𝑓= 𝑓 0 𝑓 1 𝑓 2 ⋮ 𝑓 𝑚 Slide credit: Andrew Ng

35 SVM with kernels Hypothesis: Given 𝑥, compute features 𝑓∈ ℝ 𝑚+1
Predict 𝑦=1 if 𝜃 ⊤ 𝑓≥0 Training (original) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Training (with kernel) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 𝑗=1 𝑚 𝜃 𝑗 2

36 Slide credit: Andrew Ng
SVM parameters 𝐶 = 1 𝜆 Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance. 𝜎 2 Large 𝜎 2 : features 𝑓 𝑖 vary more smoothly. Higher bias, lower variance Small 𝜎 2 : features 𝑓 𝑖 vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

37 SVM song Video source:

38 SVM Demo

39 Support Vector Machine
Cost function Large margin classification Kernels Using an SVM

40 Slide credit: Andrew Ng
Using SVM SVM software package (e.g., liblinear, libsvm) to solve for 𝜃 Need to specify: Choice of parameter 𝐶. Choice of kernel (similarity function): Linear kernel: Predict 𝑦=1 if 𝜃 ⊤ 𝑥≥0 Gaussian kernel: 𝑓 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 𝜎 2 ), where 𝑙 (𝑖) = 𝑥 (𝑖) Need to choose 𝜎 2 . Need proper feature scaling Slide credit: Andrew Ng

41 Kernel (similarity) functions
Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng

42 Multi-class classification
𝑥 2 𝑥 1 Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦=𝑖 from the rest, get 𝜃 1 , 𝜃 (2) , ⋯, 𝜃 (𝐾) Pick class 𝑖 with the largest 𝜃 𝑖 ⊤ 𝑥 Slide credit: Andrew Ng

43 Logistic regression vs. SVMs
𝑛= number of features (𝑥∈ ℝ 𝑛+1 ), 𝑚= number of training examples If 𝒏 is large (relative to 𝒎): (𝑛=10,000, m=10−1000) → Use logistic regression or SVM without a kernel (“linear kernel”) If 𝒏 is small, 𝒎 is intermediate: (𝑛=1−1000, m=10−10,000) → Use SVM with Gaussian kernel If 𝒏 is small, 𝒎 is large: (𝑛=1−1000, m=50,000+) → Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

44 Things to remember 𝑥 2 𝑥 1 margin Cost function
min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 𝑗=1 𝑚 𝜃 𝑗 2 Large margin classification Kernels Using an SVM 𝑥 2 𝑥 1 margin


Download ppt "Support Vector Machine I"

Similar presentations


Ads by Google