Download presentation
Presentation is loading. Please wait.
1
Support Vector Machine I
Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019
2
Administrative Please use piazza. No emails.
HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.
3
Regularized logistic regression
Tumor Size Age Cost function: 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 log ℎ 𝜃 𝑥 𝑖 +(1− 𝑦 𝑖 ) log 1− ℎ 𝜃 𝑥 𝑖 𝜆 2 𝑗=1 𝑛 𝜃 𝑗 2 ℎ 𝜃 𝑥 = 𝑔(𝜃 0 + 𝜃 1 𝑥+ 𝜃 2 𝑥 2 + 𝜃 3 𝑥 𝜃 4 𝑥 𝜃 5 𝑥 1 𝑥 2 + 𝜃 6 𝑥 1 3 𝑥 2 + 𝜃 7 𝑥 1 𝑥 2 3 +⋯) Slide credit: Andrew Ng
4
Gradient descent (Regularized)
Repeat { 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 +𝜆 𝜃 𝑗 } ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝜕 𝜕 𝜃 𝑗 𝐽(𝜃) Slide credit: Andrew Ng
5
𝜃 1 : Lasso regularization
𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 +𝜆 𝑗=1 𝑛 | 𝜃 𝑗 | LASSO: Least Absolute Shrinkage and Selection Operator
6
Single predictor: Soft Thresholding
minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜃− 𝑦 𝑖 𝜆 𝜃 1 𝜃 = 1 𝑚 <𝒙, 𝒚> −𝜆 if 𝑚 <𝒙, 𝒚> > 𝜆 if 𝑚 |<𝒙, 𝒚>| ≤ 𝜆 1 𝑚 <𝒙, 𝒚> +𝜆 if 𝑚 <𝒙, 𝒚> <−𝜆 𝜃 = 𝑆 𝜆 ( 1 𝑚 <𝒙, 𝒚>) Soft Thresholding operator 𝑆 𝜆 𝑥 =sign 𝑥 𝑥 −𝜆 +
7
Multiple predictors: : Cyclic Coordinate Descenet
minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 + 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘 − 𝑦 𝑖 𝜆 𝑘≠𝑗 | 𝜃 𝑘 |+ 𝜆 𝜃 𝑗 1 For each 𝑗, update 𝜃 𝑗 with minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 − 𝑟 𝑗 (𝑖) 𝜆 𝜃 𝑗 1 where 𝑟 𝑗 (𝑖) = 𝑦 𝑖 − 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘
8
L1 and L2 balls Image credit:
9
Terminology Regularization function Name Solver 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2
𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 Tikhonov regularization Ridge regression Close form 𝜃 1 = 𝑗=1 𝑛 |𝜃 𝑗 | LASSO regression Proximal gradient descent, least angle regression 𝛼 𝜃 1 +(1−𝛼) 𝜃 2 2 Elastic net regularization Proximal gradient descent
10
Things to remember Overfitting Cost function
Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression
11
Support Vector Machine
Cost function Large margin classification Kernels Using an SVM
12
Support Vector Machine
Cost function Large margin classification Kernels Using an SVM
13
Slide credit: Andrew Ng
Logistic regression 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 Suppose predict “y = 1” if ℎ 𝜃 𝑥 ≥0.5 𝑧= 𝜃 ⊤ 𝑥≥0 predict “y = 0” if ℎ 𝜃 𝑥 <0.5 𝑧= 𝜃 ⊤ 𝑥<0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng
14
Slide credit: Andrew Ng
Alternative view 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 If “y = 1”, we want ℎ 𝜃 𝑥 ≈1 𝑧= 𝜃 ⊤ 𝑥≫0 If “y = 0”, we want ℎ 𝜃 𝑥 ≈0 𝑧= 𝜃 ⊤ 𝑥≪0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng
15
Cost function for Logistic Regression
Cost( ℎ 𝜃 𝑥 ,𝑦)= −log ℎ 𝜃 𝑥 if 𝑦=1 −log 1−ℎ 𝜃 𝑥 if 𝑦=0 if 𝑦=1 if 𝑦=0 ℎ 𝜃 𝑥 1 ℎ 𝜃 𝑥 1 Slide credit: Andrew Ng
16
Alternative view of logistic regression
Cost ℎ 𝜃 𝑥 ,𝑦 =− 𝑦 log h 𝜃 x −(1−y) log 1−ℎ 𝜃 𝑥 = 𝑦 −log 𝑒 − 𝜃 ⊤ 𝑥 −y −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2
17
if 𝑦=1 if 𝑦=0 Logistic regression (logistic loss)
min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) −log ℎ 𝜃 𝑥 (𝑖) +(1− 𝑦 𝑖 ) −log 1−ℎ 𝜃 𝑥 (𝑖) 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 Support vector machine (hinge loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 −log 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2
18
Optimization objective for SVM
min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 1) Multiply 1 𝑚 2) Multiply C= 1 𝜆 min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Slide credit: Andrew Ng
19
Slide credit: Andrew Ng
Hypothesis of SVM min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Hypothesis ℎ 𝜃 𝑥 = if 𝜃 ⊤ 𝑥≥ if 𝜃 ⊤ 𝑥<0 Slide credit: Andrew Ng
20
Support Vector Machine
Cost function Large margin classification Kernels Using an SVM
21
Support vector machine
min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 If “y = 1”, we want 𝜃 ⊤ 𝑥≥ 1 (not just ≥0) If “y = 0”, we want 𝜃 ⊤ 𝑥≤−1 (not just <0) cos t 1 𝜃 ⊤ 𝑥 cos t 0 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2 Slide credit: Andrew Ng
22
Slide credit: Andrew Ng
SVM decision boundary min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Let’s say we have a very large 𝐶… Whenever 𝑦 𝑖 =1: 𝜃 ⊤ 𝑥 𝑖 ≥1 Whenever 𝑦 𝑖 =0: 𝜃 ⊤ 𝑥 𝑖 ≤−1 min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝜃 ⊤ 𝑥 𝑖 ≥ if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Slide credit: Andrew Ng
23
SVM decision boundary: Linearly separable case
𝑥 2 𝑥 1 Slide credit: Andrew Ng
24
SVM decision boundary: Linearly separable case
𝑥 2 𝑥 1 margin Slide credit: Andrew Ng
25
Large margin classifier in the presence of outlier
𝐶 very large 𝑥 2 𝐶 not too large 𝑥 1 Slide credit: Andrew Ng
26
Slide credit: Andrew Ng
Vector inner product 𝑢= 𝑢 1 𝑢 2 𝑣= 𝑣 1 𝑣 2 𝑢 = length of vector 𝑢 = 𝑢 𝑢 2 2 ∈ ℝ 𝑝= length of projection of 𝑣 onto 𝑢 𝑣 𝑣 2 𝑢 𝑢 2 𝑣 1 𝑢 1 𝑢 ⊤ 𝑣=𝑝⋅ 𝑢 = u 1 v 1 + u 2 v 2 Slide credit: Andrew Ng
27
Slide credit: Andrew Ng
SVM decision boundary 1 2 𝑗=1 𝑛 𝜃 𝑗 2 = 𝜃 𝜃 2 2 = 𝜃 𝜃 = 𝜃 2 min 𝜃 𝑗=1 𝑛 𝜃 𝑗 2 s.t 𝜃 ⊤ 𝑥 𝑖 ≥ if y 𝑖 = 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 What’s 𝜃 ⊤ 𝑥 𝑖 ? 𝜃 ⊤ 𝑥 𝑖 = 𝑝 (𝑖) 𝜃 2 𝑥 (𝑖) 𝑥 2 𝑖 𝜃 𝜃 2 𝑝 (𝑖) 𝑥 1 𝑖 𝜃 1 Slide credit: Andrew Ng
28
Slide credit: Andrew Ng
SVM decision boundary min 𝜃 𝜃 2 s.t 𝑝 (𝑖) 𝜃 2 ≥ if y 𝑖 = 𝑝 (𝑖) 𝜃 2 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 𝑝 1 , 𝑝 (2) small → 𝜃 2 large 𝑝 1 , 𝑝 (2) large → 𝜃 2 can be small 𝜃 𝑥 2 𝑥 2 𝜃 𝑥 1 𝑥 1 Slide credit: Andrew Ng
29
Support Vector Machine
Cost function Large margin classification Kernels Using an SVM
30
Non-linear decision boundary
𝑥 2 Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 𝑥 2 + 𝜃 4 𝑥 𝜃 5 𝑥 2 2 +⋯≥0 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 +⋯ 𝑓 1 = 𝑥 1 , 𝑓 2 = 𝑥 2 , 𝑓 3 = 𝑥 1 𝑥 2 , ⋯ Is there a different/better choice of the features 𝑓 1 , 𝑓 2 , 𝑓 3 , ⋯? 𝑥 1 Slide credit: Andrew Ng
31
Slide credit: Andrew Ng
Kernel 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Give 𝑥, compute new features depending on proximity to landmarks 𝑙 (1) , 𝑙 (2) , 𝑙 (3) 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Gaussian kernel similarity 𝑥, 𝑙 𝑖 =exp(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Slide credit: Andrew Ng
32
Slide credit: Andrew Ng
𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Ex: 𝜃 0 =−0.5, 𝜃 1 =1, 𝜃 2 =1, 𝜃 3 =0 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Slide credit: Andrew Ng
33
Choosing the landmarks
Given 𝑥 𝑓 𝑖 =similarity 𝑥, 𝑙 𝑖 =exp(− 𝑥− 𝑙 𝑖 𝜎 2 ) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Where to get 𝑙 (1) , 𝑙 (2) , 𝑙 3 , ⋯? 𝑙 1 𝑙 2 𝑙 3 ⋮ 𝑙 𝑚 𝑥 1 𝑥 2 Slide credit: Andrew Ng
34
Slide credit: Andrew Ng
SVM with kernels Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 , 𝑦 𝑚 Choose 𝑙 1 =𝑥 1 , 𝑙 2 =𝑥 2 , 𝑙 3 =𝑥 3 , ⋯, 𝑙 𝑚 =𝑥 𝑚 Given example 𝑥: 𝑓 1 =similarity 𝑥, 𝑙 (1) 𝑓 2 =similarity 𝑥, 𝑙 (2) ⋯ For training example 𝑥 𝑖 , 𝑦 𝑖 : 𝑥 𝑖 → 𝑓 (𝑖) 𝑓= 𝑓 0 𝑓 1 𝑓 2 ⋮ 𝑓 𝑚 Slide credit: Andrew Ng
35
SVM with kernels Hypothesis: Given 𝑥, compute features 𝑓∈ ℝ 𝑚+1
Predict 𝑦=1 if 𝜃 ⊤ 𝑓≥0 Training (original) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 𝑗=1 𝑛 𝜃 𝑗 2 Training (with kernel) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 𝑗=1 𝑚 𝜃 𝑗 2
36
Slide credit: Andrew Ng
SVM parameters 𝐶 = 1 𝜆 Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance. 𝜎 2 Large 𝜎 2 : features 𝑓 𝑖 vary more smoothly. Higher bias, lower variance Small 𝜎 2 : features 𝑓 𝑖 vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng
37
SVM song Video source:
38
SVM Demo
39
Support Vector Machine
Cost function Large margin classification Kernels Using an SVM
40
Slide credit: Andrew Ng
Using SVM SVM software package (e.g., liblinear, libsvm) to solve for 𝜃 Need to specify: Choice of parameter 𝐶. Choice of kernel (similarity function): Linear kernel: Predict 𝑦=1 if 𝜃 ⊤ 𝑥≥0 Gaussian kernel: 𝑓 𝑖 =exp(− 𝑥− 𝑙 𝑖 𝜎 2 ), where 𝑙 (𝑖) = 𝑥 (𝑖) Need to choose 𝜎 2 . Need proper feature scaling Slide credit: Andrew Ng
41
Kernel (similarity) functions
Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng
42
Multi-class classification
𝑥 2 𝑥 1 Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦=𝑖 from the rest, get 𝜃 1 , 𝜃 (2) , ⋯, 𝜃 (𝐾) Pick class 𝑖 with the largest 𝜃 𝑖 ⊤ 𝑥 Slide credit: Andrew Ng
43
Logistic regression vs. SVMs
𝑛= number of features (𝑥∈ ℝ 𝑛+1 ), 𝑚= number of training examples If 𝒏 is large (relative to 𝒎): (𝑛=10,000, m=10−1000) → Use logistic regression or SVM without a kernel (“linear kernel”) If 𝒏 is small, 𝒎 is intermediate: (𝑛=1−1000, m=10−10,000) → Use SVM with Gaussian kernel If 𝒏 is small, 𝒎 is large: (𝑛=1−1000, m=50,000+) → Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng
44
Things to remember 𝑥 2 𝑥 1 margin Cost function
min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 𝑗=1 𝑚 𝜃 𝑗 2 Large margin classification Kernels Using an SVM 𝑥 2 𝑥 1 margin
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.