Support Vector Machine I

Slides:



Advertisements
Similar presentations
Neural Networks and SVM Stat 600. Neural Networks History: started in the 50s and peaked in the 90s Idea: learning the way the brain does. Numerous applications.
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
An Introduction of Support Vector Machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Lecture 4: Embedded methods
Lecture 14 – Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Lecture 10: Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Logistic Regression Week 3 – Soft Computing By Yosi Kristian.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com.
Support Vector Machines Optimization objective Machine Learning.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
Optimization objective
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Large Margin classifiers
ECE 5424: Introduction to Machine Learning
Support Vector Machines
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Machine Learning – Regression David Fenyő
An Introduction to Support Vector Machines
Logistic Regression Classification Machine Learning.
Logistic Regression Classification Machine Learning.
Support Vector Machines
CS 2750: Machine Learning Support Vector Machines
Hyperparameters, bias-variance tradeoff, validation
Logistic Regression Classification Machine Learning.
Large Scale Support Vector Machines
What is Regression Analysis?
Recap Finds the boundary with “maximum margin”
Support Vector Machines
Softmax Classifier.
Support vector machines
Machine Learning Algorithms – An Overview
Logistic Regression Classification Machine Learning.
Jia-Bin Huang Virginia Tech
Shih-Yang Su Virginia Tech
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Semi-Supervised Learning
Jia-Bin Huang Virginia Tech
Logistic Regression Classification Machine Learning.
Support Vector Machines 2
Logistic Regression Classification Machine Learning.
Introduction to Machine Learning
Presentation transcript:

Support Vector Machine I Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW 2 on the way.

Regularized logistic regression Tumor Size Age Cost function: 𝐽 𝜃 = 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 log ℎ 𝜃 𝑥 𝑖 +(1− 𝑦 𝑖 ) log 1− ℎ 𝜃 𝑥 𝑖 + 𝜆 2 𝑗=1 𝑛 𝜃 𝑗 2 ℎ 𝜃 𝑥 = 𝑔(𝜃 0 + 𝜃 1 𝑥+ 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 2 + 𝜃 4 𝑥 2 2 + 𝜃 5 𝑥 1 𝑥 2 + 𝜃 6 𝑥 1 3 𝑥 2 + 𝜃 7 𝑥 1 𝑥 2 3 +⋯) Slide credit: Andrew Ng

Gradient descent (Regularized) Repeat { 𝜃 0 ≔ 𝜃 0 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝜃 𝑗 ≔ 𝜃 𝑗 −𝛼 1 𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝑖 +𝜆 𝜃 𝑗 } ℎ 𝜃 𝑥 = 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 𝜕 𝜕 𝜃 𝑗 𝐽(𝜃) Slide credit: Andrew Ng

𝜃 1 : Lasso regularization 𝐽 𝜃 = 1 2𝑚 𝑖=1 𝑚 ℎ 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 +𝜆 𝑗=1 𝑛 | 𝜃 𝑗 | LASSO: Least Absolute Shrinkage and Selection Operator

Single predictor: Soft Thresholding minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜃− 𝑦 𝑖 2 +𝜆 𝜃 1 𝜃 = 1 𝑚 <𝒙, 𝒚> −𝜆 if 1 𝑚 <𝒙, 𝒚> > 𝜆 0 if 1 𝑚 |<𝒙, 𝒚>| ≤ 𝜆 1 𝑚 <𝒙, 𝒚> +𝜆 if 1 𝑚 <𝒙, 𝒚> <−𝜆 𝜃 = 𝑆 𝜆 ( 1 𝑚 <𝒙, 𝒚>) Soft Thresholding operator 𝑆 𝜆 𝑥 =sign 𝑥 𝑥 −𝜆 +

Multiple predictors: : Cyclic Coordinate Descenet minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 + 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘 − 𝑦 𝑖 2 + 𝜆 𝑘≠𝑗 | 𝜃 𝑘 |+ 𝜆 𝜃 𝑗 1 For each 𝑗, update 𝜃 𝑗 with minimize 𝜃 1 2𝑚 𝑖=1 𝑚 𝑥 𝑗 𝑖 𝜃 𝑗 − 𝑟 𝑗 (𝑖) 2 +𝜆 𝜃 𝑗 1 where 𝑟 𝑗 (𝑖) = 𝑦 𝑖 − 𝑘≠𝑗 𝑥 𝑖𝑗 𝑖 𝜃 𝑘

L1 and L2 balls Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

Terminology Regularization function Name Solver 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 𝜃 2 2 = 𝑗=1 𝑛 𝜃 𝑗 2 Tikhonov regularization Ridge regression Close form 𝜃 1 = 𝑗=1 𝑛 |𝜃 𝑗 | LASSO regression Proximal gradient descent, least angle regression 𝛼 𝜃 1 +(1−𝛼) 𝜃 2 2 Elastic net regularization Proximal gradient descent

Things to remember Overfitting Cost function Complex model: doing well on the training set, but perform poorly on the testing set Cost function Norm penalty: L1 or L2 Regularized linear regression Gradient decent with weight decay (L2 norm) Regularized logistic regression

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Slide credit: Andrew Ng Logistic regression 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 Suppose predict “y = 1” if ℎ 𝜃 𝑥 ≥0.5 𝑧= 𝜃 ⊤ 𝑥≥0 predict “y = 0” if ℎ 𝜃 𝑥 <0.5 𝑧= 𝜃 ⊤ 𝑥<0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng

Slide credit: Andrew Ng Alternative view 𝑔(𝑧) ℎ 𝜃 𝑥 =𝑔 𝜃 ⊤ 𝑥 𝑔 𝑧 = 1 1+ 𝑒 −𝑧 If “y = 1”, we want ℎ 𝜃 𝑥 ≈1 𝑧= 𝜃 ⊤ 𝑥≫0 If “y = 0”, we want ℎ 𝜃 𝑥 ≈0 𝑧= 𝜃 ⊤ 𝑥≪0 𝑧= 𝜃 ⊤ 𝑥 Slide credit: Andrew Ng

Cost function for Logistic Regression Cost( ℎ 𝜃 𝑥 ,𝑦)= −log ℎ 𝜃 𝑥 if 𝑦=1 −log 1−ℎ 𝜃 𝑥 if 𝑦=0 if 𝑦=1 if 𝑦=0 ℎ 𝜃 𝑥 1 ℎ 𝜃 𝑥 1 Slide credit: Andrew Ng

Alternative view of logistic regression Cost ℎ 𝜃 𝑥 ,𝑦 =− 𝑦 log h 𝜃 x −(1−y) log 1−ℎ 𝜃 𝑥 = 𝑦 −log 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 + 1−y −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2

if 𝑦=1 if 𝑦=0 Logistic regression (logistic loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) −log ℎ 𝜃 𝑥 (𝑖) +(1− 𝑦 𝑖 ) −log 1−ℎ 𝜃 𝑥 (𝑖) + 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 Support vector machine (hinge loss) min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 −log 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 −log 1− 1 1+ 𝑒 − 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2

Optimization objective for SVM min 𝜃 1 𝑚 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝜆 2𝑚 𝑗=1 𝑛 𝜃 𝑗 2 1) Multiply 1 𝑚 2) Multiply C= 1 𝜆 min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝑗=1 𝑛 𝜃 𝑗 2 Slide credit: Andrew Ng

Slide credit: Andrew Ng Hypothesis of SVM min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝑗=1 𝑛 𝜃 𝑗 2 Hypothesis ℎ 𝜃 𝑥 = 1 if 𝜃 ⊤ 𝑥≥0 0 if 𝜃 ⊤ 𝑥<0 Slide credit: Andrew Ng

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Support vector machine min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 𝑗=1 𝑛 𝜃 𝑗 2 If “y = 1”, we want 𝜃 ⊤ 𝑥≥ 1 (not just ≥0) If “y = 0”, we want 𝜃 ⊤ 𝑥≤−1 (not just <0) cos t 1 𝜃 ⊤ 𝑥 cos t 0 𝜃 ⊤ 𝑥 if 𝑦=1 if 𝑦=0 𝜃 ⊤ 𝑥 𝜃 ⊤ 𝑥 −2 −1 1 2 −2 −1 1 2 Slide credit: Andrew Ng

Slide credit: Andrew Ng SVM decision boundary min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 1 2 𝑗=1 𝑛 𝜃 𝑗 2 Let’s say we have a very large 𝐶… Whenever 𝑦 𝑖 =1: 𝜃 ⊤ 𝑥 𝑖 ≥1 Whenever 𝑦 𝑖 =0: 𝜃 ⊤ 𝑥 𝑖 ≤−1 min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝜃 ⊤ 𝑥 𝑖 ≥1 if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case 𝑥 2 𝑥 1 Slide credit: Andrew Ng

SVM decision boundary: Linearly separable case 𝑥 2 𝑥 1 margin Slide credit: Andrew Ng

Large margin classifier in the presence of outlier 𝐶 very large 𝑥 2 𝐶 not too large 𝑥 1 Slide credit: Andrew Ng

Slide credit: Andrew Ng Vector inner product 𝑢= 𝑢 1 𝑢 2 𝑣= 𝑣 1 𝑣 2 𝑢 = length of vector 𝑢 = 𝑢 1 2 + 𝑢 2 2 ∈ ℝ 𝑝= length of projection of 𝑣 onto 𝑢 𝑣 𝑣 2 𝑢 𝑢 2 𝑣 1 𝑢 1 𝑢 ⊤ 𝑣=𝑝⋅ 𝑢 = u 1 v 1 + u 2 v 2 Slide credit: Andrew Ng

Slide credit: Andrew Ng SVM decision boundary 1 2 𝑗=1 𝑛 𝜃 𝑗 2 = 1 2 𝜃 1 2 + 𝜃 2 2 = 1 2 𝜃 1 2 + 𝜃 2 2 2 = 1 2 𝜃 2 min 𝜃 1 2 𝑗=1 𝑛 𝜃 𝑗 2 s.t. 𝜃 ⊤ 𝑥 𝑖 ≥1 if y 𝑖 =1 𝜃 ⊤ 𝑥 𝑖 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 What’s 𝜃 ⊤ 𝑥 𝑖 ? 𝜃 ⊤ 𝑥 𝑖 = 𝑝 (𝑖) 𝜃 2 𝑥 (𝑖) 𝑥 2 𝑖 𝜃 𝜃 2 𝑝 (𝑖) 𝑥 1 𝑖 𝜃 1 Slide credit: Andrew Ng

Slide credit: Andrew Ng SVM decision boundary min 𝜃 1 2 𝜃 2 s.t. 𝑝 (𝑖) 𝜃 2 ≥1 if y 𝑖 =1 𝑝 (𝑖) 𝜃 2 ≤−1 if y 𝑖 =0 Simplication: 𝜃 0 =0, 𝑛=2 𝑝 1 , 𝑝 (2) small → 𝜃 2 large 𝑝 1 , 𝑝 (2) large → 𝜃 2 can be small 𝜃 𝑥 2 𝑥 2 𝜃 𝑥 1 𝑥 1 Slide credit: Andrew Ng

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Non-linear decision boundary 𝑥 2 Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑥 1 + 𝜃 2 𝑥 2 + 𝜃 3 𝑥 1 𝑥 2 + 𝜃 4 𝑥 1 2 + 𝜃 5 𝑥 2 2 +⋯≥0 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 +⋯ 𝑓 1 = 𝑥 1 , 𝑓 2 = 𝑥 2 , 𝑓 3 = 𝑥 1 𝑥 2 , ⋯ Is there a different/better choice of the features 𝑓 1 , 𝑓 2 , 𝑓 3 , ⋯? 𝑥 1 Slide credit: Andrew Ng

Slide credit: Andrew Ng Kernel 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Give 𝑥, compute new features depending on proximity to landmarks 𝑙 (1) , 𝑙 (2) , 𝑙 (3) 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Gaussian kernel similarity 𝑥, 𝑙 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Slide credit: Andrew Ng

Slide credit: Andrew Ng 𝑥 1 𝑥 2 𝑙 (2) 𝑙 (3) 𝑙 (1) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Ex: 𝜃 0 =−0.5, 𝜃 1 =1, 𝜃 2 =1, 𝜃 3 =0 𝑓 1 =similarity(𝑥, 𝑙 (1) ) 𝑓 2 =similarity(𝑥, 𝑙 (2) ) 𝑓 3 =similarity(𝑥, 𝑙 (3) ) Slide credit: Andrew Ng

Choosing the landmarks Given 𝑥 𝑓 𝑖 =similarity 𝑥, 𝑙 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ) Predict 𝑦=1 if 𝜃 0 + 𝜃 1 𝑓 1 + 𝜃 2 𝑓 2 + 𝜃 3 𝑓 3 ≥0 Where to get 𝑙 (1) , 𝑙 (2) , 𝑙 3 , ⋯? 𝑙 1 𝑙 2 𝑙 3 ⋮ 𝑙 𝑚 𝑥 1 𝑥 2 Slide credit: Andrew Ng

Slide credit: Andrew Ng SVM with kernels Given 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯, 𝑥 𝑚 , 𝑦 𝑚 Choose 𝑙 1 =𝑥 1 , 𝑙 2 =𝑥 2 , 𝑙 3 =𝑥 3 , ⋯, 𝑙 𝑚 =𝑥 𝑚 Given example 𝑥: 𝑓 1 =similarity 𝑥, 𝑙 (1) 𝑓 2 =similarity 𝑥, 𝑙 (2) ⋯ For training example 𝑥 𝑖 , 𝑦 𝑖 : 𝑥 𝑖 → 𝑓 (𝑖) 𝑓= 𝑓 0 𝑓 1 𝑓 2 ⋮ 𝑓 𝑚 Slide credit: Andrew Ng

SVM with kernels Hypothesis: Given 𝑥, compute features 𝑓∈ ℝ 𝑚+1 Predict 𝑦=1 if 𝜃 ⊤ 𝑓≥0 Training (original) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑥 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑥 𝑖 + 1 2 𝑗=1 𝑛 𝜃 𝑗 2 Training (with kernel) min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 + 1 2 𝑗=1 𝑚 𝜃 𝑗 2

Slide credit: Andrew Ng SVM parameters 𝐶 = 1 𝜆 Large 𝐶: Lower bias, high variance. Small 𝐶: Higher bias, low variance. 𝜎 2 Large 𝜎 2 : features 𝑓 𝑖 vary more smoothly. Higher bias, lower variance Small 𝜎 2 : features 𝑓 𝑖 vary less smoothly. Lower bias, higher variance Slide credit: Andrew Ng

SVM song https://www.youtube.com/watch?v=g15bqtyidZs Video source:

SVM Demo https://cs.stanford.edu/people/karpathy/svmjs/demo/

Support Vector Machine Cost function Large margin classification Kernels Using an SVM

Slide credit: Andrew Ng Using SVM SVM software package (e.g., liblinear, libsvm) to solve for 𝜃 Need to specify: Choice of parameter 𝐶. Choice of kernel (similarity function): Linear kernel: Predict 𝑦=1 if 𝜃 ⊤ 𝑥≥0 Gaussian kernel: 𝑓 𝑖 =exp⁡(− 𝑥− 𝑙 𝑖 2 2 𝜎 2 ), where 𝑙 (𝑖) = 𝑥 (𝑖) Need to choose 𝜎 2 . Need proper feature scaling Slide credit: Andrew Ng

Kernel (similarity) functions Note: not all similarity functions make valid kernels. Many off-the-shelf kernsl available: Polynomial kernel String kernel Chi-square kernel Histogram intersection kernel Slide credit: Andrew Ng

Multi-class classification 𝑥 2 𝑥 1 Use one-vs.-all method. Train 𝐾 SVMs, one to distinguish 𝑦=𝑖 from the rest, get 𝜃 1 , 𝜃 (2) , ⋯, 𝜃 (𝐾) Pick class 𝑖 with the largest 𝜃 𝑖 ⊤ 𝑥 Slide credit: Andrew Ng

Logistic regression vs. SVMs 𝑛= number of features (𝑥∈ ℝ 𝑛+1 ), 𝑚= number of training examples If 𝒏 is large (relative to 𝒎): (𝑛=10,000, m=10−1000) → Use logistic regression or SVM without a kernel (“linear kernel”) If 𝒏 is small, 𝒎 is intermediate: (𝑛=1−1000, m=10−10,000) → Use SVM with Gaussian kernel If 𝒏 is small, 𝒎 is large: (𝑛=1−1000, m=50,000+) → Create/add more features, then use logistic regression of linear SVM Neural network likely to work well for most of these case, but slower to train Slide credit: Andrew Ng

Things to remember 𝑥 2 𝑥 1 margin Cost function min 𝜃 𝐶 𝑖=1 𝑚 𝑦 (𝑖) cos t 1 𝜃 ⊤ 𝑓 𝑖 +(1− 𝑦 𝑖 ) cos t 0 𝜃 ⊤ 𝑓 𝑖 + 1 2 𝑗=1 𝑚 𝜃 𝑗 2 Large margin classification Kernels Using an SVM 𝑥 2 𝑥 1 margin