Lecture 08: Soft-margin SVM

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

Support Vector Machine

Linear Classifiers (perceptrons)

A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Support vector machine

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Optimization Tutorial

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Separating Hyperplanes

Support Vector Machines (and Kernel Methods in general)

The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:

Support Vector Machine (SVM) Classification

Lecture 10: Support Vector Machines

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:

An Introduction to Support Vector Machines Martin Law.

Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

An Introduction to Support Vector Machines (M. Law)

Non-Bayes classifiers. Linear discriminants, neural networks.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Biointelligence Laboratory, Seoul National University

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Support Vector Machines

Support vector machines

PREDICT 422: Practical Machine Learning

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Large Margin classifiers

Lecture 04: Logistic Regression

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Dan Roth Department of Computer and Information Science

Lecture 07: Soft-margin SVM

Support Vector Machines

Support Vector Machines

Lecture 04: Logistic Regression

Jan Rupnik Jozef Stefan Institute

CS 4/527: Artificial Intelligence

An Introduction to Support Vector Machines

CS 2750: Machine Learning Support Vector Machines

Lecture 07: Soft-margin SVM

CSCI B609: “Foundations of Data Science”

Online Learning Kernels

Large Scale Support Vector Machines

Lecture 18: Bagging and Boosting

Lecture 07: Hard-margin SVM

Kai-Wei Chang University of Virginia

Support vector machines

Lecture 07: Soft-margin SVM

CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.

Support Vector Machines

Lecture 18. SVM (II): Non-separable Cases

Lecture 06: Bagging and Boosting

Support vector machines

Softmax Classifier.

Support vector machines

Linear Discrimination

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Lecture 08: Soft-margin SVM CS480/680: Intro to ML Lecture 08: Soft-margin SVM 10/11/18 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/11/18 Yao-Liang Yu

Hard-margin SVM Primal hard constraint Dual 10/11/18 Yao-Liang Yu

What if inseparable? 10/11/18 Yao-Liang Yu

Soft-margin (Cortes & Vapnik’95) Primal hard constraint propto 1/margin hyper-parameter training error Primal soft constraint prediction (no sign) 10/11/18 Yao-Liang Yu

Zero-one loss your prediction Find prediction rule f so that on an unseen random X, our prediction sign(f(X)) has small chance to be different from the true label Y 10/11/18 Yao-Liang Yu

The hinge loss upper bound zero-one Squared hinge exponential loss logistic loss exponential loss Squared hinge zero-one still suffer loss! 10/11/18 Yao-Liang Yu

Classification-calibration Want to minimize zero-one loss End up with minimizing some other loss Theorem (Bartlett, Jordan, McAuLiffe’06). Any convex margin loss ℓ is classification-calibrated iff ℓ is differentiable at 0 and ℓ’(0) < 0. Classification calibration. has the same sign as , i.e., the Bayes rule. 10/11/18 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/11/18 Yao-Liang Yu

Important optimization trick joint over x and t 10/11/18 Yao-Liang Yu

Slack for “wrong” prediction 10/11/18 Yao-Liang Yu

Lagrangian 10/11/18 Yao-Liang Yu

Dual problem only dot product is needed! 10/11/18 Yao-Liang Yu

The effect of C RdxR C  0? C  inf? Rn 10/11/18 Yao-Liang Yu

Karush-Kuhn-Tucker conditions Primal constraints on w, b and ξ: Dual constraints on α and β: Complementary slackness Stationarity 10/11/18 Yao-Liang Yu

Parsing the equations 10/11/18 Yao-Liang Yu

Support Vectors 10/11/18 Yao-Liang Yu

Recover b Take any i such that Then xi is on the hyperplane: How to recover ξ ? What if there is no such i ? 10/11/18 Yao-Liang Yu

More examples 10/11/18 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/11/18 Yao-Liang Yu

Gradient Descent Step size (learning rate) const., if L is smooth diminishing, otherwise (Generalized) gradient O(nd) ! 10/11/18 Yao-Liang Yu

Stochastic Gradient Descent (SGD) average over n samples a random sample suffices O(d) diminishing step size, e.g., 1/sqrt{t} or 1/t averaging, momentum, variance-reduction, etc. sample w/o replacement; cycle; permute in each pass 10/11/18 Yao-Liang Yu

The derivative What about zero-one loss? All other losses are diff. What about perceptron? 10/11/18 Yao-Liang Yu

Solving the dual O(n*n) Can choose constant step size ηt = η Faster algorithms exist: e.g., choose a pair of αp and αq and derive a closed-form update 10/11/18 Yao-Liang Yu

A little history on optimization Gradient descent mentioned first in (Cauchy, 1847) First rigorous convergence proof (Curry, 1944) SGD proposed and analyzed (Robbins & Monro, 1951) 10/11/18 Yao-Liang Yu

Herbert Robbins (1915 – 2001) 10/11/18 Yao-Liang Yu

Outline Formulation Dual Optimization Extension 10/11/18 Yao-Liang Yu

Multiclass (Crammer & Singer’01) separate by a “safety margin” Prediction for correct class Prediction for wrong classes Soft-margin is similar Many other variants Calibration theory is more involved 10/11/18 Yao-Liang Yu

Regression (Drucker et al.’97) 10/11/18 Yao-Liang Yu

Large-scale training (You, Demmel, et al.’17) Randomly partition training data evenly into p nodes Train SVM independently on each node Compute center on each node For a test sample Find the nearest center (node / SVM) Predict using the corresponding node / SVM 10/11/18 Yao-Liang Yu

Questions? 10/11/18 Yao-Liang Yu