Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx. 1000 test images That will be your real test.

Slides:



Advertisements
Similar presentations
3.6 Support Vector Machines
Advertisements

Support Vector Machine
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Machine Learning Week 3 Lecture 1. Programming Competition
Support vector machine
Separating Hyperplanes
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Machine Learning Week 2 Lecture 2.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Exploiting Duality (Particularly the dual of SVM) M. Pawan Kumar VISUAL GEOMETRY GROUP.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Support Vector Machines
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Classification and Regression
Support Vector Machines
SVM by Sequential Minimal Optimization (SMO)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
Support vector machines
PREDICT 422: Practical Machine Learning
Large Margin classifiers
Geometrical intuition behind the dual problem
Support Vector Machines
Large Scale Support Vector Machines
Support vector machines
Machine Learning Week 3.
Support Vector Machines
Lecture 18. SVM (II): Non-separable Cases
Support vector machines
Support Vector Machines
Presentation transcript:

Machine Learning Week 4 Lecture 1

Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test You are most welcome to add regularization as we discussed last week. It is not a requirement. Hand in Version 4 available

Recap What is going on Ways to fix it

Overfitting Data Increases -> Overfitting Decreases Noise Increases -> Overfitting Increases Target Complexity Increase -> Overfitting Increases

Learning Theory Perspective In Sample Error + Model Complexity Instead of picking simpler hypothesis set Prefer “Simpler” hypotheses h from Define what “simple” means in complexity measure Minimize

Regularization In Sample Error + Model Complexity Weight Decay Decay Every round we take a step towards the zero vector

Why are small weights better Practical Perspective Because in practice we believe that Noise is Noisy Stochastic Noise High Frequency Deterministic Noise also non-smooth Sometimes weight are weighed differently Bias Term gets free ride

Regularization Summary More Art Than Science Use VC and Bias Variance as guides Weight Decay universal technique – practical believe that noise is noisy (non-smooth) – Question. Which λ to use Many other regularizers exist. Extremely Important. Quote Book: “Necessary Evil”

Validation Regularization Estimates Validation Estimate Remember the test set

Model Selection t Models m 1,…,m t Which is better? E val (m 1 ) E val (m 2 ). E val (m t ) Pick the minimum one Compute Train on D train Validate on D val Use to find λ for my weight decay

Cross Validation Increasing K Dilemma E val estimate tightens E val increases Small K Large K We would like to have both. Cross Validation

K-Fold Cross Validation – Split Data in N/K Parts of size K – Test Train all but one set. Test on remaining. – Pick one who is best on average over N/K partitions Usual K = N/10 (we do not have all day)

Today: Support Vector Machines Margins Intuition Optimization Problem Convex Optimization Lagrange Multipliers Lagrange for SVM WARNING: Linear Algebra and function analysis coming up

Support Vector Machines Today Next Time

Notation Target y is in {-1,+1} We write parameters as w and b The hyperplane we consider is w T x + b = 0 Data D = {x i,y i ) For now assume D is linear separable

Hyperplanes again Return +1 Else Return -1 For w,b to classify x i correctly

Functional Margins Which Prediction are you more certain about? Intuition: Find w such that for all x i |w t x i + b| is large Classify x i correctly

Functional Margins (Useful Later) For each point we define the functional margin Define the functional margin of the hyperplane, e.g. the parameters w,b as Negative if w,b misclassifies a point

Geometric Margin Idea: Maximize Geometric Margin Lets get to work

Learning Theory Perspective There are much fewer large margin hyperplanes

Geometric Margin xixi How far is x i from the hyperplane? How long is segment from x i to L? Hyperplane L w Since L on hyperplane Definition of L Multiply in Solve

Geometric Mean xixi L w If x i is on the other side of the hyperplane we get an identical calculation. In general we get

Geometric Margin Distance to hyperplane Length of projection of x i onto w w xixi Origin is a orthogonal basisas normalized ||w|| = 1 Distance in w direction + shift

Geometric Margins For each point we define the geometric margin Define the geometric margin of the hyperplane, e.g. the parameters w,b as w xixi Origin Geometric Margin is invariant under scale of w,b.

Margins functional and Geometrical w Related by ||w||

Optimization Maximize Subject To Geometric Margin Point Margins Scale Constraint Maximize Subject To We may scale w,b any way want Rescaling w,b rescales Force

Optimization Maximize Subject To Minimize maximize 1/|x| is equal to minimize x 2 Quadratic Programming - Convex Force w

Linear Separable SVM Subject To Minimize Constrained Problem We need to study the theory of Lagrange Multipliers to understand it in detail

Lagrange Multipliers Define The Lagrangian Only consider convex f, g i, and affine h i (method is more general) α,β are called Lagrange Multipliers

Primal Problem If x is primal infeasible: g i (x) >0 for some i maximize over α i >0 then α i g i (x) is unbounded h i (x) ≠ 0 for some i maximize over β then β i h i (x) is unbounded x is primal infeasible if g i (x) < 0 for some i or h i (x) ≠ 0 for some i Primal Problem

If x is primal feasible: g i (x) ≤ 0 for all i maximize over α i ≥0 then optimal is α i =0 h i (x) = 0 for all i maximize over β then β i h i (x) = 0, β is irrelevant

Primal Problem Made constraints into ∞ value in optimization function Which is what we are looking for!!! is an optimal x

Dual Problem α,β are dual feasible if α i ≥ 0 for all i This implies

Weak and Strong Duality Question: When are they equal?

Strong Duality: Slaters Condition If f,g i are convex and h i is affine and the problem is strictly feasible e.g. exist primal feasible x such g i (x) < 0 for all i then d* = p * (strong duality) Assume that is the case

Complementary Slackness Let x* be primal optimal α*,β* dual optimal (p*=d*) All Non-Negative for all i Complimentary Slackness

Karush-Kuhn-Tucker (KKT) Conditions Let x* be primal optimal α*,β* dual optimal (p*=d*) g i (x*) ≤ 0, for all i α i * ≥ 0 for all i α i * g i (x*) = 0 for all i h i (x*) = 0 for all i Primal Feasibility Dual Feasibility Complementary Slackness Since x* minimizes Stationary KKT Conditions for optimality, necessary and sufficient.

Finally Back To SVM Subject To Minimize Define the Lagrangian (no β required)

SVM Dual Form Need to minimize. We take derivatives and solve for 0 Solve for 0 w is a vector that is a specific linear combination of input points

SVM Dual Form Which must be 0. We get constraint

SVM Dual Form Insert Above

SVM Dual Form Insert Above

SVM Dual Form

SVM Dual Problem Found the minimum over w,b now maximize over α Subject To Remember

Intercerpt b* Case: y i = 1 Cases: y i =-1 Constraint

Making Predictions Sign of Support Vectors

w Complementary Slackness Support vectors are the vectors that support the plane

SVM Summary Subject To Support Vectors w