Binary Classification Problem Linearly Separable Case

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
The Most Important Concept in Optimization (minimization)  A point is said to be an optimal solution of a unconstrained minimization if there exists no.
Support Vector Machine
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Statistics and Machine Learning Fall, 2005 鮑興國 and 李育杰 National Taiwan University of Science and Technology.
Computational Learning Theory
Reduced Support Vector Machine
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines Formulation  Solve the quadratic program for some : min s. t.,, denotes where or membership.  Different error functions and measures.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Unconstrained Optimization Problem
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
SVM Support Vectors Machines
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
Classification and Regression
Mathematical Programming in Support Vector Machines
Support Vector Machines
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Lecture 18. SVM (II): Non-separable Cases
Support vector machines
Presentation transcript:

Binary Classification Problem Linearly Separable Case Which one is the best? A+ A- Solvent Bankrupt

Support Vector Machines Maximizing the Margin between Bounding Planes

Algebra of the Classification Problem Linearly Separable Case (Exists ) Given l points in the n dimensional real space Represented by an matrix or Membership of each point in the classes is specified by an diagonal matrix D : if and Separate and by two bounding planes such that: Predict the membership of a new data point

Summary the Notations Let be a training dataset and represented by matrices equivalent to , where

The Mathematical Model for SVMs: Linear Program and Quadratic Program An optimization problem in which the objective function and all constraints are linear functions is called a linear programming problem formulation is in this category If the objective function is convex quadratic while the constraints are all linear then the problem is called convex quadratic programming problem Standard SVM formulation is in this category

Optimization Problem Formulation Problem setting: Given functions and , defined on a domain subject to where is called the objective function and are called constraints.

The Most Important Concept in Optimization (minimization) A point is said to be an optimal solution of a unconstrained minimization if there exists no decent direction. This implies the optimality conditions: A point is said to be an optimal solution of a constrained minimization if there exists no feasible decent direction. This implies the KKT optimality conditions. There might exist decent direction but move along this direction will leave out the feasible region

Minimum Principle Example: be a convex and differentiable function Let be the feasible region. Example:

Kuhn-Tucker Stationary-point Problem Minimization Problem vs. Kuhn-Tucker Stationary-point Problem such that MP: KTSP: Find such that

Support Vector Classification (Linearly Separable Case, Primal) The hyperplane that solves the minimization problem: realizes the maximal margin hyperplane with geometric margin

Support Vector Classification (Linearly Separable Case, Dual Form) The dual problem of previous MP: subject to Applying the KKT optimality conditions, we have . But where is Don’t forget

Dual Representation of SVM (Key of Kernel Methods: ) The hypothesis is determined by

Soft Margin SVM (Nonseparable Case) If data are not linearly separable Primal problem is infeasible Dual problem is unbounded above Introduce the slack variable for each training point The inequality system is always feasible e.g.

Two Different Measures of Training Error 2-Norm Soft Margin: 1-Norm Soft Margin:

Why We Maximize the Margin? (Based on Statistical Learning Theory) The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound

Goal of Learning Algorithms The early learning algorithms were designed to find such an accurate fit to the data. A classifier is said to be consistent if it performed the correct classification of the training data The ability of a classifier to correctly classify data not in the training set is known as its generalization Bible code? 1994 Taipei Mayor election? Predict the real future NOT fitting the data in your hand or predict the desired results

Probably Approximately Correct Learning pac Model Key assumption: Training and testing data are generated i.i.d. fixed but unknown according to an distribution When we evaluate the “quality” of a hypothesis (classification function) we should take the ( i.e. “average unknown distribution into account error” or “expected error” made by the ) We call such measure risk functional and denote it as

Generalization Error of pac Model Let be a set of training examples chosen i.i.d. according to Treat the generalization error as a r.v. depending on the random selection of Find a bound of the trail of the distribution of in the form r.v. is a function of and ,where is the confidence level of the error bound which is given by learner

Probably Approximately Correct We assert: or The error made by the hypothesis then the error bound will be less that is not depend on the unknown distribution

PAC vs. 民意調查 成功樣本為1265個,以單純隨機抽樣方式(SRS)估計抽樣誤差,在95%的信心水準下,其最大誤差應不超過±2.76%。

Find the Hypothesis with Minimum Expected Risk? Let the training examples chosen i.i.d. according to with the probability density be The expected misclassification error made by is The ideal hypothesis should has the smallest expected risk Unrealistic !!!

Empirical Risk Minimization (ERM) and are not needed) Replace the expected risk over by an average over the training example The empirical risk: Find the hypothesis with the smallest empirical risk Only focusing on empirical risk will cause overfitting

VC Confidence (The Bound between ) The following inequality will be held with probability C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998), p.121-167

Why We Maximize the Margin? (Based on Statistical Learning Theory) The Structural Risk Minimization (SRM): The expected risk will be less than or equal to empirical risk (training error)+ VC (error) bound

Capacity (Complexity) of Hypothesis Space :VC-dimension A given training set is shattered by if for every labeling of with this labeling if and only consistent Three (linear independent) points shattered by a hyperplanes in

Shattering Points with Hyperplanes Can you always shatter three points with a line in ? Theorem: Consider some set of m points in . Choose a point as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the rest points are linearly independent.

Definition of VC-dimension (A Capacity Measure of Hypothesis Space ) The Vapnik-Chervonenkis dimension, , of hypothesis space defined over the input space is the size of the (existent) largest finite subset of shattered by If arbitrary large finite set of can be shattered by , then Let then

Let then Lemma: Two sets of points may be separated by a Hyperplane iff the intersection of their convex hulls is empty Theorem: Given points in and choose any one of the points as origin. Then these points can be shattered by oriented hyperplanes if and only if the position vectors of the Remaining points are linearly independent