Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Support Vector Machines
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Linear Discriminant Functions
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Simple Neural Nets For Pattern Classification
x – independent variable (input)
Chapter 5: Linear Discriminant Functions
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Lecture 10: Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Non-Bayes classifiers. Linear discriminants, neural networks.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Lecture 4 Linear machine
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Final Review Course web page: vision.cis.udel.edu/~cv May 21, 2003  Lecture 37.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
LINEAR DISCRIMINANT FUNCTIONS
Linear machines 28/02/2017.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Support Vector Machines
Presentation transcript:

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34

Announcements Read selection from Trucco & Verri on deformable contours for guest lecture on Friday On Monday I’ll cover neural networks (Forsyth & Ponce Chapter 22.4), and begin reviewing for the final

Outline Linear discriminants –Two-class –Multicategory Criterion functions for computing discriminants Generalized linear discriminants

Discriminants for Classification Previously, decision boundary was chosen based on underlying class probability distributions –Completely known distribution –Estimate parameters for distribution with known form –Nonparametrically approximate unknown distribution Idea: Ignore class distributions and simply assume decision boundary is of known form with unknown parameters Discriminant function (two-class): Which side of boundary is data point on? –Linear discriminant  Hyperplane decision boundary In general, not optimal

Two-Class Linear Discriminants Represent the n -dimensional data points x in homogeneous coordinates: y = (x T, 1) T Decision boundary is hyperplane a = (w T, w 0 ) T –w = (w 1, …, w n ) T : Plane’s normal (weight vector) –w 0 : Plane’s distance from origin (bias or threshold) in x space (always passes through y space origin)

Discriminant Function Define the two-class linear discriminant function with the dot product g(x) = a T y –g(x) = 0  Normal vector to plane and y are orthogonal  y is on the plane –g(x)  0  Angle between vectors is acute  y on side of plane that normal points to  Classify as c 1 –g(x)  0  Angle between vectors is obtuse  y on plane’s other side  Classify as c 2 from Duda et al.

Distance to Decision Boundary Distance from y to hyperplane in y space given by projection a T y/  a  Since  a   w , this is a lower bound on the distance of x to hyperplane in x space from Duda et al.

Multicategory Linear Discriminants Given C categories, define C discriminant functions g i (x) = a i T y Classify x as a member of c i if g i (x)  g j (x) for all j  i from Duda et al.

Characterizing Solutions Separability: There exists at least one a in weight space ( y space) that classifies all samples correctly Solution region: The region of weight space in which every a separates the classes (not the same as decision regions!) from Duda et al. Separable dataNon-separable data

Normalization Suppose each data point y i is classified correctly as c 1 when a T y i  0 and c 2 when a T y i  0 Idea: Replace c 2 -labeled samples with negation - y i –This simplifies things, since now we need only look for an a such that a T y i  0 for all of the data from Duda et al.

Margin Set minimum distance that decision hyperplane can be from nearest data point with a T y  b. For a particular point y i, this distance is b/  y i  Intuitively, we want a maximal margin from Duda et al.

Criterion Functions To actually solve for a discriminant a, define criterion function J(a; y 1, …, y d ) that is minimized when a is a solution –For example, let J e = the number of misclassified data points Minimal ( J e = 0 ) for solutions For practical purposes, we will use something like gradient descent on J to arrive at a solution –J e is unsuitable for gradient descent since it’s only piecewise continuous

Example: Plot of J e from Duda et al.

Perceptron Criterion Function Define the following piecewise linear function: J p (a) =  y  Y (-a T y) where Y(a) is set of samples misclassified by a This is proportional to sum of distances between misclassified samples and decision hyperplane from Duda et al.

Non-Separable Data: Error Minimization Perceptron assumes separability—won’t stop otherwise –Only focuses on erroneous classifications Idea: Minimize mean squared error over all data Trying to put decision hyperplane exactly at the margin leads to linear equations rather than linear inequalities: a T y i = y i T a = b i Stack all data points as row vectors y i T and collect margins b i to get system of equations Y a = b Can solve with pseudoinverse a = Y + b

Non-Separable Data: Error Minimization Alternative to pseudoinverse approach is gradient descent on criterion function J s (a) =  Ya - b  2 This is called the Widrow-Hoff or least mean- squared (LMS) procedure Doesn’t necessarily converge to separating hyperplane if one exists Advantages –Avoids problems that occur for singular Y T Y –Avoids need for manipulating large matrices from Duda et al.

Generalized Linear Discriminants We originally constructed the vector y from the n - vector x by simply adding 1 coordinate for homogeneous representation Can go further and use any number m of arbitrary functions: y = (y 1 (x), …, y m (x)) —sometimes called basis expansion Even if y i (x) are nonlinear, we can still use linear methods in m –dimensional y space Why? Because we can approximate nonlinear discriminant functions straightforwardly

Example: Quadratic Discriminant Define 1-D quadratic discriminant function as g(x) = a 1 + a 2 x + a 3 x 2 –This is nonlinear in x, so we can’t directly use the methods described thus far –But by mapping to 3-D with y = (1, x, x 2 ) T, we can use linear methods (e.g., Perceptron, LMS) to solve for a = (a 1, a 2, a 3 ) T in y space Inefficiency: Number of variables may overwhelm amount of data for larger n since m = (n + 1) (n + 2)/2

Example: Quadratic Discriminant from Duda et al. no linear decision boundary separates in x space hyperplane separates in y space

Support Vector Machines (SVM) Map input nonlinearly to higher-dimensional space (where in general there is a separating hyperplane) Find separating hyperplane that maximizes distance to nearest data point (i.e., the margin)

Example: SVM for Gender Classification of Faces Data: 1, x 12 cropped face images Error rates –Human: 30.7% (hampered by lack of hair cues?) –SVM: 3.4% (5-fold cross-validation) courtesy of B. Moghaddam Humans’ top misclassifications F M M F M from Moghaddam & Yang, 2001