Linear Discriminant Functions Chapter 5 (Duda et al.)

Slides:



Advertisements
Similar presentations
G53MLE | Machine Learning | Dr Guoping Qiu
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CS479/679 Pattern Recognition Dr. George Bebis
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Separating Hyperplanes
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Linear Discriminant Functions
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Chapter 5: Linear Discriminant Functions
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Unconstrained Optimization Problem
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Principles of Pattern Recognition
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Discriminant Functions
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Lecture 4 Linear machine
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
LECTURE 03: DECISION SURFACES
LINEAR DISCRIMINANT FUNCTIONS
Linear classifiers.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Linear machines 28/02/2017.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Support Vector Machines
Presentation transcript:

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis

Generative vs Discriminant Approach Generative approaches find the discriminant function by first estimating the probability distribution of the patterns belonging to each class. Discriminant approaches find the discriminant function explicitly, without assuming a probability distribution.

Generative Approach – Example (two categories) More common to use a single discriminant function (dichotomizer) instead of two: Examples:

Discriminant Approach Specify parametric form of the discriminant function, for example, a linear discriminant: Decide w1 if g(x) > 0 and w2 if g(x) < 0 If g(x)=0, then x lies on the decision boundary and can be assigned to either class.

Discriminant Approach (cont’d) Find the “best” decision boundary (i.e., estimate w and w0) using a set of training examples xk.

Discriminant Approach (cont’d) The solution is found by minimizing a criterion function (e.g., “training error” or “empirical risk”): Learning algorithms can be applied to find the solution. correct class predicted class

Linear Discriminant Functions: two-categories case A linear discriminant function has the following form: The decision boundary, is a hyperplane where the orientation of the hyperplane is determined by w and its location by w0. w is the normal to the hyperplane If w0=0, the hyperplane passes through the origin

Geometric Interpretation of g(x) g(x) provides an algebraic measure of the distance of x from the hyperplane. x can be expressed as follows: direction of r

Geometric Interpretation of g(x) (cont’d) Substitute x in g(x): since and

Geometric Interpretation of g(x) (cont’d) Therefore, the distance of x from the hyperplane is given by: setting x=0:

Linear Discriminant Functions: multi-category case There are several ways to devise multi-category classifiers using linear discriminant functions: (1) One against the rest problem: ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d) (2) One against another (i.e., c(c-1)/2 pairs of classes) problem: ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d) To avoid the problem of ambiguous regions: Define c linear discriminant functions Assign x to wi if gi(x) > gj(x) for all j  i. The resulting classifier is called a linear machine (see Chapter 2)

Linear Discriminant Functions: multi-category case (cont’d) A linear machine divides the feature space in c convex decisions regions. If x is in region Ri, the gi(x) is the largest. Note: although there are c(c-1)/2 pairs of regions, there typically less decision boundaries

Linear Discriminant Functions: multi-category case (cont’d) The decision boundary between adjacent regions Ri and Rj is a portion of the hyperplane Hij given by: (wi-wj) is normal to Hij and the signed distance from x to Hij is

Higher Order Discriminant Functions Can produce more complicated decision boundaries than linear discriminant functions.

Generalized discriminants - defined through special functions yi(x) called φ functions - α is a dimensional weight vector the φ functions yi(x) map a point from the d-dimensional x-space to a point in the -dimensional y-space (usually >> d ) φ

Generalized discriminants (cont’d) The resulting discriminant function is linear in y-space. Separates points in the transformed space by a hyperplane passing through the origin.

Example The corresponding decision regions R1,R2 in the x-space are not simply connected! φ functions d=1,

Example (cont’d) g(x) maps a line in x- space to a parabola in y- space. The plane αty=0 divides the y-space in two decision regions

Learning: two-category, linearly separable case Given a linear discriminant function the goal is to “learn” the parameters w and w0 from a set of n labeled samples xi where each xi has a class label ω1 or ω2.

Augmented feature/parameter space Simplify notation: dimensionality: d  (d+1)

Classification in augmented space Classification rule: If αtyi>0 assign yi to ω1 else if αtyi<0 assign yi to ω2 g(x)=αty Discriminant:

Learning in augmented space: two-category, linearly separable case Given a linear discriminant function the goal is to learn the weights (parameters) α from a set of n labeled samples yi where each yi has a class label ω1 or ω2. g(x)=αty

Learning in augmented space: effect of training examples Every training sample yi places a constraint on the weight vector α. αty=0 defines a hyperplane in parameter space having y as a normal vector. Given n examples, the solution α must lie on the intersection of n half-spaces. a1 a2 parameter space (ɑ1, ɑ2)

Learning in augmented space: effect of training examples (cont’d) Visualize solution in the parameter or feature space. parameter space (ɑ1, ɑ2) feature space (y1, y2) a1 a2

Uniqueness of Solution Solution vector α is usually not unique; we can impose certain constraints to enforce uniqueness: “Find unit-length weight vector that maximizes the minimum distance from the training examples to the separating plane”

Iterative Optimization Define an error function J(α) (i.e., missclassifications) that is minimized if α is a solution vector. Minimize J(α) iteratively: α(k) α(k+1) search direction learning rate How should we define pk?

Choosing pk using Gradient Descent learning rate (note: replace a with α)

Gradient Descent (cont’d) solution space - J(α)

Gradient Descent (cont’d) What is the effect of the learning rate? η J(α) slow but converges to solution fast by overshoots solution

Gradient Descent (cont’d) How to choose the learning rate h(k)? If J(α) is quadratic, then H is constant which implies that the learning rate is constant. Taylor series approximation Hessian (2nd derivatives) (note:replace a with α) optimum learning rate

Choosing pk using Newton’s Method requires inverting H (note: replace a with α)

Newton’s method (cont’d) If J(α) is quadratic, Newton’s method converges in one step! J(α)

Gradient descent vs Newton’s method

“Normalized” Problem If yi in ω2, replace yi by -yi Find α such that: αtyi>0 replace yi by -yi Seek a hyperplane that separates patterns from different categories Seek a hyperplane that puts normalized patterns on the same (positive) side

Perceptron rule Use Gradient Descent assuming: where Y(α) is the set of samples misclassified by α. If Y(α) is empty, Jp(α)=0; otherwise, Jp(α)>0 Find α such that: αtyi>0

Perceptron rule (cont’d) The gradient of Jp(α) is: The perceptron update rule is obtained using gradient descent: (note: replace a with α) or

Perceptron rule (cont’d) (note: replace a with α and Yk with Y(α)) missclassified examples

Perceptron rule (cont’d) Move the hyperplane so that training samples are on its positive side. a2 a1 Example:

Perceptron rule (cont’d) η(k)=1 one example at a time Perceptron Convergence Theorem: If training samples are linearly separable, then the sequence of weight vectors by the above algorithm will terminate at a solution vector in a finite number of steps.

Perceptron rule (cont’d) order of examples: y2 y3 y1 y3 “Batch” algorithm leads to a smoother trajectory in solution space.