Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Slides:



Advertisements
Similar presentations
CS479/679 Pattern Recognition Dr. George Bebis
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Separating Hyperplanes
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
Linear Discriminant Functions
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Simple Neural Nets For Pattern Classification
Chapter 5: Linear Discriminant Functions
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Functions Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Unconstrained Optimization Problem
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Lecture 10: Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Principles of Pattern Recognition
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Discriminant Functions
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Lecture 4 Linear machine
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
LECTURE 03: DECISION SURFACES
LINEAR DISCRIMINANT FUNCTIONS
Linear classifiers.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition: Statistical and Neural
Linear machines 28/02/2017.
COSC 4335: Other Classification Techniques
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Support Vector Machines
Support vector machines
Linear Discrimination
Presentation transcript:

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis

Generative vs Discriminant Approach Generative approaches estimate the discriminant function by first estimating the probability distribution of the patterns belonging to each class. Discriminant approaches estimate the discriminant function explicitly, without assuming a probability distribution.

Generative Approach (two categories) More common to use a single discriminant function (dichotomizer) instead of two: Examples: If g(x)=0, then x lies on the decision boundary and can be assigned to either class.

Discriminant Approach (two categories) Specify parametric form of the discriminant function, for example, a linear discriminant: Decide  1 if g(x) > 0 and  2 if g(x) < 0 If g(x)=0, then x lies on the decision boundary and can be assigned to either class.

Discriminant Approach (cont’d) (two categories) Find the “best” decision boundary (i.e., estimate w and w 0 ) using a set of training examples x k.

Discriminant Approach (cont’d) (two categories) The solution can be found by minimizing an error function (e.g., “training error” or “empirical risk”): “Learning” algorithms can be applied to find the solution. correct class predicted class class labels:

Linear Discriminants (two categories) A linear discriminant has the following form: The decision boundary (g(x)=0), is a hyperplane where the orientation of the hyperplane is determined by w and its location by w 0. – w is the normal to the hyperplane – If w 0 =0, the hyperplane passes through the origin

Geometric Interpretation of g(x) g(x) provides an algebraic measure of the distance of x from the hyperplane. direction of r x can be expressed as follows :

Geometric Interpretation of g(x) (cont’d) Substitute x in g(x): since and

Geometric Interpretation of g(x) (cont’d) The distance of x from the hyperplane is given by: setting x=0:

Linear Discriminant Functions: multi-category case There are several ways to devise multi-category classifiers using linear discriminant functions: (1) One against the rest problem: ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d) (2) One against another (i.e., c(c-1)/2 pairs of classes) problem: ambiguous regions

Linear Discriminant Functions: multi-category case (cont’d) To avoid the problem of ambiguous regions: – Define c linear discriminant functions – Assign x to  i if g i (x) > g j (x) for all j  i. The resulting classifier is called a linear machine (see Chapter 2)

Linear Discriminant Functions: multi-category case (cont’d) A linear machine divides the feature space in c convex decisions regions. – If x is in region R i, the g i (x) is the largest. Note: although there are c(c-1)/2 pairs of regions, there typically less decision boundaries

Linear Discriminant Functions: multi-category case (cont’d) The decision boundary between adjacent regions R i and R j is a portion of the hyperplane H ij given by: (w i -w j ) is normal to H ij and the signed distance from x to H ij is

Higher Order Discriminant Functions Can produce more complicated decision boundaries than linear discriminant functions.

Linear Discriminants – Revisited Augmented feature/parameter space:

Linear Discriminants – Revisited (cont’d) Classification rule: If α t y i >0 assign y i to ω 1 else if α t y i <0 assign y i to ω 2 Discriminant:

Generalized Discriminants First, map the data to a space of higher dimensionality. – Non-linearly separable  linearly separable This can be accomplished using special transformation functions (φ functions): Map a point from a d-dimensional space to a point in a -dimensional (where >> d ) φ

Generalized Discriminants (cont’d) A generalized discriminant is a linear discriminant in the - dimensional space: Separates points in the - space by a hyperplane passing through the origin.

Example φ functions The corresponding decision regions R 1,R 2 in the d-space are not simply connected (not linearly separable). Consider the following mapping functions: d=1, Discriminant:

Example (cont’d) g(x) maps a line in d-space to a parabola in - space. The problem has now become linearly separable! The plane divides the -space in two decision regions

Learning: linearly separable case (two categories) Given a linear discriminant function the goal is to “learn” the parameters (weights) α from a set of n labeled samples y i, where each y i has a class label ω 1 or ω 2. g(x)=α t y

Learning: effect of training examples Every training sample y i places a constraint on the weight vector α; let’s see how. Case 1: visualize solution in “parameter space”: – α t y=0 defines a hyperplane in the parameter space with y being the normal vector. – Given n examples, the solution α must lie on the intersection of n half-spaces. parameter space (ɑ 1, ɑ 2 ) a1a1a1a1 a2a2a2a2

Learning: effect of training examples Case 2: visualize solution in “feature space”: – α t y=0 defines a hyperplane in the feature space with α being the normal vector. – Given n examples, the solution α must lie within a certain region.

Uniqueness of Solution Solution vector α is usually not unique; we can impose certain constraints to enforce uniqueness, for example: “ Find unit-length weight vector that maximizes the minimum distance from the training examples to the separating plane”

“Learning” Using Iterative Optimization Minimize an error function J(α) (e.g., classification error) with respect to α: Minimize J(α) iteratively: (k) α(k) (k+1) α(k+1) search direction learning rate (search step) How should we choose p k ? estimate at k-th iteration estimate at k+1-th iteration

Choosing p k using Gradient Descent learning rate (a = α)

Gradient Descent (cont’d) search space J(α)

Gradient Descent (cont’d) ηη What is the effect of the learning rate? fast by overshoots solution slow but converges to solution J(α)

Gradient Descent (cont’d) How to choose the learning rate  (k)? Taylor series approximation optimum learning rate Hessian (2 nd derivatives) (a = α) Setting a=a(k+1) and using

Choosing p k using Newton’s Method requires inverting H (a = α)

Newton’s method (cont’d) If J(α) is quadratic, Newton’s method converges in one step! J(α)

Gradient descent vs Newton’s method

“Dual” Problem If y i in ω 2, replace y i by -y i Find α such that: α t y i >0 Seek a hyperplane that separates patterns from different categories Seek a hyperplane that puts normalized patterns on the same (positive) side Classification rule: If α t y i >0 assign y i to ω 1 else if α t y i <0 assign y i to ω 2

Perceptron rule Use Gradient Descent assuming that the error function to be minimized is: where Y(α) is the set of samples misclassified by α. If Y(α) is empty, J p (α)=0; otherwise, J p (α)>0 Find α such that: α t y i >0

Perceptron rule (cont’d) The gradient of J p (α) is: The perceptron update rule is obtained using gradient descent: or (a = α)

Perceptron rule (cont’d) missclassified examples (note: replace a with α and α Y k with Y( α ))

Perceptron rule (cont’d) Move the hyperplane so that training samples are on its positive side. a2a2a2a2 a1a1a1a1 Example:

Perceptron rule (cont’d) Perceptron Convergence Theorem: If training samples are linearly separable, then the perceptron algorithm will terminate at a solution vector in a finite number of steps. η(k)=1 one example at a time

Perceptron rule (cont’d) order of examples: y 2 y 3 y 1 y 3 “Batch” algorithm leads to a smoother trajectory in solution space.