Download presentation
Presentation is loading. Please wait.
Published byBlaze Franklin Modified over 8 years ago
1
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis
2
Generative vs Discriminant Approach Generative approaches estimate the discriminant function by first estimating the probability distribution of the patterns belonging to each class. Discriminant approaches estimate the discriminant function explicitly, without assuming a probability distribution.
3
Generative Approach (two categories) More common to use a single discriminant function (dichotomizer) instead of two: Examples: If g(x)=0, then x lies on the decision boundary and can be assigned to either class.
4
Discriminant Approach (two categories) Specify parametric form of the discriminant function, for example, a linear discriminant: Decide 1 if g(x) > 0 and 2 if g(x) < 0 If g(x)=0, then x lies on the decision boundary and can be assigned to either class.
5
Discriminant Approach (cont’d) (two categories) Find the “best” decision boundary (i.e., estimate w and w 0 ) using a set of training examples x k.
6
Discriminant Approach (cont’d) (two categories) The solution can be found by minimizing an error function (e.g., “training error” or “empirical risk”): “Learning” algorithms can be applied to find the solution. correct class predicted class class labels:
7
Linear Discriminants (two categories) A linear discriminant has the following form: The decision boundary (g(x)=0), is a hyperplane where the orientation of the hyperplane is determined by w and its location by w 0. – w is the normal to the hyperplane – If w 0 =0, the hyperplane passes through the origin
8
Geometric Interpretation of g(x) g(x) provides an algebraic measure of the distance of x from the hyperplane. direction of r x can be expressed as follows :
9
Geometric Interpretation of g(x) (cont’d) Substitute x in g(x): since and
10
Geometric Interpretation of g(x) (cont’d) The distance of x from the hyperplane is given by: setting x=0:
11
Linear Discriminant Functions: multi-category case There are several ways to devise multi-category classifiers using linear discriminant functions: (1) One against the rest problem: ambiguous regions
12
Linear Discriminant Functions: multi-category case (cont’d) (2) One against another (i.e., c(c-1)/2 pairs of classes) problem: ambiguous regions
13
Linear Discriminant Functions: multi-category case (cont’d) To avoid the problem of ambiguous regions: – Define c linear discriminant functions – Assign x to i if g i (x) > g j (x) for all j i. The resulting classifier is called a linear machine (see Chapter 2)
14
Linear Discriminant Functions: multi-category case (cont’d) A linear machine divides the feature space in c convex decisions regions. – If x is in region R i, the g i (x) is the largest. Note: although there are c(c-1)/2 pairs of regions, there typically less decision boundaries
15
Linear Discriminant Functions: multi-category case (cont’d) The decision boundary between adjacent regions R i and R j is a portion of the hyperplane H ij given by: (w i -w j ) is normal to H ij and the signed distance from x to H ij is
16
Higher Order Discriminant Functions Can produce more complicated decision boundaries than linear discriminant functions.
17
Linear Discriminants – Revisited Augmented feature/parameter space:
18
Linear Discriminants – Revisited (cont’d) Classification rule: If α t y i >0 assign y i to ω 1 else if α t y i <0 assign y i to ω 2 Discriminant:
19
Generalized Discriminants First, map the data to a space of higher dimensionality. – Non-linearly separable linearly separable This can be accomplished using special transformation functions (φ functions): Map a point from a d-dimensional space to a point in a -dimensional (where >> d ) φ
20
Generalized Discriminants (cont’d) A generalized discriminant is a linear discriminant in the - dimensional space: Separates points in the - space by a hyperplane passing through the origin.
21
Example φ functions The corresponding decision regions R 1,R 2 in the d-space are not simply connected (not linearly separable). Consider the following mapping functions: d=1, Discriminant:
22
Example (cont’d) g(x) maps a line in d-space to a parabola in - space. The problem has now become linearly separable! The plane divides the -space in two decision regions
23
Learning: linearly separable case (two categories) Given a linear discriminant function the goal is to “learn” the parameters (weights) α from a set of n labeled samples y i, where each y i has a class label ω 1 or ω 2. g(x)=α t y
24
Learning: effect of training examples Every training sample y i places a constraint on the weight vector α; let’s see how. Case 1: visualize solution in “parameter space”: – α t y=0 defines a hyperplane in the parameter space with y being the normal vector. – Given n examples, the solution α must lie on the intersection of n half-spaces. parameter space (ɑ 1, ɑ 2 ) a1a1a1a1 a2a2a2a2
25
Learning: effect of training examples Case 2: visualize solution in “feature space”: – α t y=0 defines a hyperplane in the feature space with α being the normal vector. – Given n examples, the solution α must lie within a certain region.
26
Uniqueness of Solution Solution vector α is usually not unique; we can impose certain constraints to enforce uniqueness, for example: “ Find unit-length weight vector that maximizes the minimum distance from the training examples to the separating plane”
27
“Learning” Using Iterative Optimization Minimize an error function J(α) (e.g., classification error) with respect to α: Minimize J(α) iteratively: (k) α(k) (k+1) α(k+1) search direction learning rate (search step) How should we choose p k ? estimate at k-th iteration estimate at k+1-th iteration
28
Choosing p k using Gradient Descent learning rate (a = α)
29
Gradient Descent (cont’d) search space J(α)
30
Gradient Descent (cont’d) ηη What is the effect of the learning rate? fast by overshoots solution slow but converges to solution J(α)
31
Gradient Descent (cont’d) How to choose the learning rate (k)? Taylor series approximation optimum learning rate Hessian (2 nd derivatives) (a = α) Setting a=a(k+1) and using
32
Choosing p k using Newton’s Method requires inverting H (a = α)
33
Newton’s method (cont’d) If J(α) is quadratic, Newton’s method converges in one step! J(α)
34
Gradient descent vs Newton’s method
35
“Dual” Problem If y i in ω 2, replace y i by -y i Find α such that: α t y i >0 Seek a hyperplane that separates patterns from different categories Seek a hyperplane that puts normalized patterns on the same (positive) side Classification rule: If α t y i >0 assign y i to ω 1 else if α t y i <0 assign y i to ω 2
36
Perceptron rule Use Gradient Descent assuming that the error function to be minimized is: where Y(α) is the set of samples misclassified by α. If Y(α) is empty, J p (α)=0; otherwise, J p (α)>0 Find α such that: α t y i >0
37
Perceptron rule (cont’d) The gradient of J p (α) is: The perceptron update rule is obtained using gradient descent: or (a = α)
38
Perceptron rule (cont’d) missclassified examples (note: replace a with α and α Y k with Y( α ))
39
Perceptron rule (cont’d) Move the hyperplane so that training samples are on its positive side. a2a2a2a2 a1a1a1a1 Example:
40
Perceptron rule (cont’d) Perceptron Convergence Theorem: If training samples are linearly separable, then the perceptron algorithm will terminate at a solution vector in a finite number of steps. η(k)=1 one example at a time
41
Perceptron rule (cont’d) order of examples: y 2 y 3 y 1 y 3 “Batch” algorithm leads to a smoother trajectory in solution space.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.