Download presentation
Presentation is loading. Please wait.
Published byMorgan Miller Modified over 9 years ago
1
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003 Lecture 34
2
Announcements Read selection from Trucco & Verri on deformable contours for guest lecture on Friday On Monday I’ll cover neural networks (Forsyth & Ponce Chapter 22.4), and begin reviewing for the final
3
Outline Linear discriminants –Two-class –Multicategory Criterion functions for computing discriminants Generalized linear discriminants
4
Discriminants for Classification Previously, decision boundary was chosen based on underlying class probability distributions –Completely known distribution –Estimate parameters for distribution with known form –Nonparametrically approximate unknown distribution Idea: Ignore class distributions and simply assume decision boundary is of known form with unknown parameters Discriminant function (two-class): Which side of boundary is data point on? –Linear discriminant Hyperplane decision boundary In general, not optimal
5
Two-Class Linear Discriminants Represent the n -dimensional data points x in homogeneous coordinates: y = (x T, 1) T Decision boundary is hyperplane a = (w T, w 0 ) T –w = (w 1, …, w n ) T : Plane’s normal (weight vector) –w 0 : Plane’s distance from origin (bias or threshold) in x space (always passes through y space origin)
6
Discriminant Function Define the two-class linear discriminant function with the dot product g(x) = a T y –g(x) = 0 Normal vector to plane and y are orthogonal y is on the plane –g(x) 0 Angle between vectors is acute y on side of plane that normal points to Classify as c 1 –g(x) 0 Angle between vectors is obtuse y on plane’s other side Classify as c 2 from Duda et al.
7
Distance to Decision Boundary Distance from y to hyperplane in y space given by projection a T y/ a Since a w , this is a lower bound on the distance of x to hyperplane in x space from Duda et al.
8
Multicategory Linear Discriminants Given C categories, define C discriminant functions g i (x) = a i T y Classify x as a member of c i if g i (x) g j (x) for all j i from Duda et al.
9
Characterizing Solutions Separability: There exists at least one a in weight space ( y space) that classifies all samples correctly Solution region: The region of weight space in which every a separates the classes (not the same as decision regions!) from Duda et al. Separable dataNon-separable data
10
Normalization Suppose each data point y i is classified correctly as c 1 when a T y i 0 and c 2 when a T y i 0 Idea: Replace c 2 -labeled samples with negation - y i –This simplifies things, since now we need only look for an a such that a T y i 0 for all of the data from Duda et al.
11
Margin Set minimum distance that decision hyperplane can be from nearest data point with a T y b. For a particular point y i, this distance is b/ y i Intuitively, we want a maximal margin from Duda et al.
12
Criterion Functions To actually solve for a discriminant a, define criterion function J(a; y 1, …, y d ) that is minimized when a is a solution –For example, let J e = the number of misclassified data points Minimal ( J e = 0 ) for solutions For practical purposes, we will use something like gradient descent on J to arrive at a solution –J e is unsuitable for gradient descent since it’s only piecewise continuous
13
Example: Plot of J e from Duda et al.
14
Perceptron Criterion Function Define the following piecewise linear function: J p (a) = y Y (-a T y) where Y(a) is set of samples misclassified by a This is proportional to sum of distances between misclassified samples and decision hyperplane from Duda et al.
15
Non-Separable Data: Error Minimization Perceptron assumes separability—won’t stop otherwise –Only focuses on erroneous classifications Idea: Minimize mean squared error over all data Trying to put decision hyperplane exactly at the margin leads to linear equations rather than linear inequalities: a T y i = y i T a = b i Stack all data points as row vectors y i T and collect margins b i to get system of equations Y a = b Can solve with pseudoinverse a = Y + b
16
Non-Separable Data: Error Minimization Alternative to pseudoinverse approach is gradient descent on criterion function J s (a) = Ya - b 2 This is called the Widrow-Hoff or least mean- squared (LMS) procedure Doesn’t necessarily converge to separating hyperplane if one exists Advantages –Avoids problems that occur for singular Y T Y –Avoids need for manipulating large matrices from Duda et al.
17
Generalized Linear Discriminants We originally constructed the vector y from the n - vector x by simply adding 1 coordinate for homogeneous representation Can go further and use any number m of arbitrary functions: y = (y 1 (x), …, y m (x)) —sometimes called basis expansion Even if y i (x) are nonlinear, we can still use linear methods in m –dimensional y space Why? Because we can approximate nonlinear discriminant functions straightforwardly
18
Example: Quadratic Discriminant Define 1-D quadratic discriminant function as g(x) = a 1 + a 2 x + a 3 x 2 –This is nonlinear in x, so we can’t directly use the methods described thus far –But by mapping to 3-D with y = (1, x, x 2 ) T, we can use linear methods (e.g., Perceptron, LMS) to solve for a = (a 1, a 2, a 3 ) T in y space Inefficiency: Number of variables may overwhelm amount of data for larger n since m = (n + 1) (n + 2)/2
19
Example: Quadratic Discriminant from Duda et al. no linear decision boundary separates in x space hyperplane separates in y space
20
Support Vector Machines (SVM) Map input nonlinearly to higher-dimensional space (where in general there is a separating hyperplane) Find separating hyperplane that maximizes distance to nearest data point (i.e., the margin)
21
Example: SVM for Gender Classification of Faces Data: 1,755 21 x 12 cropped face images Error rates –Human: 30.7% (hampered by lack of hair cues?) –SVM: 3.4% (5-fold cross-validation) courtesy of B. Moghaddam Humans’ top misclassifications F M M F M from Moghaddam & Yang, 2001
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.