Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003 Lecture 34
Announcements Read selection from Trucco & Verri on deformable contours for guest lecture on Friday On Monday I’ll cover neural networks (Forsyth & Ponce Chapter 22.4), and begin reviewing for the final
Outline Linear discriminants –Two-class –Multicategory Criterion functions for computing discriminants Generalized linear discriminants
Discriminants for Classification Previously, decision boundary was chosen based on underlying class probability distributions –Completely known distribution –Estimate parameters for distribution with known form –Nonparametrically approximate unknown distribution Idea: Ignore class distributions and simply assume decision boundary is of known form with unknown parameters Discriminant function (two-class): Which side of boundary is data point on? –Linear discriminant Hyperplane decision boundary In general, not optimal
Two-Class Linear Discriminants Represent the n -dimensional data points x in homogeneous coordinates: y = (x T, 1) T Decision boundary is hyperplane a = (w T, w 0 ) T –w = (w 1, …, w n ) T : Plane’s normal (weight vector) –w 0 : Plane’s distance from origin (bias or threshold) in x space (always passes through y space origin)
Discriminant Function Define the two-class linear discriminant function with the dot product g(x) = a T y –g(x) = 0 Normal vector to plane and y are orthogonal y is on the plane –g(x) 0 Angle between vectors is acute y on side of plane that normal points to Classify as c 1 –g(x) 0 Angle between vectors is obtuse y on plane’s other side Classify as c 2 from Duda et al.
Distance to Decision Boundary Distance from y to hyperplane in y space given by projection a T y/ a Since a w , this is a lower bound on the distance of x to hyperplane in x space from Duda et al.
Multicategory Linear Discriminants Given C categories, define C discriminant functions g i (x) = a i T y Classify x as a member of c i if g i (x) g j (x) for all j i from Duda et al.
Characterizing Solutions Separability: There exists at least one a in weight space ( y space) that classifies all samples correctly Solution region: The region of weight space in which every a separates the classes (not the same as decision regions!) from Duda et al. Separable dataNon-separable data
Normalization Suppose each data point y i is classified correctly as c 1 when a T y i 0 and c 2 when a T y i 0 Idea: Replace c 2 -labeled samples with negation - y i –This simplifies things, since now we need only look for an a such that a T y i 0 for all of the data from Duda et al.
Margin Set minimum distance that decision hyperplane can be from nearest data point with a T y b. For a particular point y i, this distance is b/ y i Intuitively, we want a maximal margin from Duda et al.
Criterion Functions To actually solve for a discriminant a, define criterion function J(a; y 1, …, y d ) that is minimized when a is a solution –For example, let J e = the number of misclassified data points Minimal ( J e = 0 ) for solutions For practical purposes, we will use something like gradient descent on J to arrive at a solution –J e is unsuitable for gradient descent since it’s only piecewise continuous
Example: Plot of J e from Duda et al.
Perceptron Criterion Function Define the following piecewise linear function: J p (a) = y Y (-a T y) where Y(a) is set of samples misclassified by a This is proportional to sum of distances between misclassified samples and decision hyperplane from Duda et al.
Non-Separable Data: Error Minimization Perceptron assumes separability—won’t stop otherwise –Only focuses on erroneous classifications Idea: Minimize mean squared error over all data Trying to put decision hyperplane exactly at the margin leads to linear equations rather than linear inequalities: a T y i = y i T a = b i Stack all data points as row vectors y i T and collect margins b i to get system of equations Y a = b Can solve with pseudoinverse a = Y + b
Non-Separable Data: Error Minimization Alternative to pseudoinverse approach is gradient descent on criterion function J s (a) = Ya - b 2 This is called the Widrow-Hoff or least mean- squared (LMS) procedure Doesn’t necessarily converge to separating hyperplane if one exists Advantages –Avoids problems that occur for singular Y T Y –Avoids need for manipulating large matrices from Duda et al.
Generalized Linear Discriminants We originally constructed the vector y from the n - vector x by simply adding 1 coordinate for homogeneous representation Can go further and use any number m of arbitrary functions: y = (y 1 (x), …, y m (x)) —sometimes called basis expansion Even if y i (x) are nonlinear, we can still use linear methods in m –dimensional y space Why? Because we can approximate nonlinear discriminant functions straightforwardly
Example: Quadratic Discriminant Define 1-D quadratic discriminant function as g(x) = a 1 + a 2 x + a 3 x 2 –This is nonlinear in x, so we can’t directly use the methods described thus far –But by mapping to 3-D with y = (1, x, x 2 ) T, we can use linear methods (e.g., Perceptron, LMS) to solve for a = (a 1, a 2, a 3 ) T in y space Inefficiency: Number of variables may overwhelm amount of data for larger n since m = (n + 1) (n + 2)/2
Example: Quadratic Discriminant from Duda et al. no linear decision boundary separates in x space hyperplane separates in y space
Support Vector Machines (SVM) Map input nonlinearly to higher-dimensional space (where in general there is a separating hyperplane) Find separating hyperplane that maximizes distance to nearest data point (i.e., the margin)
Example: SVM for Gender Classification of Faces Data: 1, x 12 cropped face images Error rates –Human: 30.7% (hampered by lack of hair cues?) –SVM: 3.4% (5-fold cross-validation) courtesy of B. Moghaddam Humans’ top misclassifications F M M F M from Moghaddam & Yang, 2001