Download presentation
1
Linear Models for Classification
Berkay Topçu
2
Linear Models for Classification
Goal: Take an input vector and assign it to one of K classes (Ck where k=1,...,K) Linear separation of classes
3
Generalized Linear Models
We wish to predict discrete class labels, or more generally class posterior probabilities that lies in range (0,1). Classification model as a linear function of the parameters, Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions
4
Discriminant Functions
Linear discriminants If , assign to class C1 and to class C2 otherwise Decision boundary is given by determines the orientation of the decision surface and determines the location Compact notation:
5
Multiple Classes K-class discriminant by combining number of two-class discriminant functions (K>2) One-versus-the-rest: seperating points in one particular class Ck from points not in that class One-versus-one: K(K-1)/2 binary discriminant functions
6
Multiple Classes A single K-class discriminant comprising K linear functions Assign to class Ck if for all How to learn the parameters of linear discriminant functions?
7
Least Squares for Classification
Each class Ck is described by its own linear model Training data set for n =1,...,N where Matrix whose nth row is the vector and whose nth row is
8
Least Squares for Classification
Minimizing the sum-of-squares error function Solution : Discriminant function :
9
Fisher’s Linear Discriminant
Dimensionality reduction: take the D-dimensional input vector and project to one dimension using Projection that maximizes class seperation Two-class problem: N1 points of C1 and N2 points of C2 Fisher’s idea: large separation between the projected class means small variance within each class, minimizing class overlap
10
Fisher’s Linear Discriminant
The Fisher criterion:
11
Fisher’s Linear Discriminant
For the two-class problem, Fisher criterion is a special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani) For multiple classes: The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of
12
The Perceptron Algorithm
Input vector is transformed using a nonlinear transformation Perceptron criterion: For all training samples We need to minimize
13
The Perceptron Algorithm – Stocastic Gradient Descent
Cycle through the training patterns in turn If the pattern is correctly classified weight vectors remains unchanged, else:
14
Probabilistic Generative Models
Depend on simple assumptions about the distribution of the data Logistic sigmoid function Maps the whole real axis to a finite interval
15
Continuous Inputs - Gaussian
Assuming the class-conditional densities are Gaussian Case of two classes
16
Maximum Likelihood Solution
Likelihood function: Maximizing log-likelihood
17
Probabilistic Discriminative Models
Probabilistic generative model Number of parameters grows quadratically with M (# dim.) However has M adjustable parameters Maximum likelihood solution for Logistic Regression Energy function: negative log likelihood
18
Iterative Reweighted Least Squares
Newton-Raphson iterative optimization on linear regression Same as the standard least-squares solution
19
Iterative Reweighted Least Squares
Newton-Raphson update for negative log likelihood Weighted least-squares problem
20
Maximum Margin Classifiers
Support Vector Machines for two-class problem Assuming linearly seperable data set There exists at least one set of variables satisfies That give the smallest generalization error Margin: the smallest distance between decision boundary and any of the samples
21
Support Vector Machines
Optimization of parameters, maximizing the margin Maximizing the margin minimizing : subject to the constraint: Introduction of Lagrange multipliers
22
Support Vector Machines - Lagrange Multipliers
Minimizing with respect to w and b and maximizing with respect to a. The dual form: Quadratic programming problem:
23
Support Vector Machines
Overlapping class distributions (linearly unseparable data) Slack variable: distance from the boundary To maximize the margin while penalizing points that lie on the wrong side of the margin boundary
24
SVM-Overlapping Class Distributions
Identical to separable case Again represents a quadratic programming problem
25
Support Vector Machines
Relation to logistic regression Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE) Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.