Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim

3(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Classification Models Linear classification model  (D-1)-dimensional hyperplane for D-dimensional input space  1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0) T Discriminant function  Directly assigns each vector x to a specific class.  ex. Fishers linear discriminant Approaches using conditional probability  Separation of inference and decision states  Two approaches  Direct modeling of the posterior probability  Generative approach –Modeling likelihood and prior probability to calculate the posterior probability –Capable of generating samples

5(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Discriminant Functions-Multiple Classes One-versus-the-rest classifier  K-1 classifiers for a K-class discriminant  Ambiguous when more than two classifiers say ‘yes’. One-versus-one classifier  K(K-1)/2 binary discriminant functions  Majority voting  ambiguousness with equal scores One-versus-the-restOne-versus-one

6(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Discriminant Functions-Multiple Classes (Cont’d) K-class discriminant comprising K linear functions  Assigns x to the corresponding class having the maximum output. The decision regions are always singly connected and convex.

7(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Approaches for Learning Parameters for Linear Discriminant Functions Least square method Fisher’s linear discriminant  Relation to least squares  Multiple classes Perceptron algorithm

8(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Least Square Method Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t. For a training data set, {x n, t n } where n = 1,…,N. The sum of squares error function is… Minimizing SSE gives Pseudo inverse

9(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Least Square Method (Cont’d) -Limit and Disadvantage The least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers  Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary.  ML under Gaussian conditional distribution  Unimodal vs. multimodal

10(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Least Square Method (Cont’d) -Limit and Disadvantage Lack of robustness comes from…  Least square method corresponds to the maximum likelihood under the assumption of Gaussian distribution.  Binary target vectors are far from this assumption. Least square solutionLogistic regression

11(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant Linear classification model as dimensionality reduction from the D-dimensional space to one dimension.  In case of two classes Finding w such that the projected data are clustered well.

12(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant (Cont’d) Maximizing projected mean distance?  The distance between the cluster means, m 1 and m 2 projected onto w.  Not appropriate when the covariances are nondiagonal.

13(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant (Cont’d) Integrate the within-class variance of the projected data. Finding w that maximizes J(w). J(w) is maximized when Fisher’s linear discriminant If the within-class covariance is isotropic, w is proportional to the difference of the class means as in the previous case. S B : Between-class covariance matrix S W : Within-class covariance matrix in the direction of (m 2 -m 1 )

14(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant -Relation to Least Squares- Fisher criterion as a special case of least squares  When setting target values as:  N/N 1 for class C 1 and N/N 2 for class C 2.

15(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Discriminant for Multiple Classes K > 2 classes Dimension reduction from D to D’  D’ > 1 linear features, y k (k = 1,…,D’) Generalization of S W and S B S B is from the decomposition of total covariance matrix (Duda and Hart, 1997)

16(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Discriminant for Multiple Classes (Cont’d) Covariance matrices in the projected y-space Fukunaga’s criterion Another criterion  Duda et al. ‘Pattern Classification’, Ch. 3.8.3  Determinant: the product of the eigenvalues, i.e. the variances in the principal directions.

18(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Perceptron Algorithm Classification of x by a perceptron Error functions  The total number of misclassified patterns  Piecewise constant and discontinuous  gradient is zero almost everywhere.  Perceptron criterion.

19(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Perceptron Algorithm (cont’d) Stochastic gradient descent algorithm The error from a misclassified pattern is reduced after each iteration.  Not imply the overall error is reduced. Perceptron convergence theorem.  If there exists an exact solution (i.e. linear separable), the perceptron learning algorithm is guaranteed to find it. However…  Learning speed, linearly nonseparable, multiple classes

21(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models Computation of posterior probabilities using class-conditional densities and class priors. Two classes Generalization to K > 2 classes The normalized exponential is also known as the softmax function, i.e. smoothed version of the ‘max’ function.

22(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Continuous Inputs- Posterior probabilities when the class-conditional densities are Gaussian.  When sharing the same covariance matrix ∑, Two classes  The quadratic terms in x from the exponents are cancelled.  The resulting decision boundary is linear in input space.  The prior only shifts the decision boundary, i.e. parallel contour.

23(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Continuous Inputs (cont’d)- Generalization to K classes  When sharing the same covariance matrix, the decision boundaries are linear again.  If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant.

24(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Maximum Likelihood Solution- Determining the parameters for using maximum likelihood from a training data set. Two classes  The likelihood function

25(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Maximum Likelihood Solution (cont’d)- Two classes (cont’d)  Maximization of the likelihood with respect to π.  Terms of the log likelihood that depend on π.  Setting the derivative with respect to π equal to zero.  Maximization with respect to μ 1. and analogously

26(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Maximum Likelihood Solution (cont’d)- Two classes (cont’d)  Maximization of the likelihood with respect to the shared covariance matrix ∑. Weighted average of the covariance matrices associated with each classes. But not robust to outliers.

27(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Discrete Features- Discrete feature values General distribution would correspond to a 2 D size table.  When we have D inputs, the table size grows exponentially with the number of features.  Naïve Bayes assumption, conditioned on the class C k  Linear with respect to the features as in the continuous features.

30(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Exponential Family- For both Gaussian distributed and discrete inputs…  The posterior class probabilities are given by  Generalized linear models with logistic sigmoid or softmax activation functions. Generalization to the class-conditional densities of the exponential family  The subclass for which u(x) = x.  Linear with respect to x again. Exponential family Two-classes K-classes

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim.

Similar presentations

Presentation on theme: "Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim.

Similar presentations

Presentation on theme: "Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized and revised by Hee-Woong Lim."— Presentation transcript:

Similar presentations

About project

Feedback