Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch 4. Linear Models for Classification Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University

Similar presentations


Presentation on theme: "Ch 4. Linear Models for Classification Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University"— Presentation transcript:

1 Ch 4. Linear Models for Classification Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

2 Recall, given {xn,tn} model t= y(x,w) +Є Regression: find w for modeling y(x,w) which is real Prediction: forget about w, but find t which is real for a g iven x Now, Classification: t only takes discrete values or probability values (0,1). Partition the feature space  (D-1)-dimensional hyperplane for D-dimensional input space  1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0) T 2

3 Need to generalize Need to generalize y(x)=f (wTx + w0). Activation function 3 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

4 4 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/Contents Deterministic Models:Discriminant Functions Find yk(x) to partition the feature space into decision regions Probabilistic Models  Generative Models:  Inference : Model p(x/Ck) and p(Ck)  Decision : Model p(Ck/x)  Discriminative Models  Model p(Ck/x) directly

5 Discriminant Function A discriminant is a function that takes an input vector x and assigns it to one of K classes, denoted Ck. Linear discriminants, y(x)=w T x + w0 5

6 CONCEPT OF SPACE

7 Vector Spaces Space of vectors, closed under addition and scalar multi plication

8 Scaler product, dot product, norm

9 Norm of Images

10 Orthogonal Images, Distance,Basis

11 Cauchy Schwartz Inequality  U+V  ≤  U  +  V 

12 Schwartz Inequality

13 13 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Discriminant Functions-Two Classes Classification by hyperplanes or

14 14 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Discriminant Functions-Multiple Classes One-versus-the-rest classifier  K-1 classifiers for a K-class discriminant  Ambiguous when more than two classifiers say ‘yes’. One-versus-one classifier  K(K-1)/2 binary discriminant functions  Majority voting  ambiguousness with equal scores One-versus-the-restOne-versus-one

15 15 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Discriminant Functions-Multiple Classes (Cont’d) K-class discriminant comprising K linear functions  Assigns x to the corresponding class having the maximum output. The decision regions are always singly connected and convex.

16 Decision Boundary between Ck and Cj is a hyperplane 16 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

17 17 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Approaches for Learning Parameters for Linear Discriminant Functions Least square method Fisher’s linear discriminant  Relation to least squares  Multiple classes Perceptron algorithm

18 18 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Least Square Method Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t. For a training data set, {x n, t n } where n = 1,…,N. The sum of squares error function is… Minimizing SSE gives Pseudo inverse

19 19 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Least Square Method (Cont’d) -Limit and Disadvantage The least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers  Because SSE function penalizes ‘too correct’ examples i.e. far from the decision boundary.  ML under Gaussian conditional distribution  Unimodal vs. multimodal

20 20 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Least Square Method (Cont’d) -Limit and Disadvantage Lack of robustness comes from…  Least square method corresponds to the maximum likelihood under the assumption of Gaussian distribution.  Binary target vectors are far from this assumption. Least square solutionLogistic regression

21 21 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant Linear classification model as dimensionality reduction from the D-dimensional space to one dimension.  In case of two classes Finding w such that the projected data are clustered well.

22 22 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant (Cont’d) Maximizing projected mean distance?  The distance between the cluster means, m 1 and m 2 projected onto w.  Not appropriate when the covariances are nondiagonal.

23 23 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant (Cont’d) Integrate the within-class variance of the projected data. Finding w that maximizes J(w). J(w) is maximized when Fisher’s linear discriminant If the within-class covariance is isotropic, w is proportional to the difference of the class means as in the previous case. S B : Between-class covariance matrix S W : Within-class covariance matrix in the direction of (m 2 -m 1 )

24 24 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Linear Discriminant -Relation to Least Squares- Fisher criterion as a special case of least squares  When setting target values as:  N/N 1 for class C 1 and N/N 2 for class C 2.

25 25 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Discriminant for Multiple Classes K > 2 classes Dimension reduction from D to D’  D’ > 1 linear features, y k (k = 1,…,D’) Generalization of S W and S B S B is from the decomposition of total covariance matrix (Duda and Hart, 1997)

26 26 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Discriminant for Multiple Classes (Cont’d) Covariance matrices in the projected y-space Fukunaga’s criterion Another criterion  Duda et al. ‘Pattern Classification’, Ch. 3.8.3  Determinant: the product of the eigenvalues, i.e. the variances in the principal directions.

27 27 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fisher’s Discriminant for Multiple Classes (Cont’d)

28 Perceptron:F. Rosenblatt Connectionism: Birth of ANN: Kohonen, Hopefield, Inspired from Biological Neurons 28 =f(w T x)

29 Activation Function : f(w T x) Transform x to Φ to make space linearly separable. Then Activation function becomes f(w T Φ(x)) and Note Φ(x) = [ 1 Φ(x1) Φ(x2) Φ(x3) Φ(x4)… Φ(xD) ] 29 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

30 Goal: Find w Define a cost function and minimiz it w.r. To w We want f(w T Φ(x)) tn≥ 0 Recall, t ∈ {-1, +1} 30 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

31 31 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Perceptron Criterion  Let M is the set of misclassified samples by w

32 32 Stochastic gradient descent algorithm The error from a misclassified pattern is reduced after each iteration.  Not imply the overall error is reduced. Perceptron convergence theorem.  If there exists an exact solution (i.e. linear separable), the perceptron learning algorithm is guaranteed to find it.

33 33 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Perceptron Algorithm (cont’d) (a)(b) (c)(d)

34 Problems with Perceptrom Learning speed, Poor results for linearly nonseparable, Difficult to apply to multiple classes İt is deterministic 34

35 35 Probabilistic Approaches for Classification  Generative Models  Inference: Model class-conditional densities and class priors  Decision: Apply Bayes’ theorem to find the posterior class probabilities.  Probabilistic Discriminative Models  Use the functional form of the generalized linear model explicitly  Determine the parameters directly using Maximum Likelihood

36 Comes from population growth Prob distribution function of Norlam R.V. İs Logistic sigmoid İf class conditional densities are Normal, posteriors become lo gistic sigmoid Logistic Sigmoid Function 36 simple [2] logistic function may be defined by the formula [2]

37 Posterior Probabilities can be formulated by 2-Class: Logistic sigmoid acting on a linear function of x K-Class: Softmax transformation of a linear function of x Then, The parameters of the densities as well as the class priors can be determined using Maximum Likelihood 37

38 38 Probabilistic Generative Models: 2-Class Recall, given Posterior can be expresses by Logistic Sigmoid a is called logit function

39 Posterior can be expresses by Softmax function or normalized exponential Probabilistic Generative Models K-Class 39

40 40 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models Gaussian Class Conditionals for 2-Class Assume same covariance matrix ∑, Note  The quadratic terms in x from the exponents are cancelled.  The resulting decision boundary is linear in input space.  The prior only shifts the decision boundary, i.e. parallel contour.

41 41 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models: Gaussian Class Conditionals for K-classes When, covariance matrix is the same, decision boundaries are linear. When, each class-condition density have its own covariance matrix, a k becoes quadratic functions of x, giving rise to a quadratic discriminant.

42 42 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Maximum Likelihood Solution- Determining the parameters for using maximum likelihood from a training data set. Two classes  The likelihood function

43 Q: Find the parameters of p(Ck/x) P(Ck) μ1, μ2 and  43 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

44 Probabilistic Generative Models -Maximum Likelihood Solution Probabilistic Generative Models -Maximum Likelihood Solution Let P(C1) = π and P(C2) = 1- π 44

45 45 Probabilistic Generative Models -Maximize log likelihood w r to. μ 1 μ 2. ∑ Probabilistic Generative Models -Maximize log likelihood w r to. π,μ 1 μ 2. ∑.

46 46 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Discrete Features- Discrete feature values General distribution would correspond to a 2 D size table.  When we have D inputs, the table size grows exponentially with the number of features.  Naïve Bayes assumption, conditioned on the class C k  Linear with respect to the features as in the continuous features.

47 47 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bayes Decision Boundaries: 2D -Pattern Classification, Duda et al. pp.42

48 48 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bayes Decision Boundaries: 3D -Pattern Classification, Duda et al. pp.43

49 49 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probabilistic Generative Models -Exponential Family- For both Gaussian distributed and discrete inputs…  The posterior class probabilities are given by  Generalized linear models with logistic sigmoid or softmax activation functions. Generalization to the class-conditional densities of the exponential family  The subclass for which u(x) = x.  Linear with respect to x again. Exponential family Two-classes K-classes

50 Probabilistic Discriminative Models Goal: Find p(Ck/x) directly Discriminative Training: Max likelihood p(Ck/x) İmroves prediction performance when p(x/Ck) is poorly estimated 50 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

51 51 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Fixed basis functions: x Assume fixed nonlinear transformation  Transform inputs using a vector of basis functions  The resulting decision boundaries will be linear in the feature space y(x)= W T Φ

52 52 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Logistic Regression Model Posterior probability of a class for two-class problem: The number of adjustable parameters (M-dimensional, 2-class)  2 Gaussian class conditional densities (generative model)  2M parameters for means  M(M+1)/2 parameters for (shared) covariance matrix  Grows quadratically with M  Logistic regression (discriminative model)  M parameters for  Grows linearly with M

53 53 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Logistic Regression (Cont’d) Determining the parameters using ML  Likelihood function:  Cross-entropy error function (negative log likelihood)  the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possi bilities, if a coding scheme is used based on a given probability distribu tion q, rather than the "true" distribution p.probability distributionsbits

54 The gradient of the error function w.r.t. W The same form as the linear regression regression model) 54 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

55 55 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Iterative Reweighted Least Squares Linear regression models in ch.3  ML solution on the assumption of a Gaussian noise leads to a close- form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w. Logistic regression model  No longer a closed-form solution  But the error function is concave and has a unique minimum  Efficient iterative technique can be used  The Newton-Raphson update to minimize a function E(w) –Where H is the Hessian matrix, the second derivatives of E(w)

56 56 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Iterative reweighted least squares (Cont’d) Sum-of-squares error function:  Newton-Raphson update: Cross-entropy error function:  Newton-Rhapson update: (iterative reweighted least squares)

57 57 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Multiclass logistic regerssion Posterior probability for multiclass classification We can use ML to determine the parameters directly.  Likelihood function using 1-of-K coding scheme  Cross-entropy error function for the multiclass classification

58 58 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Multiclass logistic regression (Cont’d) The derivative of the error function  Same form, the product of error times the basis function. The Hessian matrix  IRLS algorithm can also be used for a batch processing

59 59 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probit regression For a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables.  However this is not the case for all choices of class-conditional density  It might be worth exploring other types of discriminative probabilistic model

60 60 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Probit regression Noisy threshold model Corresponding activation function when θ is drawn from p(θ) The probit function  Sigmoidal shape  The generalized linear model based on a probit activation function is known as probit regression.

61 61 Canonical link functions We have seen that for some models, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector.  Logistic regression model with sigmoid activation function  Logistic regression model with softmax activation function This is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function.

62 62 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Canonical link functions (Cont’d) Conditional distributions of the target variable  Log likelihood:  The derivative of the log likelihood: where The canonical link function: then

63 63 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ The Laplace approximation We cannot integrate exactly over the parameter vector since the posterior is no longer Gaussian. The Laplace approximation: find a Gaussian approximation centered on the mode of the distribution.  Taylor expansion of the logarithm of the target function:  Resulting approximated Gaussian distribution:

64 64 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ The Laplace approximation (Cont’d) M-dimensional case

65 65 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Model comparison and BIC Laplace approximation to the normalization constant Z  This result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison. Consider a set of models having parameters  The log of model evidence can be approximated as  Further approximation with some more assumption: Bayesian Information Criterion (BIC)

66 66 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Bayesian Logistic Regression Exact Bayesian inference is intractable.  Gaussian prior:  Posterior:  Log of posterior: Laplace approximation of posterior distribution

67 67 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Predictive distribution Can be obtained by marginalizing w.r.t the posterior distribution p (w|t) which is approximated by a Gaussian q(w) where a is a marginal distribution of a Gaussian which is also Gaussian

68 68 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ Predictive distribution Resulting variational approximation to the predictive distribution  To integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function Then where  Finally we get


Download ppt "Ch 4. Linear Models for Classification Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University"

Similar presentations


Ads by Google