Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Similar presentations


Presentation on theme: "Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"— Presentation transcript:

1 Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw

2 Types of Prototype Methods Crisp model (K-means, KM) Prototypes are centers of non-overlapping clusters Fuzzy model (Fuzzy c-means, FCM) Prototypes are weighted average of all samples Gaussian Mixture model (GM) Prototypes have a mixture of distributions Linear Discriminant Analysis (LDA) Prototypes are projected sample means K-nearest neighbor classifier (K-NN) Learning vector quantization (LVQ)

3 Prototypes thru Clustering Given the number k of prototypes, find k clusters whose centers are prototypes Commonality: Use iterative algorithm, aimed at decreasing an objective function May converge to local minima The number of k as well as an initial solution must be specified

4 Clustering Objectives The aim of the iterative algorithm is to decrease the value of an objective function Notations: Samples Prototypes L 2 -distance:

5 Objectives (cnt’d) Crisp objective: Fuzzy objective: Gaussian mixture objective

6 K-Means Clustering

7 The Algorithm Initiate k seeds of prototypes p 1, p 2, …, p k Grouping: Assign samples to their nearest prototypes Form non-overlapping clusters out of these samples Centering: Centers of clusters become new prototypes Repeat the grouping and centering steps, until convergence

8 Justification Grouping: Assigning samples to their nearest prototypes helps to decrease the objective Centering: Also helps to decrease the above objective, because and equality holds only if

9 Exercise: 1. Prove that for any group of vectors yi, the following inequality is always true 2. Prove that the equality holds only when 3. Use this fact to prove that the centering step is helpful to decrease the objective function

10 Fuzzy c-Means Clustering

11 Crisp vs. Fuzzy Membership Membership matrix: U c×n U ij is the grade of membership of sample j with respect to prototype i Crisp membership: Fuzzy membership:

12 Fuzzy c-means (FCM) The objective function of FCM is

13 FCM (Cnt’d) Introducing the Lagrange multiplier λ with respect to the constraint we rewrite the objective function as:

14 FCM (Cnt’d) Setting the partial derivatives to zero, we obtain

15 FCM (Cnt’d) From the 2 nd equation, we obtain From this fact and the 1 st equation, we obtain

16 FCM (Cnt’d) Therefore, and

17 FCM (Cnt’d) Together with the 2 nd equation, we obtain the updating rule for u ij

18 FCM (Cnt’d) On the other hand, setting the derivative of J with respect to p i to zero, we obtain

19 FCM (Cnt’d) It follows that Finally, we can obtain the update rule of c i :

20 FCM (Cnt’d) To summarize:

21 K-means vs. Fuzzy c-means Sample Points

22 K-means vs. Fuzzy c-means K-meansFuzzy c-means

23 Expectation-Maximization (EM) Algorithm

24 What Is Given Observed data: X = {x 1, x 2, …, x n }, each of them is drawn independently from a mixture of probability distributions with the density where

25 Incomplete vs. Complete Data The incomplete-data log-likelihood is given by: which is difficult to optimize The complete-data log-likelihood can be handled much easily, where H is the set of hidden random variables How do we compute the distribution of H?

26 EM Algorithm E-Step: first find the expected value where is the current estimate of M-Step: Update the estimate Repeat the process, until convergence

27 E-M Steps

28 Justification The expected value (the circled term) is the lower bound of the log-likelihood

29 Justification (Cnt’d) The maximum of the lower bound equals to the log-likelihood The first term of (1) is the relative entropy of q(h) with respect to The second term is a magnitude that does not depend on h We would obtain the maximum of (1) if the relative entropy becomes zero With this choice, the first term becomes zero and (1) achieves the upper bound, which is

30 Details of EM Algorithm Let be the guessed values of For the given, we can compute

31 Details (Cnt’d) We then consider the expected value:

32 Details (Cnt’d) Lagrangian and partial derivative equation:

33 Details (Cnt’d) From (2), we derive that λ = - n and Based on these values, we can derive the optimal for, of which only the following part involves :

34 Exercise: 4. Deduce from (1) that λ = - n and

35 Gaussian Mixtures The Gaussian distribution is given by: For Gaussian mixtures,

36 Gaussian Mixtures (Cnt’d) Partial derivative: Setting this to zero, we obtain

37 Gaussian Mixtures (Cnt’d) Taking the derivative of with respect to and setting it to zero, we get (many details are omitted)

38 Gaussian Mixtures (Cnt’d) To summarize:

39 Linear Discriminant Analysis (LDA)

40 Illustration

41 Definitions Given: Samples x 1, x 2, …, x n Classes: n i of them are of class i, i = 1, 2, …, c Definition: Sample mean for class i: Scatter matrix for class i:

42 Scatter Matrices Total scatter matrix: Within-class scatter matrix: Between-class scatter matrix:

43 Multiple Discriminant Analysis We seek vectors w i, i = 1, 2,.., c-1 And project the samples x to the c-1 dimensional space y The criterion for W = (w 1, w 2, …, w c-1 ) is

44 Multiple Discriminant Analysis (Cnt’d) Consider the Lagrangian Take the partial derivative Setting the derivative to zero, we obtain

45 Multiple Discriminant Analysis (Cnt’d) Find the roots of the characteristic function as eigenvalues and then solve for w i for the largest c-1 eigenvalues

46 LDA Prototypes The prototype of each class is the mean of the projected samples of that class, the projection is thru the matrix W In the testing phase All test samples are projected thru the same optimal W The nearest prototype is the winner

47 K-Nearest Neighbor (K-NN) Classifier

48 K-NN Classifier For each test sample x, find the nearest K training samples and classify x according to the vote among the K neighbors The error rate is where This shows that the error rate is at most twice the Bayes error

49 Learning Vector Quantization (LVQ)

50 LVQ Algorithm 1.Initialize R prototypes for each class: m 1 (k), m 2 (k), …, m R (k), where k = 1, 2, …, K. 2.Sample a training sample x and find the nearest prototype m j (k) to x a)If x and m j (k) match in class type, b)Otherwise, 3.Repeat step 2, decreasing ε at each iteration

51 References F. H ö ppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition, John Wiley & Sons, 1999. J. A. Bilmes, “ A Gentle Tutorial of the EM algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, ” www.cs.berkeley.edu/~daf/appsem/WordsAndPictures/Papers/b ilmes98gentle.pdf www.cs.berkeley.edu/~daf/appsem/WordsAndPictures/Papers/b ilmes98gentle.pdf T. P. Minka, “Expectation-Maximization as Lower Bound Maximization,” www.stat.cmu.edu/~minka/papers/em.htmlwww.stat.cmu.edu/~minka/papers/em.html R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 nd Ed., Wiley Interscience, 2001. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.


Download ppt "Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"

Similar presentations


Ads by Google