ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Introduction to Bayesian Parameter Estimation
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 32: HIERARCHICAL CLUSTERING Objectives: Unsupervised.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Univariate Gaussian Case (Cont.)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Probabilistic Models with Latent Variables
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means Clustering Resources: J.B.: EM Estimation A.M.: GMM Models D.D.: Clustering C.B.: Unsupervised Clustering Wiki: K-Means A.M.: Hierarchical Clustering MU: Introduction to Clustering Java PR Applet J.B.: EM Estimation A.M.: GMM Models D.D.: Clustering C.B.: Unsupervised Clustering Wiki: K-Means A.M.: Hierarchical Clustering MU: Introduction to Clustering Java PR Applet LECTURE 31: CLUSTERING

ECE 8527: Lecture 31, Slide 1 Introduction Training procedures that use labeled samples are referred to as supervised. Unsupervised procedures use unlabeled data. Seven basic reasons why we are interested in unsupervised methods: 1)Data collection and labeling data is very costly and nontrivial (often this is a research problem in itself). 2)Heuristic methods (application-specific) exist that allow us to improve a classifier trained using supervised techniques by introducing large amounts of unlabeled data. This is often faster than labeling data. 3)We would like to exploit “found” data such as that available on the Internet. Often this data is not truth-marked or is only partially transcribed. 4)Reversal of the training process: train on unlabeled data and then use supervision to label the groupings. 5)Models often need to be adapted over time. 6)Use unsupervised methods to find features that will be useful for categorization. 7)Perform rapid exploratory analysis to gain insight into a new problem. In this chapter, we begin with parametric models and then step back and consider nonparametric techniques.

ECE 8527: Lecture 31, Slide 2 Mixture Densities Assume:  The samples come from a known number of c classes.  The prior probabilities, P(ω j ), for j = 1, …, c, are known,.  The forms of the class-conditional probability densities, p(x|ω j,θ j ) are known.  The values for the c parameter vectors ω 1, …, ω c are unknown.  The category labels are unknown. The probability density function for the samples is given by: where θ = (θ 1, …, θ c ) t. P(θ j ), the prior probabilities are called the mixing parameters and without loss of generality sum to one. A density, p(x|θ), is said to be identifiable if θ ≠ θ’ implies there exists an x such that p(x|θ) ≠ p(x|θ’). (A density is unidentifiable if we cannot recover a unique θ from an infinite amount of data.) Identifiability of θ is a property of the model and not the procedure used to estimate the model. We have already discussed methods to estimate these mixture coefficients.

ECE 8527: Lecture 31, Slide 3 Maximum Likelihood Estimates Given a set D = {x 1, …, x n } of n unlabeled samples drawn independently from the mixture density, the likelihood of the observed samples is: The maximum likelihood estimate is the value of θ that maximizes p(D|θ). If we differentiate the log-likelihood: Assume ω i and θ j are functionally independent if i ≠ j. Substitute the posterior: The gradient can be written as: The gradient must vanish at the value of θ i that maximizes the log likelihood. Therefore, the ML solution must satisfy:

ECE 8527: Lecture 31, Slide 4 We can generalize these results to include the prior probability, P(ω i ), among the unknown quantities. The search for the maximum value of p(D|θ) extends over θ and P(ω i ), subject to the constraints: It can be shown that the ML estimate for the prior is: The first equation simply states the estimate of the prior is computed by averaging over the entire data set. The third equation we have seen before in the HMM section of this course. The second equation just restates the ML principle that the optimal value of θ produces a maximum. So the good news here is that doing the obvious maximizes the posterior. Generalization of the ML Estimate

ECE 8527: Lecture 31, Slide 5 If the only unknown quantities are the mean vectors, μ i, we can write: and its derivative: The ML solution must satisfy: Rearranging terms: But this does not give us a new estimate explicitly, nor does it typically give closed form solutions. Instead, we can use a gradient-descent approach: Unknown Mean Vectors

ECE 8527: Lecture 31, Slide 6 Example Consider the simple two- component one-dimensional normal mixture. Generate 25 samples sequentially assuming μ 1 = -2 and μ 2 = 2. The likelihood function is calculated as a function of the estimates for the two means. We see that while the maximum is achieved near the true means, there are two peaks of comparable height:  μ 1 = and μ 2 = =  μ 1 = and μ 2 =

ECE 8527: Lecture 31, Slide 7 All Parameters Unknown If the means, covariances, and priors are all unknown, the ML principle yields singular solutions. If we only consider solutions in the neighborhood of the largest local maximum, we can derive estimation equations:

ECE 8527: Lecture 31, Slide 8 An approximate technique to determine the parameters of a mixture distribution is k-Means: k is the number of cluster centers, c, and “means” refers to the iterative process for finding the cluster centroids. We observe that the probability,, is large when the squared Mahalanobis distance,, is small. Suppose we merely compute the squared Euclidean distance,, find the mean nearest to x k, and approximate as: We can formally define the k-Means clustering algorithm:  Initialize: select the number of clusters, c, and seed the means, μ 1, …, μ n.  Iterate: o Classify n samples according to the nearest mean, μ i. o Recompute each mean using the n i samples assigned to cluster i. o Until: no change in μ i.  Done: return μ 1, …, μ n. Later we will see this is one case of an iterative optimization algorithm. There are many ways to cluster, recompute means, merge/split clusters and stop. K-Means Clustering

ECE 8527: Lecture 31, Slide 9 In k-Means, each data point is assumed to reside in one and only one cluster. We can allow “fuzzy” membership – a data point can appear in a cluster with probability. We can minimize a heuristic global cost function: where b is a free parameter that adjusts the blending of different clusters. The probabilities of cluster membership for each point are normalized as: The relevant reestimation equations are: This can be viewed as a form of soft quantization, and fits nicely with our general notion of probabilistic modeling and EM estimation. Fuzzy K-Means Clustering

ECE 8527: Lecture 31, Slide 10 Demonstrations

ECE 8527: Lecture 31, Slide 11 Summary Introduced the concept of unsupervised clustering. Reviewed the reestimation equations for ML estimates of mixtures. Discussed application to Gaussian mixture distributions. Introduced k-Means and Fuzzy k-Means clustering. Demonstrated clustering using the Java Pattern Recognition applet.