Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised learning: Clustering Ata Kaban The University of Birmingham
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 10 Unsupervised Learning & Clustering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 32: HIERARCHICAL CLUSTERING Objectives: Unsupervised.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Pattern Classification, Chapter 10 1 Introduction Mixture Densities and Identifiability Maximum-Likelihood Estimates Application to Normal Mixtures Data Description and Clustering Criterion Functions for Clustering Hierarchical Clustering Component Analysis Unsupervised Learning & Clustering

Pattern Classification, Chapter 10 2 Introduction Previously, all our training samples were labeled: those samples were said to be “supervised” We now investigate “unsupervised” procedures using unlabeled samples. At least five reasons for this: 1. Collecting and Labeling a large set of samples can be costly 2. We can train with large amounts of (less expensive) unlabeled data, and only then use supervision to label the groupings found, this is appropriate for large “data mining” applications 3. This is also appropriate in many applications when the characteristics of the patterns can change slowly with time 4. We can use unsupervised methods to identify features that will then be useful for categorization 5. We gain some insight into the nature (or structure) of the data

Pattern Classification, Chapter 10 3 Mixture Densities and Identifiability We begin with the assumption that the functional forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector We make the following assumptions: 1. The samples come from a known number c of classes 2. The prior probabilities P(  j ) for each class are known j = 1, …,c 3. The class-conditional densities P(x |  j,  j ) j = 1, …,c are known 4. The values of the c parameter vectors  1,  2, …,  c are unknown 5. The category labels are unknown

Pattern Classification, Chapter 10 4 This density function is called a mixture density Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector  Once  is known, we can decompose the mixture into its components and use a maximum a posteriori (MAP) classifier on the derived densities

Pattern Classification, Chapter 10 5 Definition A density P(x |  ) is said to be identifiable if    ’ implies that there exists an x such that: P(x |  )  P(x |  ’) i.e., a density is identifiable only if we can recover unique parameters Most mixtures of commonly-encountered real-world density functions are identifiable, especially for continuous densities The textbook gives some unidentifiable examples Although identifiability can be a problem, we always assume that the densities we are dealing with are identifiable!

Pattern Classification, Chapter 10 6 Maximum-Likelihood Estimates Suppose that we have a set D = {x 1, …, x n } of n unlabeled samples drawn independently from the mixture density where  is fixed but unknown! To estimate  take the gradient of the log likelihood with respect to  i and set to zero

Pattern Classification, Chapter 10 7 Applications to Normal Mixtures p(x |  i,  i ) ~ N(  i,  i ) Case 1 = Simplest case Case ii ii P(  i ) c 1?xxx 2???x 3???? x = known ? = unknown

Pattern Classification, Chapter 10 8 Case 1: Unknown mean vectors This “simplest” case is not easy and the textbook obtains an iterative gradient ascent (hill-climbing) procedure to maximize the log-likelihood function Example: Consider the simple two-component one- dimensional normal mixture Set  1 = -2,  2 = 2 and draw 25 samples from this mixture. The log-likelihood function is:

Pattern Classification, Chapter 10 9

10 k-Means Clustering Popular approximation method to estimate the c mean vectors  1,  2, …,  c Replace the squared Mahalanobis distance by the squared Euclidean distance Find the mean nearest to x k and approximate as: Use the iterative scheme to find The # iterations is usually much less than # samples

Pattern Classification, Chapter If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is: Begin initialize n, c,  1,  2, …,  c (randomly selected) do classify n samples according to nearest  i recompute  i until no change in  i return  1,  2, …,  c End

Pattern Classification, Chapter Two-class example - compare max likelihood in previous fig

Pattern Classification, Chapter Three-class example – convergence in three iterations

Pattern Classification, Chapter Data Description and Clustering Structures of multidimensional patterns are important for clustering If we know that data come from a specific distribution, such data can be represented by a compact set of parameters (sufficient statistics) E.g., mean and covariance matrix for a normal distribution If samples are considered coming from a specific distribution, but actually they are not, these statistics is a misleading representation of the data (next figure)

Pattern Classification, Chapter 10 15

Pattern Classification, Chapter Mixture of normal distributions can approximate a large variety of situations – i.e., any density functions In these cases, one can use parametric methods to estimate the parameters of the mixture density. However, if little prior knowledge can be assumed, the assumption of a parametric form is meaningless: we are actually imposing structure on data, not finding structure on it! In these cases, one can use non parametric methods to estimate the unknown mixture density. If the goal is to find subclasses, one can use a clustering procedure to identify groups of data points having strong internal similarities

Pattern Classification, Chapter Similarity measures The question is how to evaluate that the samples in one cluster are more similar among themselves than samples in other clusters Two isses: How to measure the similarity between samples? How to evaluate a partitioning of a set into clusters? The most obvious measure of similarity between two samples is the distance between them, i.e., define a metric Once this measure is defined, one would expect the distance between samples of the same cluster to be significantly less than the distance between samples in different classes

Pattern Classification, Chapter Euclidean distance is a possible metric: a possible criterion is to assume samples belonging to same cluster if their distance is less than a threshold d 0 Clusters defined by Euclidean distance are invariant to translations and rotation of the feature space, but not invariant to general transformations that distort the distance relationship

Pattern Classification, Chapter 10 19

Pattern Classification, Chapter 10 20

Pattern Classification, Chapter 10 21

Pattern Classification, Chapter Scaling axes

Pattern Classification, Chapter Scaling for unit variance may be undesirable

Pattern Classification, Chapter Criterion Functions for Clustering The second issue: how to evaluate a partitioning of a set into clusters? Clustering can be posted as an optimization of a criterion function 1. The sum-of-squared-error criterion and its variants 2. Scatter criteria The Sum-of-Squared-Error Criterion where

Pattern Classification, Chapter This criterion defines clusters as their mean vectors m i in the sense that it minimizes the sum of the squared lengths of the error x - m i. The optimal partition is defined as one that minimizes J e, also called minimum variance partition. Works fine when clusters form well separated compact clouds, less fine when there are great differences in the number of samples in different clusters.

Pattern Classification, Chapter Scatter Criteria Scatter matrices used in multiple discriminant analysis, i.e., the within-scatter matrix S W and the between-scatter matrix S B S T = S B +S W (T = total scatter matrix) (Note: T depends only on the set of samples and not on the partitioning) Various criteria can be used to minimize the within-cluster or maximize the between-cluster scatter

Pattern Classification, Chapter 10 27

Pattern Classification, Chapter The Trace Criterion (minimize trace of S w ) is the simplest scalar measure of the scatter matrix, as it is proportional to the sum of the variances in the coordinate directions, that is in practice the sum-of-squared-error criterion As tr[S T ] = tr[S W ] + tr[S B ] and tr[S T ] is independent from the partitioning, no new results can be derived by maximizing tr[S B ] However, seeking to minimize the within-cluster criterion J e =tr[S W ], is equivalent to maximizing the between-cluster criterion

Pattern Classification, Chapter Hierarchical Clustering Many times, clusters are not disjoint, but may have subclusters, in turn having sub-subclusters, etc. Consider a sequence of partitions of the n samples into c clusters The first is a partition into n clusters, each one containing exactly one sample The second is a partition into n-1 clusters, the third into n-2, and so on, until the n-th in which there is only one cluster containing all of the samples At the level k in the sequence, c = n-k+1.

Pattern Classification, Chapter Given any two samples x and x’, they will be grouped together at some level, and if they are grouped a level k, they remain grouped for all higher levels Hierarchical clustering  tree called dendrogram

Pattern Classification, Chapter The similarity values may help to determine if the grouping are natural or forced, but if they are evenly distributed no information can be gained Another representation is based on Venn diagrams

Pattern Classification, Chapter Hierarchical clustering can be divided in agglomerative and divisive. Agglomerative (bottom up, clumping): start with n singleton cluster and form the sequence by merging clusters Divisive (top down, splitting): start with all of the samples in one cluster and form the sequence by successively splitting clusters

Pattern Classification, Chapter The problem of the number of clusters Typically, the number of clusters is known When it’s not, there are several ways of proceed When clustering is done by extremizing a criterion function, a common approach is to repeat the clustering with c=1, c=2, c=3, etc. Another approach is to state a threshold for the creation of a new cluster These approaches are similar to model selection procedures, typically used to determine the topology and number of states (e.g., clusters, parameters) of a model, given a specific application

Pattern Classification, Chapter Component Analysis Combine features to reduce the dimension of the feature space Linear combinations are simple to compute and tractable Project high dimensional data onto a lower dimensional space Two classical approaches for finding “optimal” linear transformation PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense” (generalization of Fisher’s Linear Discriminant for two classes)

Pattern Classification, Chapter PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” The scatter matrix of the cloud of samples is the same as the maximum-likelihood estimate of the covariance matrix Unlike a covariance matrix, however, the scatter matrix includes samples from all classes! And the least-square projection solution (maximum scatter) is simply the subspace defined by the d’<d eigenvectors of the covariance matrix that correspond to the largest d’ eigenvalues of the matrix Because the scatter matrix is real and symmetric the eigenvectors are orthogonal

Pattern Classification, Chapter Fisher Linear Discriminant While PCA seeks directions efficient for representation, discriminant analysis seeks directions efficient for discrimination

Pattern Classification, Chapter Multiple Discriminant Analysis Generalization of Fisher’s Linear Discriminant for more than two classes

Pattern Classification, Chapter Videos Seeds are data samples Seeds are random points and not data samples