Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Similar presentations


Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John."— Presentation transcript:

1 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, with the permission of the authors and the publisher

2 Introduction Previously, all our training samples were labeled: those samples were said to be “supervised” We now investigate “unsupervised” procedures using unlabeled samples. At least five reasons for this: Collecting and Labeling a large set of samples can be costly We can train with large amounts of (less expensive) unlabeled data, and only then use supervision to label the groupings found, this is appropriate for large “data mining” applications This is also appropriate in many applications when the characteristics of the patterns can change slowly with time We can use unsupervised methods to identify features that will then be useful for categorization We gain some insight into the nature (or structure) of the data Pattern Classification, Chapter 10

3 Mixture Densities and Identifiability
We begin with the assumption that the functional forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector We make the following assumptions: The samples come from a known number c of classes The prior probabilities P(j) for each class are known j = 1, …,c The class-conditional densities P(x | j, j) j = 1, …,c are known The values of the c parameter vectors 1, 2, …, c are unknown The category labels are unknown Pattern Classification, Chapter 10

4 This density function is called a mixture density
Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector  Once  is known, we can decompose the mixture into its components and use a maximum a posteriori (MAP) classifier on the derived densities Pattern Classification, Chapter 10

5 Maximum-Likelihood Estimates
Suppose that we have a set D = {x1, …, xn} of n unlabeled samples drawn independently from the mixture density where  is fixed but unknown! To estimate  take the gradient of the log likelihood with respect to i and set to zero Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Pattern Classification, Chapter 10

6 Applications to Normal Mixtures
p(x | i, i) ~ N(i, i) Case 1 = Simplest case Case i i P(i) c 1 ? x 2 3 x = known ? = unknown Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Pattern Classification, Chapter 10

7 Case 1: Unknown mean vectors
This “simplest” case is not easy and the textbook obtains an iterative gradient ascent (hill-climbing) procedure to maximize the log-likelihood function Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Pattern Classification, Chapter 10

8 k-Means Clustering Popular approximation method to estimate the c mean vectors 1, 2, …, c Replace the squared Mahalanobis distance by the squared Euclidean distance Find the mean nearest to xk and approximate as: Use the iterative scheme to find The # iterations is usually much less than # samples Pattern Classification, Chapter 10

9 If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is:
Begin initialize n, c, 1, 2, …, c(randomly selected) do classify n samples according to nearest i recompute i until no change in i return 1, 2, …, c End Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows. Pattern Classification, Chapter 10

10 Three-class example – convergence in three iterations
Pattern Classification, Chapter 10

11 Scaling for unit variance may be undesirable
Pattern Classification, Chapter 10

12 Hierarchical Clustering
Many times, clusters are not disjoint, but may have subclusters, in turn having sub-subclusters, etc. Consider a sequence of partitions of the n samples into c clusters The first is a partition into n clusters, each one containing exactly one sample The second is a partition into n-1 clusters, the third into n-2, and so on, until the n-th in which there is only one cluster containing all of the samples At the level k in the sequence, c = n-k+1. Pattern Classification, Chapter 10

13 Hierarchical clustering  tree called dendrogram
Given any two samples x and x’, they will be grouped together at some level, and if they are grouped at level k, they remain grouped for all higher levels Hierarchical clustering  tree called dendrogram Pattern Classification, Chapter 10

14 Another representation is based on Venn diagrams
The similarity values may help to determine if the grouping are natural or forced, but if they are evenly distributed no information can be gained Another representation is based on Venn diagrams Pattern Classification, Chapter 10

15 Hierarchical clustering can be divided in agglomerative and divisive.
Agglomerative (bottom up, clumping): start with n singleton cluster and form the sequence by merging clusters Divisive (top down, splitting): start with all of the samples in one cluster and form the sequence by successively splitting clusters Pattern Classification, Chapter 10

16 The problem of the number of clusters
Typically, the number of clusters is known When it’s not, there are several ways of proceed When clustering is done by extremizing a criterion function, a common approach is to repeat the clustering with c=1, c=2, c=3, etc. Another approach is to state a threshold for the creation of a new cluster These approaches are similar to model selection procedures, typically used to determine the topology and number of states (e.g., clusters, parameters) of a model, given a specific application Pattern Classification, Chapter 10

17 k-Means Clustering Videos
Seeds are data samples Dynamic examples with large number of points Pattern Classification, Chapter 10


Download ppt "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John."

Similar presentations


Ads by Google