Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University.

Similar presentations


Presentation on theme: "Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University."— Presentation transcript:

1 Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University of Washington This work has been supported by NSA grant 62-1942 Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

2 Motivating Example Consider clustering documents Topic Detection and Tracking corpus 15,863 news stories for one year from Reuters and CNN 25,000 unique words Possibly many topics Large numbers of observations High dimensions Many groups

3 Goal of Clustering Detect that there are 5 or 6 groups Assign Observations to groups

4 NonParametric Clustering Premise: Observations are sampled from a density p(x) Groups correspond to modes of p(x)

5 NonParametric Clustering Fitting: Estimate p(x) nonparametrically and find significant modes of the estimate

6 Model Based Clustering Premise: Observations are sampled from a mixture density p(x) =   g p g (x) Groups correspond to mixture components

7 Model Based Clustering Fitting: Estimate  g and parameters of p g (x)

8 Model Based Clustering Fitting a Mixture of Gaussians Use the EM algorithm to maximize the log likelihood –Estimates the probabilities of each observation belonging to each group –Maximizes likelihood given these probabilites –Requires a good starting point

9 Model Based Clustering Hierarchical Clustering Provides a good starting point for EM algorithm Start with every point being it’s own cluster Merge the two closest clusters –Measured by the decrease in likelihood when those two clusters are merged –Uses the Classification Likelihood – not the Mixture Likelihood Algorithm is quadratic in the number of observations

10 p 1 (x) p 2 (x) p (x) Merge gives small decrease in likelihood Merge gives big decrease in likelihood Likelihood Distance p 1 (x)p 2 (x) p (x)

11 Bayesian Information Criterion Choose number of clusters by maximizing the Bayesian Information Criterion –r is the number of parameters –n is the number of observations Log likelihood penalized for complexity

12 Fractionation Original Data – size n n/M fractions of size M If  n >M M is the largest number of observations for which a hierarchical O(M 2 ) algorithm is computationally feasible Invented by Cutting, Karger, Pederson and Tukey for nonparametric clustering of large datasets.  n clusters (meta-obervations,  i ) Partition each fraction into  M clusters  < 1

13 Fractionation –  n meta-observations after the first round –  2 n meta-observations after the second round –  i n meta-observations after the i th round For the i th pass, we have  i-1 n/M fractions taking O(M 2 ) operations each Total number of operations is: Total running time is linear in n!

14 Use model based clustering Meta-observations contain all sufficient statistics – (n i,  i,  i ) – n i is the number of observations – size –  i is the mean – location –  i is the covariance matrix – shape and volume Model Based Fractionation

15 An example, 400 observations in 4 groupsObservations in the first fraction10 meta-observations from the first fraction 10 meta-observations from the second fraction 10 meta-observations from the third fraction10 meta-observations from the fourth fractionThe 40 Meta-observations The Final Clusters Chosen by BIC Success! Model Based Fractionation

16 The data – 400 observations in 25 groupsObservations in fraction 110 meta-observations from the first fraction10 meta-observations from the second fraction10 meta-observations from the third fraction10 meta-observations from the fourth fractionThe 40 meta-observations The clusters chosen by BIC Fractionation fails! Example 2

17 Refractionation Problem: If the number of meta-observations generated from a fraction is less than the number of groups in that fraction then two or more groups will be merged. Once observations from two groups are merged they can never be split again. Solution: Apply fractionation repeatedly. Use meta-observations from the previous pass of fractionation to create “better” fractions.

18 Example 2 Continued The 40 meta-observations4 new clusters4 new fractions

19 Observations in the new fraction 1Clusters from the first fractionClusters from the second fractionClusters from the third fractionClusters from the fourth fractionThe 40 meta-observationsClusters chosen by BIC Example 2 – Pass 2

20 The 40 meta-observations of pass 2 of fractionation4 new clusters4 new fractionsObservations in the new fraction 1Clusters from the first fractionClusters from the second fractionClusters from the third fractionClusters from the fourth fractionThe 40 meta-observations Clusters chosen by BIC Refractionation Succeeds Example 2 – Pass 3

21 Realistic Example 1100 documents from the TDT corpus partitioned by people into 19 topics –Transformed into 50 dimensional space using Latent Semantic Indexing Projection of the data onto a plane – colors represent topics

22 Realistic Example Want to create a dataset with more observations and more groups Idea: Replace each group with a scaled and transformed version of the entire data set.

23 Realistic Example Want to create a dataset with more observations and more groups Idea: Replace each group with a scaled and transformed version of the entire data set.

24 Realistic Example To measure similarity of clusters to groups: Fowlkes-Mallows index Geometric average of: –Probability of 2 randomly chosen observations from the same cluster being in the same group –Probability of 2 randomly chosen observations from the same group being in the same cluster Fowlkes–Mallows index near 1 means clusters are good estimates of the groups Clustering the 1100 documents gives a Fowlkes– Mallows index of 0.76 – our “gold standard”

25 Realistic Example 19 £ 19=361 clusters, 19 £ 1100=20900 observations in 50 dimensions Fraction size ¼ 1000 with 100 metaobservations per fraction 4 passes of fractionation choosing 361 clusters PassMinMedianMaxnfnf 127028929620 2188815018 3 196017 419 5816 Distribution of the number of groups per fraction. Number of fractions

26 Realistic Example PassFowlkes Mallows Purity of the clusters 10.3251729 20.554908 30.616671 40.613651 The sum of the number of groups represented in each cluster: 361 is perfect 19 £ 19=361 clusters, 19 £ 1100=20900 observations in 50 dimensions Fraction size ¼ 1000 with 100 metaobservations per fraction 4 passes of fractionation choosing 361 clusters

27 Realistic Example 19 £ 19=361 clusters, 19 £ 1100=20900 observations in 50 dimensions Fraction size ¼ 1000 with 100 metaobservations per fraction 4 passes of fractionation choosing 361 clusters Refractionation: Purifies fractions Successfully deals with the case where the number of groups is greater than  M, the number of meta- observations

28 Contributions Model Based Fractionation –Extended fractionation idea to parametric setting Incorporates information about size, shape and volume of clusters Chooses number of clusters –Still linear in n Model Based ReFractionation –Extended fractionation to handle larger number of groups

29 Extensions Extend to 100,000s of observations – 1000s of groups –Currently the number of groups must be less than M Extend to a more flexible class of models –With small groups in high dimensions, we need a more constrained model (fewer parameters) than the full covariance model –Mixture of Factor Analyzers

30

31 Fowlkes-Mallows Index Pr(2 documents in same group | they are in the same cluster) Pr(2 documents in same cluster | they are in the same group) trueclusters Groups12…ITotal 1n 11 n 12 …n 1I n1¢n1¢ 2n 21 n 22 …n 2I n1¢n1¢ ……………… Jn J1 n j2 …n JI n1¢n1¢ Totaln¢1n¢1 n¢2n¢2 …n¢In¢I n


Download ppt "Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University."

Similar presentations


Ads by Google