Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Jeremy Tantrum Alejandro Murua Werner Stuetzle
Motivation Objective Introduction Model-based Fractionation Model-based ReFractionation Example Conclusions Personal Opinion Outline
Motivation Propose a extended method to improve performance of model-based clustering method and apply it to large datasets.
Objective Apply Fractionation and Refractionation to model-based clustering.
Introduction Model-based clustering in a nutshell Sample: is the density modeling group g is the prior probability that a randomly chosen observation belongs to group g
Introduction Model-based clustering in a nutshell We can use Approximate Weight of Evidence to estimate the number of groups. where
Introduction Previous work on model-based clustering for large datasets Scalable EM(SEM) algorithm can be used to finding fitting mixture models to large datasets but it can ’ t estimate the number of groups. The simplest and potentially fastest is to draw a sample of the data.
Original Fractionation algorithm 2. Fractionation 1Split data into fractions of size M 2Cluster each fraction into a fixed number M where a < 1. Summarize each cluster by its mean We refer to these cluster means as meat-observations. 3If the total number of meta-observations is greater that M return to setp1 4Cluster the meta-observations into G clusters. 5Assign each individual observation to the cluster with the closet mean.
In model-based Fractionation, we use all sufficient the mean,the covariance,and the number of observations to present cluster. Using AWE to determine the number of clusters in Step Model-based Fractionation Main difference:
3. Model-based ReFractionation Step 4 of Fractionation algorithm is replaced by 4a,4b 4a Clustering the meta-observations into G clusters, where G is determined by AWE criterion 4b Define the fractions for the i-th pass.
3.1 Illustration M=100 fraction=4 meta-observation=40
3.1 Illustration Step 4a Use AWE find G=25 Step 4b
3.1 Illustration Second pass
3.1 Illustration 2th pass3th pass
3.2 Scope of (Re)Fractionation Let n g be the number of groups in the data n f be the number of fractions n c be the number of clusters generated from each fraction Step2 If n g > n c will bead to impure clusters.
4. Example 4.1 Measuring the agreement between groups and clusters Fowlkes-Mallows index=
4.3 Example 1 Group = 19 n=22000 M=1000 clusters=100
4.3 Example 3 Group=361 n=20900 M=1045 cluster=100
Conclusions We can study the performance of the AWE criterion for estimating the number of groups in a mixture of factor analyzers model.
Personal Opinion We can apply advantage of another clustering method to improve ours defect.