Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Outline Traditional Hierarchical Clustering Bayesian Hierarchical Clustering –Algorithm –Results Potential Application
Hierarchical Clustering Given a set of data points, output is a tree –Leaves are the data points –Internal nodes are nested clusters Examples –Evolutionary tree of living organisms –Internet newsgroups –Newswire documents
Traditional Hierarchical Clustering Bottom-up agglomerative algorithm –Begin with each data point in own cluster –Iteratively merge two “closest” clusters –Stop when have single cluster –Closeness based on given distance measure (e.g., Euclidean distance between cluster means) Limitations –No guide to choosing “correct” number of clusters, or where to prune tree –Distance metric selection (especially for data such as images or sequences) –How to evaluate how good result is, how to compare to other models, how to make predictions and cluster new data with existing hierarchy
Bayesian Hierarchical Clustering (BHC) Basic idea: –Use marginal likelihoods to decide which clusters to merge –Asks what the probability is that all the data in a potential merge were generated from the same mixture component. Compare to exponentially many hypotheses at lower levels of the tree –Generative model used is a Dirichlet Process Mixture Model (DPM)
BHC Algorithm Overview One-pass, bottom-up method Initializes each data point in own cluster, and iteratively merges pairs of clusters Uses a statistical hypothesis test to choose which clusters to merge At each stage, algorithm considers merging all pairs of existing trees
BHC Algorithm: Merging Two hypotheses compared –1. all data in the pair of trees to be merged was generated i.i.d. from the same probabilistic model with unknown parameters: (e.g., a Gaussian) –2. said data has two or more clusters in it
Hypothesis H 1 Probability of the data under H 1 : Prior over the parameters: D k is the data in the two trees to be merged Integral is tractable when conjugate prior employed
Hypothesis H 2 Probability of the data under H 2 : Is a product over sub-trees Prior that all points belong to one cluster: Probability of the data in tree T k :
Merging Clusters From Bayes Rule, the posterior probability of the merged hypothesis: The pair of trees with highest probability are merged Natural place to cut the final tree: where
Dirichlet Process Mixture Models (DPMs) Probability of a new data point belonging to a cluster is proportional to the number of points already in that cluster α controls the probability of the new point creating a new cluster
Merged Hypothesis Prior DPM with α defines a prior on all partitions of the n k data points in D k Prior on merged hypothesis, π k, is the relative mass of all n k points belonging to one cluster versus all other partitions of those n k points, consistent with the tree structure.
DPM Other quantities needed for the posterior merged hypothesis probabilities can also be written and computed with the DPM (see math/proofs in paper)
Results Some sample results…
Unique Aspects of Algorithm Is a hierarchical way of organizing nested clusters, not a hierarchical generative model Is derived from DPMs Hypothesis test is not for one vs. two clusters at each stage (is one vs. many other clusterings) Is not iterative and does not require sampling
Summary Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree. Model-based criterion to decide on merging clusters. Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree. Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.
Why This Paper? Mixed-type data problems: both continuous and discrete features How to perform density estimation? –One way: partition continuous data into groups determined by the values of the discrete features. –Problem: number of groups grows quickly. (e.g., 5 features, each of which can take 4 values, gives 4 5 =1024 groups) –How to determine which groups should be combined to reduce the total number of groups? –Possible solution: idea in this paper, except rather than leaves being individual data points, they would be groups of data points as determined by the discrete feature-values