Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH

Outline  Background - Traditional Methods  Bayesian Hierarchical Clustering (BHC)  Basic ideas  Dirichlet Process Mixture Model (DPM)  Algorithm  Experiment results  Conclusion  Background - Traditional Methods  Bayesian Hierarchical Clustering (BHC)  Basic ideas  Dirichlet Process Mixture Model (DPM)  Algorithm  Experiment results  Conclusion

Background Traditional Method

Hierarchical Clustering  Given : data points  Output: a tree (series of clusters)  Leaves : data points  Internal nodes : nested clusters  Examples  Evolutionary tree of living organisms  Internet newsgroups  Newswire documents  Given : data points  Output: a tree (series of clusters)  Leaves : data points  Internal nodes : nested clusters  Examples  Evolutionary tree of living organisms  Internet newsgroups  Newswire documents

Traditional Hierarchical Clustering  Bottom-up agglomerative algorithm  Closeness based on given distance measure (e.g. Euclidean distance between cluster means)  Bottom-up agglomerative algorithm  Closeness based on given distance measure (e.g. Euclidean distance between cluster means)

Traditional Hierarchical Clustering (cont’d)  Limitations  No guide to choosing correct number of clusters, or where to prune tree.  Distance metric selection (especially for data such as images or sequences)  Evaluation (Probabilistic model)  How to evaluate how good result is ?  How to compare to other models ?  How to make predictions and cluster new data with existing hierarchy ?  Limitations  No guide to choosing correct number of clusters, or where to prune tree.  Distance metric selection (especially for data such as images or sequences)  Evaluation (Probabilistic model)  How to evaluate how good result is ?  How to compare to other models ?  How to make predictions and cluster new data with existing hierarchy ?

BHC Bayesian Hierarchical Clustering

 Basic ideas:  Use marginal likelihoods to decide which clusters to merge  P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components)  Generative Model : Dirichlet Process Mixture Model (DPM)  Basic ideas:  Use marginal likelihoods to decide which clusters to merge  P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components)  Generative Model : Dirichlet Process Mixture Model (DPM)

Dirichlet Process Mixture Model (DPM)  Formal Definition  Different Perspectives  Infinite version of Mixture Model (Motivation and Problems)  Stick-breaking Process (How generated distribution look like)  Chinese Restaurant Process, Polya urn scheme  Benefits  Conjugate prior  Unlimited clusters  “Rich-Get-Richer, ” Does it really work? Depends!  Pitman-Yor process, Uniform Process, …  Formal Definition  Different Perspectives  Infinite version of Mixture Model (Motivation and Problems)  Stick-breaking Process (How generated distribution look like)  Chinese Restaurant Process, Polya urn scheme  Benefits  Conjugate prior  Unlimited clusters  “Rich-Get-Richer, ” Does it really work? Depends!  Pitman-Yor process, Uniform Process, …

BHC Algorithm - Overview  Same as traditional  One-pass, bottom-up method  Initializes each data point in own cluster, and iteratively merges pairs of clusters.  Difference  Uses a statistical hypothesis test to choose which clusters to merge.  Same as traditional  One-pass, bottom-up method  Initializes each data point in own cluster, and iteratively merges pairs of clusters.  Difference  Uses a statistical hypothesis test to choose which clusters to merge.

BHC Algorithm - Concepts  Two hypotheses to compare  1. All data was generated i.i.d. from the same probabilistic model with unknown parameters.  2. Data has two or more clusters in it.  Two hypotheses to compare  1. All data was generated i.i.d. from the same probabilistic model with unknown parameters.  2. Data has two or more clusters in it.

Hypothesis H 1  Probability of the data under H 1 :  : prior over the parameters  D k : data in the two trees to be merged  Integral is tractable with conjugate prior  Probability of the data under H 1 :  : prior over the parameters  D k : data in the two trees to be merged  Integral is tractable with conjugate prior

Hypothesis H 2  Probability of the data under H 2 :  Product over sub-trees  Probability of the data under H 2 :  Product over sub-trees

 From Bayes Rule, the posterior probability of the merged hypothesis:  The pair of trees with highest probability are merged.  Natural place to cut the final tree:  From Bayes Rule, the posterior probability of the merged hypothesis:  The pair of trees with highest probability are merged.  Natural place to cut the final tree: Data number, concentration(DPM)Hidden features (Beneath Distribution) BHC Algorithm - Working Flow

Tree-Consistent Partitions  Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4)  (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.  (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.  Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4)  (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.  (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.

Merged Hypothesis Prior ( π k )  Based on DPM (CRP perspective)  π k = P(All points belong to one cluster)  d’s are the case for all tree-consistent partitions  Based on DPM (CRP perspective)  π k = P(All points belong to one cluster)  d’s are the case for all tree-consistent partitions

Predictive Distribution  BHC allow to define predictive distributions for new data points.  Note : P(x|D) != P(x|D k ) for root!?  BHC allow to define predictive distributions for new data points.  Note : P(x|D) != P(x|D k ) for root!?

Approximate Inference for DPM prior  BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions.  Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass.  Compare to MCMC method, this is more deterministic and efficient.  BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions.  Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass.  Compare to MCMC method, this is more deterministic and efficient.

Learning Hyperparameters  α : Concentration parameter  β : Define G0  Learned by recursive gradients and EM-like method  α : Concentration parameter  β : Define G0  Learned by recursive gradients and EM-like method

To Sum Up for BHC  Statistical model for comparison and decides when to stop.  Allow to define predictive distributions for new data points.  Approximate Inference for DPM marginal.  Parameters  α : Concentration parameter  β : Define G0  Statistical model for comparison and decides when to stop.  Allow to define predictive distributions for new data points.  Approximate Inference for DPM marginal.  Parameters  α : Concentration parameter  β : Define G0

Unique Aspects of BHC Algorithm  Hierarchical way of organizing nested clusters, not a hierarchical generative model.  Derived from DPM.  Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage)  Not iterative and does not require sampling. (except for learning parameters)  Hierarchical way of organizing nested clusters, not a hierarchical generative model.  Derived from DPM.  Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage)  Not iterative and does not require sampling. (except for learning parameters)

Results from the experiments

Conclusion and some take home notes

Conclusion  Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <-  Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <- Solved!!

Summary  Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.  Model-based criterion to decide on merging clusters.  Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.  Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.  Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.  Model-based criterion to decide on merging clusters.  Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.  Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.

Limitations  Inherent greediness  Lack of any incorporation of tree uncertainty  O(n 2 ) complexity for building tree  Inherent greediness  Lack of any incorporation of tree uncertainty  O(n 2 ) complexity for building tree

References  Main paper:  Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005  Thesis:  Efficient Bayesian Methods for Clustering, Katherine Ann Heller  Other references:  Wikipedia  Paper Slides  www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt  http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf  General ML  http://blog.echen.me/  Main paper:  Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005  Thesis:  Efficient Bayesian Methods for Clustering, Katherine Ann Heller  Other references:  Wikipedia  Paper Slides  www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt  http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf  General ML  http://blog.echen.me/

References  Other references(cont’d)  DPM & Nonparametric Bayesian :  http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt  https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf  http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf  http://videolectures.net/mlss07_teh_dp/, http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf  http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read)  http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf  Heavy text:  http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf  http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf  http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf  Hierarchical DPM  http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf  Other methods  https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf  Other references(cont’d)  DPM & Nonparametric Bayesian :  http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt  https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf  http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf  http://videolectures.net/mlss07_teh_dp/, http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf  http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read)  http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf  Heavy text:  http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf  http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf  http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf  Hierarchical DPM  http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf  Other methods  https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf

Thank You for Your Attentions!

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Similar presentations

Presentation on theme: "Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Similar presentations

Presentation on theme: "Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH."— Presentation transcript:

Similar presentations

About project

Feedback