Download presentation
Presentation is loading. Please wait.
Published byAusten Copeland Modified over 9 years ago
1
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH
2
Outline Background - Traditional Methods Bayesian Hierarchical Clustering (BHC) Basic ideas Dirichlet Process Mixture Model (DPM) Algorithm Experiment results Conclusion Background - Traditional Methods Bayesian Hierarchical Clustering (BHC) Basic ideas Dirichlet Process Mixture Model (DPM) Algorithm Experiment results Conclusion
3
Background Traditional Method
4
Hierarchical Clustering Given : data points Output: a tree (series of clusters) Leaves : data points Internal nodes : nested clusters Examples Evolutionary tree of living organisms Internet newsgroups Newswire documents Given : data points Output: a tree (series of clusters) Leaves : data points Internal nodes : nested clusters Examples Evolutionary tree of living organisms Internet newsgroups Newswire documents
5
Traditional Hierarchical Clustering Bottom-up agglomerative algorithm Closeness based on given distance measure (e.g. Euclidean distance between cluster means) Bottom-up agglomerative algorithm Closeness based on given distance measure (e.g. Euclidean distance between cluster means)
6
Traditional Hierarchical Clustering (cont’d) Limitations No guide to choosing correct number of clusters, or where to prune tree. Distance metric selection (especially for data such as images or sequences) Evaluation (Probabilistic model) How to evaluate how good result is ? How to compare to other models ? How to make predictions and cluster new data with existing hierarchy ? Limitations No guide to choosing correct number of clusters, or where to prune tree. Distance metric selection (especially for data such as images or sequences) Evaluation (Probabilistic model) How to evaluate how good result is ? How to compare to other models ? How to make predictions and cluster new data with existing hierarchy ?
7
BHC Bayesian Hierarchical Clustering
8
Basic ideas: Use marginal likelihoods to decide which clusters to merge P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components) Generative Model : Dirichlet Process Mixture Model (DPM) Basic ideas: Use marginal likelihoods to decide which clusters to merge P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components) Generative Model : Dirichlet Process Mixture Model (DPM)
9
Dirichlet Process Mixture Model (DPM) Formal Definition Different Perspectives Infinite version of Mixture Model (Motivation and Problems) Stick-breaking Process (How generated distribution look like) Chinese Restaurant Process, Polya urn scheme Benefits Conjugate prior Unlimited clusters “Rich-Get-Richer, ” Does it really work? Depends! Pitman-Yor process, Uniform Process, … Formal Definition Different Perspectives Infinite version of Mixture Model (Motivation and Problems) Stick-breaking Process (How generated distribution look like) Chinese Restaurant Process, Polya urn scheme Benefits Conjugate prior Unlimited clusters “Rich-Get-Richer, ” Does it really work? Depends! Pitman-Yor process, Uniform Process, …
10
BHC Algorithm - Overview Same as traditional One-pass, bottom-up method Initializes each data point in own cluster, and iteratively merges pairs of clusters. Difference Uses a statistical hypothesis test to choose which clusters to merge. Same as traditional One-pass, bottom-up method Initializes each data point in own cluster, and iteratively merges pairs of clusters. Difference Uses a statistical hypothesis test to choose which clusters to merge.
11
BHC Algorithm - Concepts Two hypotheses to compare 1. All data was generated i.i.d. from the same probabilistic model with unknown parameters. 2. Data has two or more clusters in it. Two hypotheses to compare 1. All data was generated i.i.d. from the same probabilistic model with unknown parameters. 2. Data has two or more clusters in it.
12
Hypothesis H 1 Probability of the data under H 1 : : prior over the parameters D k : data in the two trees to be merged Integral is tractable with conjugate prior Probability of the data under H 1 : : prior over the parameters D k : data in the two trees to be merged Integral is tractable with conjugate prior
13
Hypothesis H 2 Probability of the data under H 2 : Product over sub-trees Probability of the data under H 2 : Product over sub-trees
14
From Bayes Rule, the posterior probability of the merged hypothesis: The pair of trees with highest probability are merged. Natural place to cut the final tree: From Bayes Rule, the posterior probability of the merged hypothesis: The pair of trees with highest probability are merged. Natural place to cut the final tree: Data number, concentration(DPM)Hidden features (Beneath Distribution) BHC Algorithm - Working Flow
15
Tree-Consistent Partitions Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4) (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions. (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions. Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4) (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions. (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.
16
Merged Hypothesis Prior ( π k ) Based on DPM (CRP perspective) π k = P(All points belong to one cluster) d’s are the case for all tree-consistent partitions Based on DPM (CRP perspective) π k = P(All points belong to one cluster) d’s are the case for all tree-consistent partitions
17
Predictive Distribution BHC allow to define predictive distributions for new data points. Note : P(x|D) != P(x|D k ) for root!? BHC allow to define predictive distributions for new data points. Note : P(x|D) != P(x|D k ) for root!?
18
Approximate Inference for DPM prior BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions. Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass. Compare to MCMC method, this is more deterministic and efficient. BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions. Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass. Compare to MCMC method, this is more deterministic and efficient.
19
Learning Hyperparameters α : Concentration parameter β : Define G0 Learned by recursive gradients and EM-like method α : Concentration parameter β : Define G0 Learned by recursive gradients and EM-like method
20
To Sum Up for BHC Statistical model for comparison and decides when to stop. Allow to define predictive distributions for new data points. Approximate Inference for DPM marginal. Parameters α : Concentration parameter β : Define G0 Statistical model for comparison and decides when to stop. Allow to define predictive distributions for new data points. Approximate Inference for DPM marginal. Parameters α : Concentration parameter β : Define G0
21
Unique Aspects of BHC Algorithm Hierarchical way of organizing nested clusters, not a hierarchical generative model. Derived from DPM. Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage) Not iterative and does not require sampling. (except for learning parameters) Hierarchical way of organizing nested clusters, not a hierarchical generative model. Derived from DPM. Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage) Not iterative and does not require sampling. (except for learning parameters)
22
Results from the experiments
28
Conclusion and some take home notes
29
Conclusion Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <- Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <- Solved!!
30
Summary Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree. Model-based criterion to decide on merging clusters. Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree. Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data. Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree. Model-based criterion to decide on merging clusters. Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree. Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.
31
Limitations Inherent greediness Lack of any incorporation of tree uncertainty O(n 2 ) complexity for building tree Inherent greediness Lack of any incorporation of tree uncertainty O(n 2 ) complexity for building tree
32
References Main paper: Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005 Thesis: Efficient Bayesian Methods for Clustering, Katherine Ann Heller Other references: Wikipedia Paper Slides www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf General ML http://blog.echen.me/ Main paper: Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005 Thesis: Efficient Bayesian Methods for Clustering, Katherine Ann Heller Other references: Wikipedia Paper Slides www.ee.duke.edu/~lcarin/emag/.../DW_PD_100705.ppt http://cs.brown.edu/courses/csci2950-p/fall2011/lectures/2011-10-13_ghosh.pdf General ML http://blog.echen.me/
33
References Other references(cont’d) DPM & Nonparametric Bayesian : http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf http://videolectures.net/mlss07_teh_dp/, http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read) http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf Heavy text: http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf Hierarchical DPM http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf Other methods https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf Other references(cont’d) DPM & Nonparametric Bayesian : http://nlp.stanford.edu/~grenager/papers/dp_2005_02_24.ppt https://www.cs.cmu.edu/~kbe/dp_tutorial.pdf http://www.iro.umontreal.ca/~lisa/seminaires/31-10-2006.pdf http://videolectures.net/mlss07_teh_dp/, http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf http://www.cns.nyu.edu/~eorhan/notes/dpmm.pdf (Easy to read) http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf Heavy text: http://stat.columbia.edu/~porbanz/reports/OrbanzTeh2010.pdf http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/dp.pdf http://www.stat.uchicago.edu/~pmcc/reports/clusters.pdf Hierarchical DPM http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf Other methods https://people.cs.umass.edu/~wallach/publications/wallach10alternative.pdf
34
Thank You for Your Attentions!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.