Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group (10.07.05)

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Machine Learning and Data Mining Clustering
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Variational Inference and Variational Message Passing
Clustering II.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Latent Dirichlet Allocation a generative model for text
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Clustering.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Crash Course on Machine Learning
Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
BraMBLe: The Bayesian Multiple-BLob Tracker By Michael Isard and John MacCormick Presented by Kristin Branson CSE 252C, Fall 2003.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Randomized Algorithms for Bayesian Hierarchical Clustering
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Multi-target Detection in Sensor Networks Xiaoling Wang ECE691, Fall 2003.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Bayes Net Learning: Bayesian Approaches
Special Topics In Scientific Computing
John Nicholas Owen Sarah Smith
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hierarchical clustering approaches for high-throughput data
Revision (Part II) Ke Chen
Multitask Learning Using Dirichlet Process
Revision (Part II) Ke Chen
CS639: Data Management for Data Science
EM Algorithm and its Applications
Presentation transcript:

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Outline Traditional Hierarchical Clustering Bayesian Hierarchical Clustering –Algorithm –Results Potential Application

Hierarchical Clustering Given a set of data points, output is a tree –Leaves are the data points –Internal nodes are nested clusters Examples –Evolutionary tree of living organisms –Internet newsgroups –Newswire documents

Traditional Hierarchical Clustering Bottom-up agglomerative algorithm –Begin with each data point in own cluster –Iteratively merge two “closest” clusters –Stop when have single cluster –Closeness based on given distance measure (e.g., Euclidean distance between cluster means) Limitations –No guide to choosing “correct” number of clusters, or where to prune tree –Distance metric selection (especially for data such as images or sequences) –How to evaluate how good result is, how to compare to other models, how to make predictions and cluster new data with existing hierarchy

Bayesian Hierarchical Clustering (BHC) Basic idea: –Use marginal likelihoods to decide which clusters to merge –Asks what the probability is that all the data in a potential merge were generated from the same mixture component. Compare to exponentially many hypotheses at lower levels of the tree –Generative model used is a Dirichlet Process Mixture Model (DPM)

BHC Algorithm Overview One-pass, bottom-up method Initializes each data point in own cluster, and iteratively merges pairs of clusters Uses a statistical hypothesis test to choose which clusters to merge At each stage, algorithm considers merging all pairs of existing trees

BHC Algorithm: Merging Two hypotheses compared –1. all data in the pair of trees to be merged was generated i.i.d. from the same probabilistic model with unknown parameters: (e.g., a Gaussian) –2. said data has two or more clusters in it

Hypothesis H 1 Probability of the data under H 1 : Prior over the parameters: D k is the data in the two trees to be merged Integral is tractable when conjugate prior employed

Hypothesis H 2 Probability of the data under H 2 : Is a product over sub-trees Prior that all points belong to one cluster: Probability of the data in tree T k :

Merging Clusters From Bayes Rule, the posterior probability of the merged hypothesis: The pair of trees with highest probability are merged Natural place to cut the final tree: where

Dirichlet Process Mixture Models (DPMs) Probability of a new data point belonging to a cluster is proportional to the number of points already in that cluster α controls the probability of the new point creating a new cluster

Merged Hypothesis Prior DPM with α defines a prior on all partitions of the n k data points in D k Prior on merged hypothesis, π k, is the relative mass of all n k points belonging to one cluster versus all other partitions of those n k points, consistent with the tree structure.

DPM Other quantities needed for the posterior merged hypothesis probabilities can also be written and computed with the DPM (see math/proofs in paper)

Results Some sample results…

Unique Aspects of Algorithm Is a hierarchical way of organizing nested clusters, not a hierarchical generative model Is derived from DPMs Hypothesis test is not for one vs. two clusters at each stage (is one vs. many other clusterings) Is not iterative and does not require sampling

Summary Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree. Model-based criterion to decide on merging clusters. Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree. Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.

Why This Paper? Mixed-type data problems: both continuous and discrete features How to perform density estimation? –One way: partition continuous data into groups determined by the values of the discrete features. –Problem: number of groups grows quickly. (e.g., 5 features, each of which can take 4 values, gives 4 5 =1024 groups) –How to determine which groups should be combined to reduce the total number of groups? –Possible solution: idea in this paper, except rather than leaves being individual data points, they would be groups of data points as determined by the discrete feature-values