Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

Similar presentations


Presentation on theme: "Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing."— Presentation transcript:

1

2 Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing Clusters Among Related Groups:

3 Overview Motivation Dirichlet Processes Hierarchical Dirichlet Processes Inference Experimental results Conclusions

4 Motivation Multi-task learning: clustering Goal: Share clusters among multiple related clustering problems (model-based). Approach: Hierarchical; Nonparametric Bayesian; DP Mixture Model: learn a generative model over the data, treating the classes as hidden variables;

5 Dirichlet Processes Let ( ,  ) be a measurable space, G 0 be a probability measure on the space, and  be a positive real number. A Dirichlet process is any distribution of a random probability measure G over ( ,  ) such that, for all finite partitions (A 1,…,A r ) of , G ~DP( , G 0 ) if G is a random probability measure with distribution given by the Dirichlet process. Draws G from DP are generally not distinct, discrete,, Ө k ~G 0, β k are random and depend on . Properties:

6 Chinese Restaurant Processes CRP(the polya urn scheme) Φ 1,…,Φ i-1, i.i.d., r.v., distributed according to G; Ө 1,…, Ө K to be the distinct values taken on by Φ 1,…,Φ i-1, n k be # of Φ i’ = Ө k, 0<i’<i, This slide is from “Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process”, NLP Group, Stanford, Feb. 2005

7 DP Mixture Model One of the most important application of DP: nonparametric prior distribution on the components of a mixture model. Why no direct application of density estimation? Because G is discrete?

8 HDP – Problem statement We have J groups of data, {X j }, j=1,…, J. For each group, X j ={x ji }, i=1, …, nj. In each group, X j ={x ji } are modeled with a mixture model. The mixing proportions are specific to the group. Different groups share the same set of mixture components (underlying clusters, ), but different group is a different combination of the mixture components. Goal: Discover the distribution of within a group; Discover the distribution of across groups;

9 HDP - General representation G 0 : the global prob. measure ~ DP(r, H), r: concentration parameter, H is the base measure. G j : the probability distribution for group j, ~ DP(α, G0). Φ ji : the hidden parameters of distribution F( Φ ji ) corresponding to x ji. The overall model is: Two-level DPs.

10 HDP - General representation G 0 places non-zeros mass only on, thus,, i.i.d, r.v. distributed according to H.

11 HDP-CR franchise First level: within each group, DP mixture Φ j1,…,Φ ji-1, i.i.d., r.v., distributed according to G j ; Ѱ j1,…, Ѱ jT j to be the values taken on by Φ j1,…,Φ ji-1, n jk be # of Φ ji’ = Ѱ jt, 0<i’<i. Second level: across group, sharing components Base measure of each group is a draw from DP: Ѱ jt | G 0 ~ G 0, G 0 ~ DP(r, H), Ө 1,…, Ө K to be the values taken on by Ѱ j1,…, Ѱ jT j, m k be # of Ѱ jt =Ө k, all j, t.

12 HDP-CR franchise Values of Φ ji are shared among groups. Integrating out G 0

13 Inference- MCMC Gibbs sampling the posterior in the CR franchise: Instead of directly dealing with Φ ji & Ѱ jt to get p(Φ, Ѱ |X), p(t, k, Ө|X) is achieved by sampling t, k, Ө, where, t={t ji }, t ji is the table index that Φ ji associated with, Φ ji = Ѱ jt ji. K={k jt }, k jt is the index that Ѱ jt takes value on Ө k, Ѱ jt =Ө k jt. Knowing the prior distribution as shown in CPR franchise, the posterior is sampled iteratively, Sampling t: Sampling K: Sampling Ө:

14 Experiments on the synthetic data Data description: We have three group data; Each group is a Gaussian mixture; Different group can share same clusters; Each cluster has 50 2-D data points, features are independent; 1 276 435 Group 1: [1, 2, 3, 7] 1 276 435 Group 2: [3, 4, 5, 7] 1 276 435 Group 3: [5, 6, 1, 7]

15 Experiments on the synthetic data HDPs definition: here, F(x ji |φ ji ) is Gussian distribution, φ ji ={μ ji, σ ji }; φ ji take values on one of θ k ={μ k, σ k }, k=1…. μ ~ N(m, σ/β), σ -1 ~ Gamma (a, b), i. e., H is Norm- Gamma joint distribution. m, β, a, b are given hyperparameters. Goal: Model each group as a Gaussian mixture ; Model the cluster distribution over groups ;

16 Experiments on the synthetic data Results on Synthetic Data Global distribution: Estimated over all groups and the corresponding mixing proportions The number of components is open- ended, here only partial is shown.

17 Experiments on the synthetic data Mixture within each group : The number of components in each group is also open-ended, here only partial is shown.

18 Conclusions & discussions This hierarchical Bayesian method can automatically determine the appropriate number of mixture components needed. A set of DPs are coupled via their base measure to achieve the component sharing among groups. Non-parametric priors; not non-parametric density estimation.


Download ppt "Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing."

Similar presentations


Ads by Google