Download presentation
Presentation is loading. Please wait.
Published byGregory Haynes Modified over 9 years ago
1
Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux June 20, 2009
2
Hierarchy Standard nonparametric mixtures:
3
Hierarchy Standard nonparametric mixture: Hierarchical nonparametric mixtures:
4
Hierarchy Standard nonparametric mixtures: Hierarchical nonparametric mixtures: Cf. nonhierarchical alternatives for sharing atoms:
5
De Finetti Exchangeability: invariance to permutation of the joint probability distribution of infinite sequences of random variables Theorem (De Finetti, 1935). If are infinitely exchangeable, then the joint probability distribution has a representation as a conditional independence mixture:
6
JohnJaneBobJohn BobBob JillJill Application: Speaker Diarization
7
Application: Protein Modeling A protein is a folded chain of amino acids The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) Empirical plots of phi and psi angles are called Ramachandran diagrams
8
Application: Protein Modeling We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood We thus have a linked set of density estimation problems
9
Haplotype Modeling Consider binary markers in a genomic region There are possible haplotypes---i.e., states of a single chromosome –but in fact, far fewer are seen in human populations A genotype is a set of unordered pairs of markers from one individual Given a set of genotypes (multiple individuals), estimate the underlying haplotypes
10
Haplotype Modeling This is a mixture modeling problem: –where the key problem is inference for the number of clusters (the cardinality of ) Consider now the case of multiple groups of genotype data (e.g., ethnic groups) Geneticists would like to find clusters within each group but they would also like to share clusters between the groups
11
Stick-Breaking A general way to obtain distributions on countably infinite spaces The classical example: Define an infinite sequence of beta random variables: And then define an infinite random sequence as follows: This can be viewed as breaking off portions of a stick:
12
Constructing Random Measures It's not hard to see that (wp1) Now define the following object: where are independent draws from a distribution on some space Because, is a probability measure---it is a random measure The distribution of is known as a Dirichlet process: What exchangeable marginal distribution does this yield when integrated against in the De Finetti setup?
13
Chinese Restaurant Process (CRP) A random process in which customers sit down in a Chinese restaurant with an infinite number of tables –first customer sits at the first table – th subsequent customer sits at a table drawn from the following distribution: –where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated
14
The CRP and Mixture Modeling Data points are customers; tables are mixture components –the CRP defines a prior distribution on the partitioning of the data and on the number of tables This prior can be completed with: –a likelihood---e.g., associate a parameterized probability distribution with each table –a prior for the parameters---the first customer to sit at table chooses the parameter vector,, for that table from a prior So we now have defined a full Bayesian posterior for a mixture model of unbounded cardinality
15
CRP Prior, Gaussian Likelihood, Conjugate Prior
16
Exchangeability As a prior on the partition of the data, the CRP is exchangeable The prior on the parameter vectors associated with the tables is also exchangeable The latter probability model is generally called the Polya urn model. Letting denote the parameter vector associated with the th data point, we have: From these conditionals, a short calculation shows that the joint distribution for is invariant to order (this is the exchangeability proof)
17
Dirichlet Process Mixture Models
18
Marginal Probabilities To obtain the marginal probability of the parameters we need to integrate out This marginal distribution turns out to be the Chinese restaurant process (more precisely, it's the Polya urn model)
19
Multiple estimation problems We often face multiple, related estimation problems E.g., multiple Gaussian means: Maximum likelihood: Maximum likelihood often doesn't work very well –want to “share statistical strength”
20
Hierarchical Bayesian Approach The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups
21
Hierarchical Modeling Recall the plate notation: Equivalent to:
22
Hierarchical Dirichlet Process Mixtures (Teh, Jordan, Beal, & Blei, JASA 2006)
23
Marginal Probabilities First integrate out the, then integrate out
24
Chinese Restaurant Franchise (CRF)
25
Protein Folding (cont.) We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood
26
Protein Folding (cont.)
27
Natural Language Parsing Key idea: lexicalization of probabilistic context-free grammars –the grammatical rules (S NP VP) are conditioned on the specific lexical items (words) that they derive –this leads to huge numbers of potential rules, and (adhoc) clustering methods are used to control the number of rules –note that this is a set of clustering problems---each grammatical context defines a clustering problem
28
HDP-PCFG Use the HDP to link these multiple clustering problems and to let the number of grammatical rules be chosen adaptively (Liang, Jordan & Klein, 2009)
29
Nonparametric Hidden Markov models Essentially a dynamic mixture model in which the mixing proportion is a transition probability Use Bayesian nonparametric tools to allow the cardinality of the state space to be random –urn model point of view (Beal, et al, “infinite HMM”) –Dirichlet process point of view (Teh, et al, “HDP-HMM”)
30
Hidden Markov Models Time State states observations
31
Hidden Markov Models Time modes observations
32
Hidden Markov Models Time states observations
33
Hidden Markov Models Time states observations
34
HDP-HMM Dirichlet process: –state space of unbounded cardinality Hierarchical Bayes: –ties state transition distributions Time State
35
Average transition distribution: HDP-HMM sparsity of is shared State-specific transition distributions:
36
State Splitting HDP-HMM inadequately models temporal persistence of states DP bias insufficient to prevent unrealistically rapid dynamics Reduces predictive performance
37
“Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky
38
Direct Assignment Sampler Marginalize: –Transition densities –Emission parameters Sequentially sample: Conjugate base measure closed form Chinese restaurant prior likelihood Rao-Blackwellized Gibbs Sampler Splits true state, hard to merge
39
Sample: Blocked Resampling HDP-HMM weak limit approximation Compute backwards messages: Block sample as:
40
Results: Gaussian Emissions Blocked sampler Sticky HDP-HMM HDP-HMM Sequential sampler
41
Sticky HDP-HMM HDP-HMM Results: Fast Switching Observations True state sequence
42
Results: Multinomial Emissions Observations embedded in discrete topological space No sense of closeness to help with grouping Sticky HDP-HMM HDP-HMM Sticky parameter improves predictive probability of test sequences Histogram of log-likelihood of test sequences using parameters sampled at thinned iterations in the 10,000 to 30,000 range
43
Multimodal Emissions Approximate multimodal emissions with DP mixture Temporal state persistence disambiguates model states mixture components observations
44
Results: Multimodal Emissions 5-state HMM # emission components Equal mixture weights Sticky HDP-HMM HDP-HMM
45
JohnJaneBobJohn BobBob JillJill Speaker Diarization
46
NIST Evaluations NIST Rich Transcription 2004- 2007 meeting recognition evaluations 21 meetings ICSI results have been the current state-of-the-art Meeting by Meeting Comparison
47
Results: 21 meetings Overall DERBest DERWorst DER Sticky HDP-HMM17.84%1.26%34.29% Non-Sticky HDP-HMM23.91%6.26%46.95% ICSI18.37%4.39%32.23%
48
Results: Meeting 1 (AMI_20041210-1052) Sticky DER = 1.26% ICSI DER = 7.56%
49
Results: Meeting 18 (VT_20050304-1300) Sticky DER = 4.81% ICSI DER = 22.00%
50
Results: Meeting 16 (NIST_20051102-1323)
51
Nonparametric Hidden Markov Trees A tree in which each parent-child edge is a Markov transition We need to tie the parent-child transitions across the parent states; this is again done with the HDP (Kivinen, Sudderth & Jordan, 2009)
52
Nonparametric Hidden Markov Trees Local Gaussian Scale Mixture (31.84 dB) (Kivinen, Sudderth & Jordan, 2009)
53
Nonparametric Hidden Markov Trees HDP-HMT (32.10 dB) (Kivinen, Sudderth & Jordan, 2009)
54
The Beta Process The Dirichlet process naturally yields a multinomial random variable (which table is the customer sitting at?) Problem: in many problem domains we have a very large (combinatorial) number of possible tables –using the Dirichlet process means having a large number of parameters, which may overfit –perhaps instead want to characterize objects as collections of attributes (“sparse features”)? –i.e., binary matrices with more than one 1 in each row
55
Indian Buffet Process (IBP) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant –the first customer samples dishes –the th customer samples a previously sampled dish with probability then samples new dishes (Griffiths & Ghahramani, 2002)
56
Indian Buffet Process (IBP) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant –the first customer samples dishes –the th customer samples a previously sampled dish with probability then samples new dishes (Griffiths & Ghahramani, 2002)
57
Indian Buffet Process (IBP) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant –the first customer samples dishes –the th customer samples a previously sampled dish with probability then samples new dishes (Griffiths & Ghahramani, 2002)
58
Indian Buffet Process (IBP) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant –the first customer samples dishes –the th customer samples a previously sampled dish with probability then samples new dishes (Griffiths & Ghahramani, 2002)
59
Indian Buffet Process (IBP) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant –the first customer samples dishes –the th customer samples a previously sampled dish with probability then samples new dishes (Griffiths & Ghahramani, 2002)
60
Indian Buffet Process (IBP) The IBP is usually derived by taking a finite limit of a process on a finite matrix But this leaves some issues somewhat obscured: –Is the IBP exchangeable? –Why the Poisson number of dishes in each row? –Is the IBP conjugate to some stochastic process? –Is there an underlying stochastic process with the IBP as its marginal?
61
Completely Random Processes Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of –e.g., Brownian motion, gamma processes, beta processes, compound Poisson processes and limits thereof (The Dirichlet process is not a completely random process –but it's a normalized gamma process) Completely random processes are discrete wp1 (up to a possible deterministic continuous component) Completely random processes are random measures, not necessarily random probability measures (Kingman, 1968)
62
Completely Random Processes Assigns independent mass to nonintersecting subsets of (Kingman, 1968) x x x x x x x x x x x x x x x
63
Completely Random Processes Consider a non-homogeneous Poisson process on with rate function obtained from some product measure Sample from this Poisson process and connect the samples vertically to their coordinates in (Kingman, 1968) xx x x x xx x x
64
Beta Processes The product measure is called a Levy measure For the beta process, this measure lives on and is given as follows: And the resulting random measure can be written simply as: (Hjort, Kim, et al.) degenerate Beta(0,c) distribution Base measure
65
Beta Processes
66
Beta Process -> IBP Theorem: The beta process is the De Finetti mixing measure underlying the IBP Thus: –the IBP is obviously exchangeable –it’s clear why the Poisson assumption arises –this leads to new algorithms for the IBP based on augmentation –we’re motivated to consider hierarchies of beta processes (Thibaux & Jordan, 2007)
67
Beta Process and Bernoulli Process
68
BP and BeP Sample Paths
69
Hierarchical Beta Processes A hierarchical beta process is a beta process whose base measure is itself random and drawn from a beta process
70
Text Classification SPORTS words SCIENCE POLITICS documents
71
Prediction 0.00010.0010.010.1110100 -34 -33 -32 -31 -30 -29 -28 -27 smoothing average log likelihood Predicting class 1 based on 10 samples per class Hierarchy of beta processes Laplace smoothing
72
Classification Accuracy Hierarchies of BP Best Naïve Bayes 58% 50% 20-newsgroup dataset (unbalanced dataset, with categories of size 2 to 100)
73
Stick-Breaking Representations
74
Thibaux & Jordan sized-biased ordering: Teh, et al inverse Lévy measure approach:
75
Conclusions Hierarchical modeling has at least an important a role to play in Bayesian nonparametrics as is plays in classical Bayesian parametric modeling In particular, infinite-dimensional parameters can be controlled recursively via Bayesian nonparametric strategies For papers and more details: www.cs.berkeley.edu/~jordan/publications.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.