Download presentation
Presentation is loading. Please wait.
Published byHugo Kennedy Modified over 8 years ago
1
Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux June 14, 2010
2
Information Theory and Statistics Many classical and neo-classical connections –hypothesis testing, large deviations, empirical processes, consistency of density estimation, compressed sensing, MDL priors, reference priors, divergences and asymptotics
3
Information Theory and Statistics Many classical and neo-classical connections –hypothesis testing, large deviations, empirical processes, consistency of density estimation, compressed sensing, MDL priors, reference priors, divergences and asymptotics Some of these connections are frequentist and some are Bayesian Some of the connections involve nonparametrics and some involve parametric modeling –it appears that most of the nonparametric work is being conducted within a frequentist framework I want to introduce you to Bayesian nonparametrics
4
Bayesian Nonparametrics At the core of Bayesian inference is Bayes theorem: For parametric models, we let denote a Euclidean parameter and write: For Bayesian nonparametric models, we let be a general stochastic process (an “infinite-dimensional random variable”) and write (non-rigorously): This frees us to work with flexible data structures
5
Bayesian Nonparametrics (cont) Examples of stochastic processes we'll mention today include distributions on: –directed trees of unbounded depth and unbounded fan- out –partitions –grammars –sparse binary infinite-dimensional matrices –copulae –distributions General mathematical tool: completely random processes
6
Bayesian Nonparametrics Replace distributions on finite-dimensional objects with distributions on infinite-dimensional objects such as function spaces, partitions and measure spaces –mathematically this simply means that we work with stochastic processes A key construction: random measures These can be used to provide flexibility at the higher levels of Bayesian hierarchies:
7
Stick-Breaking A general way to obtain distributions on countably infinite spaces The classical example: Define an infinite sequence of beta random variables: And then define an infinite random sequence as follows: This can be viewed as breaking off portions of a stick:
8
Constructing Random Measures It's not hard to see that (wp1) Now define the following object: –where are independent draws from a distribution on some space Because, is a probability measure---it is a random probability measure The distribution of is known as a Dirichlet process: There are other random measures that can be obtained this way, but there are other approaches as well…
9
Random Measures from Stick-Breaking x x x x x x x x x x x x x x x
10
Completely Random Measures Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of –e.g., Brownian motion, gamma processes, beta processes, compound Poisson processes and limits thereof (The Dirichlet process is not a completely random process –but it's a normalized gamma process) Completely random measures are discrete wp1 (up to a possible deterministic continuous component) Completely random measures are random measures, not necessarily random probability measures (Kingman, 1968)
11
Completely Random Measures Consider a non-homogeneous Poisson process on with rate function obtained from some product measure Sample from this Poisson process and connect the samples vertically to their coordinates in (Kingman, 1968) xx x x x xx x x
12
Beta Processes The product measure is called a Levy measure For the beta process, this measure lives on and is given as follows: And the resulting random measure can be written simply as: degenerate Beta(0,c) distribution Base measure
13
Beta Processes
14
Gamma Processes, Dirichlet Processes and Bernoulli Processes The gamma process uses an improper gamma density in the Levy measure Once again, the resulting random measure is an infinite weighted sum of atoms To obtain the Dirichlet process, simply normalize the gamma process The Bernoulli process is obtained by treating the beta process as an infinite collection of coins and tossing those coins: where
15
Beta Process and Bernoulli Process
16
BP and BeP Sample Paths
17
Beta Processes vs. Dirichlet Processes The Dirichlet process yields a classification of each data point into a single class (a “cluster”) The beta process allows each data point to belong to multiple classes (a “feature vector”) Essentially, the Dirichlet process yields a single coin with an infinite number of sides Essentially, the beta process yields an infinite collection of coins with mostly small probabilities, and the Bernoulli process tosses those coins to yield a binary feature vector
18
The De Finetti Theorem An infinite sequence of random variables is called infinitely exchangeable if the distribution of any finite subsequence is invariant to permutation Theorem: infinite exchangeability if and only if for some random measure
19
Random Measures and Their Marginals Taking to be a completely random or stick- breaking measure, we obtain various interesting combintorial stochastic processes: –Dirichlet process => Chinese restaurant process (Polya urn) –Beta process => Indian buffet process –Hierarchical Dirichlet process => Chinese restaurant franchise –HDP-HMM => infinite HMM –Nested Dirichlet process => nested Chinese restaurant process
20
Dirichlet Process Mixture Models
21
Chinese Restaurant Process (CRP) A random process in which customers sit down in a Chinese restaurant with an infinite number of tables –first customer sits at the first table – th subsequent customer sits at a table drawn from the following distribution: –where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated
22
The CRP and Clustering Data points are customers; tables are mixture components –the CRP defines a prior distribution on the partitioning of the data and on the number of tables This prior can be completed with: –a likelihood---e.g., associate a parameterized probability distribution with each table –a prior for the parameters---the first customer to sit at table chooses the parameter vector,, for that table from a prior So we now have defined a full Bayesian posterior for a mixture model of unbounded cardinality
23
CRP Prior, Gaussian Likelihood, Conjugate Prior
24
Application: Protein Modeling A protein is a folded chain of amino acids The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) Empirical plots of phi and psi angles are called Ramachandran diagrams
25
Application: Protein Modeling We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood We thus have a linked set of density estimation problems
26
Multiple Estimation Problems We often face multiple, related estimation problems E.g., multiple Gaussian means: Maximum likelihood: Maximum likelihood often doesn't work very well –want to “share statistical strength”
27
Hierarchical Bayesian Approach The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups
28
Hierarchical Modeling The plate notation: Equivalent to:
29
Hierarchical Dirichlet Process Mixtures
30
Chinese Restaurant Franchise (CRF) Global menu: Restaurant 1: Restaurant 2: Restaurant 3:
31
Protein Folding (cont.) We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood
32
Protein Folding (cont.)
33
JohnJaneBobJohn BobBob JillJill Speaker Diarization
34
Motion Capture Analysis Goal: Find coherent “behaviors” in the time series that transfer to other time series (e.g., jumping, reaching)
35
Page 35 Hidden Markov Models Time State states observations
36
Page 36 Hidden Markov Models Time modes observations
37
Page 37 Hidden Markov Models Time states observations
38
Page 38 Hidden Markov Models Time states observations
39
Issues with HMMs How many states should we use? –we don’t know the number of speakers a priori –we don’t know the number of behaviors a priori How can we structure the state space? –how to encode the notion that a particular time series makes use of a particular subset of the states? –how to share states among time series? We’ll develop a Bayesian nonparametric approach to HMMs that solves these problems in a simple and general way
40
HDP-HMM Dirichlet process: –state space of unbounded cardinality Hierarchical DP: –ties state transition distributions Time State
41
Average transition distribution: HDP-HMM sparsity of is shared State-specific transition distributions:
42
State Splitting HDP-HMM inadequately models temporal persistence of states DP bias insufficient to prevent unrealistically rapid dynamics Reduces predictive performance
43
“Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky
44
“Sticky” HDP-HMM state-specific base measure Increased probability of self-transition originalsticky
45
JohnJaneBobJohn BobBob JillJill Speaker Diarization
46
NIST Evaluations NIST Rich Transcription 2004- 2007 meeting recognition evaluations 21 meetings ICSI results have been the current state-of-the-art Meeting by Meeting Comparison
47
Results: 21 meetings Overall DERBest DERWorst DER Sticky HDP-HMM17.84%1.26%34.29% Non-Sticky HDP-HMM23.91%6.26%46.95% ICSI18.37%4.39%32.23%
48
HDP-PCFG The most successful parsers of natural language are based on lexicalized probabilistic context-free grammars (PCFGs) We want to learn PCFGs from data without hand- tuning of the number or kind of lexical categories
49
HDP-PCFG
50
The Beta Process The Dirichlet process naturally yields a multinomial random variable (which table is the customer sitting at?) Problem: in many problem domains we have a very large (combinatorial) number of possible tables –using the Dirichlet process means having a large number of parameters, which may overfit –perhaps instead want to characterize objects as collections of attributes (“sparse features”)? –i.e., binary matrices with more than one 1 in each row
51
Beta Processes
52
Beta Process and Bernoulli Process
53
BP and BeP Sample Paths
54
Multiple Time Series Goals: –transfer knowledge among related time series in the form of a library of “behaviors” –allow each time series model to make use of an arbitrary subset of the behaviors Method: –represent behaviors as states in a nonparametric HMM –use the beta/Bernoulli process to pick out subsets of states
55
BP-AR-HMM Bernoulli process determines which states are used Beta process prior: –encourages sharing –allows variability
56
Motion Capture Results
57
Document Corpora “Bag-of-words” models of document corpora have a long history in the field of information retrieval The bag-of-words assumption is an exchangeability assumption for the words in a document –it is generally a very rough reflection of reality –but it is useful, for both computational and statistical reasons Examples –text (bags of words) –images (bags of patches) –genotypes (bags of polymorphisms) Motivated by De Finetti, we wish to find useful latent representations for these “documents”
58
Short History of Document Modeling Finite mixture models Finite admixture models (latent Dirichlet allocation) Bayesian nonparametric admixture models (the hierarchical Dirichlet process) Abstraction hierarchy mixture models (nested Chinese restaurant process) Abstraction hierarchy admixture models (nested beta process)
59
Finite Mixture Models The mixture components are distributions on individual words in some vocabulary (e.g., for text documents, a multinomial over lexical items) –often referred to as “topics” The generative model of a document: –select a mixture component –repeatedly draw words from this mixture component The mixing proportions are corpora-specific, not document-specific Major drawback: each document can express only a single topic
60
Finite Admixture Models The mixture components are distributions on individual words in some vocabulary (e.g., for text documents, a multinomial over lexical items) –often referred to as “topics” The generative model of a document: –repeatedly select a mixture component –draw a word from this mixture component The mixing proportions are document-specific Now each document can express multiple topics
61
Latent Dirichlet Allocation A word is represented as a multinomial random variable w A topic allocation is represented as a multinomial random variable z A document is modeled as a Dirichlet random variable The variables and are hyperparameters (Blei, Ng, and Jordan, 2003)
62
Abstraction Hierarchies Words in documents are organized not only by topics but also by level of abstraction Models such as LDA don’t capture this notion; common words often appear repeatedly across many topics Idea: Let documents be represented as paths down a tree of topics, placing topics focused on common words near the top of the tree
63
Nesting In the hierarchical models, all of the atoms are available to all of the “groups” In nested models, the atoms are subdivided into smaller collections, and the groups select among the collections E.g., the nested CRP of Blei, Griffiths and Jordan -each table in the Chinese restaurant indexes another Chinese restaurant E.g., the nested DP of Rodriguez, Dunson and Gelfand -each atom in a high-level DP is itself a DP
64
Nested Chinese Restaurant Process (Blei, Griffiths and Jordan, JACM, 2010)
65
Nested Chinese Restaurant Process
70
Hierarchical Latent Dirichlet Allocation The generative model for a document is as follows: -use the nested CRP to select a path down the infinite tree -draw to obtain a distribution on levels along the path -repeatedly draw a level from and draw a word from the topic distribution at that level
71
Psychology Today Abstracts
72
Pros and Cons of Hierarchical LDA The hierarchical LDA model yields more interpretable topics than flat LDA The maximum a posteriori tree can be interesting But the model is out of the spirit of classical topic models because it only allows one “theme” per document We would like a model that can both mix over levels in a tree and mix over paths within a single document
73
Nested Beta Processes In the nested CRP, at each level of the tree the procedure selects only a single outgoing branch Instead of using the CRP, use the beta process (or the gamma process) together with the Bernoulli process (or the Poisson process) to allow multiple branches to be selected:
74
Nested Beta Processes
77
Conclusions Hierarchical modeling and nesting have at least an important a role to play in Bayesian nonparametrics as they play in classical Bayesian parametric modeling In particular, infinite-dimensional parameters can be controlled recursively via Bayesian nonparametric strategies Completely random measures are an important tool in constructing nonparametric priors For papers, software, tutorials and more details: www.cs.berkeley.edu/~jordan/publications.html
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.