Nonparametric Bayesian Learning Michael I. Jordan University of California, Berkeley September 17, 2009 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux
Computer Science and Statistics Separated in the 40's and 50's, but merging in the 90's and 00’s What computer science has done well: data structures and algorithms for manipulating data structures What statistics has done well: managing uncertainty and justification of algorithms for making decisions under uncertainty What machine learning attempts to do: hasten the merger along
Bayesian Nonparametrics At the core of Bayesian inference is Bayes theorem: For parametric models, we let denote a Euclidean parameter and write: For Bayesian nonparametric models, we let be a general stochastic process (an “infinite-dimensional random variable”) and write: This frees us to work with flexible data structures
Bayesian Nonparametrics (cont) Examples of stochastic processes we'll mention today include distributions on: directed trees of unbounded depth and unbounded fan-out partitions sparse binary infinite-dimensional matrices copulae distributions General mathematical tool: completely random processes
Hierarchical Bayesian Modeling Hierarchical modeling is a key idea in Bayesian inference It's essentially a form of recursion in the parametric setting, it just means that priors on parameters can themselves be parameterized in the nonparametric setting, it means that a stochastic process can have as a parameter another stochastic process
Speaker Diarization John Jane Bob Ji l l The speaker diarization task involves parsing an audio file into a set of labels of who is speaking when. One approach to this problem is to model the audio, or features of the audio, as observations generated from a hidden Markov model. Recall that the HMM is a class of doubly stochastic processes consisting of an underlying discrete-valued Markov process and a set of observations which are conditionally independent given this state sequence. Here, the state of the HMM represents the speaker label. Let’s assume we know there are K people speaking. If we use an HMM to model this process, the Markov structure on the state sequence dictates that the distribution over who is speaking at the next time step solely depends on who is speaking at the current time step. There are two key features of this application that are common to many datasets modeled by HMMs. The first is that there is a high probability of self-transition, because if you are currently speaking, you are likely to continue speaking. The second is that the speaker-specific emissions are complex, multi-modal distributions.
Motion Capture Analysis Goal: Find coherent “behaviors” in the time series that transfer to other time series (e.g., jumping, reaching) 7
Hidden Markov Models Time State states observations Page 8 Now that we’ve discussed the Dirichlet process, we’ll go through some background on hidden Markov models and then show how the Dirichlet process can be used as a prior on the parameters of these models. I am assuming that most of you are familiar with HMMs, so I’ll run through this quickly. The primary goal of these slides is to establish some notation that will make it easier to parse the HDP-HMM. The hidden Markov model is a class of doubly stochastic processes consisting of an underlying discrete state sequence, represented by the z variables, which is modeled as Markov with state-specific transition distributions \pi. You are probably more familiar with defining a transition matrix P, with each element defining the probability of going from state i to state j, but we will find it useful to instead talk about the collection of transitions distributions that comprise the rows of this matrix. Conditioned on the state sequence, the observations y are independent emissions from distributions parameterized by the \theta parameter for that state. One can view a sample path of the state sequence as a walk through a state vs. time lattice. Page 8 8
Hidden Markov Models Time modes observations Page 9 Assume we start in mode 2 at the first time step. Here we show the set of possible transitions from time step 1 to time step 2. The weights of the arrows indicate the probability of each transition. These probabilities are captured by the transition distribution \pi_2. Page 9 9
Hidden Markov Models Time states observations Page 10 10 Now assume we transitioned to state 1 and have the following possible transitions… Page 10 10
Hidden Markov Models Time states observations And so on. Page 11 11
Issues with HMMs How many states should we use? we don’t know the number of speakers a priori we don’t know the number of behaviors a priori How can we structure the state space? how to encode the notion that a particular time series makes use of a particular subset of the states? how to share states among time series? We’ll develop a Bayesian nonparametric approach to HMMs that solves these problems in a simple and general way
Bayesian Nonparametrics Replace distributions on finite-dimensional objects with distributions on infinite-dimensional objects such as function spaces, partitions and measure spaces mathematically this simply means that we work with stochastic processes A key construction: random measures These are often used to provide flexibility at the higher levels of Bayesian hierarchies:
Stick-Breaking A general way to obtain distributions on countably infinite spaces The classical example: Define an infinite sequence of beta random variables: And then define an infinite random sequence as follows: This can be viewed as breaking off portions of a stick:
Constructing Random Measures It's not hard to see that (wp1) Now define the following object: where are independent draws from a distribution on some space Because , is a probability measure---it is a random measure The distribution of is known as a Dirichlet process: What exchangeable marginal distribution does this yield when integrated against in the De Finetti setup?
Chinese Restaurant Process (CRP) A random process in which customers sit down in a Chinese restaurant with an infinite number of tables first customer sits at the first table th subsequent customer sits at a table drawn from the following distribution: where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated
The CRP and Clustering Data points are customers; tables are mixture components the CRP defines a prior distribution on the partitioning of the data and on the number of tables This prior can be completed with: a likelihood---e.g., associate a parameterized probability distribution with each table a prior for the parameters---the first customer to sit at table chooses the parameter vector, , for that table from a prior So we now have defined a full Bayesian posterior for a mixture model of unbounded cardinality
CRP Prior, Gaussian Likelihood, Conjugate Prior
Dirichlet Process Mixture Models
Multiple estimation problems We often face multiple, related estimation problems E.g., multiple Gaussian means: Maximum likelihood: Maximum likelihood often doesn't work very well want to “share statistical strength”
Hierarchical Bayesian Approach The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups
Hierarchical Modeling The plate notation: Equivalent to:
Hierarchical Dirichlet Process Mixtures (Teh, Jordan, Beal, & Blei, JASA 2006)
Marginal Probabilities First integrate out the , then integrate out
Chinese Restaurant Franchise (CRF)
Application: Protein Modeling A protein is a folded chain of amino acids The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) Empirical plots of phi and psi angles are called Ramachandran diagrams
Application: Protein Modeling We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood We thus have a linked set of density estimation problems
Protein Folding (cont.) We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood
Protein Folding (cont.)
Nonparametric Hidden Markov models Essentially a dynamic mixture model in which the mixing proportion is a transition probability Use Bayesian nonparametric tools to allow the cardinality of the state space to be random obtained from the Dirichlet process point of view (Teh, et al, “HDP-HMM”)
Hidden Markov Models Time State states observations Now that we’ve discussed the Dirichlet process, we’ll go through some background on hidden Markov models and then show how the Dirichlet process can be used as a prior on the parameters of these models. I am assuming that most of you are familiar with HMMs, so I’ll run through this quickly. The primary goal of these slides is to establish some notation that will make it easier to parse the HDP-HMM. The hidden Markov model is a class of doubly stochastic processes consisting of an underlying discrete state sequence, represented by the z variables, which is modeled as Markov with state-specific transition distributions \pi. You are probably more familiar with defining a transition matrix P, with each element defining the probability of going from state i to state j, but we will find it useful to instead talk about the collection of transitions distributions that comprise the rows of this matrix. Conditioned on the state sequence, the observations y are independent emissions from distributions parameterized by the \theta parameter for that state. One can view a sample path of the state sequence as a walk through a state vs. time lattice.
HDP-HMM Dirichlet process: Hierarchical Bayes: Time State state space of unbounded cardinality Hierarchical Bayes: ties state transition distributions The HDP-HMM allows for an unbounded number of possible states. That is, K goes to infinity, by defining a set of countably infinite transition distributions. The Dirichlet process part of the HDP allows for this unbounded state space, just like it allowed for an unknown number of mixture components in the mixture of Gaussian model. In addition, the Dirichlet process encourages the use of only a spare subset of these HMM states, which is analogous to the reinforcement of mixture components. The hierarchical layering of these processes ties together the state spaces of each state-specific transition distribution, and through this process, creates a shared sparse set of possible states.
HDP-HMM Average transition distribution: A bit more formally, we start with an average transition distribution defined according to the stick-breaking construction and then use this distribution to define an infinite set of state-specific transition distributions, each of which is distributed according to a Dirichlet process with \beta as the base measure. This implies that the expected set of weights of each of these distributions is equivalent to \beta. Thus, the sparsity induced by \beta is shared by each of the different state-specific transitions distributions. State-specific transition distributions: sparsity of b is shared
State Splitting Let’s return to the 3-mode HMM example with the true labels shown here and the inferred labels shown here with errors shown in red. As before, we see the split into redundant states which are rapidly switched between. In this scenario, the DP’s bias towards simpler models is insufficient in preventing this unrealistically fast switching. There are a few things to note about this issue. First, splitting into redundant states can reduce the predictive performance of the learned model since each state has fewer observations from which to infer model parameters. Second, in applications such as speaker diarization, one cares about the accuracy of the inferred label sequence and we’re not just performing model averaging. HDP-HMM inadequately models temporal persistence of states DP bias insufficient to prevent unrealistically rapid dynamics Reduces predictive performance
“Sticky” HDP-HMM original sticky state-specific base measure Specifically, we consider augmenting the HDP-HMM by adding a self-transition parameter \kappa. The average transition density \beta remains the same, but every state-specific transition density is defined according to a Dirichlet process with an added weight on the component of the base measure corresponding to a self-transition. Now, the expected transition distribution has weights which are a convex combination of the global weights and state-specific weights. We can qualitatively compare to the transition distributions we had before, and see that we now have a larger probability of self-transition. state-specific base measure Increased probability of self-transition
Speaker Diarization John Jane Bob Ji l l We return to the NIST speaker diarization database described at the beginning of the talk. Recall that this database consists of 21 recorded conference meetings with ground truth labels, and from this data, we aim to both learn the number of speakers and segment the audio into speaker-homogenous regions.
Meeting by Meeting Comparison NIST Evaluations Meeting by Meeting Comparison NIST Rich Transcription 2004-2007 meeting recognition evaluations 21 meetings ICSI results have been the current state-of-the-art One dataset that we revisit later in the talk is the NIST Rich Transcription set of 21 meetings used for evaluations in 2004-2007. For the past 6 years the Berkeley ICSI team has won the NIST competition by a large margin. Their approach is based on agglomerative clustering. This system is highly engineered to this task and has been developed over many years by a large team of researchers. We will show that the nonparametric Bayesian model we develop provides performance that is competitive with this state-of-the-art, and with significant improvements over the performance achieved by the original HDP-HMM. In this plot, we show the official NIST speaker diarization error rate, or DER, that each of these algorithms achieved on the 21 meetings. This plot clearly demonstrates the importance of the extensions we develop in this talk. 37
Results: 21 meetings Overall DER Best DER Worst DER Sticky HDP-HMM Using 50,000 Gibbs iterations just on meeting 16, our results are now as follows. We have an overall DER that is slightly better than that of ICSI. Furthermore, if you do a meeting-by-meeting comparison we see that we are beating ICSI on 13 out of the 21 meetings. Overall DER Best DER Worst DER Sticky HDP-HMM 17.84% 1.26% 34.29% Non-Sticky HDP-HMM 23.91% 6.26% 46.95% ICSI 18.37% 4.39% 32.23% 38
Results: Meeting 1 (AMI_20041210-1052) To show some qualitative results, here is a meeting in which we had a very low DER of 1.26%. The true speaker labels are plotted here and our most-likely estimated sequence here. Errors are shown in red. Sticky DER = 1.26% ICSI DER = 7.56%
Results: Meeting 18 (VT_20050304-1300) Sticky DER = 4.81% ICSI DER = 22.00%
Results: Meeting 16 (NIST_20051102-1323) Let’s look at that meeting in a bit more detail. Here we show plots of log-likelihood and Hamming distance versus Gibbs iteration for each of the 10 initializations of the sampler. We had previously run the Gibbs sampler to 10,000 iterations, but we see that the sampler is very slow to mix so we decided to run the sampler out to 100,000 iterations. For this meeting, the max-likelihood sample corresponds to the trial with a Hamming distance significantly lower than the other meetings. However, there is a cluster of trials that merge a speaker, such as depicted here. Thus, this segmentation becomes the one that minimizes the expected Hamming distance error. That is, the “typical” segmentation. We thought that running the sampler longer would allow for more chains to find that speaker, but unfortunately the sampler is just extremely slow to mix. The reason behind this is the fact that the parameter associated with a new state is just a draw from our prior; that draw has to somehow better explain the merged speaker than the other parameters that have already been informed by the data. In high-dimensional settings, such as this one, that can take a while to accomplish. However, the likelihood of this segmentation has increased enough to separate from the other chains so that most of them are thrown out as not having mixed.
The Beta Process The Dirichlet process naturally yields a multinomial random variable (which table is the customer sitting at?) Problem: in many problem domains we have a very large (combinatorial) number of possible tables using the Dirichlet process means having a large number of parameters, which may overfit perhaps instead want to characterize objects as collections of attributes (“sparse features”)? i.e., binary matrices with more than one 1 in each row
Completely Random Processes (Kingman, 1968) Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of e.g., Brownian motion, gamma processes, beta processes, compound Poisson processes and limits thereof (The Dirichlet process is not a completely random process but it's a normalized gamma process) Completely random processes are discrete wp1 (up to a possible deterministic continuous component) Completely random processes are random measures, not necessarily random probability measures
Completely Random Processes (Kingman, 1968) x x x x x x x x x x x x x x x Assigns independent mass to nonintersecting subsets of
Completely Random Processes (Kingman, 1968) Consider a non-homogeneous Poisson process on with rate function obtained from some product measure Sample from this Poisson process and connect the samples vertically to their coordinates in x
Beta Processes The product measure is called a Levy measure (Hjort, Kim, et al.) The product measure is called a Levy measure For the beta process, this measure lives on and is given as follows: And the resulting random measure can be written simply as: degenerate Beta(0,c) distribution Base measure
Beta Processes
Beta Process and Bernoulli Process
BP and BeP Sample Paths
Beta Process Marginals (Thibaux & Jordan, 2007) Theorem: The beta process is the De Finetti mixing measure underlying the a stochastic process on binary matrices known as the Indian buffet process (IBP)
Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes
Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes
Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes
Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes
Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes
Beta Process Point of View The IBP is usually derived by taking a finite limit of a process on a finite matrix But this leaves some issues somewhat obscured: is the IBP exchangeable? why the Poisson number of dishes in each row? is the IBP conjugate to some stochastic process? These issues are clarified from the beta process point of view A draw from a beta process yields a countably infinite set of coin-tossing probabilities, and each draw from the Bernoulli process tosses these coins independently
Hierarchical Beta Processes A hierarchical beta process is a beta process whose base measure is itself random and drawn from a beta process
Multiple Time Series Goals: Method: transfer knowledge among related time series in the form of a library of “behaviors” allow each time series model to make use of an arbitrary subset of the behaviors Method: represent behaviors as states in a nonparametric HMM use the beta/Bernoulli process to pick out subsets of states Specifically, we are going to assume that each of the time series can be modeled as having Markov switching dynamics with an arbitrarily large set of dynamical modes. The way the time series are related is via sharing some subset of the dynamical modes. For example, we might have a collection of motion capture movies of multiple people performing jumping jacks. By jointly modeling the time series, we can improve our estimates of the dynamic parameters for each mode, especially in the case of limited data, and also discover interesting relationships between the sequences. In addition to allowing for infinitely many possible dynamical modes, we would also like to allow for behaviors unique to a given sequence. For example, one movie might have a person doing side twists while another does not. We show that the beta process prior provides a useful Bayesian nonparametric prior in this case.
IBP-AR-HMM Bernoulli process determines which states are used Beta process prior: encourages sharing allows variability We now show how we can use such a feature model to relate multiple time series. We assume that each of our N time series can be described as a switching VAR process. Then, the set of features associated with the time series defines the set of behavioral modes that object switches between. Specifically, for each object, imagine we define a set of feature-constrained transition distributions that restrict the object to only switching between it’s chosen behaviors. One can visualize this process as defining transition distributions in an infinite dimensional space and then doing an element wise product with the object’s feature vector, and renormalizing. The overall generative model is summarized here. Through our beta process prior, we encourage objects to share behavior modes while allowing some object-specific behaviors. Our formulation also allows variability in how the object switches between its chosen behaviors.
Motion Capture Results Here is a plot of the results from that sample. We show a series of skeleton plots, each corresponding to a learned contiguous segment of at least 2 seconds. We then join all of the segments categorized under the same feature label within a box. The color of the box indicates the true behavior category.
Conclusions Hierarchical modeling has at least an important a role to play in Bayesian nonparametrics as is plays in classical Bayesian parametric modeling In particular, infinite-dimensional parameters can be controlled recursively via Bayesian nonparametric strategies For papers and more details: www.cs.berkeley.edu/~jordan/publications.html