Nonparametric Bayesian Learning

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Teg Grenager NLP Group Lunch February 24, 2005
Xiaolong Wang and Daniel Khashabi
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Computer vision: models, learning and inference Chapter 8 Regression.
Bayesian Estimation in MARK
Sharing Features among Dynamical Systems with Beta Processes
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Learning Scalable Discriminative Dictionaries with Sample Relatedness a.k.a. “Infinite Attributes” Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Latent Dirichlet Allocation a generative model for text
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Visual Recognition Tutorial
Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Randomized Algorithms for Bayesian Hierarchical Clustering
Variational Inference for the Indian Buffet Process
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Stick-Breaking Constructions
CS Statistical Machine learning Lecture 24
Learning to Detect Events with Markov-Modulated Poisson Processes Ihler, Hutchins and Smyth (2007)
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
The Infinite Hierarchical Factor Regression Model Piyush Rai and Hal Daume III NIPS 2008 Presented by Bo Chen March 26, 2009.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Gaussian Processes For Regression, Classification, and Prediction.
Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Nonparametric Bayesian Learning of Switching Dynamical Processes
CS 2750: Machine Learning Density Estimation
Non-Parametric Models
Bayes Net Learning: Bayesian Approaches
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Statistical NLP: Lecture 4
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Nonparametric Bayesian Learning Michael I. Jordan University of California, Berkeley September 17, 2009 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux

Computer Science and Statistics Separated in the 40's and 50's, but merging in the 90's and 00’s What computer science has done well: data structures and algorithms for manipulating data structures What statistics has done well: managing uncertainty and justification of algorithms for making decisions under uncertainty What machine learning attempts to do: hasten the merger along

Bayesian Nonparametrics At the core of Bayesian inference is Bayes theorem: For parametric models, we let denote a Euclidean parameter and write: For Bayesian nonparametric models, we let be a general stochastic process (an “infinite-dimensional random variable”) and write: This frees us to work with flexible data structures

Bayesian Nonparametrics (cont) Examples of stochastic processes we'll mention today include distributions on: directed trees of unbounded depth and unbounded fan-out partitions sparse binary infinite-dimensional matrices copulae distributions General mathematical tool: completely random processes

Hierarchical Bayesian Modeling Hierarchical modeling is a key idea in Bayesian inference It's essentially a form of recursion in the parametric setting, it just means that priors on parameters can themselves be parameterized in the nonparametric setting, it means that a stochastic process can have as a parameter another stochastic process

Speaker Diarization John Jane Bob Ji l l The speaker diarization task involves parsing an audio file into a set of labels of who is speaking when. One approach to this problem is to model the audio, or features of the audio, as observations generated from a hidden Markov model. Recall that the HMM is a class of doubly stochastic processes consisting of an underlying discrete-valued Markov process and a set of observations which are conditionally independent given this state sequence. Here, the state of the HMM represents the speaker label. Let’s assume we know there are K people speaking. If we use an HMM to model this process, the Markov structure on the state sequence dictates that the distribution over who is speaking at the next time step solely depends on who is speaking at the current time step. There are two key features of this application that are common to many datasets modeled by HMMs. The first is that there is a high probability of self-transition, because if you are currently speaking, you are likely to continue speaking. The second is that the speaker-specific emissions are complex, multi-modal distributions.

Motion Capture Analysis Goal: Find coherent “behaviors” in the time series that transfer to other time series (e.g., jumping, reaching) 7

Hidden Markov Models Time State states observations Page 8 Now that we’ve discussed the Dirichlet process, we’ll go through some background on hidden Markov models and then show how the Dirichlet process can be used as a prior on the parameters of these models. I am assuming that most of you are familiar with HMMs, so I’ll run through this quickly. The primary goal of these slides is to establish some notation that will make it easier to parse the HDP-HMM. The hidden Markov model is a class of doubly stochastic processes consisting of an underlying discrete state sequence, represented by the z variables, which is modeled as Markov with state-specific transition distributions \pi. You are probably more familiar with defining a transition matrix P, with each element defining the probability of going from state i to state j, but we will find it useful to instead talk about the collection of transitions distributions that comprise the rows of this matrix. Conditioned on the state sequence, the observations y are independent emissions from distributions parameterized by the \theta parameter for that state. One can view a sample path of the state sequence as a walk through a state vs. time lattice. Page 8 8

Hidden Markov Models Time modes observations Page 9 Assume we start in mode 2 at the first time step. Here we show the set of possible transitions from time step 1 to time step 2. The weights of the arrows indicate the probability of each transition. These probabilities are captured by the transition distribution \pi_2. Page 9 9

Hidden Markov Models Time states observations Page 10 10 Now assume we transitioned to state 1 and have the following possible transitions… Page 10 10

Hidden Markov Models Time states observations And so on. Page 11 11

Issues with HMMs How many states should we use? we don’t know the number of speakers a priori we don’t know the number of behaviors a priori How can we structure the state space? how to encode the notion that a particular time series makes use of a particular subset of the states? how to share states among time series? We’ll develop a Bayesian nonparametric approach to HMMs that solves these problems in a simple and general way

Bayesian Nonparametrics Replace distributions on finite-dimensional objects with distributions on infinite-dimensional objects such as function spaces, partitions and measure spaces mathematically this simply means that we work with stochastic processes A key construction: random measures These are often used to provide flexibility at the higher levels of Bayesian hierarchies:

Stick-Breaking A general way to obtain distributions on countably infinite spaces The classical example: Define an infinite sequence of beta random variables: And then define an infinite random sequence as follows: This can be viewed as breaking off portions of a stick:

Constructing Random Measures It's not hard to see that (wp1) Now define the following object: where are independent draws from a distribution on some space Because , is a probability measure---it is a random measure The distribution of is known as a Dirichlet process: What exchangeable marginal distribution does this yield when integrated against in the De Finetti setup?

Chinese Restaurant Process (CRP) A random process in which customers sit down in a Chinese restaurant with an infinite number of tables first customer sits at the first table th subsequent customer sits at a table drawn from the following distribution: where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated

The CRP and Clustering Data points are customers; tables are mixture components the CRP defines a prior distribution on the partitioning of the data and on the number of tables This prior can be completed with: a likelihood---e.g., associate a parameterized probability distribution with each table a prior for the parameters---the first customer to sit at table chooses the parameter vector, , for that table from a prior So we now have defined a full Bayesian posterior for a mixture model of unbounded cardinality

CRP Prior, Gaussian Likelihood, Conjugate Prior

Dirichlet Process Mixture Models

Multiple estimation problems We often face multiple, related estimation problems E.g., multiple Gaussian means: Maximum likelihood: Maximum likelihood often doesn't work very well want to “share statistical strength”

Hierarchical Bayesian Approach The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups

Hierarchical Modeling The plate notation: Equivalent to:

Hierarchical Dirichlet Process Mixtures (Teh, Jordan, Beal, & Blei, JASA 2006)

Marginal Probabilities First integrate out the , then integrate out

Chinese Restaurant Franchise (CRF)

Application: Protein Modeling A protein is a folded chain of amino acids The backbone of the chain has two degrees of freedom per amino acid (phi and psi angles) Empirical plots of phi and psi angles are called Ramachandran diagrams

Application: Protein Modeling We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood We thus have a linked set of density estimation problems

Protein Folding (cont.) We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood

Protein Folding (cont.)

Nonparametric Hidden Markov models Essentially a dynamic mixture model in which the mixing proportion is a transition probability Use Bayesian nonparametric tools to allow the cardinality of the state space to be random obtained from the Dirichlet process point of view (Teh, et al, “HDP-HMM”)

Hidden Markov Models Time State states observations Now that we’ve discussed the Dirichlet process, we’ll go through some background on hidden Markov models and then show how the Dirichlet process can be used as a prior on the parameters of these models. I am assuming that most of you are familiar with HMMs, so I’ll run through this quickly. The primary goal of these slides is to establish some notation that will make it easier to parse the HDP-HMM. The hidden Markov model is a class of doubly stochastic processes consisting of an underlying discrete state sequence, represented by the z variables, which is modeled as Markov with state-specific transition distributions \pi. You are probably more familiar with defining a transition matrix P, with each element defining the probability of going from state i to state j, but we will find it useful to instead talk about the collection of transitions distributions that comprise the rows of this matrix. Conditioned on the state sequence, the observations y are independent emissions from distributions parameterized by the \theta parameter for that state. One can view a sample path of the state sequence as a walk through a state vs. time lattice.

HDP-HMM Dirichlet process: Hierarchical Bayes: Time State state space of unbounded cardinality Hierarchical Bayes: ties state transition distributions The HDP-HMM allows for an unbounded number of possible states. That is, K goes to infinity, by defining a set of countably infinite transition distributions. The Dirichlet process part of the HDP allows for this unbounded state space, just like it allowed for an unknown number of mixture components in the mixture of Gaussian model. In addition, the Dirichlet process encourages the use of only a spare subset of these HMM states, which is analogous to the reinforcement of mixture components. The hierarchical layering of these processes ties together the state spaces of each state-specific transition distribution, and through this process, creates a shared sparse set of possible states.

HDP-HMM Average transition distribution: A bit more formally, we start with an average transition distribution defined according to the stick-breaking construction and then use this distribution to define an infinite set of state-specific transition distributions, each of which is distributed according to a Dirichlet process with \beta as the base measure. This implies that the expected set of weights of each of these distributions is equivalent to \beta. Thus, the sparsity induced by \beta is shared by each of the different state-specific transitions distributions. State-specific transition distributions: sparsity of b is shared

State Splitting Let’s return to the 3-mode HMM example with the true labels shown here and the inferred labels shown here with errors shown in red. As before, we see the split into redundant states which are rapidly switched between. In this scenario, the DP’s bias towards simpler models is insufficient in preventing this unrealistically fast switching. There are a few things to note about this issue. First, splitting into redundant states can reduce the predictive performance of the learned model since each state has fewer observations from which to infer model parameters. Second, in applications such as speaker diarization, one cares about the accuracy of the inferred label sequence and we’re not just performing model averaging. HDP-HMM inadequately models temporal persistence of states DP bias insufficient to prevent unrealistically rapid dynamics Reduces predictive performance

“Sticky” HDP-HMM original sticky state-specific base measure Specifically, we consider augmenting the HDP-HMM by adding a self-transition parameter \kappa. The average transition density \beta remains the same, but every state-specific transition density is defined according to a Dirichlet process with an added weight on the component of the base measure corresponding to a self-transition. Now, the expected transition distribution has weights which are a convex combination of the global weights and state-specific weights. We can qualitatively compare to the transition distributions we had before, and see that we now have a larger probability of self-transition. state-specific base measure Increased probability of self-transition

Speaker Diarization John Jane Bob Ji l l We return to the NIST speaker diarization database described at the beginning of the talk. Recall that this database consists of 21 recorded conference meetings with ground truth labels, and from this data, we aim to both learn the number of speakers and segment the audio into speaker-homogenous regions.

Meeting by Meeting Comparison NIST Evaluations Meeting by Meeting Comparison NIST Rich Transcription 2004-2007 meeting recognition evaluations 21 meetings ICSI results have been the current state-of-the-art One dataset that we revisit later in the talk is the NIST Rich Transcription set of 21 meetings used for evaluations in 2004-2007. For the past 6 years the Berkeley ICSI team has won the NIST competition by a large margin. Their approach is based on agglomerative clustering. This system is highly engineered to this task and has been developed over many years by a large team of researchers. We will show that the nonparametric Bayesian model we develop provides performance that is competitive with this state-of-the-art, and with significant improvements over the performance achieved by the original HDP-HMM. In this plot, we show the official NIST speaker diarization error rate, or DER, that each of these algorithms achieved on the 21 meetings. This plot clearly demonstrates the importance of the extensions we develop in this talk. 37

Results: 21 meetings Overall DER Best DER Worst DER Sticky HDP-HMM Using 50,000 Gibbs iterations just on meeting 16, our results are now as follows. We have an overall DER that is slightly better than that of ICSI. Furthermore, if you do a meeting-by-meeting comparison we see that we are beating ICSI on 13 out of the 21 meetings. Overall DER Best DER Worst DER Sticky HDP-HMM 17.84% 1.26% 34.29% Non-Sticky HDP-HMM 23.91% 6.26% 46.95% ICSI 18.37% 4.39% 32.23% 38

Results: Meeting 1 (AMI_20041210-1052) To show some qualitative results, here is a meeting in which we had a very low DER of 1.26%. The true speaker labels are plotted here and our most-likely estimated sequence here. Errors are shown in red. Sticky DER = 1.26% ICSI DER = 7.56%

Results: Meeting 18 (VT_20050304-1300) Sticky DER = 4.81% ICSI DER = 22.00%

Results: Meeting 16 (NIST_20051102-1323) Let’s look at that meeting in a bit more detail. Here we show plots of log-likelihood and Hamming distance versus Gibbs iteration for each of the 10 initializations of the sampler. We had previously run the Gibbs sampler to 10,000 iterations, but we see that the sampler is very slow to mix so we decided to run the sampler out to 100,000 iterations. For this meeting, the max-likelihood sample corresponds to the trial with a Hamming distance significantly lower than the other meetings. However, there is a cluster of trials that merge a speaker, such as depicted here. Thus, this segmentation becomes the one that minimizes the expected Hamming distance error. That is, the “typical” segmentation. We thought that running the sampler longer would allow for more chains to find that speaker, but unfortunately the sampler is just extremely slow to mix. The reason behind this is the fact that the parameter associated with a new state is just a draw from our prior; that draw has to somehow better explain the merged speaker than the other parameters that have already been informed by the data. In high-dimensional settings, such as this one, that can take a while to accomplish. However, the likelihood of this segmentation has increased enough to separate from the other chains so that most of them are thrown out as not having mixed.

The Beta Process The Dirichlet process naturally yields a multinomial random variable (which table is the customer sitting at?) Problem: in many problem domains we have a very large (combinatorial) number of possible tables using the Dirichlet process means having a large number of parameters, which may overfit perhaps instead want to characterize objects as collections of attributes (“sparse features”)? i.e., binary matrices with more than one 1 in each row

Completely Random Processes (Kingman, 1968) Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of e.g., Brownian motion, gamma processes, beta processes, compound Poisson processes and limits thereof (The Dirichlet process is not a completely random process but it's a normalized gamma process) Completely random processes are discrete wp1 (up to a possible deterministic continuous component) Completely random processes are random measures, not necessarily random probability measures

Completely Random Processes (Kingman, 1968) x x x x x x x x x x x x x x x Assigns independent mass to nonintersecting subsets of

Completely Random Processes (Kingman, 1968) Consider a non-homogeneous Poisson process on with rate function obtained from some product measure Sample from this Poisson process and connect the samples vertically to their coordinates in x

Beta Processes The product measure is called a Levy measure (Hjort, Kim, et al.) The product measure is called a Levy measure For the beta process, this measure lives on and is given as follows: And the resulting random measure can be written simply as: degenerate Beta(0,c) distribution Base measure

Beta Processes

Beta Process and Bernoulli Process

BP and BeP Sample Paths

Beta Process Marginals (Thibaux & Jordan, 2007) Theorem: The beta process is the De Finetti mixing measure underlying the a stochastic process on binary matrices known as the Indian buffet process (IBP)

Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes

Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes

Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes

Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes

Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2002) Indian restaurant with infinitely many dishes in a buffet line Customers through enter the restaurant the first customer samples dishes the th customer samples a previously sampled dish with probability then samples new dishes

Beta Process Point of View The IBP is usually derived by taking a finite limit of a process on a finite matrix But this leaves some issues somewhat obscured: is the IBP exchangeable? why the Poisson number of dishes in each row? is the IBP conjugate to some stochastic process? These issues are clarified from the beta process point of view A draw from a beta process yields a countably infinite set of coin-tossing probabilities, and each draw from the Bernoulli process tosses these coins independently

Hierarchical Beta Processes A hierarchical beta process is a beta process whose base measure is itself random and drawn from a beta process

Multiple Time Series Goals: Method: transfer knowledge among related time series in the form of a library of “behaviors” allow each time series model to make use of an arbitrary subset of the behaviors Method: represent behaviors as states in a nonparametric HMM use the beta/Bernoulli process to pick out subsets of states Specifically, we are going to assume that each of the time series can be modeled as having Markov switching dynamics with an arbitrarily large set of dynamical modes. The way the time series are related is via sharing some subset of the dynamical modes. For example, we might have a collection of motion capture movies of multiple people performing jumping jacks. By jointly modeling the time series, we can improve our estimates of the dynamic parameters for each mode, especially in the case of limited data, and also discover interesting relationships between the sequences. In addition to allowing for infinitely many possible dynamical modes, we would also like to allow for behaviors unique to a given sequence. For example, one movie might have a person doing side twists while another does not. We show that the beta process prior provides a useful Bayesian nonparametric prior in this case.

IBP-AR-HMM Bernoulli process determines which states are used Beta process prior: encourages sharing allows variability We now show how we can use such a feature model to relate multiple time series. We assume that each of our N time series can be described as a switching VAR process. Then, the set of features associated with the time series defines the set of behavioral modes that object switches between. Specifically, for each object, imagine we define a set of feature-constrained transition distributions that restrict the object to only switching between it’s chosen behaviors. One can visualize this process as defining transition distributions in an infinite dimensional space and then doing an element wise product with the object’s feature vector, and renormalizing. The overall generative model is summarized here. Through our beta process prior, we encourage objects to share behavior modes while allowing some object-specific behaviors. Our formulation also allows variability in how the object switches between its chosen behaviors.

Motion Capture Results Here is a plot of the results from that sample. We show a series of skeleton plots, each corresponding to a learned contiguous segment of at least 2 seconds. We then join all of the segments categorized under the same feature label within a box. The color of the box indicates the true behavior category.

Conclusions Hierarchical modeling has at least an important a role to play in Bayesian nonparametrics as is plays in classical Bayesian parametric modeling In particular, infinite-dimensional parameters can be controlled recursively via Bayesian nonparametric strategies For papers and more details: www.cs.berkeley.edu/~jordan/publications.html