Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Topic models Source: Topic models, David Blei, MLSS 09.
Xiaolong Wang and Daniel Khashabi
Course: Neural Networks, Instructor: Professor L.Behera.
Hierarchical Dirichlet Process (HDP)
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
CS479/679 Pattern Recognition Dr. George Bebis
Angelo Dalli Department of Intelligent Computing Systems
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Beam Sampling for the Infinite Hidden Markov Model Van Gael, et al. ICML 2008 Presented by Daniel Johnson.
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Conditional Random Fields
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Maximum Likelihood (ML), Expectation Maximization (EM)
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Machine Learning Overview Adapted from Sargur N. Srihari University at Buffalo, State University of New York USA.
Bayes Factor Based on Han and Carlin (2001, JASA).
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Statistical NLP Winter 2008 Lecture 16: Unsupervised Learning I Roger Levy [thanks to Sharon Goldwater for many slides]
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Integrating Topics and Syntax -Thomas L
Template matching and object recognition. CS8690 Computer Vision University of Missouri at Columbia Matching by relations Idea: –find bits, then say object.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Randomized Algorithms for Bayesian Hierarchical Clustering
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
John Lafferty Andrew McCallum Fernando Pereira
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Online Multiscale Dynamic Topic Models
Nonparametric Bayesian Learning of Switching Dynamical Processes
Omiros Papaspiliopoulos and Gareth O. Roberts
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Statistical Models for Automatic Speech Recognition
Bayesian Inference for Mixture Language Models
Robust Full Bayesian Learning for Neural Networks
Topic Models in Text Processing
Presentation transcript:

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani

Introduction n HM models: time series with discrete hidden states n Infinite HM models (iHMM): nonparametric Bayesian approach n Equivalence between Polya urn and HDP interpretations for iHMM n Inference algorithms: collapsed Gibbs sampler, beam sampler n Use of iHMM: simple sequence labeling task

Introduction

From HMMs to Bayesian HMMs n An example of HMM: speech recognition u Hidden state sequence: phones u Observation: acoustic signals u Parameters ,  come from a physical model of speech / can be learned from recordings of speech n Computational questions u 1.( , , K) is given: apply Bayes rule to find posterior of hidden variables u Computation can be done by a dynamic programming called forward-backward algorithm u 2. K given, ,  not given: apply EM u 3.( , , K) is not given: penalizing, etc..

From HMMs to Bayesian HMMs n Fully Bayesian approach u Adding priors for ,  and extending full joint pdf as u Compute the marginal likelihood or evidence for comparing, choosing or averaging over different values of K. u Analytic computing of the marginal likelihood is intractable

From HMMs to Bayesian HMMs n Methods for dealing the intractability u MCMC 1: by estimating the marginal likelihood explicitly. Annealed importance sampling, Bridge sampling. Computationally expensive. u MCMC 2: by switching between different K values. Reversible jump MCMC u Approximation by using good state sequence: by independency of parameters and conjugacy between prior and likelihood under given hidden states, marginal likelihood can be computed analytically. u Variational Bayesian inference: by computing lower bound of the marginal likelihood and applying VB inference.

Infinite HMM – hierarchical Polya Urn n iHMM: Instead of defining K different HMMs, implicitly define a distribution over the number of visited states. n Polya Urn: u add a ball of new color:  / (  +  n i ). u add a ball of color i : n i / (  +  n i ). u Nonparametric clustering scheme n Hierarchical Polya Urn: u Assume separate Urn(k) for each state k u At each time step t, select a ball from the corresponding Urn(k)_(t-1) u Interpretation of transition probability by the # of balls of color j in Urn color i: u Probability of drawing from oracle:

Infinite HMM – HDP

HDP and hierarchical Polya Urn

Inference n Gibbs sampler: O(KT 2 ) n Approximate Gibbs sampler: O(KT) n State sequence variables are strongly correlated  slow mixing n Beam sampler as an auxiliary variable MCMC algorithm u Resamples the whole Markov chain at once u Hence suffers less from slow mixing

Inference – collapsed Gibbs sampler

n Sampling s t : u Conditional likelihood of y t : u Second factor: a draw from a Polya urn

Inference – collapsed Gibbs sampler

Inference – Beam sampler

n u Compute only for finitely many s t, s t-1 values. n

Inference – Beam sampler n Complexity: O(TK 2 ) when K states are presented n Remarks: auxiliary variables need not be sampled from uniform. Beta distribution could also be used to bias auxiliary variables close to the boundaries of

Example: unsupervised part-of–speech (PoS) tagging n PoS-tagging: annotating the words in a sentence with their appropriate part- of-speech tag u “ The man sat”  ‘The’ : determiner, ‘man’: noun, ‘sat’: verb u HM model is commonly used F Observation: words F Hidden: unknown PoP-tag F Usually learned using a corpus of annotated sentences: building corpus is expensive u In iHMM F Multinomial likelihood is assumed F with base distribution H as symmetric Dirichlet so its conjugate to multinomial likelihood u Trained on section 0 of WSJ of Penn Treebank: 1917 sentences with total of word tokens (observations) and 7904 word types (dictionary size) u Initialize the sampler with 50 states with iterations

Example: unsupervised part-of–speech (PoS) tagging n Top 5 words for the five most common states u Top line: state ID and frequency u Rows: top 5 words with frequency in the sample u state 9: class of prepositions u State 12: determinants + possessive pronouns u State 8: punctuation + some coordinating conjunction u State 18: nouns u State 17: personal pronouns

Beyond the iHMM: input-output(IO) iHMM n MC affected by external factors u A robot is driving around in a room while taking pictures (room index  picture) u If robot follows a particular policy, robots action can be integrated as an input to iHMM (IO-iHMM) u Three dimensional transition matrix:

Beyond the iHMM: sticky and block-diagonal iHMM n Weight on the diagonal of the transition matrix controls the frequency of state transitions n Probability of staying in state i for g times: n Sticky iHMM: by adding a prior probability mass to the diagonal of the transition matrix and applying a dynamic programming based inference n Appropriate for segmentation problems where the number of segments is not known a priori n To carry more weight for diagonal entry: u  is a parameter for controlling the switching rate n Block-diagonal iHMM:for grouping of states u Sticky iHMM is a case for size 1 block u Larger blocks allow unsupervised clustering of states u Used for unsupervised learning of view-based object models from video data where each block corresponds to an object. u Intuition behind: Temporary contiguous video frames are more likely correspond to different views of the same objects than different objects n Hidden semi-Markov model u Assuming an explicit duration model for the time spent in a particular state

Beyond the iHMM: iHMM with Pitman-Yor base distribution n Frequency vs. rank of colors (on log-log scale) u DP is quite specific about distribution implied in the Polya Urn: colors that appear once or twice is very small u Pitman-Yor can be more specific about the tails u Pitman-Yor fits a power-law distribution (linear fitting in the plot) u Replace DP by Pitman-Yor in most cases u Helpful comments on beam sampler

Beyond the iHMM: autoregressive iHMM, SLD-iHMM n AR-iHMM: Observations follow auto-regressive dynamics n SLD-iHMM: part of the continuous variables are observed and the unobserved variables follow linear dynamics SLD model FA-HMM model