Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

Similar presentations


Presentation on theme: "Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani."— Presentation transcript:

1 Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani

2 Introduction n HM models: time series with discrete hidden states n Infinite HM models (iHMM): nonparametric Bayesian approach n Equivalence between Polya urn and HDP interpretations for iHMM n Inference algorithms: collapsed Gibbs sampler, beam sampler n Use of iHMM: simple sequence labeling task

3 Introduction

4 From HMMs to Bayesian HMMs n An example of HMM: speech recognition u Hidden state sequence: phones u Observation: acoustic signals u Parameters ,  come from a physical model of speech / can be learned from recordings of speech n Computational questions u 1.( , , K) is given: apply Bayes rule to find posterior of hidden variables u Computation can be done by a dynamic programming called forward-backward algorithm u 2. K given, ,  not given: apply EM u 3.( , , K) is not given: penalizing, etc..

5 From HMMs to Bayesian HMMs n Fully Bayesian approach u Adding priors for ,  and extending full joint pdf as u Compute the marginal likelihood or evidence for comparing, choosing or averaging over different values of K. u Analytic computing of the marginal likelihood is intractable

6 From HMMs to Bayesian HMMs n Methods for dealing the intractability u MCMC 1: by estimating the marginal likelihood explicitly. Annealed importance sampling, Bridge sampling. Computationally expensive. u MCMC 2: by switching between different K values. Reversible jump MCMC u Approximation by using good state sequence: by independency of parameters and conjugacy between prior and likelihood under given hidden states, marginal likelihood can be computed analytically. u Variational Bayesian inference: by computing lower bound of the marginal likelihood and applying VB inference.

7 Infinite HMM – hierarchical Polya Urn n iHMM: Instead of defining K different HMMs, implicitly define a distribution over the number of visited states. n Polya Urn: u add a ball of new color:  / (  +  n i ). u add a ball of color i : n i / (  +  n i ). u Nonparametric clustering scheme n Hierarchical Polya Urn: u Assume separate Urn(k) for each state k u At each time step t, select a ball from the corresponding Urn(k)_(t-1) u Interpretation of transition probability by the # of balls of color j in Urn color i: u Probability of drawing from oracle:

8

9 Infinite HMM – HDP

10 HDP and hierarchical Polya Urn

11 Inference n Gibbs sampler: O(KT 2 ) n Approximate Gibbs sampler: O(KT) n State sequence variables are strongly correlated  slow mixing n Beam sampler as an auxiliary variable MCMC algorithm u Resamples the whole Markov chain at once u Hence suffers less from slow mixing

12 Inference – collapsed Gibbs sampler

13 n Sampling s t : u Conditional likelihood of y t : u Second factor: a draw from a Polya urn

14 Inference – collapsed Gibbs sampler

15 Inference – Beam sampler

16 n u Compute only for finitely many s t, s t-1 values. n

17 Inference – Beam sampler n Complexity: O(TK 2 ) when K states are presented n Remarks: auxiliary variables need not be sampled from uniform. Beta distribution could also be used to bias auxiliary variables close to the boundaries of

18 Example: unsupervised part-of–speech (PoS) tagging n PoS-tagging: annotating the words in a sentence with their appropriate part- of-speech tag u “ The man sat”  ‘The’ : determiner, ‘man’: noun, ‘sat’: verb u HM model is commonly used F Observation: words F Hidden: unknown PoP-tag F Usually learned using a corpus of annotated sentences: building corpus is expensive u In iHMM F Multinomial likelihood is assumed F with base distribution H as symmetric Dirichlet so its conjugate to multinomial likelihood u Trained on section 0 of WSJ of Penn Treebank: 1917 sentences with total of 50282 word tokens (observations) and 7904 word types (dictionary size) u Initialize the sampler with 50 states with 50000 iterations

19 Example: unsupervised part-of–speech (PoS) tagging n Top 5 words for the five most common states u Top line: state ID and frequency u Rows: top 5 words with frequency in the sample u state 9: class of prepositions u State 12: determinants + possessive pronouns u State 8: punctuation + some coordinating conjunction u State 18: nouns u State 17: personal pronouns

20 Beyond the iHMM: input-output(IO) iHMM n MC affected by external factors u A robot is driving around in a room while taking pictures (room index  picture) u If robot follows a particular policy, robots action can be integrated as an input to iHMM (IO-iHMM) u Three dimensional transition matrix:

21 Beyond the iHMM: sticky and block-diagonal iHMM n Weight on the diagonal of the transition matrix controls the frequency of state transitions n Probability of staying in state i for g times: n Sticky iHMM: by adding a prior probability mass to the diagonal of the transition matrix and applying a dynamic programming based inference n Appropriate for segmentation problems where the number of segments is not known a priori n To carry more weight for diagonal entry: u  is a parameter for controlling the switching rate n Block-diagonal iHMM:for grouping of states u Sticky iHMM is a case for size 1 block u Larger blocks allow unsupervised clustering of states u Used for unsupervised learning of view-based object models from video data where each block corresponds to an object. u Intuition behind: Temporary contiguous video frames are more likely correspond to different views of the same objects than different objects n Hidden semi-Markov model u Assuming an explicit duration model for the time spent in a particular state

22 Beyond the iHMM: iHMM with Pitman-Yor base distribution n Frequency vs. rank of colors (on log-log scale) u DP is quite specific about distribution implied in the Polya Urn: colors that appear once or twice is very small u Pitman-Yor can be more specific about the tails u Pitman-Yor fits a power-law distribution (linear fitting in the plot) u Replace DP by Pitman-Yor in most cases u Helpful comments on beam sampler

23 Beyond the iHMM: autoregressive iHMM, SLD-iHMM n AR-iHMM: Observations follow auto-regressive dynamics n SLD-iHMM: part of the continuous variables are observed and the unobserved variables follow linear dynamics SLD model FA-HMM model


Download ppt "Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani."

Similar presentations


Ads by Google