A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Slides:

Advertisements

Similar presentations

Hierarchical Dirichlet Process (HDP)

Advertisements

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

Supervised Learning Recap

Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.

Hidden Markov Models in NLP

EE-148 Expectation Maximization Markus Weber 5/11/99.

Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

Isolated-Word Speech Recognition Using Hidden Markov Models

Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.

7-Speech Recognition Speech Recognition Concepts

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

CS Statistical Machine learning Lecture 24

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Amir Harati and Joseph Picone

Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Classification of melody by composer using hidden Markov models Greg Eustace MUMT 614: Music Information Acquisition, Preservation, and Retrieval.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Online Multiscale Dynamic Topic Models

Non-Parametric Models

Statistical Models for Automatic Speech Recognition

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Statistical Models for Automatic Speech Recognition

CONTEXT DEPENDENT CLASSIFICATION

LECTURE 15: REESTIMATION, EM AND MIXTURES

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Abstract Recently, nonparametric Bayesian (NPB) methods have become a popular alternative to Bayesian approaches. In such approaches, we do not fix the complexity a priori and place a prior over the complexity (or model structure). The Hierarchical Dirichlet Process hidden Markov model (HDP-HMM) is the nonparametric Bayesian equivalent of an HMM. However, HDP-HMM is restricted to an ergodic topology and uses a Dirichlet Process Model (DPM) to achieve a mixture distribution-like model. A new type of HDP-HMM is presented that preserves the useful left-to-right properties of a conventional HMM, yet still supports automated learning of the structure and complexity from data. We also introduce HDP-HMM with HDPM emissions which allows a model to share data-points among different states. This new model shows better likelihood relative to original HDP-HMM and it has much better scalability properties.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Nonparametric Bayesian Models Nonparametric Hidden Markov Models Acoustic Modeling in Speech Recognition Left-to-Right HDP-HMM Models HDP HMM with HDP Emissions Results Summary of Contributions Outline

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Nonparametric Bayesian Parametric vs. Nonparametric Model Selection/Averaging: 1.Computational Cost 2.Discrete Optimization 3.Criteria Nonparametric Bayesian Promises: 1.Inferring the model from the data 2.Immunity to over-fitting 3.Well defined mathematical framework

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Dirichlet Distribution Functional form:  q ℝ k : a probability mass function (pmf).  α: a concentration parameter.  α can be interpreted as pseudo-observations.  The total number of pseudo-observations is α 0. The Dirichlet Distribution is a conjugate prior for a multinomial distribution.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Dirichlet Distribution

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Dirichlet Process (DP) A Dirichlet Process is a Dirichlet distribution split infinitely many times q2q2 q2q2 q1q1 q1q1 q 21 q 11 q 22 q 12 DP is a discrete distribution with infinite number of atoms.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Hierarchical Dirichlet Process (HDP) Grouped Data Clustering: Consider data organized into several groups (e.g. documents). DP can be used to define a mixture over each group. However, each mixture would be independent of others. Sometimes we want to share components among mixtures (e.g. to share topics among documents). HDP: a) b)

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Hidden Markov Models (HMMs) Markov Chain A stochastic process with Markov property. “states” are observed at each time and the probability of being at any state at time t+1 is a function of state at time t. Hidden Markov Models A Markov chain where “states” are not observed. “observed sequence” is the output of probability distribution associated with each state. A model is characterized by: 1- Number of states. 2-Transition probabilities between these states. 3-Parameters of the of the probability distribution for each state. Properties A parametric model. Expectation-Maximization (EM) is used to “Learn” the model.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Hierarchical Dirichlet Process-Based HMM (HDP-HMM) Inference algorithms are used to infer the values of the latent variables (z t and s t ). A variation of the forward-backward procedure is used for training. K z : Maximum number of states. K s : Maximum number of components per mixture. Graphical Model: Definition: z t, s t and x t represent a state, mixture component and observation respectively.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, The Acoustic Modeling Problem in Speech Recognition Goal of speech recognition is to map the acoustic data into word sequences: In this formulation, P(W|A) is the probability of a particular word sequence given acoustic observations. P(W) is the language model. P(A) is the probability of the observed acoustic data and usually can be ignored. P(A|W) is the acoustic model.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Left-to-Right HDP-HMM with HDP emission Problem Statement: In many pattern recognition applications involving temporal structure, such as speech processing, a left-to-right topology is preferred or sometimes required. In speech recognition, all units usually use the same topology and the same number of mixtures; i.e. the complexity is fixed for all models. Given more data, model’s structure will remain the same and only the estimation of the parameters (e.g. means and covariances) improves. Different models have different amount of data but their complexity is the same. This means some models are over-trained and some under-trained. Because of the lack of hierarchical structure, extending the model is heuristic. For example, there is no consensus about sharing data in gender specific modeling. Some prefer to train completely separate models for each group while others use different heuristic methods to share data.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Left-to-Right HDP-HMM With HDP Emission Relevant Works: Bourlard (1993) and others proposed to replace Gaussian mixture models (GMMs) with a neural network based on a multilayer perceptron (MLP). It was shown that MLPs generate reasonable estimates of a posterior distribution of an output class conditioned on the input patterns. This hybrid HMM-MLP system works slightly better than traditional HMM- GMMs, but the gain was not significant. Another example of this approach is reported in (Lefèvre, 2003) and (Shang, 2009) where nonparametric density estimators (e.g. kernel methods) have been used to replace GMMs. Henter et al. (2012) introduced a new model named a Gaussian process dynamical model (GPDM) to completely replace HMMs in acoustic modeling. This model is used only in speech synthesis and as of now lacks a corresponding recognition algorithm. The above efforts can be classified as nonparametric non-Bayesian approaches. Each of these approaches were proposed to model the emission distributions using a nonparametric method but they did not address the model topology problem.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Left-to-Right HDP-HMM With HDP Emission LR-HDP-HMM with HDP Emission:  The proposed model address both topology and emission distribution modeling problems.  Based on well-defined HDP-HMM Model.  Three new features are introduced: 1.HDP-HMM is an ergodic model. We extend the definition to a left-to-right topology. 2.HDP-HMM uses DPM to model emissions for each state. Our model uses HDP to model the emissions. In this manner we allow components of emission distributions to be shared within a HMM. This is particularly important for left-to-right models where the number of discovered states is usually more than an ergodic HMM. As a result we have fewer data points associated with each state. 3.Non-emitting “initial” and “final” states are included in the final definition.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Left-to-Right HDP-HMM With HDP Emission LR HDP-HMM (cont.): A new inference algorithm based on a block sampler has been derived. The new model is more accurate and does not have some of the intrinsic problems of parametric HMMs (e.g. over-trained and under-trained models). The hierarchical definition of the model within the Bayesian framework make it relatively easy to solve problems such as sharing data among related models (e.g. models of the same phoneme for different cluster of speakers). we have shown that the computation time for HDP-HMM with HDP emissions is proportional to K s, while for HDP-HMM with DPM emissions it is proportional to K s *K z. This means HDP-HMM/HDPM is more scalable when increasing the maximum bound on complexity of a model.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Left-to-Right HDP-HMM With HDP Emission Definition: Graphical Model

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Left-to-Right HDP-HMM With HDP Emission Adding none-emitting states: Inference algorithm estimate the probability of self-transitions (P 1 ) and transition to other emitting states (P 2 ), but each state can also transit to a none-emitting state (P 3 ). Since P 1 +P 2 +P 3 = 1 we can re-estimate P 1,P 3 by fixing P 2. The Problem is similar to tossing a coin until obtaining a first head (geometric distribution). We have shown that a maximum likelihood (ML) estimation cab be obtained using: Where M is the number examples in which state i is the last state of the model (during the inference) and k i is the number of self-transitions for state i.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Results Toy problem: 4-state LR-HMM. The emission distribution for each state is a GMM with up to three components.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Results HDP-HMM/DPM computation time is proportional to K s * K z. HDP-HMM/HDPM inference time is proportional to K s. Since at each iteration, for HDP-HMM/DPM model, there are K s *K z log-likelihoods to be computed but for HDP-HMM/HDPM model, the mixture components are shared among all states so the actual number of computations is almost proportional to K s.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Results An automatically derived model structure (without the first and last non-emitting states) for: (a)/aa/ with 175 examples (b)/sh/ with 100 examples (c)/aa/ with 2256 examples (d)/sh/ with 1317 examples using left-to-right HDP-HMM model. The data used in this illustration was extracted from the training portion of the TIMIT Corpus. A C OMPARISON OF C LASSIFICATION E RROR R ATES ModelError Rate HMM/GMM (10 components) 27.8% LR-HDP-HMM/GMM (1 component) 26.7% LR-HDP-HMM 24.1%

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Summary of Contributions contributions: 1.Introducing left-to-right HDP-HMMs with HDP emissions and a corresponding inference algorithm. 2.Showing that HDP emissions can replace DPM emissions in most applications (for both LR and ergodic models) without losing performance while the scalability of model improves significantly. 3.We have also shown that LR HDP-HMM models can learn multi modality in data. For example, for a single phoneme, LR HDP-HMM can learn parallel paths corresponding to different type of speakers while at the same time we can share the data among states if HDPM emissions are used.

Department of Electrical and Computer Engineering, Temple UniversityMarch 20, Reference Bacchiani, M., & Ostendorf, M. (1999). Joint lexicon, acoustic unit inventory and model design. Speech Communication, 29(2-4), 99–114. Bourlard, H., & Morgan, N. (1993). Connectionist Speech Recognition A Hybrid Approach. Springer. Dusan, S., & Rabiner, L. (2006). On the relation between maximum spectral transition positions and phone boundaries. Proceedings of INTERSPEECH (pp. 1317–1320). Pittsburgh, Pennsylvania, USA. Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2011). A Sticky HDP-HMM with Application to Speaker Diarization. The Annalas of Applied Statistics, 5(2A), 1020–1056. Harati, A., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to Speaker Adaptation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4321–4324). Kyoto, Japan. Harati, A., Picone, J., & Sobel, M. (2013). Speech Segmentation Using Hierarchical Dirichlet Processes. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (p. TBD). Vancouver, Canada. Henter, G. E., Frean, M. R., & Kleijn, W. B. (2012). Gaussian process dynamical models for nonparametric speech representation and synthesis. IEEE International Conference on ASSP(pp. 4505– 4508). Kyoto, Japan. Lee, C., & Glass, J. (2012). A Nonparametric Bayesian Approach to Acoustic Model Discovery. Proceedings of the Association for Computational Linguistics (pp. 40–49). Jeju, Republic of Korea. Lefèvre, F. (n.d.). Non-parametric probability estimation for HMM-based automatic speech recognition. Computer Speech & Language, 17(2-3), 113–136. Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 879–893. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 639–650. Shang, L. (n.d.). Nonparametric Discriminant HMM and Application to Facial Expression Recognition. IEEE Conference on Computer Vision and Pattern Recognition (pp. 2090– 2096). Miami, FL, USA. Shin, W., Lee, B.-S., Lee, Y.-K., & Lee, J.-S. (2000). Speech/non-speech classification using multiple features for robust endpoint detection. proceedings of IEEE international Conference on ASSP (pp. 1899–1402). Istanbul, Turkey. Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., West, M., & Cron, A. (2010). Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures. Journal of Computational and Graphical Statistics, 19(2), 419–438. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(47), 1566–1581.