A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Slides:

Advertisements

Similar presentations

Hierarchical Dirichlet Process (HDP)

Advertisements

A Tutorial on Learning with Bayesian Networks

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

Supervised Learning Recap

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

Hidden Markov Models in NLP

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Lecture 5: Learning models using EM

Latent Dirichlet Allocation a generative model for text

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

Isolated-Word Speech Recognition Using Hidden Markov Models

Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.

7-Speech Recognition Speech Recognition Concepts

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

CS Statistical Machine learning Lecture 24

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Amir Harati and Joseph Picone

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

A NONPARAMETRIC BAYESIAN APPROACH FOR

Online Multiscale Dynamic Topic Models

LECTURE 11: Advanced Discriminant Analysis

Non-Parametric Models

Statistical Models for Automatic Speech Recognition

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Hidden Markov Models Part 2: Algorithms

Statistical Models for Automatic Speech Recognition

CONTEXT DEPENDENT CLASSIFICATION

LECTURE 15: REESTIMATION, EM AND MIXTURES

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

48 th Annual Conference on Information Sciences and SystemsMarch 20, Abstract Nonparametric Bayesian (NPB) methods are a popular alternative to Bayesian approaches in which we place a prior over the complexity (or model structure). The Hierarchical Dirichlet Process hidden Markov model (HDP-HMM) is the nonparametric Bayesian equivalent of an HMM. HDP-HMM is restricted to an ergodic topology and uses a Dirichlet Process Mixture (DPM) to achieve a mixture distribution-like model. A new type of HDP-HMM is introduced that:  preserves the useful left-to-right properties of a conventional HMM, yet still supports automated learning of the structure and complexity from data.  uses HDPM emissions which allows a model to share data-points among different states.  Introducing non-emitting states. This new model produces better likelihoods relative to original HDP- HMM and has much better scalability properties.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Nonparametric Bayesian Models Parametric Models:  Number of parameters fixed.  Model selection / averaging.  Discrete optimization. Nonparametric Bayesian:  Inferring the model from the data.  Alternative to model selection.  Immunity to over-fitting.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Dirichlet Distributions – Prior For Bayesian Models Functional form:  q ℝ k : a probability mass function (pmf).  {α i }: a vector of concentration parameters that can be interpreted as pseudo-observations.  Pseudo-observations reflect your beliefs about the priors and are related to the number of observations in each category previously seen.  The total number of pseudo-observations is α 0. The Dirichlet Distribution is a conjugate prior for a multinomial distribution.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Example: A Distribution over 3-dim probability simplex

48 th Annual Conference on Information Sciences and SystemsMarch 20, Dirichlet Processes – Infinite Sequence of Random Variables a Dirichlet distribution split infinitely many times: q2q2 q2q2 q1q1 q1q1 q 21 q 11 q 22 q 12 A discrete distribution with an infinite number of atoms. H: base distribution α: concentration parameter

48 th Annual Conference on Information Sciences and SystemsMarch 20, Hierarchical Dirichlet Process – Nonparametric Clustering Dirichlet Process Mixture (DPM): An infinite mixture model assumes that the data come from a mixture of an infinite number of distributions. Hierarchical Dirichlet Process (HDP): Consider data organized into several groups (e.g. documents). A DP can be used to define a mixture over each group. A common DP can be used to model a base distribution for all DPs.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Hidden Markov Models Markov Chain A memoryless stochastic process. States are observed at each time, t. The probability of being at any state at time t+1 is a function of the state at time t. Hidden Markov Models (HMMs) A Markov chain where states are not observed. An observed sequence is the output of a probability distribution associated with each state. A model is characterized by:  number of states;  transition probabilities between these states;  emission probability distributions for each state. Expectation-Maximization (EM) is used for training.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Hierarchical Dirichlet Process-Based HMM (HDP-HMM) Inference algorithms are used to infer the values of the latent variables (z t and s t ). A variation of the forward-backward procedure is used for training. K z : Maximum number of states. K s : Max. no. of components per mixture. Graphical Model: Definition: z t, s t and x t represent a state, mixture component and observation respectively.

48 th Annual Conference on Information Sciences and SystemsMarch 20, The Acoustic Modeling Problem in Speech Recognition Goal of speech recognition is to map the acoustic data into word sequences: P(W|A) is the probability of a particular word sequence given acoustic observations. P(W) is the language model. P(A) is the probability of the observed acoustic data and usually can be ignored. P(A|W) is the acoustic model.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Left-to-Right HDP-HMM with HDPM Emissions In many pattern recognition applications involving temporal structure, such as speech recognition, a left-to-right topology is used to model the temporal order of the signal. In speech recognition, all acoustic units use the same topology and the same number of mixtures; i.e., the complexity is fixed for all models. Given more data, a model’s structure (e.g., the topology) will remain the same and only the parameter values change. The amount of data associated with each model varies, which implies some models are overtrained while others are undertrained. Because of the lack of hierarchical structure, techniques for extending the model tend to be heuristic. Example: Gender-specific models are trained as separate models. A counter-example are decision trees. What Do you mean by the "yellow" part? Decision trees provide a principled hierarchical structure for clustering

48 th Annual Conference on Information Sciences and SystemsMarch 20, Relevant Work Bourlard (1993) and others proposed to replace Gaussian mixture models (GMMs) with a neural network based on a multilayer perceptron (MLP). It was shown that MLPs generate reasonable estimates of a posterior distribution of an output class conditioned on the input patterns. This hybrid HMM-MLP system produced small gains over traditional HMMs. Lefèvre (2003) and Shang (2009) where nonparametric density estimators (e.g. kernel methods) replaced GMMs. Henter et al. (2012) introduced a Gaussian process dynamical model (GPDM) for speech synthesis. Each of these approaches were proposed to model the emission distributions using a nonparametric method but they did not address the model topology problem.

48 th Annual Conference on Information Sciences and SystemsMarch 20, New Features of the HDP-HMM/HDPM Model Introduce an HDP-HMM with a left-to-right topology (which is crucial for modeling the temporal structure in speech). Incorporate HDP emissions into an HDP-HMM which allows a common pool of mixture components to be shared among states. Non-emitting “initial” and “final” states are included in the final definition, which are critical for modeling finite sequences and connecting models.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Mathematical Definition Definition: Graphical Model

48 th Annual Conference on Information Sciences and SystemsMarch 20, Non-emitting States Inference algorithm estimates the probability of self-transitions (P 1 ) and transitions to other emitting states (P 2 ), but each state can also transit to a none-emitting state (P 3 ). Since P 1 + P 2 + P 3 = 1 we can reestimate P 1, P 3 by fixing P 2. Similar to tossing a coin until a first head is obtained (can be modeled as a geometric distribution). A maximum likelihood (ML) estimation cab be obtained: where M is the number examples in which state i is the last state of the model and k i is the number of self-transitions for state i.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Results – Simulation Data is generated from an LR-HMM with 1 to 3 mixtures per state. Held-out data used to assess the models.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Results – Computation Time and Scalability HDP-HMM/DPM computation time is proportional to K s * K z. HDP-HMM/HDPM inference time is proportional to K s. The mixture components are shared among all states so the actual number of computations is proportional to K s.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Results – TIMIT classification The data used in this illustration was extracted from the TIMIT Corpus where a phoneme level transcription is available. MFCC features plus their 1 st and 2 nd derivatives are used (39 dimensions). State of the art parametric HMM/GMM used for comparison. Classification results show around 15% improvement. A Comparison of Classification Error Rates ModelError Rate HMM/GMM (10 components)27.8% LR-HDP-HMM/GMM (1 component)26.7% LR-HDP-HMM24.1%

48 th Annual Conference on Information Sciences and SystemsMarch 20, Results – TIMIT classification An automatically derived model structure (without the first and last non- emitting states) for: (a) /aa/ with 175 examples(b) /sh/ with 100 examples (c) /aa/ with 2256 examples(d) /sh/ with 1317 examples

48 th Annual Conference on Information Sciences and SystemsMarch 20, Summary Summarize your performance results:  Showing that HDPM emissions can replace DPM emissions in most applications (for both LR and ergodic models) without losing performance while the scalability of model improves significantly.  We have also shown that LR HDP-HMM models can learn multi modality in data. For example, for a single phoneme, LR HDP-HMM can learn parallel paths corresponding to different type of speakers while at the same time we can share the data among states if HDPM emissions are used. Three theoretical contributions:  A left-to-right HDP-HMM model.  Introducing HDP emissions to HDP-HMM model.  Introducing non-emitting states to the model. Comparing to EM algorithm the Inference algorithm for HDP-HMM models is computationally too expensive. One of the future directions is to investigate approaches based on Variational Inference for these models. Another direction is to add level of hierarchy to the model to share data among several HDP-HMM models.

48 th Annual Conference on Information Sciences and SystemsMarch 20, References 1.Bourlard, H., & Morgan, N. (1993). Connectionist Speech Recognition A Hybrid Approach. Springer. 2.Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2011). A Sticky HDP-HMM with Application to Speaker Diarization. The Annalas of Applied Statistics, 5(2A), 1020– Harati, A., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to Speaker Adaptation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4321–4324). Kyoto, Japan. 4.Harati, A., Picone, J., & Sobel, M. (2013). Speech Segmentation Using Hierarchical Dirichlet Processes. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (p. TBD). Vancouver, Canada. 5.Lefèvre, F. (n.d.). Non-parametric probability estimation for HMM-based automatic speech recognition. Computer Speech & Language, 17(2-3), 113– Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 879– Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 639– Shang, L. (n.d.). Nonparametric Discriminant HMM and Application to Facial Expression Recognition. IEEE Conference on Computer Vision and Pattern Recognition (pp. 2090– 2096). Miami, FL, USA. 9.Shin, W., Lee, B.-S., Lee, Y.-K., & Lee, J.-S. (2000). Speech/non-speech classification using multiple features for robust endpoint detection. proceedings of IEEE international Conference on ASSP (pp. 1899–1402). Istanbul, Turkey. 10.Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., West, M., & Cron, A. (2010). Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures. Journal of Computational and Graphical Statistics, 19(2), 419– Teh, Y., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(47), 1566–1581.

48 th Annual Conference on Information Sciences and SystemsMarch 20, Biography Amir Harati is a PhD candidate in the Department of Electrical and Computer Engineering at Temple University. He received his Bachelor’s Degree from Tabriz University in 2004 and his Master’s Degree from K.N. Toosi University in 2008, both in Electrical and Computer Engineering. He has also worked as signal processing researcher for Bina-Pardaz LTD in Mashhad, Iran, where he was responsible for developing algorithms for geolocation using a variety of types of emitter technology. He is currently pursuing a PhD in Electrical Engineering at Temple University. The focus of his research is the application of nonparametric Bayesian methods in acoustic modeling for speech recognition. He is also the senior scientist on a commercialization project involving a collaboration between Temple Hospital and the Neural Engineering Data Consortium to automatically interpret EEG signals. Mr. Harati has published one journal paper and five conference papers on machine learning applications in signal processing. He is a member of the IEEE and Eta-Kappa-Nu.