Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Slides:

Advertisements

Similar presentations

Dynamic Spatial Mixture Modelling and its Application in Cell Tracking - Work in Progress - Chunlin Ji & Mike West Department of Statistical Sciences,

Advertisements

Hierarchical Dirichlet Process (HDP)

Information retrieval – LSI, pLSI and LDA

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

EE-148 Expectation Maximization Markus Weber 5/11/99.

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University.

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

A Unifying Review of Linear Gaussian Models

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models C. J. Leggetter and P. C. Woodland Department of.

Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.

Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Randomized Algorithms for Bayesian Hierarchical Clustering

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Computer Graphics and Image Processing (CIS-601).

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.

Lecture 2: Statistical learning primer for biologists

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Amir Harati and Joseph Picone

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

NIPS 2013 Michael C. Hughes and Erik B. Sudderth

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.

The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.

State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

A NONPARAMETRIC BAYESIAN APPROACH FOR

Deeply learned face representations are sparse, selective, and robust

College of Engineering Temple University

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

CS 2750: Machine Learning Density Estimation

Stochastic Optimization Maximization for Latent Variable Models

Generally Discriminant Analysis

Presentation transcript:

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create clusters and adjust all models in a cluster together. Clusters are organized hierarchically. The classical solution is to use a binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform-based adaption a transformation is calculated for each cluster. In Maximum Likelihood Linear Regression (MLLR), transforms are computed an ML criterion. Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create clusters and adjust all models in a cluster together. Clusters are organized hierarchically. The classical solution is to use a binary regression tree. The tree is constructed using a centroid splitting algorithm. In transform-based adaption a transformation is calculated for each cluster. In Maximum Likelihood Linear Regression (MLLR), transforms are computed an ML criterion. Introduction Performance of speaker independent acoustic models in speech recognition is significantly lower than speaker dependent models. Training speaker dependent models is impractical due to the limited amount of data. One of the most popular solutions is speaker adaptation. This transforms the mean and covariance of all Gaussian components. Because of the huge number of components; we often need to tie (cluster) components together. The complexity of the model should be adapted to available data. Introduction Performance of speaker independent acoustic models in speech recognition is significantly lower than speaker dependent models. Training speaker dependent models is impractical due to the limited amount of data. One of the most popular solutions is speaker adaptation. This transforms the mean and covariance of all Gaussian components. Because of the huge number of components; we often need to tie (cluster) components together. The complexity of the model should be adapted to available data. Amir H Harati Nejad Torbati, Joseph PiconeMarc Sobel Department of Electrical and Computer EngineeringDepartment of Statistics APPLICATIONS OF DIRICHLET PROCESS MIXTURESTO SPEAKER ADAPTATION Dirichlet Process Mixture (DPM) One of the classical problems in clustering is the determination of the number of clusters and complexity of the model. Dirichlet process (DP) mixture models use a nonparametric Bayesian framework to put a prior on the number of clusters. Dirichlet Process Mixture (DPM) One of the classical problems in clustering is the determination of the number of clusters and complexity of the model. Dirichlet process (DP) mixture models use a nonparametric Bayesian framework to put a prior on the number of clusters. Algorithm Premise: Replace the binary regression tree in MLLR with a DPM. Procedure: 1.Train speaker independent (SI) model. Collect all mixture components and their frequencies of occurrence. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom-up approach. We start from terminal nodes and merge them based on Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.With the resulting tree, compute the transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP) Algorithm Premise: Replace the binary regression tree in MLLR with a DPM. Procedure: 1.Train speaker independent (SI) model. Collect all mixture components and their frequencies of occurrence. 2.Generate samples for each component and cluster them using a DPM model. 3.Construct a tree structure of the final result using a bottom-up approach. We start from terminal nodes and merge them based on Euclidean distance. 4.Assign clusters to each component using a majority vote scheme. 5.With the resulting tree, compute the transformation matrix using maximum likelihood approach. Inference is accomplished using three different variational algorithms: 1.Accelerated variational Dirichlet process mixture (AVDP). 2.Collapsed variational stick-breaking (CSB). 3.Collapsed Dirichlet priors (CDP) Results for Monophone Models Experiments using Resource Management (RM). Monophone models using a single Gaussian mixture model. 12 different speakers with 600 training utterances. The result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while the regression tree finds only 2 clusters. Word error rate (WER) can be reduced by more than 10%. Results for Monophone Models Experiments using Resource Management (RM). Monophone models using a single Gaussian mixture model. 12 different speakers with 600 training utterances. The result of clustering resembles broad phonetic classes. DPM finds 6 clusters in the data while the regression tree finds only 2 clusters. Word error rate (WER) can be reduced by more than 10%. Results for Cross-word models Cross-word triphone models use a single Gaussian mixture model. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for moderate amounts of data while CDP and CSB work better for larger amounts of data. Results for Cross-word models Cross-word triphone models use a single Gaussian mixture model. 12 different speakers with 600 training utterances. The clusters generated using DPM have acoustically and phonetically meaningful interpretations. ADVP works better for moderate amounts of data while CDP and CSB work better for larger amounts of data. Conclusion It has been shown, that with enough data, DPMs surpass regression tree results with clusters that have meaningful acoustical interpretation. Lower performance in some cases may be related to our tree construction approach. In this work we assigned each “distribution” to just one cluster. An obvious extension is to use some form of soft-tying. The RM dataset and single Gaussian mixture models were used in this research. In the future we can expand the complexity of the model (e.g., multiple mixture components). We are also investigating the use of nonparametric Bayesian HMMs (HDP-HMM) in training so that training and testing operate under matched conditions with HDP-HMM inside the training loop. Conclusion It has been shown, that with enough data, DPMs surpass regression tree results with clusters that have meaningful acoustical interpretation. Lower performance in some cases may be related to our tree construction approach. In this work we assigned each “distribution” to just one cluster. An obvious extension is to use some form of soft-tying. The RM dataset and single Gaussian mixture models were used in this research. In the future we can expand the complexity of the model (e.g., multiple mixture components). We are also investigating the use of nonparametric Bayesian HMMs (HDP-HMM) in training so that training and testing operate under matched conditions with HDP-HMM inside the training loop. Reference [1] E. Sudderth, Graphical models for visual object recognition and tracking, Ph.D. dissertation, Massachusetts Institute of Technology, May [2] J. Paisley, Machine learning with Dirichlet and beta process priors: Theory and Applications, Ph.D. Dissertation, Duke University, May [3] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [4] C. J. Leggetter, Improved acoustic modeling for HMMs using linear transformations, Ph.D. Dissertation, University of Cambridge, February [5] C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, New York, USA, [6] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [7] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Reference [1] E. Sudderth, Graphical models for visual object recognition and tracking, Ph.D. dissertation, Massachusetts Institute of Technology, May [2] J. Paisley, Machine learning with Dirichlet and beta process priors: Theory and Applications, Ph.D. Dissertation, Duke University, May [3] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [4] C. J. Leggetter, Improved acoustic modeling for HMMs using linear transformations, Ph.D. Dissertation, University of Cambridge, February [5] C. Bishop, Pattern Recognition and Machine Learning, Springer, New York, New York, USA, [6] K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [7] K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Figure 1. Model Complexity as a Function of Available Data. (a) 20 (b) 200 (c) 2000 data points Figure 2. Mapping Speaker Independent Models to Speaker Dependent Models Figure 4. Chinese Restaurant Process Figure 3. Dirichlet Process Mixture Figure 5. A comparison of regression tree and ADVP approaches for monophone models. Figure 6. The number of discovered clusters Figure 7. Comparison of WERs between regression tree-based MLLR and several DPM inference algorithms for cross-word acoustic models. DPMs induce a prior such that clusters can grow logarithmically with the number of observations. When combining with likelihoods we obtain a complete model.