A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Course: Neural Networks, Instructor: Professor L.Behera.
Information retrieval – LSI, pLSI and LDA
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Supervised Learning Recap
Particle Swarm Optimization PSO was first introduced by Jammes Kennedy and Russell C. Eberhart in Fundamental hypothesis: social sharing of information.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Optimizing number of hidden neurons in neural networks
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Abstract EEGs, which record electrical activity on the scalp using an array of electrodes, are routinely used in clinical settings to.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Emerging Directions in Statistical Modeling in Speech Recognition Joseph Picone and Amir Harati Institute for Signal and Information Processing Temple.
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 10a-11:30a Instructor: Christoph F. Eick Classroom:AH123
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Amir Harati and Joseph Picone
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 1-2:30p Instructor: Christoph F. Eick Classroom:AH301
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Online Multiscale Dynamic Topic Models
College of Engineering Temple University
Statistical Models for Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
CSCI 5822 Probabilistic Models of Human and Machine Learning
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Collapsed Variational Dirichlet Process Mixture Models
Statistical Models for Automatic Speech Recognition
Stochastic Optimization Maximization for Latent Variable Models
feature extraction methods for EEG EVENT DETECTION
Evolutionary Ensembles with Negative Correlation Learning
Speech Enhancement Based on Nonparametric Factor Analysis
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

Introduction

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, A set of data is generated from multiple distributions but it is unclear how many. The Motivating Problem ParametricNonparametric Requires a priori assumptions about data structure Do not require a priori assumptions about data structure Underlying structure is approximated with a limited number of mixtures Underlying structure is learned from data Number of mixtures is rigidly set Number of mixtures can evolve  Distributions over distributions  Needs a prior!

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Goals Use a nonparametric Bayesian approach to learn the underlying structure of speech data Investigate viability of three variational inference algorithms for acoustic modeling:  Accelerated Variational Dirichlet Process Mixtures (AVDPM)  Collapsed Variational Stick Breaking (CVSB)  Collapsed Dirichlet Priors (CDP) Assess Performance:  Compare error rates to parametric GMM models  Understand computational complexity

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Goals: An Explanation Why use Dirichlet process mixtures (DPMs)?  Goal: Automatically determine an optimal # of mixture components for each phoneme model  DPMs generate priors needed to solve this problem! What is “Stick Breaking”? Step 1: Let p 1 = θ 1. Thus the stick, now has a length of 1–θ 1. Step 2:Break off a fraction of the remaining stick, θ 2. Now, p 2 = θ 2 (1–θ 1 ) and the length of the remaining stick is (1–θ 1 )(1–θ 2 ). If this is repeated k times, then the remaining stick's length and corresponding weight is: θ3θ3 θ1θ1 θ2θ2 Dir~1

Background

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Inference: Estimating probabilities in statistically meaningful ways Parameter estimation is computationally difficult  Distributions over distributions ∞ parameters  Posteriors, p(y|x), can’t be analytically solved  Variational Inference  Uses independence assumptions to create simpler variational distributions, q(y), to approximate p(y|x).  Optimize q from Q = {q 1, q 2, …, q m } using an objective function, e.g. Kullbach-Liebler divergence  Constraints can be added to Q to improve computational efficiency Inference: An Approximation

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Accelerated Variational Dirichlet Process Mixtures (AVDPMs)  Incorporates kd-trees to improve efficiency  Complexity O(NlogN) + O(2 depth ) Collapsed Variational Stick Breaking (CVSB) & Collapsed Dirichlet Priors (CDP)  Truncates the DPM to a maximum of K clusters and marginalizes out mixture weights  Complexity O(TN) Variational Inference Algorithms CVSB CDP A, B, C, D, E, F, G A, B, D, FA, DB, FC, E, GC, GE KD Tree

Experimental Setup

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Phoneme Recognition (TIMIT, CH-E, CH-M)  Acoustic models trained for phoneme alignment  Phoneme alignments generated using HTK Experimental Design & Data CorpusDescription TIMIT Studio recorded, read speech 630 speakers, ~130,000 phones 39 phoneme labels CALLHOME English (CH-E) Spontaneous, conversational telephone speech 120 conversations, ~293,000 training samples 42 phoneme labels CALLHOME Mandarin (CH-M) Spontaneous, conversational telephone speech 120 conversations, ~250,000 training samples 92 phoneme labels

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Evaluation Error Rate Comparison Model TIMITCH-ECH-M ErrorNotesErrorNotesErrorNotes NN30.54% 140 neurons in hidden layer 47.62% 170 neurons in hidden layer 52.92% 130 neurons in hidden layer KNN32.08%K = %K = %K = 28 RF32.04%150 trees48.95%150 trees55.72%150 trees GMM38.02%# Mixt. = %# Mixt. = %# Mixt. = 64 AVDPM37.14% Init. Depth = 4 Avg. # Mixt. = % Init. Depth = 6 Avg. # Mixt. = % Init. Depth = 8 Avg. # Mixt. = 5.01 CVSB40.30% Trunc. Level = 4 Avg. # Mixt. = % Trunc. Level = 6 Avg. # Mixt. = % Trunc. Level = 6 Avg. # Mixt. = 5.75 CDP40.24% Trunc. Level = 4 Avg. # Mixt. = % Trunc. Level = 10 Avg. # Mixt. = % Trunc. Level = 6 Avg. # Mixt. = 5.75

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Computational Complexity: Training Samples TIMIT

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Conclusions and Future Work Conclusions AVDPM, CVSB, and CDP yield comparable error rates to GMM models AVDPM, CVSB, and CDP utilize much fewer #’s of mixtures per phoneme label than standard GMMs AVDPM is much better suited for large corpora since the KD tree significantly reduces training time without substantially degrading error rates. Performance gap between CH-E and CH-M can be attributed to # of labels Future Work Investigate methods to improve covariance estimation Apply AVDPM to HDP-HMM systems to move towards complete Bayesian nonparametric speech recognizer

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Acknowledgements Thanks to my committee for all of the help and support:  Dr. Iyad Obeid  Dr. Joseph Picone  Dr. Marc Sobel  Dr. Chang-Hee Won  Dr. Alexander Yates Thanks to my research group for all of their patience and support:  Amir Harati  Shuang Lu The Linguistic Data Consortium (LDC) for awarding a data scholarship to this project and providing the lexicon and transcripts for CALLHOME Mandarin. Owlsnest 1 1 This research was supported in part by the National Science Foundation through Major Research Instrumentation Grant No. CNS

Department of Electrical and Computer Engineering, Temple UniversityDecember 12, Brief Bibliography of Related Research 1.Bussgang, J. (2012). Seeing Both Sides. Retrieved November 27, 2012 from Ng, A. (2012). Machine Learning. Retrieved November 27, 2012 from 3.K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). 4.K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Steinberg, J., & Picone, J. (2012). HTK Tutorials. Retrieved from 6.Harati, A., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to Speaker Adaptation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4321–4324). Kyoto, Japan. doi: /ICASSP Frigyik, B., Kapila, A., & Gupta, M. (2010). Introduction to the Dirichlet Distribution and Related Processes. Seattle, Washington, USA. Retrieved from html 8.D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, Zografos, V. Wikipedia. Retrieved November 27, 2012 from 10.Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 879–893. doi: / Quintana, F. A., & Muller, P. (2004). Nonparametric Bayesian Data Analysis. Statistical Science, 19(1), 95– Harati, A., & Picone, J. (2012). Applications of Dirichlet Process Models to Speech Processing and Machine Learning. IEEE Section Meeting, Northern Virginia. Fairfax, Virginia, USA. doi: 13.Teh, Y. W. (2007) Dirichlet Processes: Tutorial and Practical Course. Retrieved November 30, 2012 from