A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Course: Neural Networks, Instructor: Professor L.Behera.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Supervised Learning Recap
Particle Swarm Optimization PSO was first introduced by Jammes Kennedy and Russell C. Eberhart in Fundamental hypothesis: social sharing of information.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Neural Networks II CMPUT 466/551 Nilanjan Ray. Outline Radial basis function network Bayesian neural network.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Speaker Adaptation for Vowel Classification
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel
Abstract The emergence of big data and deep learning is enabling the ability to automatically learn how to interpret EEGs from a big data archive. The.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Amir Harati and Joseph Picone
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Online Multiscale Dynamic Topic Models
College of Engineering Temple University
Statistical Models for Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Collapsed Variational Dirichlet Process Mixture Models
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
feature extraction methods for EEG EVENT DETECTION
Presentation transcript:

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Abstract Nonparametric Bayesian models have become increasingly popular in speech recognition tasks such as language and acoustic modeling due to their ability to discover underlying structure in an iterative manner. These methods do not require a priori assumptions about the structure of the data, such as the number of mixture components, and can learn this structure directly. Dirichlet process mixtures (DPMs) are a widely used Bayesian nonparametric method which can be used as priors to determine an optimal number of mixture components and their respective weights in a Gaussian mixture model (GMM) to model speech. Because DPMs potentially require infinite parameters, inference algorithms are needed to make posterior calculations tractable. The focus of this work is an evaluation of three of these Bayesian variational inference algorithms, which have only recently become computationally viable: Accelerated Variational Dirichlet Process Mixtures (AVDPM), Collapsed Variational Stick Breaking (CVSB), and Collapsed Dirichlet Priors (CDP). To eliminate other effects on performance such as language models, a phoneme classification task is chosen to more clearly assess the viability of these algorithms for acoustic modeling. Evaluations were conducted on the CALLHOME English and Mandarin corpora, consisting of two languages that, from a human perspective, are phonologically very different. It is shown in this work that these inference algorithms yield error rates comparable to a baseline Gaussian mixture model (GMM) but with a factor of up to 20 fewer mixture components. AVDPM is shown to be the most attractive choice because it delivers the most compact models and is computationally efficient, enabling its application to big data problems.

Introduction

Department of Electrical and Computer Engineering, Temple UniversityApril 25, A set of data is generated from multiple distributions but it is unclear how many.  Parametric methods assume the number of distributions is known a priori  Nonparametric methods learn the number of distributions from the data, e.g. a model of a distribution of distributions The Motivating Problem

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Goals Use a nonparametric Bayesian approach to learn the underlying structure of speech data Investigate viability of three variational inference algorithms for acoustic modeling:  Accelerated Variational Dirichlet Process Mixtures (AVDPM)  Collapsed Variational Stick Breaking (CVSB)  Collapsed Dirichlet Priors (CDP) Assess Performance:  Compare error rates to parametric GMM models  Understand computational complexity  Identify any language-specific artifacts

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Goals: An Explanation Why use Dirichlet process mixtures (DPMs)?  Goal: Automatically determine an optimal # of mixture components for each phoneme model  DPMs generate priors needed to solve this problem! What is “Stick Breaking”? Step 1: Let p 1 = θ 1. Thus the stick, now has a length of 1–θ 1. Step 2:Break off a fraction of the remaining stick, θ 2. Now, p 2 = θ 2 (1–θ 1 ) and the length of the remaining stick is (1–θ 1 )(1–θ 2 ). If this is repeated k times, then the remaining stick's length and corresponding weight is: θ3θ3 θ1θ1 θ2θ2 Dir~1

Background

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Speech Recognition Overview

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Parametric vs. Nonparametric Models ParametricNonparametric Requires a priori assumptions about data structure Do not require a priori assumptions about data structure Underlying structure is approximated with a limited number of mixtures Underlying structure is learned from data Number of mixtures is rigidly set Number of mixtures can evolve  Distributions over distributions  Needs a prior! Complex models frequently require inference algorithms for approximation!

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Taxonomy of Nonparametric Models Inference algorithms are needed to approximate these infinitely complex models

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Why Use Dirichlet Processes? Creating a model to characterize the number of mixture components in a GMM is best represented by a multinomial distribution  Choosing the number of mixtures and their weights requires priors  Dirichlet distributions (DDs) and Dirichlet processes (DPs) are the conjugate prior of a multinomial distribution  DDs and DPs can find the optimal GMM structure  DPs produce discrete priors for discrete #’s of mixtures

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Dirichlet Distribution pdf:  q ℝ k : a probability mass function (pmf)  α: a concentration parameter Properties of Dirichlet Distributions  Agglomerative Property (Joining)  Decimative Property (Splitting) Dirichlet Distributions

Department of Electrical and Computer Engineering, Temple UniversityApril 25, A Dirichlet Process is a Dirichlet distribution split infinitely many times Dirichlet Processes (DPs) q2q2 q2q2 q1q1 q1q1 q 21 q 11 q 22 q 12 These discrete probabilities are used as a prior for our infinite mixture model

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Inference: Estimating probabilities in statistically meaningful ways Parameter estimation is computationally difficult  Distributions over distributions ∞ parameters  Posteriors, p(y|x), can’t be analytically solved Sampling methods (e.g. MCMC)  Samples estimate true distribution  Drawbacks  Needs large number of samples for accuracy  Step size must be chosen carefully  “Burn in” phase must be monitored/controlled Inference: An Approximation

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Converts sampling problem to an optimization problem  Avoids need for careful monitoring of sampling  Uses independence assumptions to create simpler variational distributions, q(y), to approximate p(y|x).  Optimize q from Q = {q 1, q 2, …, q m } using an objective function, e.g. Kullbach-Liebler divergence  EM or other gradient descent algorithms can be used  Constraints can be added to Q to improve computational efficiency Variational Inference

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Accelerated Variational Dirichlet Process Mixtures (AVDPMs)  Limits computation of Q: For i > T, q i is set to its prior  Incorporates kd-trees to improve efficiency  Complexity O(NlogN) + O(2 depth ) Variational Inference Algorithms A, B, C, D, E, F, GA, B, D, FA, DB, FC, E, GC, GE

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Collapsed Variational Stick Breaking (CVSB)  Truncates the DPM to a maximum of K clusters and marginalizes out mixture weights  Creates a finite DP Collapsed Dirichlet Priors (CDP)  Truncates the DPM to a maximum of K clusters and marginalizes out mixture weights  Assigns cluster size with a symmetric prior CVSB and CDP Complexity O(TN) Variational Inference Algorithms [ 4 ]

Experimental Setup

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Phoneme Classification (TIMIT)  Manual alignments Phoneme Recognition (TIMIT, CH-E, CH-M)  Acoustic models trained for phoneme alignment  Phoneme alignments generated using HTK Experimental Design & Data CorpusDescription TIMIT Studio recorded, read speech 630 speakers, ~130,000 phones 39 phoneme labels CALLHOME English (CH-E) Spontaneous, conversational telephone speech 120 conversations, ~293,000 training samples 42 phoneme labels CALLHOME Mandarin (CH-M) Spontaneous, conversational telephone speech 120 conversations, ~250,000 training samples 92 phoneme labels

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Experimental Setup: Feature Extraction Window39 MFCC Features + Duration F 1AVG 40 Features F 2AVG 40 Features F 3AVG 40 Features Averaging 3x40 Feature Matrix F 1AVG Round Down F 2AVG Round Up F 3AVG Remainder Raw Audio Data 1 Frames/ MFCCs 1 Image taken from

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Baseline Algorithms: GMMs TIMITCH-E and CH-M

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Inference Algorithm Tuning AVDPM: KD Tree Depth v. Avg. Error % CVSB & CDP: Truncation Level vs. Avg. Error % TIMIT CH-E and CH-M

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Evaluation Error Rate Comparison Model TIMITCH-ECH-M ErrorNotesErrorNotesErrorNotes NN30.54% 140 neurons in hidden layer 47.62% 170 neurons in hidden layer 52.92% 130 neurons in hidden layer KNN32.08%K = %K = %K = 28 RF32.04%150 trees48.95%150 trees55.72%150 trees GMM38.02%# Mixt. = %# Mixt. = %# Mixt. = 64 AVDPM37.14% Init. Depth = 4 Avg. # Mixt. = % Init. Depth = 6 Avg. # Mixt. = % Init. Depth = 8 Avg. # Mixt. = 5.01 CVSB40.30% Trunc. Level = 4 Avg. # Mixt. = % Trunc. Level = 6 Avg. # Mixt. = % Trunc. Level = 6 Avg. # Mixt. = 5.75 CDP40.24% Trunc. Level = 4 Avg. # Mixt. = % Trunc. Level = 10 Avg. # Mixt. = % Trunc. Level = 6 Avg. # Mixt. = 5.75

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Computational Complexity: Truncation Level and Depth Current Operating Points TIMIT

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Computational Complexity: Training Samples TIMIT

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Conclusions and Future Work Conclusions AVDPM, CVSB, and CDP yield comparable error rates to GMM models AVDPM, CVSB, and CDP utilize much fewer #’s of mixtures per phoneme label than standard GMMs AVDPM is much better suited for large corpora since the KD tree significantly reduces training time without substantially degrading error rates. Performance gap between CH-E and CH-M can be attributed to # of labels Future Work Investigate methods to improve covariance estimation Apply AVDPM to HDP-HMM systems to move towards complete Bayesian nonparametric speech recognizer

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Publications Submitted:  Conference Paper: INTERSPEECH 2013 Steinberg, J., Harati Nejad Torbati, A. H., & Picone, J. (2013). A Comparative Analysis of Bayesian Nonparametric Inference Algorithms for Acoustic Modeling in Speech Recognition. Proceedings of INTERSPEECH. Lyon, France. Under Development:  Journal Paper: International Journal of Speech Technology Steinberg, J., Harati Nejad Torbati, A. H., & Picone, J. (2013). Phone Classification Using Bayesian Nonparametric Methods. To be submitted to the International Journal of Speech Technology. Summer  Project Website: Steinberg, J. (2013). DPM-Based Phoneme Classification with Variational Inference Algorithms. Retrieved from

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Acknowledgements Thanks to my committee for all of the help and support:  Dr. Iyad Obeid  Dr. Joseph Picone  Dr. Marc Sobel  Dr. Chang-Hee Won  Dr. Alexander Yates Thanks to my research group for all of their patience and support:  Amir Harati  Shuang Lu The Linguistic Data Consortium (LDC) for awarding a data scholarship to this project and providing the lexicon and transcripts for CALLHOME Mandarin. Owlsnest 1 1 This research was supported in part by the National Science Foundation through Major Research Instrumentation Grant No. CNS

Department of Electrical and Computer Engineering, Temple UniversityApril 25, Brief Bibliography of Related Research 1.Bussgang, J. (2012). Seeing Both Sides. Retrieved November 27, 2012 from Ng, A. (2012). Machine Learning. Retrieved November 27, 2012 from 3.K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). 4.K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan Steinberg, J., & Picone, J. (2012). HTK Tutorials. Retrieved from 6.Harati, A., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to Speaker Adaptation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4321–4324). Kyoto, Japan. doi: /ICASSP Frigyik, B., Kapila, A., & Gupta, M. (2010). Introduction to the Dirichlet Distribution and Related Processes. Seattle, Washington, USA. Retrieved from html 8.D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, Zografos, V. Wikipedia. Retrieved November 27, 2012 from 10.Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 879–893. doi: / Quintana, F. A., & Muller, P. (2004). Nonparametric Bayesian Data Analysis. Statistical Science, 19(1), 95– Harati, A., & Picone, J. (2012). Applications of Dirichlet Process Models to Speech Processing and Machine Learning. IEEE Section Meeting, Northern Virginia. Fairfax, Virginia, USA. doi: 13.Teh, Y. W. (2007) Dirichlet Processes: Tutorial and Practical Course. Retrieved November 30, 2012 from