A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute.

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

Hierarchical Dirichlet Process (HDP)
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Supervised Learning Recap
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian Nonparametric Matrix Factorization for Recorded Music Reading Group Presenter: Shujie Hou Cognitive Radio Institute Friday, October 15, 2010 Authors:
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
General Information Course Id: COSC6342 Machine Learning Time: MO/WE 2:30-4p Instructor: Christoph F. Eick Classroom:SEC 201
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Joseph Picone Co-PIs: Amir Harati, John Steinberg and Dr. Marc Sobel
Abstract The emergence of big data and deep learning is enabling the ability to automatically learn how to interpret EEGs from a big data archive. The.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 10a-11:30a Instructor: Christoph F. Eick Classroom:AH123
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Randomized Algorithms for Bayesian Hierarchical Clustering
TUH EEG Corpus Data Analysis 38,437 files from the Corpus were analyzed. 3,738 of these EEGs do not contain the proper channel assignments specified in.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Lecture 2: Statistical learning primer for biologists
Amir Harati and Joseph Picone
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
General Information Course Id: COSC6342 Machine Learning Time: TU/TH 1-2:30p Instructor: Christoph F. Eick Classroom:AH301
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Online Multiscale Dynamic Topic Models
College of Engineering Temple University
College of Engineering
Non-Parametric Models
Statistical Models for Automatic Speech Recognition
CSCI 5822 Probabilistic Models of Human and Machine Learning
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Collapsed Variational Dirichlet Process Mixture Models
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Bayesian Inference for Mixture Language Models
Presentation transcript:

A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Phoneme Recognition A Thesis Proposal By: John Steinberg Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

Temple UniversityDecember 4, Abstract Nonparametric Bayesian models have become increasingly popular in speech recognition tasks such as language and acoustic modeling due to their ability to discover underlying structure in an iterative manner. Nonparametric methods do not require a priori assumptions about the structure of the data, such as the number of mixture components, and can learn this structure directly from the data. Dirichlet process mixtures (DPMs) are a widely used nonparametric method. These processes are an extension of the Dirichlet distribution which spans non-negative numbers that sum to one. Thus, DPMs are particularly useful for modeling distributions of distributions. Because DPMs potentially require infinite parameters, inference algorithms are needed to approximate these models for sampling purposes. The focus of this work is an evaluation of three of these Bayesian inference algorithms, which have only recently become computationally viable: Accelerated Variational Dirichlet Process Mixtures (AVDPM), Collapsed Variational Stick Breaking (CVSB), and Collapsed Dirichlet Priors (CDP). Rather than conducting a complete speech recognition experiment where classification is affected by several factors (i.e. language and acoustic modeling), a simple phone classification task is chosen to more clearly assess the efficacy of these algorithms. Furthermore, this evaluation is conducted using the CALLHOME Mandarin and English corpora, two languages that, from a human perspective, are phonologically very different. This serves two purposes: (1) Artifacts from either language that influence classification will be identified. (2) If no such artifacts exist, then these algorithms will have strongly supported their use for future multi- language recognition tasks. Finally, Mandarin misclassification error rates have consistently been much higher than those from comparable English experiments. Thus the last goal of this work is to show whether these three inference algorithms can help reduce this gap while maintaining reasonable computational complexity.

Introduction

Temple UniversityDecember 4, A set of data is generated from multiple distributions but it is unclear how many.  Parametric methods assume the number of distributions is known a priori  Nonparametric methods learn the number of distributions from the data, e.g. a model of a distribution of distributions The Motivating Problem

Temple UniversityDecember 4, Goals Investigate viability of three variational inference algorithms for acoustic modeling:  Accelerated Variational Dirichlet Process Mixtures (AVDPM)  Collapsed Variational Stick Breaking (CVSB)  Collapsed Dirichlet Priors (CDP) Assess Performance:  Compare error rates to standard GMM models  Understand computational complexity  Identify any language-specific artifacts

Background

Temple UniversityDecember 4, Speech Recognition Overview

Temple UniversityDecember 4, Parametric vs. Nonparametric Models ParametricNonparametric Requires a priori assumptions about data structure Do not require a priori assumptions about data structure Underlying structure is approximated with a limited number of mixtures Underlying structure is learned from data Number of mixtures is rigidly set Number of mixtures can evolve  Distributions of distributions  Needs a prior! Complex models frequently require inference algorithms for approximation!

Temple UniversityDecember 4, Taxonomy of Nonparametric Models Nonparametric Bayesian Models Regression Model Selection Survival Analysis Neural Networks Wavelet- Based Modeling Multivariate Regression Spline Models Dirichlet Processes Neutral to the Right Processes Dependent Increments Competing Risks Proportional Hazards Inference algorithms are needed to approximate these infinitely complex models

Temple UniversityDecember 4, Dirichlet Distribution pdf:  q ℝ k : a probability mass function (pmf)  α: a concentration parameter Dirichlet Distributions

Temple UniversityDecember 4, Properties of Dirichlet Distributions Agglomerative Property (Joining) Decimative Property (Splitting) Dirichlet Distribution Dirichlet Distributions

Temple UniversityDecember 4, A Dirichlet Process is a Dirichlet distribution split infinitely many times Dirichlet Processes (DPs) q2q2 q2q2 q1q1 q1q1 q 21 q 11 q 22 q 12 These discrete probabilities are used as a prior for our infinite mixture model

Temple UniversityDecember 4, Parameter estimation is computationally difficult  Distributions of distributions ∞ parameters  Posteriors, p(y|x), can’t be analytically solved Sampling methods are most common  Markov Chain Monte Carlo (MCMC)  Samples estimate true distribution  Drawbacks  Needs large number of samples for accuracy  Step size must be chosen carefully  “Burn in” phase must be monitored/controlled Inference: An Approximation

Temple UniversityDecember 4, Converts sampling problem to an optimization problem  Avoids need for careful monitoring of sampling  Use independence assumptions to create simpler variational distributions, q(y), to approximate p(y|x).  Optimize q from Q = {q1, q2, …, q m } using an objective function, e.g. Kullbach-Liebler divergence  Constraints can be added to Q to improve computational efficiency Variational Inference

Temple UniversityDecember 4, Accelerated Variational Dirichlet Process Mixtures (AVDPMs)  Limits computation of Q: For i > T, q i is set to its prior  Incorporates kd-trees to improve efficiency  number of splits is controlled to balance computation and accuracy Variational Inference Algorithms A, B, C, D, E, F, GA, B, D, F A, DBF C, E, G C, GE

Temple UniversityDecember 4, Collapsed Variational Stick Breaking (CVSB)  Truncates the DPM to a maximum of K clusters and marginalizes out mixture weights  Creates a finite DP Collapsed Dirichlet Priors (CDP)  Truncates the DPM to a maximum of K clusters and marginalizes out mixture weights  Assigns cluster size with a symmetric prior  Creates many small clusters that can later be collapsed Variational Inference Algorithms [ 4 ]

Temple UniversityDecember 4, DPs can be used as a prior to select statistically meaningful distributions for our model. Distributions of distributions (DPs) require an infinite number of parameters.  Complex Models  Inference Algorithms Speech data is complex and we don’t know the underlying structure.  Nonparametric models automatically find structure. Approach: Apply variational inference algorithms for DPs to a speech recognition task.  A complete speech recognition experiment has too many performance altering variables.  Phoneme classification is more controlled. The Story So Far…

Experimental Setup

Temple UniversityDecember 4, Phoneme Classification (TIMIT)  Manual alignments Phoneme Recognition (TIMIT, CH-E, CH-M)  Acoustic models trained for phoneme alignment  Phoneme alignments generated using HTK Experimental Design & Data CorpusDescription TIMIT Studio recorded, read speech 630 speakers, ~130,000 phones 39 phoneme labels CALLHOME English (CH-E) Spontaneous, conversational telephone speech 120 conversations, ~293,000 training samples 42 phoneme labels CALLHOME Mandarin (CH-M) Spontaneous, conversational telephone speech 120 conversations, ~250,000 training samples 92 phoneme labels

Temple UniversityDecember 4, Experimental Setup: Feature Extraction Window39 MFCC Features + Duration F 1AVG 40 Features F 2AVG 40 Features F 3AVG 40 Features Averaging 3x40 Feature Matrix F 1AVG Round Down F 2AVG Round Up F 3AVG Remainder Raw Audio Data Frames/ MFCCs

Temple UniversityDecember 4, Baseline Algorithms: GMMs CH-E K Misclassification Error (%) (Val / Evl) % / 63.28% % / 60.62% % / 63.55% % / 61.74% % / 59.69% % / 58.41% % / 58.37% % / 58.84% TIMIT K Misclassification Error (%) (Val / Evl) % / 36.80% % / 34.80% % / 31.56% % / 36.02% % / 36.56% % / 39.00% % / 40.40% % / 37.82% CH-M K Misclassification Error (%) (Val / Evl) % / 68.63% % / 66.32% % / 68.27% % / 65.30% % / 62.65% % / 63.53% % / 63.57% 256--

Temple UniversityDecember 4, Inference Algorithm Tuning: CH-E & CH-M AVDPM: KD Tree Depth v. Error % CVSB & CDP: Truncation Level v. Error %  Generally improves with increased depth  Requires additional iterations CVSB  Optimal # mixtures < 8 CDP  Optimal # mixtures < 32 Requires additional iterations

Temple UniversityDecember 4, Error Rate Comparison Algorithm Best Error Rate: CH-E Avg. k per Phoneme GMM58.41%128 AVDPM56.65%3.45 CVSB56.54%11.60 CDP57.14%27.93 CH-E Algorithm Best Error Rate: CH-E Avg. k per Phoneme GMM62.65%64 AVDPM62.59%2.15 CVSB63.08%3.86 CDP62.89%9.45 CH-M AVDPM, CVSB, & CDP have comparable results to GMMs AVDPM, CVSB, & CDP require significantly fewer parameters than GMMs

Temple UniversityDecember 4, Complexity: AVDPM AVDPM: CPU Time v. KD Tree Depth

Temple UniversityDecember 4, Complexity: CVSB & CDP CPU Time v. Truncation Level CPU Time v. Training Samples CVSB CDP

Temple UniversityDecember 4, Conclusions AVDPM, CVSB, and CDP slightly outperform standard GMM error rates for both CH-E and CH-M AVDPM, CVSB, and CDP all generate much fewer #’s of mixtures per phoneme label than standard GMM systems AVDPM’s complexity is exponential as KD tree depth increases CDP’s and CVSB’s complexity is linear with # training samples and truncation level Performance gap between CH-E and CH-M can easily be attributed to # of labels

Temple UniversityDecember 4, Acknowledgements Thanks to my committee for all of the help and support:  Dr. Iyad Obeid  Dr. Joseph Picone  Dr. Marc Sobel  Dr. Chang-Hee Won  Dr. Alexander Yates Thanks to my research group for all of their patience and support:  Amir Harati  Shuang Lu The Linguistic Data Consortium (LDC) for awarding a data scholarship to this project and providing the lexicon and transcripts for CALLHOME Mandarin. Owlsnest 1 1 This research was supported in part by the National Science Foundation through Major Research Instrumentation Grant No. CNS

Temple UniversityDecember 4, Brief Bibliography of Related Research [1] Bussgang, J. (2012). Seeing Both Sides. Retrieved November 27, 2012 from its-all-about-machine-learning.htmlhttp://bostonvcblog.typepad.com/vc/2012/05/forget-plastics- its-all-about-machine-learning.html [2] Ng, A. (2012). Machine Learning. Retrieved November 27, 2012 from [3]K. Kurihara, M. Welling, and N. Vlassis, “Accelerated variational Dirichlet process mixtures,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, Massachusetts, USA, 2007 (editors: B. Schölkopf and J.C. Hofmann). [4]K. Kurihara, M. Welling, and Y. W. Teh, “Collapsed variational Dirichlet process mixture models,” Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, Jan [5] Picone, J. (2012). HTK Tutorials. Retrieved from [6]Harati Nejad Torbati, A. H., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to speaker adaptation IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4321–4324). Philadelphia, Pennsylvania, USA: IEEE. doi: /ICASSP [7]Frigyik, B., Kapila, A., & Gupta, M. (2010). Introduction to the Dirichlet Distribution and Related Processes. Seattle, Washington, USA. Retrieved from [8]D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Analysis, vol. 1, pp. 121–144, [9]Zografos, V. Wikipedia. Retrieved November 27, 2012 from [10]Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 879–893. doi: / [11]Schalkwyk, J. Overview of the Recognition Process Retrieved December 2, 2012 from [12]Quintana, F. A., & Muller, P. (2004). Nonparametric Bayesian Data Analysis. Statistical Science, 19(1), 95–110. doi: / [13]Picone, J., Harati, A. (2012). Applications of Dirichlet Process Models to Speech Processing and Machine Learning. Presentation given at IEEE Nova. Virginia [14]Teh, Y. W. (2007) Dirichlet Processes: Tutorial and Practical Course. Retrieved November 30, 2012 from