A NONPARAMETRIC BAYESIAN APPROACH FOR

Slides:



Advertisements
Similar presentations
Hierarchical Dirichlet Process (HDP)
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models in NLP
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Abstract EEGs, which record electrical activity on the scalp using an array of electrodes, are routinely used in clinical settings to.
Introduction to Automatic Speech Recognition
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
A Comparative Analysis of Bayesian Nonparametric Variational Inference Algorithms for Speech Recognition John Steinberg Institute for Signal and Information.
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.
7-Speech Recognition Speech Recognition Concepts
Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Temple University QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone, PhD Department of Electrical and Computer.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
English vs. Mandarin: A Phonetic Comparison The Data & Setup Abstract The focus of this work is to assess the performance of new variational inference.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Automatic Speech Recognition
College of Engineering Temple University
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
College of Engineering
Statistical Models for Automatic Speech Recognition
LECTURE 15: HMMS – EVALUATION AND DECODING
8.0 Search Algorithms for Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Statistical Models for Automatic Speech Recognition
LECTURE 14: HMMS – EVALUATION AND DECODING
Handwritten Characters Recognition Based on an HMM Model
LECTURE 15: REESTIMATION, EM AND MIXTURES
Combination of Feature and Channel Compensation (1/2)
The Application of Hidden Markov Models in Speech Recognition
Presentation transcript:

A NONPARAMETRIC BAYESIAN APPROACH FOR AUTOMATIC DISCOVERY OF A LEXICON AND ACOUSTIC UNIT Temple University College of Engineering www.isip.piconepress.com www.temple.edu/engineering Amir Harati Conversational Technologies Jibo Inc. Joseph Picone The Institute for Signal and Information Processing Temple University Introduction State of the art speech recognition systems use data-intensive context-dependent phonemes as acoustic units. Resources such as a language model or a lexicon might not be available for all languages. Learning the lexicon and acoustic units from data is attractive for low resource languages. Learning acoustic units is an example of a problem where the complexity of model (e.g. number of units) is unknown. Automatically Discovered Units Require a sub-word acoustic unit (e.g. phonemes). Most approaches learn using a two step process: (1) segmentation and (2) clustering. Instead of an initial segmentation/clustering, we first learn a transducer. Automatically discovered units (ADUs) are relatively stationary. Learning: Use Ergodic HDPHMM/DHDPHMM to train the transducer. Decoding : Use Viterbi, Forward-backward to find the optimum path or probability distribution. Experiments: Relation to Phonemes Experiment using TIMIT dataset. ADU transducer is learned from training section of TIMIT. Test subset of TIMIT is decoded using the learned ADU transducer. ADU units are aligned with phonemes. ADU units are modeled by states of HDPHMM (Gaussian mixtures) and therefore are more stationary relative to phonemes. Experiments: Lexicon Learning ADU transducer trained on TIMIT but lexicon is learned from Resource Management (RM). For low complexity ASR, ADUs perform better, but the performance gains diminish as system complexity increases. Table 4. A Comparison of Lexicon Learning Algorithms Figure 1. Model Complexity as a Function of Available Data. (a) 20 (b) 200 (c) 2000 Data Points Figure 3. Relationship Between ADUs and Phonemes Figure 2. Example of ADUs vs. Phonemes. Nonparametric Bayesian Models Lexicon Learning Lexicon: a mapping of words into acoustic units. For ADUs we need to also learn the lexicon. We assume existence of parallel transcriptions. The algorithm uses a special version of dynamic time warping (DTW) algorithm that can align a small sequence in a larger one. Learning Algorithm Generate the posteriorgram representation for all utterances in the dataset using an ADU transducer. Generate an approximate alignment between the words and the output stream of the ADU transducer Use the aligned transcription to extract all examples of each word. Use a sub-sequence DTW algorithm to align all examples and find instances with least average edit distance to other instances. Generate a lexicon and use it train a new ASR system. Force align the transcriptions using new ASR system. Use the aligned transcriptions to extract all examples of each word If convergence is not achieved, go to step 4. Experiments: Spoken Term Detection Task: given a sample of a word (query), find all the occurrences in the dataset. Approach: Convert the database and query into an ADU. Search the query in the database using a sub-sequence DTW algorithm. Future Work Develop nonparametric Bayesian models that can model non-stationary units. Currently, each unit is modeled by a single state of HDPHMM. However, we need HDPHMMs in which each state is modeled by another HMM model. Evaluations on other datasets and languages is needed to validate results. Investigate new approaches for mapping ADU units to words. For example, we can train a G2P using parallel streams of ADUs and letters. References A. Harati and J. Picone, “Speech Acoustic Unit Segmentation Using Hierarchical Dirichlet Processes,” in Proceedings of INTERSPEECH, 2013, pp. 637–641. A. Harati and J. Picone, “A Nonparametric Bayesian Approach for Spoken Term Detection by Example Query,” in Proceedings of INTERSPEECH, 2016, 313. A. Harati and J. Picone, “A Doubly Hierarchical Dirichlet Process Hidden Markov Model with a Non-Ergodic Structure,” IEEE/ACM Transactions on Audio, Speech, and Language Proc., vol. 24, no. 1, pp. 174–184, 2016. Mixture Model Generative Model Nonparametric Bayesian equivalent extended to multi-group where each group is modeled with a mixture (HDP): Extend the same approach to HMMs (HDPHMM): an infinite number of states; each state output is modeled by a DPM. Previously introduced an extension (DHDPHMM) that allows outputs to be modeled by HDP: Implementation is available at: https://github.com/amir1981/hdphmm_lib Table 1. Spoken Term Detection by Query Table 2. Segmentation Performance Table 3. Error Examples