Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.

Slides:



Advertisements
Similar presentations
Hierarchical Dirichlet Process (HDP)
Advertisements

Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
1 AUTOMATIC TRANSCRIPTION OF PIANO MUSIC - SARA CORFINI LANGUAGE AND INTELLIGENCE U N I V E R S I T Y O F P I S A DEPARTMENT OF COMPUTER SCIENCE Automatic.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
What I did on my Summer “Vacation“ Jeremy Morris 10/06/2006.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Design and Implementation of Speech Recognition Systems Spring 2014 Class 13: Training with continuous speech 26 Mar
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
A Left-to-Right HDP-HMM with HDPM Emissions Amir Harati, Joseph Picone and Marc Sobel Institute for Signal and Information Processing Temple University.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
The HTK Book (for HTK Version 3.2.1) Young et al., 2002.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Automated Interpretation of EEGs: Integrating Temporal and Spectral Modeling Christian Ward, Dr. Iyad Obeid and Dr. Joseph Picone Neural Engineering Data.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
3. Applications to Speaker Verification
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Network Training for Continuous Speech Recognition
Speech Enhancement Based on Nonparametric Factor Analysis
Presentation transcript:

Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA

May 9, Motivation Acoustical mismatch in the lexicon can reduce the performance of the system. For many languages we don’t have sufficient resources. Recognition of other signal (e.g. gestures.) Number of subword units is unknown and should grows by data; so this problem naturally fits in the nonparametric framework. Each unit (at least for HMM based recognizers) is assumed to be stationary (piece wise constant.) so can be modeled with a single state of an HMM (for segmentation).

May 9, Problem Input: speech waveform, utterance level transcription. Output: subword units, lexicon General approaches: Several approaches have been proposed during the last two decades that differs in different stages but the process generally can be decomposed into three sections: Segmentation. Clustering. Lexicon building. The first two steps are subword unit building and the last is the dictionary generation. Some algorithms do all three steps jointly. Segmentation generally is accomplished using different dynamic programming methods. Clustering is accomplished using K-means or other heuristic methods. For segmentation different methods use different criteria such as change in spectrum etc.

May 9, Proposal Using HDP-HMM for segmentation. Segments can be in word boundaries. How to deal with them? Should we allow multiple word entries? Should we segment a whole utterance or each word? Using DPM for clustering. How many units? How to control it? Generating the lexicon will be accomplished separately. After generating the lexicon, we can use EM and forced alignment to resegment the speech and re-cluster the segments and subsequently regenerate a lexicon until convergence. One of the major problem is word segmentation. Can we use a forced alignment using xwrds models? There is another approach in the literature (explained later.). How many pronunciation per word should be existed? Simplest answer: all Another way: To force all instances of a word have equal number of segments (too restrictive) First cluster all instances and then segment over the center of these clusters. Prune the dictionary later using forced alignment. How to deal with speaker variations? It is OK? (possibly many units) Using adaption or something like that?

May 9, Plan TIMIT and RM has been used in other works. It seems starting from TIMIT would be more beneficial, specially because we can compare with performance of handily phoneme labeled system. It is also can be used to find word boundaries so we can compare with other algorithms that find word boundaries automatically. We can try many different settings feature extractions; but since almost all works used 10 msec frame rate with 25 msec windows and 39 MFCC features we should start from here. Each step of the experiment will be done in a separate sub-experiment. For example, feature extraction would be exp000, Initial segmentation using HDP-HMM exp001 etc. In this way, we can try many different settings for different steps without recalculating other steps. Outline is as follow: Feature extraction. Baseline experiment using phoneme based system (both CI and CD) Generating word boundaries information using both time alignment information and forced alignment using CD models. Segmentation of utterances using HDP-HMM. Cluster all segments obtained in pervious step using DPM, and generate initial subword inventory. Generate a lexicon: Using word boundaries obtained from time information. Using word boundaries obtained from forced alignment. Using the method described in (G. Goussard and T. Niesler, PRASA 2010) Since some of the discovered units will be spanning word boundaries, we can have multiple word entries in the lexicon. Depending on number of multi word entries we can prune the dictionary in a later step using a score similar to (M. Bacchiani et al ICASSP 96)

May 9, Next Step Depending on the results there are several other things that we can do in the next step: Using different frame rate/ window length. Estimate hyper parameters using a held out set. The algorithm is not sensitive to these hyper parameters generally but this could help to tune the algorithm. After Generating the lexicon we can use methods similar to (Bacchiani et al, speech communication 1999) to resegment the audio using the new lexicon and Viterbi (and forced alignment) and cluster the new segments and regenerate a new lexicon. In this case new segments are not generated using HDP-HMM but clustering can yet be accomplished using DPM. If number of pronunciation per word was too high, we can first group all instances of each word in one group and cluster them first and generate subword units for these clusters. At the initial experiments we just run CI models, after getting acceptable results we can run CD experiments. We have to design proper questions for the tree in this case. DPM clustering can be replaced with HDP. In this case, all instances of a word place in one group but all groups share the same underlying clusters. (I expect this result in a better inventory.)

May 9, Lexicon generation in (G. Goussard and T. Niesler, PRASA 2010) Initial pronunciation dictionary: Either flat assignment of acoustic clusters (subword units) in an utterance to words in that utterance or assign proportional to word’s length. Intermediate pronunciation dictionary: Use an HMM to align words and acoustic clusters. In this case, we use a left-right HMM with skip in which each state is presenting a word. For each state of HMM, there is a probability distribution that describe how likely is for that state to associate with each acoustic cluster. The probability that cluster c i is associated with word w j is : In this case, the right hand side of the equation is obtained by accumulating the counts from all the corpus. Final pronunciation dictionary: prune the dictionary using forced alignment to reduce the number of pronunciation.

May 9, References  Bacchiani M., Ostendorf, M., Sagisaka, Y., Paliwal, K., Design of a speech recognition system based on non- uniform segmental units. In: Proceedings of the Interna-tional Conference on Acoustics, Speech and Signal Pro- cessing, Vol. 1, pp  Bacchiani M. and Ostendorf M., 1999, “Joint lexicon, acoustic unit inventory and model design,” Speech Communication, vol.29, pp.99–114.  Goussard G., Niesler T., 2010, “Automatic discovery of subword units and Pronunciations for automatic speech recognition using TIMIT”, Proceedings of the twenty-first annual symposium of the Pattern Recognition Association of South Africa (PRASA), Stellenbosch, South Africa.  Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). UNSUPERVISED ACOUSTIC SUB-WORD UNIT DETECTION FOR QUERY-BY-EXAMPLE SPOKEN TERM DETECTION. Language and Speech, pp  Paliwal, K.K., Lexicon building methods for an acoustic sub-word based speech recognizer. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Vol. 2, pp  Singh R., Raj B., and Stern R.M., 2002, “Automatic generation of subword units for speech recognition systems,” IEEE Transactions on Speech and Audio Processing, vol.10, no.2, pp.89–99.  Varadarajan B., Khudanpur S., and Dupoux E.,2008, “Unsupervised learning of acoustic sub-word units,” in proceedings of HLT’08, Columbus, Ohio, pp. pp  Varadarajan B.,Khudanpur S., 2008, "Automatically learning speaker-independent acoustic subword units", in Proc. INTERSPEECH, pp