CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.
Word Spotting DTW.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
歡迎 IBM Watson 研究員 詹益毅 博士 蒞臨 國立台灣師範大學. Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, Franc¸ois Yvon ICASSP 2011 許曜麒 Structured Output.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Modular Neural Networks CPSC 533 Franco Lee Ian Ko.
The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Decision Tree Algorithm
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
27 th, February 2004Presentation for the speech recognition system An overview of the SPHINX Speech Recognition System Jie Zhou, Zheng Gong Lingli Wang,
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Automatic Continuous Speech Recognition Database speech text Scoring.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Mohammad Ali Keyvanrad
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Chapter 9 – Classification and Regression Trees
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
SPEECH RECOGNITION Presented to Dr. V. Kepuska Presented by Lisa & Za ECE 5526.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
Classification and Regression Trees
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
G. Anushiya Rachel Project Officer
k-Nearest neighbors and decision tree
LECTURE 20: DECISION TREES
5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang
Vincent Granville, Ph.D. Co-Founder, DSC
Research on the Modeling of Chinese Continuous Speech Recognition
Decision Trees for Mining Data Streams
LECTURE 18: DECISION TREES
Introduction to Digital Speech Processing
Speaker Identification:
Presentation transcript:

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU

2005/12/8 NTNU Speech Lab 2 Reference CMU Robust Vocabulary-Independent Speech Recognition System, Hsiao-Wuen Hon and Kai-Fu Lee, ICASSP 1991

2005/12/8 NTNU Speech Lab 3 Outline Introduction Larger Training Database Between-Word Triphone Decision Tree Allophone Clustering Summary of Experiments and Results Conclusions

2005/12/8 NTNU Speech Lab 4 Introduction This paper reports the efforts to improve the performance of CMU ’ s robust vocabulary-independent (VI) speech recognition systems on the DARPA speaker-independent resource management task The first improvement involves the incorporation of more dynamic features in the acoustic front-end processing (here add second order differenced cepstra and power) The second improvement involves the collection of more general English data, from which we can model more phonetic variabilities, such as the word boundary context

2005/12/8 NTNU Speech Lab 5 Introduction (cont.) With more detailed models (such as between-word triphones), coverage on new tasks was reduced A new decision-tree based subword clustering algorithm to find more suitable models for the subword units not covered in the training set The vocabulary-independent system suffered much more from differences in the recording environments at TI versus CMU than the vocabulary-dependent system

2005/12/8 NTNU Speech Lab 6 Larger Training Database The vocabulary-independent results improved dramatically as the vocabulary-independent training increased They add 5000 more general English data into the vocabulary- independent training set, but only obtain a small improvement, reducing the error rate from 9.4% to 9.1% The subword modeling technique may have reached an asymptote, so that additional sentences are not giving much improvement

2005/12/8 NTNU Speech Lab 7 Between-Word Triphone Because the subword models are phonetic models, one way to model more acoustic-phonetic detail is to incorporate more context information Between-word triphone are modeling on the vocabulary-independent system by adding three more contexts Word beginning, Word ending and single-phone word positions In the past, it has been argued that between-word triphones might be learning grammatical constraints instead of modeling acoustic-phonetic variations The result shows the contrary, since in vocabulary-independent systems, grammars in the training and recognition are completely different

2005/12/8 NTNU Speech Lab 8 Decision Tree Allophone Clustering At the root of the decision tree is the set of all triphones corresponding to a phone Each node has a binary “ question ” about their contexts including left, right and word boundary contexts e.g. “ Is the right phoneme a back vowel? ” These question are created using human speech knowledge and are designed to capture classes of contextual effects To find the generalized triphone for a triphone, the tree is traversed by answering the questions attached to each node, until a leaf node is reached

2005/12/8 NTNU Speech Lab 9 Decision Tree Allophone Clustering (cont.)

2005/12/8 NTNU Speech Lab 10 Decision Tree Allophone Clustering (cont.) The metric for splitting is a information-theoretic distance measure based on the amount of entropy reduction when splitting a node To find the question that divides node m into nodes a and b, such that P(m)H(m) - P(a)H(a) – P(b)H(b) is maximized

2005/12/8 NTNU Speech Lab 11 Decision Tree Allophone Clustering (cont.) The algorithm to generate a decision tree for a phone is given below 1. Generate an HMM for every triphone 2. Create a tree with one (root) node, consisting of all triphones 3. Find the best composite question for each node (a) Generate a tree with simple questions at each node (b) Cluster leaf nodes into two classes, representing the composite question 4. Split the node with the overall best question 5. until some convergence criterion is met, go to step 3

2005/12/8 NTNU Speech Lab 12 Decision Tree Allophone Clustering (cont.) If only simple questions are allowed in the algorithm, the data may be over-fragmented, resulting in similar leaves in different locations of the tree To deal with this problem by using composite questions (questions that involve conjunctive and disjunctive combinations of all questions and their negations)

2005/12/8 NTNU Speech Lab 13 Decision Tree Allophone Clustering (cont.) The significant improvement here is due to three reasons 1. improved tree growing and pruning techniques 2. models in this study are more detailed and consistent, which makes it easier to find appropriate and meaningful questions 3. triphone coverage is lower in this study, so decision tree based clustering is able to find more suitable models

2005/12/8 NTNU Speech Lab 14 Summary of Experiments and Results All the experiments are evaluated on the speaker-independent DARPA resource management task A 991-word continuous speech task and a standard word-pair grammar with perplexity 60 was used throughout The test set consists of 320 sentences from 32 speakers For the vocabulary-dependent (VD) system, they used the standard DARPA speaker-independent database which consisted of 3990 sentences from 109 speakers to train the system under different configurations

2005/12/8 NTNU Speech Lab 15 Summary of Experiments and Results (cont.) The baseline vocabulary-independent (VI) system, was trained from a total of VI sentences, 5000 of these were the TIMIT and Harvard sentences and were General English sentences recorded at CMU

2005/12/8 NTNU Speech Lab 16 Conclusions In this paper, they have presented several techniques that substantially improve the performance of CMU ’ s vocabulary-independent speech recognition system We can enhance the subword units by modeling more acoustic- phonetic variations We would like to refine and constrain the type of questions which can be asked to split the decision tree There is still a non-negligible degradation for cross recording condition