Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

Slides:



Advertisements
Similar presentations
1 Speech Sounds Introduction to Linguistics for Computational Linguists.
Advertisements

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
MAI Internship April-May MAI Internship 2002 Slide 2 of 14 What? The AST Project promotes development of speech technology for official languages.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
[kmpjuteynl] [fownldi]
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Hybrid Systems for Continuous Speech Recognition Issac Alphonso Institute for Signal and Information Processing Mississippi State.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Advances in Robust Engineering Design Henry Wynn and Ron Bates Department of Statistics Workshop at Matforsk, Ås, Norway 13 th -14 th May 2004 Design of.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A Sparse Modeling Approach to Speech Recognition Based on Relevance Vector Machines Jon Hamaker and Joseph Picone Institute for.
Joseph Picone, PhD Human and Systems Engineering Professor, Electrical and Computer Engineering URL:
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Assessment of Phonology
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Temple University Training Acoustic model using Sphinx Train Jaykrishna shukla,Mubin Amehed& cara Santin Department of Electrical and Computer Engineering.
Temple University Training Acoustic Models Using SphinxTrain Jaykrishna Shukla, Mubin Amehed, and Cara Santin Department of Electrical and Computer Engineering.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Training Tied-State Models Rita Singh and Bhiksha Raj.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Performance Comparison of Speaker and Emotion Recognition
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Creating Speech Recognizers Quickly Björn Bringert Department of Computer Science and Engineering Chalmers.
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Automatic Speech Recognition: Conditional Random Fields for ASR
A maximum likelihood estimation and training on the fly approach
Speaker Identification:
Network Training for Continuous Speech Recognition
ประกาศกระทรวงอุตสาหกรรม ฉบับที่ 5292 (พ.ศ. 2562)
Presentation transcript:

Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 0452 Mississippi State University Mississippi State, Mississippi Tel: Fax: URL: isip.msstate.edu/publications/books/msstate_theses/2003/network_training/

INTRODUCTION ORGANIZATION Motivation: Why do we need a new training paradigm? Network Training: The differences between the network training and traditional training. Experiments: Verification of the approach using industry standard databases (e.g., TIDigits, Alphadigits and Resource Management). Motivation Network Training Experiments Conclusions

INTRODUCTION MOTIVATION A traditional trainer uses an EM-based framework to estimate the parameters of a speech recognition system. A traditional trainer re-estimates the acoustic models in several complicated stages which are prone to error. A network trainer reduces the complexity of the training process by using flexible transcriptions. A network trainer achieves comparable performance and retains the robustness of the existing EM-based framework.

NETWORK TRAINER TRAINING RECIPE The flat start stage seeds the mean and variance of the speech and non-speech models. The context-independent stage inserts and optional silence model between words. The state-tying stage clusters the model parameters via linguistic rules to compensate for sparse training data. The context-dependent stage is similar to the context- independent stage (words are modeled using context). Flat Start CI Training State Tying CD Training Context-Independent Context-Dependent

NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS sil hh v v Traditional Trainer: ae sil SILENCE HAVE SILENCE Network Trainer: The network trainer uses word level transcriptions which does not impose restrictions on the word pronunciation. The traditional trainer uses phone level transcriptions which uses the canonical pronunciation of the word. Using orthographic transcriptions removes the need for directly dealing with phonetic contexts during training.

NETWORK TRAINER FLEXIBLE TRANSCRIPTIONS The network trainer uses a silence word which precludes the need for inserting it into the phonetic pronunciation. The traditional trainer deals with silence between words by explicitly specifying it in the phonetic pronunciation. Network Trainer: Traditional Trainer:

NETWORK TRAINER DUAL SILENCE MODELLING Multi-Path: Single-Path: The multi-path silence model is used between words. The single-path silence model is used at utterance ends.

NETWORK TRAINER The network trainer uses a fixed silence at utterance bounds and an optional silence between words. We use a fixed silence at utterance bounds to avoid an underestimated silence model. DUAL SILENCE MODELLING

EXPERIMENTS TIDIGITS: WER COMPARISON StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 7.7%0.1%2.5%5.0% Network Trainer 7.6%0.1%2.4%5.0% The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer.

EXPERIMENTS AD: WER COMPARISON The network trainer achieves comparable performance to the traditional trainer. The network trainer converges in word error rate to the traditional trainer. StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 38.0%0.8%3.0%34.2% Network Trainer 35.3%0.8%2.2%34.2%

EXPERIMENTS RM: WER COMPARISON The network trainer achieves comparable performance to the traditional trainer. It is important to note that the 1.8% degradation in performance is not significant (MAPSSWE test). StageWERInsertion Rate Deletion Rate Substitution Rate Traditional Trainer 25.7%1.9%6.7%17.1% Network Trainer 27.5%2.6%7.1%17.9%

Explored the effectiveness of a novel training recipe in the reestimation process of for speech processing. Analyzed performance on three databases. For TIDigits, at 7.6% WER, the performance of the network trainer was better by about 0.1%. For OGI Alphadigits, at 35.3% WER, the performance of the network trainer was better by about 2.7%. For Resource Management, at 27.5% WER, the performance degraded by about 1.8% (not significant). CONCLUSIONS SUMMARY