Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Search-Based Structured Prediction

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Confidence Measures for Speech Recognition Reza Sadraei.

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.

Making a Clay Mask 6 Step 1 Step 2 Step 3Decision Point Step 5 Step 4 Reading ComponentsTypical Types of Tasks and Test Formats Phonological/Phonemic.

Speech Recognition. What makes speech recognition hard?

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Automatic Spelling Correction Probability Models and Algorithms Motivation and Formulation Demonstration of a Prototype Program The Underlying Probability.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Introduction to Automatic Speech Recognition

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Speech and Language Processing

Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Introduction of Grphones Dong Wang 05/05/2008. Content  Grphones  Graphone-based LVCSR  Graphone-based STD.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

ALPHABET RECOGNITION USING SPHINX-4 BY TUSHAR PATEL.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Automatic Speech Recognition

2 Research Department, iFLYTEK Co. LTD.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Juicer: A weighted finite-state transducer speech decoder

Conditional Random Fields for ASR

Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

The Application of Hidden Markov Models in Speech Recognition

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie

Contents  Problems we are addressing  Previous research  Integrated stochastic pronunciation modeling  Current experimental results  Work plan

Problems we are addressing 1.Constructing a lexicon is time consuming. 2.Traditional lexicon-based triphone systems lack robustness to pronunciation variation in real speech. Linguistics-based lexica seldom considering real speech Deterministic decomposition from words to acoustic units, through lexica and decision tress

Previous research  Alternative pronunciation generation Utilize real speech to expand the lexicon.  Automatic lexicon generation Utilize real speech to create a lexicon.  Hidden sequence modeling (HSM) Build a probabilistic mapping from phonemes to context dependent phones.

Previous research Problems: 1. Linguistics-based lexica 2. determinate mapping

Integrated stochastic pronunciation modeling Integrated Stochastic Pronunciation Modeling (ISPM) Build a flexible three-layer architecture which represents pronunciation variation in probabilistic mappings, achieving better performance than traditional triphone-based systems. Focus on the grapheme-based ISPM system, eliminating human efforts on lexicon construction.

Integrated stochastic pronunciation modeling Grapheme-based ISPM

Integrated stochastic pronunciation modeling  Spelling simplification model (SSM) Map a letter string with regular pronunciation into a simple grapheme according to the context. e.g., EA->E Map a letter string with several pronunciations to simple graphemes, with appearance probability attached, e.g., OUGH->O (0.6) AF (0.4) Examining the transcription from the grapheme decoding against the reference transcription will help find the mapping.  Grapheme pronunciation model (GPM) The probabilistic mapping between the canonical layer and acoustic layer. LMs/decision trees/ANNs can all be examined here.

Integrated stochastic pronunciation modeling  Why graphemes? Simple relationship between word spellings and sub-word units helps generate baseforms for any words, so avoid human efforts on lexicon construction. It is easy to handle OOV words and reconstruct words from grapheme strings. Building and applying grapheme-based LMs will be simple. Internal composition of phonology rules and acoustic clues makes it suitable for some applications, such as spoken term detection and language identification.

Integrated stochastic pronunciation modeling  Direct grapheme ISPM Direct grapheme ISPM: SSM is a 1:1 mapping

Integrated stochastic pronunciation modeling  Hidden grapheme ISPM Hidden grapheme ISPM: SSM is a n:m mapping

Integrated stochastic pronunciation modeling  Training A divide-and-conquer approach, as in HSM, will be utilized for ISPM training. With this approach, SSM,GPM and AM are optimized iteratively and alternately within an EM framework, which ensures the process to converge to a local optimum. The acoustic units will be grown from a set of initial single-letter grapheme HMMs, as in the automatic lexicon generation approach.  Decoding The optimized ISPM will be used to expand searching graphs fed to the viterbi decoder. No changes are required in the decoder itself.  Implementation steps The SSM and GPM are well separated so can be designed/implemented respectively, and then are combined together. The SSM is relatively simpler therefore will be implmented first.

Integrated stochastic pronunciation modeling  The proposed ISPM will be evaluated on three tasks: Large vocabulary speech recognition (LVSR) Spoken term detection (STD) Language identification (LID) Simplest grapheme (NONO) Simple grapheme (SSM) Direct grapheme (GPM) Hidden grapheme (SSM+GPM) LVSR ★★★ ★★ ★ STD ★★ ★★ ★★ ★★ ★ LID ★★ Performance gain expectation from ISPM

Current experimental results  Large vocabulary speech recognition Training(h.)Development(h.)Evaluation(h.) WSJCAM RT04S Training vocTest vocLanguage model WSJCAM0 WSJ-5kWSJ 3-gram RT04S CMU+festiva l CMUAMI 3-gram WSJCAM0 for read speech and RT04S for spontaneous speech on the meeting domain Experiment settings for the LVSR task Data corpora for the LVSR task

Current experimental results Phoneme system(WER)Grapheme system(WER) WSJCAM011.3%15.8% RT04S44.5%54.5%  Large vocabulary speech recognition CI(WER)CD(WRE) Phoneme21.2%9.8% Grapheme48.4%13.0% Contribution of context dependent modeling Experimental results of the LVSR task

Current experimental results Conclusions The Grapheme-based system works usually worse than the phoneme-based one, especially in the RT04S task which is on the meeting domain, where 10% absolute performance degradation is observed. A grapheme-based system relies on context dependent modeling more than a phoneme-based system, and requires more Gaussian mixture components. State-tying questions that reflect phonological rules are helpful. Other experiments showed that manually-designed multi-letter graphemes do not help significantly.  Large vocabulary speech recognition Phoneme(WER)Grapheme(WER) Extended questions Grapheme(WER) Singleton questions 11.3%15.8%16.5% Contribution of phonology oriented questions to the grapheme system

Current experimental results  Spoken term detection sub-word lattice based architecture for STD

Current experimental results  Figure of Merit (FOM): average detection rate over the range [1,10] false alarms per hour.  Occurrence-weighted value (OCC) phonegrapheme FOM OCC ATWV WER44.5%54.5% STD performance on the RT04S task  Spoken term detection  Actual term-weighted value(ATWV)

Current experimental results  Spoken term detection A Grapheme-based STD systems is attractive because OOV words can be handled easily and the lattice search is efficient and simple. In our experiments the phoneme-based STD system works better. We suppose this because some unpopular terms are more difficult for the grapheme-based system to recognize. If similar ASR performance can be achieved, the grapheme-based system will outperform the phoneme-based one, as shown in the right figure.

Current experimental results  Spoken term detection We have demonstrated that in Spanish, which holds simple grapheme- phoneme relationship and achieves close ASR performance with phoneme and grapheme based systems, the grapheme-based STD system outperforms the phoneme-based one.

Current experimental results  Language identification parallel phone/grapheme recognizer architecture for LID

Current experimental results DER% phonegraphemePhone+grapheme unit likelihood sentence likelihood  Language identification Globalphone is used for initial experiments, but we will move to NIST standard corpora. Detection error rate (DER), defined as the incorrect detection divided by total trials, is used as metric. Results on 3 seconds of speech within 4 languages are reported. Scores of whole sentences and those averaged over sub-word units as the ANN input are all tested.

Work plan  Phase I: Simple grapheme-based system 1. Finish the STD experiments with high-order LMs (by Jan.2008). 2. Finish the LID oriented tuning (by Nov.2007). 3. Apply powerful LMs to the LID task (by Jan.2008). 4. Finish the SSM design (by Jan.2008). 5. Apply the SSM on LVSR RTS04 and STD (by Feb.2008).  Phase II: Integrated stochastic pronunciation modeling 1. Finish the direct-grapheme architecture (GPM) design (by Jul.2008). 2. Test the direct-grapheme architecture on the LVSR RTS04 task (by Oct.2008). 3. Finish the hidden-grapheme architecture (GPM+SSM) (by Jan.2009). 4. Test the hidden-grapheme architecture on the LVSR RTS04 task (by Feb.2009).  Phase III: Applications based on ISPM 1. Finish the test on the STD task (by May 2009). 2. Finish the test on the LID task (by May 2009).