ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.

Slides:



Advertisements
Similar presentations
1 Speech Sounds Introduction to Linguistics for Computational Linguists.
Advertisements

Building an ASR using HTK CS4706
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.
INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,
A Study on Detection Based Automatic Speech Recognition Author : Chengyuan Ma Yu Tsao Professor: 陳嘉平 Reporter : 許峰閤.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
ITCS 6010 Spoken Language Systems: Architecture. Elements of a Spoken Language System Endpointing Feature extraction Recognition Natural language understanding.
CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM ICSLP’ 98 CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM J. Ferreiros,
Non-native Speech Languages have different pronunciation spaces
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Why is ASR Hard? Natural speech is continuous
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Automatic Continuous Speech Recognition Database speech text Scoring.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
Presented by Tienwei Tsai July, 2005
Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.
1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Bistra Attilio.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Chapter 3 Culture and Language. Chapter Outline  Humanity and Language  Five Properties of Language  How Language Works  Language and Culture  Social.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Performance Comparison of Speaker and Emotion Recognition
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
The Audio-Lingual Method
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Psychological status of phonological analyses Before Chomsky linguists didn't talk about psychological aspects of linguistics Chomsky called linguistics.
Chapter 1 Introduction PHONOLOGY (Lane 335). Phonetics & Phonology Phonetics: deals with speech sounds, how they are made (articulatory phonetics), how.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
PLS for SSML Paolo Baggia Loquendo Workshop II on Internationalizing SSML.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Automatic Speech Recognition
Automatic Speech Recognition
Speech Recognition UNIT -5.
An overview of decoding techniques for LVCSR
Automatic Speech Recognition
Speech Processing Speech Recognition
The connected word recognition problem Problem definition: Given a fluently spoken sequence of words, how can we determine the optimum match in terms.
Connected Word Recognition
Research on the Modeling of Chinese Continuous Speech Recognition
Speaker Identification:
Presenter: Shih-Hsiang(士翔)
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS IN SPANISH SPEECH RECOGNITION SYSTEMS Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

ETRW Modelling Pronunciation variation for ASR Presentation Contents l Introduction l The strategy applied l CSR l Task l System Architecture l Results l ISR l Task l System Architecture l Results l Conclusions and Future Work

ETRW Modelling Pronunciation variation for ASR Introduction (I) l Pronunciation variation: common source of recognition errors l Rule-based strategy to incorporate pronunciation alternatives for Spanish l Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored l Alternate pronunciations can be found even within the same speaker

ETRW Modelling Pronunciation variation for ASR Introduction (II) l The lexicon should consider these different possibilities even within the same dialect l It is important to study the impact of the rules on the lexicon l Near 20% error rate reduction for continuous speech task l No significant change for isolated word hypothesis generator case

ETRW Modelling Pronunciation variation for ASR The strategy applied (I) l Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations l It deals with coarticulation and assimilation effects in word boundaries for continuous speech l Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone l Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style

ETRW Modelling Pronunciation variation for ASR The strategy applied (II) l Examples of variations considered: –DIFFERENT HABITS: exámen: /e k s a m e n/ l [e k s á m e~ n] l [e  s á m e~ n] l [e s á m e~ n] –CONTEXT DEPENDENT: bote: /b o t e/ l un bote: [ú m b ó t e] l el bote: [e l  ó t e]

ETRW Modelling Pronunciation variation for ASR The strategy applied (III) l We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity) l For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates

ETRW Modelling Pronunciation variation for ASR CSR Task l Domain: Navy Resources Management in Spanish l Speaker Dependent Task l Training: 600 sentences, 4 speakers l Test: 100 sentences, the same 4 speakers l Base dictionary size: 979 words l Extended dictionary size: 1211 words (+23.7%)

ETRW Modelling Pronunciation variation for ASR CSR System Architecture l One pass algorithm without any grammar l In the lexicon some words have several entries, each with an alternative allophone sequence l (10 MFCC + Energy), delta and delta 2 parameter sets in 3 different codebooks with 256 centroids each l discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)

ETRW Modelling Pronunciation variation for ASR CSR Results

ETRW Modelling Pronunciation variation for ASR ISR Task l Domain: Proper Names, telephone environment l Hypothesis / Verification scheme l Tested on the Hypothesis Generator so far l Training: 5800 words, 3000 speakers l Test: 2500 words, 2250 speakers l Base dictionary size: 1175 words l Extended dictionary size: 1266 words (+7.7%) with the same rules than in CSR task and 1193 words (+1.5%) excluding some rules

ETRW Modelling Pronunciation variation for ASR ISR Hypothesis Generator (I) l 8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each l PSBU generates a string of alphabet units (53 allophone-like units) very fast l Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included

ETRW Modelling Pronunciation variation for ASR ISR Hypothesis Generator (II) Preprocessing & VQ processes Lexical Access Hypothesis Generator Phonetic String Build-Up HMMsVQ booksDuratio ns Alignment costs Phonetic string List of Candidate Words Speech Dictionary Indexes

ETRW Modelling Pronunciation variation for ASR ISR Results for 12 best hypothesis

ETRW Modelling Pronunciation variation for ASR Conclusions and Future Work (I) l The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful. l The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.

ETRW Modelling Pronunciation variation for ASR Conclusions and Future Work (II) l It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR. l We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.