ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS IN SPANISH SPEECH RECOGNITION SYSTEMS Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)
ETRW Modelling Pronunciation variation for ASR Presentation Contents l Introduction l The strategy applied l CSR l Task l System Architecture l Results l ISR l Task l System Architecture l Results l Conclusions and Future Work
ETRW Modelling Pronunciation variation for ASR Introduction (I) l Pronunciation variation: common source of recognition errors l Rule-based strategy to incorporate pronunciation alternatives for Spanish l Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored l Alternate pronunciations can be found even within the same speaker
ETRW Modelling Pronunciation variation for ASR Introduction (II) l The lexicon should consider these different possibilities even within the same dialect l It is important to study the impact of the rules on the lexicon l Near 20% error rate reduction for continuous speech task l No significant change for isolated word hypothesis generator case
ETRW Modelling Pronunciation variation for ASR The strategy applied (I) l Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations l It deals with coarticulation and assimilation effects in word boundaries for continuous speech l Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone l Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style
ETRW Modelling Pronunciation variation for ASR The strategy applied (II) l Examples of variations considered: –DIFFERENT HABITS: exámen: /e k s a m e n/ l [e k s á m e~ n] l [e s á m e~ n] l [e s á m e~ n] –CONTEXT DEPENDENT: bote: /b o t e/ l un bote: [ú m b ó t e] l el bote: [e l ó t e]
ETRW Modelling Pronunciation variation for ASR The strategy applied (III) l We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity) l For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates
ETRW Modelling Pronunciation variation for ASR CSR Task l Domain: Navy Resources Management in Spanish l Speaker Dependent Task l Training: 600 sentences, 4 speakers l Test: 100 sentences, the same 4 speakers l Base dictionary size: 979 words l Extended dictionary size: 1211 words (+23.7%)
ETRW Modelling Pronunciation variation for ASR CSR System Architecture l One pass algorithm without any grammar l In the lexicon some words have several entries, each with an alternative allophone sequence l (10 MFCC + Energy), delta and delta 2 parameter sets in 3 different codebooks with 256 centroids each l discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)
ETRW Modelling Pronunciation variation for ASR CSR Results
ETRW Modelling Pronunciation variation for ASR ISR Task l Domain: Proper Names, telephone environment l Hypothesis / Verification scheme l Tested on the Hypothesis Generator so far l Training: 5800 words, 3000 speakers l Test: 2500 words, 2250 speakers l Base dictionary size: 1175 words l Extended dictionary size: 1266 words (+7.7%) with the same rules than in CSR task and 1193 words (+1.5%) excluding some rules
ETRW Modelling Pronunciation variation for ASR ISR Hypothesis Generator (I) l 8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each l PSBU generates a string of alphabet units (53 allophone-like units) very fast l Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included
ETRW Modelling Pronunciation variation for ASR ISR Hypothesis Generator (II) Preprocessing & VQ processes Lexical Access Hypothesis Generator Phonetic String Build-Up HMMsVQ booksDuratio ns Alignment costs Phonetic string List of Candidate Words Speech Dictionary Indexes
ETRW Modelling Pronunciation variation for ASR ISR Results for 12 best hypothesis
ETRW Modelling Pronunciation variation for ASR Conclusions and Future Work (I) l The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful. l The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.
ETRW Modelling Pronunciation variation for ASR Conclusions and Future Work (II) l It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR. l We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.