LORIA Irina Illina Dominique Fohr Christophe Cerisara Torino Meeting March 9-10, 2006
HIWIRE Work package 1: –Missing Data Work package 2: –Non Native Speech Recognition
WP1 : Missing Data New approach for noise speech recognition Two steps : –Training of mask models –Recognition with mask
Missing Data : Training of Mask Models Computation of mask vectors (« oracle ») for each frame –Spectrum with cuberoot compression –Spectrum for clean data and noisy data –For each frequency band f (1..12) If SNR>0dB then mask(f)=0 else mask(f)=1 Clustering of mask vectors –Euclidian distance –N clusters is used (N=31): each element of a cluster is presented by mask vector and corresponding frame vector (MFCC) Training one GMM per cluster –Observations: observation vectors associated with frames (MFCC+D) Training of one ergodic HMM (N states) –Each state is one of previous GMMs –Only state transition probabilities are trained
Missing Data : Recognition with Masks Compute mask vector for each frame –MFCC coefficients –Viterbi alignment using ergodic HMM –Each frame -> one state -> mask Perform marginalization with the masked frames –Spectrum with cuberoot compression
Missing Data: Experiments Training –Aurora2 –4 noises (test A) 4 SNR (5 1à dB) Test –Aurora2 –Test A and B
Baseline (multi-style)Missing Data Test ATest BTest ATest B clean SNR SNR SNR SNR Average Missing Data: Experiments
WP2 : Non Native Speech Recognition Method based on phone confusion Presented in Granada meeting Extract confusion rules between english phones and native acoustic models English phone -> french phone ah -> a ah -> Method based on graphemic contraint Presented in Athens meeting Phone prononciation depends on word grapheme English phone [grapheme] -> french phone ah [A] -> a Approach ah [E] -> cancEl
Non Native Speech Recognition : Method based on Graphemic Constraint Idea : –Example 1 : APPROACH /ah p r ow ch/ APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch) –Example 2 : POSITION /p ah z ih sh ah n/ POSITION (P, p) (O, ah) (S, z) (I, ih) (TI, sh) (O, ah) (N, n) Alignment between graphemes and phones for each word of lexicon –Using discret HMM –Each state of HMM is a phone symbol Lexicon modification: add graphemes for each word ( like in examples 1, 2) Confusion rules extraction (grapheme, english phone) → list of non native phones Example: (A, ah) → a Confusion rules integration in acoustic models Recognition
Example of acoustic model modification for english phone /t / /t/ /k/ //// /t / //// //// Extracted rules Modifed structure of HMM for model /t / English phonesFrench phones English model French models
Used Approach FrenchItalianSpanish WERSERWERSERWERSER Thales grammar baseline confusion graphemes confusion Word loop grammar baseline confusion graphemes confusion Experiments : HIWIRE Database Training French acoustic models : Broadcast News corpus Training English acoustic models: TIMIT Non native speech recognition : 50 sentences per speaker for rules extraction, 50 sentences per speaker for test
Questions about prototype Which noise robustness aproaches will be puted in the prototype? Which speaker robustness aproaches will be puted in the prototype? Who to integrate noise and speaker robustness approaches in the same time? Which grammar to use : Thales grammar or large vocabulary grammar? Real time recognition?