Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Similar presentations


Presentation on theme: "1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International."— Presentation transcript:

1

2 1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International

3 2 Phonetically Motivated Features Problem: –Cepstral coefficients fail to capture many discriminative cues. –Front-end optimized for traditional Mel cepstral features. –Front-end parameters are a compromise solution for all phones.

4 3 Phonetically Motivated Features Proposal: –Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends. –Optimize each specific front-end to improve discrimination. –Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding. –General framework for multiple phonetic features. First approach: voicing features.

5 4 Voicing features algorithms: 1.Normalized peak autocorrelation (PA). For time frame X max computed in pitch region 80Hz to 450Hz 2.Entropy of high order cepstrum (EC) and linear spectra (ES). If And H is the entropy of Y, then Entropy computed in pitch region 80Hz to 450Hz Voicing Features

6 5 3.Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform for the frequency band for speech signal If IT is an impulse train, the template is and the signal DLFT the correlation for frame j with the template is the DP optimal correlation is max computed in pitch region 80Hz to 450Hz

7 6 Voicing Features Preliminary exploration of voicing features: - Best feature combination: Peak Autocorrelation + Entropy Cepstrum - Complementary behavior of autocorrelation and entropy features for high and low pitch. Low pitch: time periods are well separated therefore correlation is well defined. High pitch: harmonics are well separated and cepstrum is well defined.

8 7 Voicing Features Graph of voicing features: w er k ay n d ax f s: aw th ax v dh ey ax r

9 8 Voicing Features Integration of Voicing Features: 1 - Juxtaposing Voicing Features: Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD) Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.

10 9 Voicing Features Train small switchboard database (64 hours). Test on dev 2001. WER for both sexes. Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM. Window Length Optimization WER Baseline 41.4% Baseline + 2 voicing (25.6 msec)41.2 % Baseline + 2 voicing (75 msec)40.7 % Baseline + 2 voicing (87.5 msec)40.5 % Baseline + 2 voicing (100 msec)40.4 % Baseline + 2 voicing (112.5 msec)41.2 %

11 10 Voicing Features 2 – Voiced/Unvoiced Posterior Features: Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40. Similar setup as before. Males only results. Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature. Recognition Systems WER Baseline 39.2 % Baseline + voicing posterior39.7 %

12 11 Voicing Features 3 – Window of Voicing Features + HLDA: Juxtapose MFCC features and window of voicing features around current frame. Apply dimensionality reduction with HLDA. Final feature had 39 dimensions. Same setup as before, MFCC+D+DD+3 rd diffs. Both sexes. Baseline 1.5% abs. better, Voicing improves 1% more. Recognition Systems WER % Baseline + HLDA 39.9 Baseline + 1 frame, 2 voicing + HLDA Baseline + 5 frames, 2 voicing + HLDA38.9 Baseline + 9 frames, 2 voicing + HLDA 39.5

13 12 Voicing Features 4 – Delta of Voicing Features + HLDA: Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature. Same setup as before, MFCC+D+DD+3 rd diffs. Males only. Reason may be variability in voicing features produce noisy deltas. HLDA weighting of “window of voicing features” is similar to average. ----------------------------------------------------------------------------------  The best overall configuration was MFCC+D+DD+3 rd diffs. and 10 voicing features + HLDA. Recognition Systems WER Baseline + HLDA 37.5 % Baseline + voicing + delta voicing + HLDA37.6 %

14 13 Voicing Features Voicing Features in SRI CTS Eval. Sept 03 System: Adaptation of MMIE cross-word models w/wo voicing features. Used best configuration of voicing features. Train on Full SWBD+CTRANS data. Test on EVAL’02. Feature: MFCC+D+DD+3 rd diffs.+HLDA Adaptation: 9 transforms full matrix MLLR. Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features. Recognition Systems WER Baseline EVAL 25.6 % Baseline EVAL + voicing25.1 %

15 14 Voicing Features Hypothesis Examples: REF: OH REALLY WHAT WHAT KIND OF PAPER HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER REF: YOU KNOW HE S JUST SO UNHAPPY HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY

16 15 Voicing Features Error analysis: –In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase. –Still need a more detailed study of speaker dependent performance. Implementation: –Implemented a voicing feature engine in DECIPHER system. –Fast computation, using one FFT and two IFFTs per frame for both voicing features.

17 16 Voicing Features Conclusions: –Explored how to represent/integrate the voicing features for best performance. –Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system. Future work: –Still need to further explore feature combination/selection –Develop more reliable voicing features, features not always reflect actual voicing activity –Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).

18 17


Download ppt "1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International."

Similar presentations


Ads by Google