Download presentation
Presentation is loading. Please wait.
1
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech Tone depends on the F0 contour of a time span much longer than a frame; frame-level F0 and its derivatives only capture part of the tonal patterns Our approach Use MLP with longer input window to extract tone-related posteriors as a better model of tone Incorporate tone-related posteriors in feature vector Xin Lei, Mei-Yuh Hwang and Mari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195 Tone-related Posterior Feature Extraction Overall configuration Consider two types of features: Tone posterior feature –MLP targets: five tones and a no-tone target –MLP input: 9-frame window of MFCC+F0 features Toneme posterior feature –MLP targets: 62 speech tonemes plus one silence, one for laughter, one for all other noise –PCA the output to 25-dim Speech Recognition Results Tone posterior feature results (non-crossword models) Toneme posterior feature results (non-crossword models) Cross-word system results Conclusions Tone posterior features by themselves outperform F0 Tone posteriors and the plain F0 features complement each other, though with some overlap More improvement is achieved by using toneme posteriors than tone posteriors, since they carry both tone and segmental information Combining tone-related posterior features with F0 shows 2-2.5% absolute improvement in CER Five Tones in Mandarin Five tones in Mandarin –High-level (tone 1), high-rising (tone 2), low-dipping (tone 3), high-falling (tone 4) and neutral (tone 5) –Characterized by syllable-level F0 contour Confusion caused by tones – 烟 (y an1, cigarette), 盐 (y an2, salt), 眼 (y an3, eye), 厌 (y an4, hate) Common tone modeling techniques –Tonal acoustic units (such as tonal syllable or toneme) –Add pitch and derivatives to the feature vector Tandem Features Hermansky et al. proposed tandem approach –Use log MLP posterior outputs as the input features for Gaussian mixture models of a conventional speech recognizer References –H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems”, ICASSP 2000 –N. Morgan et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition”, ICASSP 2004 Future Work Context-dependent tone (“tri-tone”) classification Tri-tone clustering TargetsCardinalityFrame Accuracy tone680.3% toneme6468.6% FeatureDimensionCER PLP3936.8% PLP+F04235.7% PLP+(tone posterior)4535.4% PLP+PCA(tone posterior)4235.6% PLP+F0+(tone posterior)4835.2% PLP+F0+PCA(tone posterior)4535.2% FeatureDimensionCER PLP+F04235.7% PLP+PCA(toneme posterior)6433.7% PLP+F0+PCA(toneme posterior)6733.2% PLP+PCA(tone, toneme posterior)6433.3% PLP+F0+PCA(tone, toneme posterior) 6733.3% FeatureDimensionCER PLP+F04235.0% PLP+F0+PCA(toneme posterior) 6733.0% Experiment Setup Corpora –Mandarin conversational telephone speech data collected by HKUST –Training set: train04, 251 phone calls, 57.7 hours of speech –Testing set: dev04, 24 calls totaling 2 hours, manually segmented Front-end feature –Standard MFCC/PLP features by SRI Decipher front-end –3-dim pitch and its delta features by ESPS get_f0 –Pitch feature post-processed by SRI pitch tool graphtrack (reduces doubling and halving errors + smoothing) –Various MLP generated tone-related posterior features Decoding setup –7-Class unsupervised MLLR adaptation –Rescoring a lattice of word hypotheses with trigram MLP Training Results MLP training with ICSI quicknet package –Features: 9-frame MFCC + F0 –Tone/toneme labels: from forced alignment –10% of the data used for cross validation Cross validation results on MLP training For reference, frame accuracy for 46 English phones is around 67%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.