Download presentation
Presentation is loading. Please wait.
Published byBrendan Rose Modified over 8 years ago
1
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003
2
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20072/14 Outline Introduction The Model EM Training Format Tracking Experiment Results Conclusion
3
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20073/14 Introduction Traditional methods use LPC or matching stored templates of spectral cross sections In either case, formant tracking is error-prone due to not enough candidates or templates This paper uses a predictor codebook of MFCC to present formant relationships Also, this method explores the complete formant space, avoiding premature elimination in LPC or template matching
4
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20074/14 The Model o t = F(x t ) + r t o t is observed MFCC coefficients x t is vocal tract resonances (VTR) and corresponding bandwidths F(x t ) is the quantized frequency and bandwidth of formants, named predictor codebook r t is the residual signal
5
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20075/14 Constructing F(x) All-pole model Assume there are I formants x = (F 1, B 1, F 2, B 2,……, F I, B I ) Then use z-transfrom to get H(z): Finally, each quantized VTR x can be transformed into a MFCC series F(x)
6
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20076/14 EM Training (1/2) Use a single Gaussian to model r t T frames utterance, θ is parameters (mean and covariance) of Gaussian Assume formant values x are uniformly distributed, and can take any of C quantized values
7
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20077/14 EM Training (2/2)
8
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20078/14 Formant Tracking (1/2) Frame-by-Frame Tracking Formants in each frame are estimated independently One-to-one Mapping (MAP) Minimum Mean Squared Error (MMSE)
9
Ts'ai, Chung-Ming, Speech Lab, NTUST, 20079/14 Formant Tracking (2/2) Tracking with Continuity Constraints First Order State Model: x t = x t-1 + w t w t is modeled as a Gaussian with zero mean and diagonal Σ w MAP method below can be estimated using Viterbi search MMSE is more much complex and this paper uses an approximate method to obtain, which is not well described here
10
Ts'ai, Chung-Ming, Speech Lab, NTUST, 200710/14 Experiment Settings Track 3 formants Frequencies are first mapped on mel-scale then uniformly quantized Bandwidths are simply uniformly quantized F1 < F2 < F3, so totally 767500 entries in codebook Gain = 1 MFCC is 12 dimension, without C 0 20 utterances of one male speaker are used for EM
11
Ts'ai, Chung-Ming, Speech Lab, NTUST, 200711/14 Experiment Results, “they were what”
12
Ts'ai, Chung-Ming, Speech Lab, NTUST, 200712/14 Experiment Results, with bandwidth
13
Ts'ai, Chung-Ming, Speech Lab, NTUST, 200713/14 Experiment Results, residual
14
Ts'ai, Chung-Ming, Speech Lab, NTUST, 200714/14 Conclusion This method is totally unsupervised, needless of any labeling Works well in unvoiced frames No gross errors May be applied to speech recognizing system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.