Download presentation
Presentation is loading. Please wait.
Published byWhitney Terry Modified over 9 years ago
1
Spontaneous speech recognition using a statistical model of VTR-dynamics Team members: L.Deng (co-tech.team leader), J.Ma, M.Schuster, J.Bridle (co-tech.team leader), H.Richards, T.Kamm, J.Picone, S.Pike, R.Reagan WS98 closing day presentation by Li Deng
2
HTK VTR-model evaluation ---overview Speech waves Automatic Boundary Adjustment preprocessor New EM Training Recognizer New: Decoded word sequence preprocessor... …. MFCC N-best Speech waves MFCC
3
What is new yFundamental Eqn of speech recognition (not new) ySame decision rule but provide sharper acoustic prob. P(O|W) than HMM yUse knowledge to provide model structure (intermediate representation: high-level phonological symbols and low-level acoustics) yUse acoustic data to train model parameters
4
What is new (con’t) zcompact parameter set (15,000 vs. 3,500,000 in HMM) zcontext dependence over utterance-length --- inherent in model structure zno triphones, no heuristic tyings, but with elaborate model structure for internal production-affiliated dynamics zmany versions of the model --- depending on nature of the internal dynamics zWS98 version --- Vocal-Tract Resonance (VTR) dynamic model (VTRs are related to but distinct from formants)
5
Mathematical formulation of the model zhighly-constrained, time-varying, nonlinear dynamic system zformulated as statistical (state-space) generative model: Z: VTR; O: observations (MFCCs); T: target; : rate of dynamics; j: unit. note asymptotic property (“spatial attractor”): zcontinuity constraint across units between units j and j+1 accounts for long-distance context dependence
6
VTR-dynamics Illustration
7
Stochastic nonlinear “supersegment” model Extends the stochastic segment model (Ostendorf et al, IEEE T-SAP,1996) by zuse of physically-motivated “global” continuity constraints across unit boundaries zone “super”-segment per utterance, not per unit zuse of physically-motivated nonlinearity in observation eqn. zuse of special structure in state eqn for VTR dynamics to ensure “attractor” & local-continuity properties zextension of parameter-estimation algorithm to the nonlinear, constrained case.
8
Likelihood computation where the innovationbecomes a white sequence after the continuous dynamic state Zk is estimated by Extended Kalman Filter (EKF). Note: boundaries for units (j) are fixed in all experiments reported.
9
Extended Kalman Filter zPredictor: zFilter: zKalman gain K is computed by recursions (error covariance; Jacobian of the MLP nonlinearity in h(Z)).
10
ML parameter estimation of nonlinear model: EM algorithm zE-step where and the conditional expectations are computed by EKF !
11
EM algorithm (con’t) zM-step computation ya) yb) an approximate solution is given by the back-propagation algorithm when using the MLP.
12
Economy of model parameters zDynamic-VTR based recognizer: <15,000 breakdown: 9 MLPs: 9x(100x3+100x12) : 3x(42+42) zHMM recognizer: 3,500,000 breakdown: 39x12x2x3500
13
Initial experiment (A) 18 utterances (conversation 3107A); same speaker; reference plus 5 manually created hypotheses Word error rate
14
Large-scale experiments ---- conditions zTraining data y30 min (sw97 training set) ysingle speaker (male) zTest data y1241 utterances (ws97 dev set; 9970 words) y23 male speakers (not including training speaker) yreference & N-best (N up to 100) hypotheses are time aligned by triphone-HMM system (ws97)
15
Results: 1241 male test utterances zWER: HMM and VTR- model comparison zall automatic alignment by HMM (ws97) zWER as a function of N in N-best paradigm (N=5 in this slide)
16
Speaker variation in WER
17
Explain why the VTR model does the right job zSince the model is based on physical parameters (e.g. VTR), we can analyze the experimental results with physical insight and understanding (penetrability) zHMM systems would have a hard time of doing this zWe used the methodology of “model synthesis” to do the analysis/diagnosis
18
Model synthesis (correct hypothesis; VTR by EKF)
19
Model synthesis (incorrect hypothesis; EKF)
20
Future work (short term) zIntegrate segmentation/scoring; Lattice rescoring. zOn-line adaptation of noise variances (varying with time and regions) zSpeaker normalization (in MFCCs and in targets) zInvestigate more effective constraints (in model structure and in EKF algorithm) zMultiple or distribution of targets (log-normal) zFast adaptation of rate/target parameters to a new speaker
21
Future work (long term) zDesign acoustic processor with “smoothness” property (matching that of production variable ) ---- auditory-motivated preprocessor? zDevelop feature-overlap mechanism (constrained by high-level linguistic information) as pronunciation model zIntegrate feature-based model with dynamic- phonetic models (VTR, articulatory models, etc.) zA computational theory of speech perception
23
Model synthesis (correct hypothesis)
24
Model synthesis (incorrect; model parameters)
25
HTK Experimental Setup Speech waves Boundary Adjustment preprocessor EM-based Training Dynam-VTR Recognizer Decoded word sequence preprocessor... …. MFCC N-best Speech waves MFCC “correct” transcriptions Lang. Model
26
Initial experiment (B) 10 utterances (conversation 2724B); separate speaker; reference plus 5 manually created hypotheses Word error rate
27
Initial experiment (C) 10 utterances (conversation 2724B); separate speaker; reference plus Nbest (N=5) hypotheses automatically created by triphone-HMM system (ws97 ) Word error rate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.