Spontaneous speech recognition using a statistical model of VTR-dynamics Team members: L.Deng (co-tech.team leader), J.Ma, M.Schuster, J.Bridle (co-tech.team.

Slides:



Advertisements
Similar presentations
Bayesian inference Lee Harrison York Neuroimaging Centre 01 / 05 / 2009.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models in NLP
Introduction to Mobile Robotics Bayes Filter Implementations Gaussian filters.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Speech Recognition in Noise
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Probabilistic Robotics
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Probabilistic Robotics Bayes Filter Implementations Gaussian filters.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Natural Language Understanding
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction to Automatic Speech Recognition
Helsinki University of Technology Adaptive Informatics Research Centre Finland Variational Bayesian Approach for Nonlinear Identification and Control Matti.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Markov Localization & Bayes Filtering
Kalman filtering techniques for parameter estimation Jared Barber Department of Mathematics, University of Pittsburgh Work with Ivan Yotov and Mark Tronzo.
7-Speech Recognition Speech Recognition Concepts
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Probabilistic Robotics Bayes Filter Implementations Gaussian filters.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing.
Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Bayesian Parameter Estimation Liad Serruya. Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning.
1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.
Knowledge Acquistion for Speech and Language from a Large Number of Sources JHU Workshop04 James Baker, CMU July 23, 2004.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Performance Comparison of Speaker and Emotion Recognition
An Introduction To The Kalman Filter By, Santhosh Kumar.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Speech Recognition UNIT -5.
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Handwritten Characters Recognition Based on an HMM Model
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Network Training for Continuous Speech Recognition
Presentation transcript:

Spontaneous speech recognition using a statistical model of VTR-dynamics Team members: L.Deng (co-tech.team leader), J.Ma, M.Schuster, J.Bridle (co-tech.team leader), H.Richards, T.Kamm, J.Picone, S.Pike, R.Reagan WS98 closing day presentation by Li Deng

HTK VTR-model evaluation ---overview Speech waves Automatic Boundary Adjustment preprocessor New EM Training Recognizer New: Decoded word sequence preprocessor... …. MFCC N-best Speech waves MFCC

What is new yFundamental Eqn of speech recognition (not new) ySame decision rule but provide sharper acoustic prob. P(O|W) than HMM yUse knowledge to provide model structure (intermediate representation: high-level phonological symbols and low-level acoustics) yUse acoustic data to train model parameters

What is new (con’t) zcompact parameter set (15,000 vs. 3,500,000 in HMM) zcontext dependence over utterance-length --- inherent in model structure zno triphones, no heuristic tyings, but with elaborate model structure for internal production-affiliated dynamics zmany versions of the model --- depending on nature of the internal dynamics zWS98 version --- Vocal-Tract Resonance (VTR) dynamic model (VTRs are related to but distinct from formants)

Mathematical formulation of the model zhighly-constrained, time-varying, nonlinear dynamic system zformulated as statistical (state-space) generative model: Z: VTR; O: observations (MFCCs); T: target; : rate of dynamics; j: unit. note asymptotic property (“spatial attractor”): zcontinuity constraint across units between units j and j+1 accounts for long-distance context dependence

VTR-dynamics Illustration

Stochastic nonlinear “supersegment” model Extends the stochastic segment model (Ostendorf et al, IEEE T-SAP,1996) by zuse of physically-motivated “global” continuity constraints across unit boundaries zone “super”-segment per utterance, not per unit zuse of physically-motivated nonlinearity in observation eqn. zuse of special structure in state eqn for VTR dynamics to ensure “attractor” & local-continuity properties zextension of parameter-estimation algorithm to the nonlinear, constrained case.

Likelihood computation where the innovationbecomes a white sequence after the continuous dynamic state Zk is estimated by Extended Kalman Filter (EKF). Note: boundaries for units (j) are fixed in all experiments reported.

Extended Kalman Filter zPredictor: zFilter: zKalman gain K is computed by recursions (error covariance; Jacobian of the MLP nonlinearity in h(Z)).

ML parameter estimation of nonlinear model: EM algorithm zE-step where and the conditional expectations are computed by EKF !

EM algorithm (con’t) zM-step computation ya) yb) an approximate solution is given by the back-propagation algorithm when using the MLP.

Economy of model parameters zDynamic-VTR based recognizer: <15,000 breakdown: 9 MLPs: 9x(100x3+100x12) : 3x(42+42) zHMM recognizer: 3,500,000 breakdown: 39x12x2x3500

Initial experiment (A) 18 utterances (conversation 3107A); same speaker; reference plus 5 manually created hypotheses Word error rate

Large-scale experiments ---- conditions zTraining data y30 min (sw97 training set) ysingle speaker (male) zTest data y1241 utterances (ws97 dev set; 9970 words) y23 male speakers (not including training speaker) yreference & N-best (N up to 100) hypotheses are time aligned by triphone-HMM system (ws97)

Results: 1241 male test utterances zWER: HMM and VTR- model comparison zall automatic alignment by HMM (ws97) zWER as a function of N in N-best paradigm (N=5 in this slide)

Speaker variation in WER

Explain why the VTR model does the right job zSince the model is based on physical parameters (e.g. VTR), we can analyze the experimental results with physical insight and understanding (penetrability) zHMM systems would have a hard time of doing this zWe used the methodology of “model synthesis” to do the analysis/diagnosis

Model synthesis (correct hypothesis; VTR by EKF)

Model synthesis (incorrect hypothesis; EKF)

Future work (short term) zIntegrate segmentation/scoring; Lattice rescoring. zOn-line adaptation of noise variances (varying with time and regions) zSpeaker normalization (in MFCCs and in targets) zInvestigate more effective constraints (in model structure and in EKF algorithm) zMultiple or distribution of targets (log-normal) zFast adaptation of rate/target parameters to a new speaker

Future work (long term) zDesign acoustic processor with “smoothness” property (matching that of production variable ) ---- auditory-motivated preprocessor? zDevelop feature-overlap mechanism (constrained by high-level linguistic information) as pronunciation model zIntegrate feature-based model with dynamic- phonetic models (VTR, articulatory models, etc.) zA computational theory of speech perception

Model synthesis (correct hypothesis)

Model synthesis (incorrect; model parameters)

HTK Experimental Setup Speech waves Boundary Adjustment preprocessor EM-based Training Dynam-VTR Recognizer Decoded word sequence preprocessor... …. MFCC N-best Speech waves MFCC “correct” transcriptions Lang. Model

Initial experiment (B) 10 utterances (conversation 2724B); separate speaker; reference plus 5 manually created hypotheses Word error rate

Initial experiment (C) 10 utterances (conversation 2724B); separate speaker; reference plus Nbest (N=5) hypotheses automatically created by triphone-HMM system (ws97 ) Word error rate