1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Conditional Random Fields For Speech and Language Processing

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Hidden Markov Models Theory By Johan Walters (SR 2003)

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Introduction to Automatic Speech Recognition

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

Gaussian Mixture Model and the EM algorithm in Speech Recognition

Conditional Random Fields

Speech and Language Processing

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.

Automatic Speech Recognition

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

LECTURE 11: Advanced Discriminant Analysis

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Conditional Random Fields An Overview

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

LECTURE 15: REESTIMATION, EM AND MIXTURES

Speaker Identification:

Learning Long-Term Temporal Features

Presentation transcript:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008

2 Outline Background – Tandem HMMs & CRFs Crandem HMM Phone recognition Word recognition

3 Background Conditional Random Fields (CRFs)  Discriminative probabilistic sequence model  Directly defines a posterior probability of a label sequence given a set of observations

Background Problem: How do we make use of CRF classification for word recognition?  Attempt to use CRFs directly?  Attempt to fit CRFs into current state-of-the-art models for speech recognition? Here we focus on the latter approach  How can we integrate what we learn from the CRF into a standard HMM-based ASR system? 4

5 Background Tandem HMM  Generative probabilistic sequence model  Uses outputs of a discriminative model (e.g. ANN MLPs) as input feature vectors for a standard HMM

6 Background Tandem HMM  ANN MLP classifiers are trained on labeled speech data Classifiers can be phone classifiers, phonological feature classifiers  Classifiers output posterior probabilities for each frame of data E.g. P(Q|X), where Q is the phone class label and X is the input speech feature vector

7 Background Tandem HMM  Posterior feature vectors are used by an HMM as inputs  In practice, posteriors are not used directly Log posterior outputs or “linear” outputs are more frequently used  “linear” here means outputs of the MLP with no application of the softmax function to transform into probabilities Since HMMs model phones as Gaussian mixtures, the goal is to make these outputs look more “Gaussian” Additionally, Principle Components Analysis (PCA) is applied to features to decorrelate features for diagonal covariance matrices

8 Idea: Crandem Use a CRF classifier to create inputs to a Tandem-style HMM  CRF labels provide a better per-frame accuracy than input MLPs  We’ve shown CRFs to provide better phone recognition than a Tandem system with the same inputs This suggests that we may get some gain from using CRF features in an HMM

9 Idea: Crandem Problem: CRF output doesn’t match MLP output  MLP output is a per-frame vector of posteriors  CRF outputs a probability across the entire sequence Solution: Use Forward-Backward algorithm to generate a vector of posterior probabilities

10 Forward-Backward Algorithm The Forward-Backward algorithm is already used during CRF training  Similar to the forward-backward algorithm for HMMs  Forward pass collects feature functions for the timesteps prior to the current timestep  Backward pass collects feature functions for the timesteps following the current timestep  Information from both passes are combined together to determine the probability of being in a given state at a particular timestep

11 Forward Backward Algorithm

12 Forward-Backward Algorithm This form allows us to use the CRF to compute a vector of local posteriors y at any timestep t. We use this to generate features for a Tandem-style system  Take log features, decorelate with PCA

13 Phone Recognition Pilot task – phone recognition on TIMIT  61 feature MLPs trained on TIMIT, mapped down to 39 features for evaluation  Crandem compared to Tandem and a standard PLP HMM baseline model  As with previous CRF work, we use the outputs of an ANN MLP as inputs to our CRF  Various CRF models examined (state feature functions only, state+transition functions), and various input feature spaces examined (phone classifier and phonological feature classifier)

14 Phone Recognition Phonological feature attributes  Detector outputs describe phonetic features of a speech signal Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values Phone class attributes  Detector outputs describe the phone label associated with a portion of the speech signal /t/, /d/, /aa/, etc.

15 Phone Recognition

16 Phone Recognition - Results Phonological feature attributes  Detector outputs describe phonetic features of a speech signal Place, Manner, Voicing, Vowel Height, Backness, etc. A phone is described with a vector of feature values Phone class attributes  Detector outputs describe the phone label associated with a portion of the speech signal /t/, /d/, /aa/, etc.

17 * Significantly (p<0.05) improvement at 0.6% difference between models Results (Fosler-Lussier & Morris 08) ModelPhone Accuracy PLP HMM reference68.1% Tandem (61 feas)70.6% Tandem (48 feas)70.8% CRF (state)69.9% CRF (state+trans)70.7% Crandem (state) – log71.1% Crandem (state+trans) – log71.7% Crandem (state) – unnorm71.2% Crandem (state+trans) – unnorm71.8%

18 * Significantly (p≤0.05) improvement at 0.6% difference between models Results (Fosler-Lussier & Morris 08) ModelPhone Accuracy PLP HMM reference68.1% Tandem (105 feas)70.9% Tandem (48 feas)71.2% CRF (state)71.4% CRF (state+trans)71.6% Crandem (state) – log71.7% Crandem (state+trans) – log72.4% Crandem (state) – unnorm71.7% Crandem (state+trans) – unnorm72.4%

19 Word Recognition Second task – Word recognition  Dictionary for word recognition has 54 distinct phones instead of 48, so new CRFs and MLPs trained to provide input features  MLPs and CRFs again trained on TIMIT to provide both phone classifier output and phonological feature classifier output  Initial experiments – use MLPs and CRFs trained on TIMIT to generate features for WSJ recognition Next pass – use MLPs and CRFs trained on TIMIT to align label files for WSJ, then train MLPs and CRFs for WSJ recognition

20 * Significant (p≤0.05) improvement at roughly 1% difference between models Initial Results ModelWord Accuracy MFCC HMM reference90.85% Tandem MLP (54feas)90.30% Tandem MFCC+MLP (54feas)90.90% Crandem (54feas) (state)90.95% Crandem (54feas) (state+trans)90.77% Crandem MFCC+CRF (state)92.29% Crandem MFCC+CRF (state+tran)92.40%

21 * Significant (p≤0.05) improvement at roughly 1% difference between models Initial Results ModelWord Accuracy MFCC HMM reference90.85% Tandem MLP (98feas)91.26% Tandem MFCC+MLP (98feas)92.04% Crandem (98feas) (state)91.31% Crandem (98feas) (state+trans)90.49% Crandem MFCC+CRF (state)92.47% Crandem MFCC+CRF (state+tran)92.62%

22 * Significant (p≤0.05) improvement at roughly 1% difference between models Initial Results ModelWord Accuracy MFCC HMM reference90.85% Tandem MLP (WSJ 54)90.41% Tandem MFCC+MLP (WSJ 54)92.21% Crandem (WSJ 54) (state)89.58%

23 Word Recognition Problems  Some of the models show slight significant improvement over their Tandem counterpart Unfortunately, what will cause an improvement is not yet predictable  Transition features give slight degredation when used on their own slight improvement when classifier is mixed with MFCCs  Retraining directly on WSJ data does not give improvement for CRF Gains from CRF training are wiped away if we just retrain the MLPs on WSJ data

24 Word Recognition Problems (cont.)  The only model that gives improvement for the Crandem system is a CRF model trained on linear outputs from MLPs Softmax outputs – much worse than baseline Log softmax outputs – ditto This doesn’t seem right, especially given the results from the Crandem phone recognition experiments  These were trained on softmax outputs  I suspect “implementor error” here, though I haven’t tracked down my mistake yet

25 Word Recognition Problems (cont.)  Because of the “linear inputs only” issue, certain features have yet to be examined fully “Hifny”-style Gaussian scores have not provided any gain – scaling of these features may be preventing them from being useful

26 Current Work Sort out problems with CRF models  Why is it so sensitive to the input feature type? (linear vs. log vs. softmax)  If this sensitivity is “built in” to the model, how can I appropriately scale features to include them in the model that works? Move on to next problem – direct decoding on CRF lattices