FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Conditional Random Fields For Speech and Language Processing

Supervised Learning Recap

John Lafferty, Andrew McCallum, Fernando Pereira

On-Line Probabilistic Classification with Particle Filters Pedro Højen-Sørensen, Nando de Freitas, and Torgen Fog, Proceedings of the IEEE International.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Introduction to Automatic Speech Recognition

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Graphical models for part of speech tagging

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.

A Dynamic Conditional Random Field Model for Object Segmentation in Image Sequences Duke University Machine Learning Group Presented by Qiuhua Liu March.

John Lafferty Andrew McCallum Fernando Pereira

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.

Automatic Speech Recognition

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Conditional Random Fields An Overview

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Parametric Methods Berlin Chen, 2005 References:

Learning Long-Term Temporal Features

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic attribute for each frame of speech. Attributes are listed descending in the same order that they appear in Table 1. Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering, OSU A Conditional Random Field is a mathematical model for sequences that is similar in many ways to a Hidden Markov Model, but is discriminative rather than generative in nature. Here we explore the application of the CRF model to ASR processing by building a system that performs first-pass phonetic recogintion using discriminatively trained phonetic attributes. This system achieves an accuracy level in a phone recognition task superior to that of an HMM that has been comparably trained. Phonetic Attributes Phonetic attributes are defined via linguistic properties per the International Phonetics Association (IPA) phonetic chart Consonants defined by their sonority, voicing, manner, and place of articulation Vowels defined by their sonority, voicing, height, frontness, roundness and tenseness Additional features for silence Phonetic attributes extracted by multi-layer perceptron (MLP) neural net classifiers Classifiers are trained on 12 th order cepstral PLP and delta coefficients derived from the speech data Speech data is broken up into frames of 25ms, with overlapping frames starting every 10ms Input is a vector of PLP and delta coefficients for a nine frame window, centered on the current frame, with four frames of context on either side Each classifier outputs a series of posterior probabilities representing the probability of the attribute given the data One probability is output for each possible attribute for that classifier for any given frame of the data These posteriors sum to one for a given classifier MLP Classifiers were trained using phonetically transcribed corpus Phonetic attribute labels were derived by using the attributes provided by the IPA description of the transcribed phone (See Figure 1) For our purposes, all phones are assumed to have their canonical values for training, and attribute boundaries occur at phonetic boundaries TABLE 1: PHONETIC ATTRIBUTES AttributePossible output values SONORITYvowel, obstruent, sonorant, syllabic, silence VOICEvoiced, unvoiced, n/a MANNERfricative,stop, closure flap, nasal, approximant, nasalflap, n/a PLACElabial, dental, alveolar, palatal, velar, glottal, lateral, rhotic, n/a HEIGHThigh, mid, low, lowhigh, midhigh, n/a FRONTfront, back, central, backfront, n/a ROUNDround, nonround, roundnonround, nonroundround, n/a TENSEtense, lax, n/a A discriminative model of a sequence that attempts to model the posterior probability of a label sequence given a set of observed data (Lafferty, et. al, 2001) A CRF can be described by the following equation: Where each s is a state feature function and each t is a transition feature function State feature functions associate observations in the data at a particular time segment with the label at that time segment Described as s(y, x, i), where y is the label, x is the observed data, and i is the time frame. Takes a non-zero value when the current label at frame i is the same as y and some observation in x holds for the frame I, otherwise the value is zero. Prior work using CRFs in speech recognition have used Gaussian attributes to build state feature function (Gunawardana et. al, 2005) Transition feature functions associate observations in the data at a particular time segment with the transition from the previous label into the current label Described as t(y,y’,x,i), where y is the label, y’ is the previous label, x is the observed data, and i is the time frame Takes a non-zero value when the current label at frame i is the same as y, the previous label is the same as y’, and some observation in x holds for the frame i Results Phone-level accuracies of the CRF system were compared to a baseline Tandem system (Hermansky et. al, 2000) A Tandem system uses the output of the neural networks as inputs to a Hidden Markov Model system Tandem system was trained with both triphone label contexts and monophone label contexts Triphone labels give a single left and right context phone to the label, allowing a finer level of context to be used when labels are assigned In other words, the context for the phone /ae/ in the string of phones /k ae t/ is different from that in the string /k ae p/ since the right context is different Monophone labels are a single phonetic label CRF system results are only for monophone labels TABLE 2 (below) breaks down the results into three categories: Phone Correctness: Was the correct phone hypothesized? Phone Accuracy: Correctness penalized for overgeneration Phone Precision: When a phone is hypothesized, how often is it right? Discussion The CRF system trained on monophones has accuracy results that fall between that of the monophone trained Tandem and triphone trained Tandem systems. The CRF system makes many fewer insertions (extra hypothesized phones) than the Tandem systems The CRF system also makes many more deletions (missed phones where ones should be hypothesized) than the Tandem systems The CRF system makes fewer hypotheses overall than either Tandem system The precision measurement shows how often a hypothesis is a correct hypothesis When the CRF system makes a hypothesis, it is correct more often than the Tandem systems These results suggest some means to improve the performance of the CRF system: Addition of new extracted attributes (such as a boundary detector) to incorporate as transition features Addition of a penalty factor on transition weights to generate more transitions Addition of more contextual attributes into state features to gain some level of triphonic context For our model, a state feature function is a single output from our MLP phonetic attribute classifiers associated with a single label Example: s j (y,x,i) = MLP stop (x i )*δ(y i =/t/) The state feature function above has the value of the output of our MLP classifier for the STOP attribute if the label at time i is /t/. Otherwise, it takes the value of zero. Currently, transition feature functions do not use the output of the MLP neural networks The value of the function is 1 if the label pair matches the pair defined for the function, 0 if it does not. Each feature function has an associated weight value This weight value is high when a non-zero feature function value is strongly associated with a particular label – giving a high value to the computation of the probability for that label Weights are trained by maximizing the log likelihood of the training set with respect to the model The strength of the CRF model is in its ability to use arbitrary features as input In traditional HMMs, dependencies among features can lead to computationally difficult models – features are usually required to be independent In a CRF, no independence assumption on the features is made. Features can have arbitrary dependencies. Conditional Random Fields References J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labelling Sequence Data”, in Proceedings of the 18 th International Conference on Machine Learning, H. Hermansky, D. Ellis, and S.Sharma, “Tandem connectionist feature stream extraction for conventional HMM systems”, in Proceedings of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, A. Gunawardana, M. Mahajan, A. Acero and J. Platt, “Hidden Conditional Random Fields for Phone Classification”, in Proceedings of Interspeech, J. Morris and E. Fosler-Lussier, “Discriminative Phonetic Recognition with Conditional Random Fields”, HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, 2006 M. Rajamanohar and E. Fosler-Lussier, “An evaluation of hierarchical articulatory feature detectors”, in IEEE Automatic Speech Recognition and Understanding Workshop, S. Sarawagi, “CRF package for Java”, D. Johnson et al. “ICSI QuickNet software”, S. Young et al. “HTK HMM software”, FIGURE 2: Graphical model for a CRF phone labelling of the word “that”. Vectors containing the neural net outputs of phonetic attribute posteriors for each time segment as described in Figure 1 are used as observations for the state feature functions to determine the identity of the phone in that timeslice. Arcs between the phone labels indicate transition feature functions determined by the CRF. /dh/ /ae/ /dx/ The CRF uses far fewer parameters than either of the Tandem systems, yet achieves a comparable result. Further work has shown that combining phonetic attribute posteriors with phonetic class posteriors yields a superior result for the CRF over the Tandem systems (Morris and Fosler-Lussier, 2006) TABLE 2: Phone Recognition Comparisons ModelPhone AccuracyPhone CorrectPhone PrecisionParameters Tandem (mono)61.48%63.50%73.66%>28,000 Tandem (tri)66.69%72.52%73.44%>2 million CRF (mono)65.23%66.74%77.66%~4500 This work was supported by NSF ITR grant IIS ; the opinions and conclusions expressed in this work are those of the authors and not of any funding agency