Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Conditional Random Fields For Speech and Language Processing
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.
OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Underspecified feature models for pronunciation variation in ASR Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab ITRW.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.
1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
John Lafferty Andrew McCallum Fernando Pereira
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.
Learning Coordination Classifiers
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
CRANDEM: Conditional Random Fields for ASR
Conditional Random Fields An Overview
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Somi Jacob and Christian Bach
LECTURE 15: REESTIMATION, EM AND MIXTURES
Speech recognition, machine learning
Learning Long-Term Temporal Features
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Speech recognition, machine learning
Presentation transcript:

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies Lab HLT-NAACL 2006 Computationally Hard Problems and Joint Inference in Speech and Language Processing Workshop June 9, 2006

2 Introduction Conditional Random Fields (CRFs) offer some benefits over traditional HMM models for sequence labeling –Direct model of the posterior probability of a label sequence given an observation –Make no assumptions about independence of observations The lack of an independence assumption make CRFs an attractive model for speech recognition We are interested in combining together arbitrary speech attributes to build a hypothesis of the observed speech

3 Speech Attributes Two different types of speech attributes: –Phone classes are trained to indicate when a particular timeslice of speech is a particular phone (e.g. /t/, /v/ etc.) –Phonological feature classes are trained to indicate when a particular timeslice of speech exhibits a particular phonological feature /t/ Manner: stop Place of articulation: dental Voicing: unvoiced

4 Speech Attributes Two different types of speech attributes: –Phone classes are trained to indicate when a particular timeslice of speech is a particular phone (e.g. /t/, /v/ etc.) –Phonological feature classes are trained to indicate when a particular timeslice of speech exhibits a particular phonological feature /t/ Manner: stop Place of articulation: dental Voicing: unvoiced /d/ Manner: stop Place of articulation: dental Voicing: voiced

5 Speech Attributes Two different types of speech attributes: –Phone classes are trained to indicate when a particular timeslice of speech is a particular phone (e.g. /t/, /v/ etc.) –Phonological feature classes are trained to indicate when a particular timeslice of speech exhibits a particular phonological feature /t/ Manner: stop Place of articulation: dental Voicing: unvoiced /d/ Manner: stop Place of articulation: dental Voicing: voiced /iy/ Height: high Backness: front Roundness: nonround

6 Speech Attributes Attribute classifiers are trained using MLP neural networks that emit posterior probabilities – P(attribute | acoustics) –These posteriors can also be viewed as indicator functions for the given classes –Outputs are highly correlated with each other We want to combine the observations given by these indicator functions to get a hypothesis for the speech

7 Tandem Systems HMM-based systems using neural network outputs as features (Hermansky and Ellis, 2000) –Neural network output is used to train an HMM –HMMs assume that the observed features are independent of each other –Features are decorrelated through principal components analysis (PCA) before training and testing /k//iy/ Pr(attr|X) XXX

8 CRF System We implement a CRF model using the neural network outputs as state feature functions e.g: /k//iy/ Pr(attr|X) Compare the results to a Tandem system trained on the same features No PCA decorrelation is performed on the CRF inputs

9 Phone Accuracy Results Tandem (triphones) CRF (monophones) Phones (all 61)67.32%66.89% Phon. Features (all 43)66.85%63.84% Combined (all 104)66.80%67.87% Combined (top 39)67.85%

10 Discussion The CRF model is much more conservative in its generation than the Tandem model –Many fewer insertions, many more deletions All features CRF: 6500 deletions, 731 insertions All features Tandem (top 39): 3184 deletions, 2511 insertions Label state space of the Tandem model is much larger than the CRF Transition information is currently unused –Adding transition feature functions built on observed data may improve results Benefit of this model over traditional Tandem model is that arbitrary features can be easily added –We want to explore adding arbitrary features to the model to see how performance changes (e.g. speaking rate, stress, pitch, etc.)

11

12 Phone Precision Results Tandem (triphones) CRF (monophones) Phones73.93%79.22% Phon. Features73.62%76.61% Combined73.33%79.49% Combined (top 39)74.43%