OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

Slides:

Advertisements

Similar presentations

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Advertisements

Conditional Random Fields For Speech and Language Processing

Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Graphical models for part of speech tagging

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

An Introduction to Support Vector Machines (M. Law)

1 CRFs for ASR: Extending to Word Recognition Jeremy Morris 05/16/2008.

1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Conditional Random Fields An Overview Jeremy Morris 01/11/2008.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

1 Conditional Random Fields For Speech and Language Processing Jeremy Morris 10/27/2008.

Olivier Siohan David Rybach

Deep Feedforward Networks

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

CRANDEM: Conditional Random Fields for ASR

Conditional Random Fields An Overview

Statistical Models for Automatic Speech Recognition

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Learning Long-Term Temporal Features

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Presentation transcript:

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006

Personnel changes Jeremy and Yu are not currently on the project –Jeremy is being funded on AFRL/DAGSI project Lexicon learning from orthography However, he is continuing to help in spare time –Yu is currently in transition New student (to some!): Ilana Bromberg –Technically funded as of 10/1 but did some experiments for an ICASSP paper in Sept. –Still sorting out project for this year

Future potential changes May transition in another student in WI 06 –Carry on further with some of Jeremy’s experiments

What’s new? First pass on the parsing framework –Last time: talked about different models Naïve Bayes, Dirichlet modeling, MaxEnt models –This time: settled on Conditional Random Fields framework Monophone CRF phone recognition beating triphone HTK recognition using attribute detectors Ready for your inputs! More boundary work –Small improvements seen in integrating boundary information into HMM recognition –Still to be seen if it helps CRFs

Parsing Desired: ability to combine the output of multiple, correlated attribute detectors to produce –Phone sequences –Word sequences Handle both semi-static & dynamic events –Traditional phonological features –Landmarks, boundaries, etc. CRFs are good bet for this

Conditional Random Fields A form of discriminative modelling –Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks Processes evidence bottom-up –Combines multiple features of the data –Builds the probability P( sequence | data) Computes joint probability of sequence given data Minimal assumptions about input –Inputs don’t need to be decorrelated cf. Diagonal Covariance HMMs

Conditional Random Fields CRFs are based on the idea of Markov Random Fields –Modelled as an undirected graph connecting labels with observations –Observations in a CRF are not modelled as random variables /k/ /iy/ XXXXX Transition functions add associations between transitions from one label to another State functions help determine the identity of the state

Hammersley-Clifford Theorem states that a random field is an MRF iff it can be described in the above form –The exponential is the sum of the clique potentials of the undirected graph Conditional Random Fields State Feature Function f([x is stop], /t/) One possible state feature function For our attributes and labels State Feature Weight =10 One possible weight value for this state feature (Strong) Transition Feature Function g(x, /iy/,/k/) One possible transition feature function Indicates /k/ followed by /iy/ Transition Feature Weight  =4 One possible weight value for this transition feature

Conditional Random Fields Conceptual Overview –Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label A positive value if the attribute appears in the data A zero value if the attribute is not in the data –Each feature function carries a weight that gives the strength of that feature function for the proposed label High positive weights indicate a good association between the feature and the proposed label High negative weights indicate a negative association between the feature and the proposed label Weights close to zero indicate the feature has little or no impact on the identity of the label

Experimental Setup Attribute Detectors –ICSI QuickNet Neural Networks Two different types of attributes –Phonological feature detectors Place, Manner, Voicing, Vowel Height, Backness, etc. Features are grouped into eight classes, with each class having a variable number of possible values based on the IPA phonetic chart –Phone detectors Neural networks output based on the phone labels – one output per label –Classifiers were applied to 2960 utterances from the TIMIT training set

Experimental Setup Output from the Neural Nets are themselves treated as feature functions for the observed sequence – each attribute/label combination gives us a value for one feature function –Note that this makes the feature functions non- binary features. Different than most NLP uses of CRFs Along lines of Gaussian-based CRFs (e.g., Microsoft)

Experiment 1 Goal: Implement a Conditional Random Field Model on ASAT-style phonological feature data –Perform phone recognition –Compare results to those obtained via a Tandem HMM system

Experiment 1 - Results ModelPhone Accuracy Phone Correct Tandem [monophone]61.48%63.50% Tandem [triphone]66.69%72.52% CRF [monophone]65.29%66.81% CRF system trained on monophones with these features achieves accuracy superior to HMM on monophones –CRF comes close to achieving HMM triphone accuracy –CRF uses many many fewer parameters

Experiment 2 Goals: –Apply CRF model to phone classifier data –Apply CRF model to combined phonological feature classifier data and phone classifier data Perform phone recognition Compare results to those obtained via a Tandem HMM system

Experiment 2 - Results ModelPhone Acc Phone Correct Tandem [mono] (phones)60.48%63.30% Tandem [tri] (phones)67.32%73.81% CRF [mono] (phones)66.89%68.49% Tandem [mono] (phones/feas)61.78%63.68% Tandem [tri] (phones/feas)67.96%73.40% CRF [mono] (phones/feas)68.00%69.58% Note that Tandem HMM result is best result with only top 39 features following a principal components analysis

Experiment 3 Goal: –Previous CRF experiments used phone posteriors for CRF, and linear outputs transformed via a Karhunen-Loeve (KL) transform for the HMM sytem This transformation is needed to improve the HMM performance through decorellation of inputs –Using the same linear outputs as the HMM system, do our results change?

Experiment 3 - Results ModelPhone Accuracy Phone Correct CRF (phones) posteriors67.27%68.77% CRF (phones) linear KL66.60%68.25% CRF (phones) post. + linear68.18%69.87% CRF (features) posteriors65.25%66.65% CRF (features) linear KL66.32%67.95% CRF (features) post + linear66.89%68.48% CRF (features) linear (no KL)65.89%68.46% Also shown – Adding both feature sets together and giving the system supposedly redundant information leads to a gain in accuracy

Experiment 4 Goal: –Previous CRF experiments did not allow for realignment of the training labels Boundaries for labels provided by TIMIT hand transcribers used throughout training HMM systems allowed to shift boundaries during EM learning –If we allow for realignment in our training process, can we improve the CRF results?

Experiment 4 - Results ModelPhone Accuracy Phone Correct Tandem [tri] (phones)67.32%73.81% CRF (phones) no realign67.27%68.77% CRF (phones) realign69.63%72.40% Tandem [tri] (features)66.69%72.52% CRF (features) no realign65.25%66.65% CRF (features) realign67.52%70.13% Allowing realignment gives accuracy results for a monophone trained CRF that are superior to a triphone trained HMM, with fewer parameters

Code status Current version: java-based, multithreaded TIMIT training takes a few days on 8 proc. machine At test time, CRF generates AT&T FSM lattice –Use AT&T FSM tools to decode –Will (hopefully) make it easier to decode words Code is stable enough to try different kinds of experiments quickly –Ilana joined the group and ran an experiment within a month

Joint models of attributes Monica’s work showed that modeling attribute detection with joint detectors worked better –e.g. modeling manner/place jointly better –cf. Chang et al: hierarchical detectors work better This study: can we improve phonetic attribute-based detection by using phone classifiers and summing? –Phone classifer: ultimate joint modeling

Independent vs Joint Feature Modeling Baseline 1: –61 phone posteriors (joint modeling) Baseline 2: –44 feature posteriors (independent modeling) Experiment: –Feature posteriors derived from 61 phone posteriors –In each frame: weight for each feature = summed weight of each phone exhibiting that feature e.g. P(stop) = P(/p/) + P(/t/) + P(/k/) + P(/b/) + P(/b/) + P(/g/)

Results: Joint vs Independent Modeling Posterior Type Number of Weights Accuracy% Correct Phonemes Features Features derived from Phonemes

Removal of Feature Classes

Results of Feature Class Removal

Continued work on phone boundary detection Basic idea: eventually we want to use these as transition functions in CRF CRF still under development when this study done –Added features corresponding to P(boundary|data) to HMM

Evaluation and Results Using phonological features as an input representation was modestly better than the phone posterior estimates themselves. Phonological feature representations also seemed to edge out direct acoustic representations Phonological feature MLPs are more complex to train. The nonlinear representations learned by the MLP were better for boundary detection than metric-based methods. Phone Boundary Detection Input features: Phonological features, acoustic features(PLP) and Phone classifier outputs Classification methods: MLP metric-based method

How to incorporating phone boundaries, estimated by Multi-layer perceptron (MLP), into an HMM system. Five-state HMM phone model to capture boundary information In order to integrate phone boundary information in speech recognition, phone boundary information were concatenated to MFCCs as additional input features. We explicitly modeled the entering and exiting state of a phone as a separate, one frame distribution. The proposed 5-state HMM phone model is introduced below. The two additional boundary states were intended to catch phone-boundary transitions, while the three self-looped states in the center can model phone-internal information. Escape arcs were also included to bypass the boundary states for short phones. Experiments For simplicity, the linear outputs from the PLP+MLP detector, were used as the phone boundary features, instead of the ones from the features+ MLP detector. Several experiments were conducted: 0) Baseline system: standard 39 MFCCs 1) MFCCs+ phone boundary features. (no KLT) 2) MFCCs + phone boundary features, which were decorrelated using Karhunen-Loeve transformation (KLT). 3) MFCCs + phone boundary features, with a KL-transformation over all features. 4) MFCCs (KLTed) to show the effect of KL transformation on MFCCs The training and recognition were conducted with the HTK toolkit, on TIMIT data set. When reaching the 4-mixture stage, some experiment failed due to data sparsity. We adopted a hybrid 2/4 mixture strategy, promoting triphones to 4-mixture when the data was sufficient. Proposed Five-state HMM model

Results & Conclusion Results The proposed 5-state HMM models performed better than their conventional 3- state counterparts on all training datasets. Decorrelation improved the accuracy of recognition on binary boundaries. Including MFCCs in the decorrelation improved recognition further. For comparison, several experiments were also conducted on a 5-state HMM with a traditional, left-to-right all-self-loops transition matrix. The results showed vastly increased deletions, indicating a bias against short duration phones, whereas the proposed model is balanced between insertions and deletions. Recently, I modified the decision tree questions in the tied-state triphone step, and pushed the model to 16-mix Gaussians. Part of the results are also shown in the table ) KLT(MFCC) ) KLT(MFCC + Boundaries) 63.78( mix)62.47 ( mix) 2) MFCC + KLT(Boundaries) ) MFCC + Boundaries 63.41%62.37% Baseline: MFCC 5-state phone recognition acc. 3-state phone recognition acc. Inputs Conclusion Phonological features perform better as inputs to phone boundary classifiers than acoustic features. The results suggest that the pattern changes in the phonological feature space may lead to robust boundary detection. By exploring the potential space of representations of boundaries, we argue that phonetic transitions are very important for automatic speech recognition. HMMs can be attuned to the transition of phone boundaries by explicitly modeling phone transition states. Also, the combined strategy of binary boundary features, KLT, and 5-state representations gives almost a 2% absolute improvement in phone recognition. Considering the boundary information we integrated is one of the simplest representations, the result is rather encouraging. In future work, we hope to integrate phone boundary information as additional features to CRF.