Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models in NLP
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
Planning under Uncertainty
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Automatic Continuous Speech Recognition Database speech text Scoring.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Introduction to Automatic Speech Recognition
MAKING COMPLEX DEClSlONS
Isolated-Word Speech Recognition Using Hidden Markov Models
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models for Sequence Analysis 4
7-Speech Recognition Speech Recognition Concepts
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
SPEECH RECOGNITION Presented to Dr. V. Kepuska Presented by Lisa & Za ECE 5526.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Online Multiscale Dynamic Topic Models
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
N-Gram Model Formulas Word sequences Chain rule of probability
Presentation transcript:

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use state of the art techniques to build and decode our new model. We demonstrate improved recognition results on a small data set.

Description of a POMDP A Markov Decision Processes (MDP) is a mathematical formalization of problems in which a decision maker, an agent, must decide which actions to choose that will maximize its expected reward as it interacts with its environment. MDPs have been used in modeling an agents behavior in –planning problems –robot navigation problems In a fully observable MDP an agent always knows precisely what state it is in.

If an agent cannot determine its state, its world is said to be partially observable. In such a situation we use a generalization of MDPs, called a Partially Observable Markov Decision Process (POMDP). POMDP vs HMM –differs from HMM multiple transitions between two states representing actions added reward to each state –as with HMM you do not know which state you are in

POMDP in Speech As with HMMs –left to right topology with 3 to 5 states –states represent pronunciation tasks: beginning, middle, end of phoneme observed acoustic features are associated with each state –Randomness in state transitions still accounts for time stretching in the phoneme: Short, long, hurried pronunciations –Randomness in the observations still accounts for the variability in pronunciations

Differs from HMMs –In theory model all possible context classes (infinite number) model all contexts of a particular context class –In practice model three context classes –Triphone, biphone, monophone model all contexts of a particular context class –Use actions of our model to represent context Beg.Mid.End

Training a POMDP We train each context class independently on the same training data –treated as HMM models trained using standard EM We then collect all context models for each phoneme over the four different context classes and combine them into a single, unified POMDP model –we label each action with both the context and context class that the particular HMM model belongs to

Decoding a POMDP We look at 3 decoding strategies based on Viterbi: –Uniform Mixed Model (UMM) Viterbi –Weighted Mixed Model (WMM) Viterbi –Cross-Context Mixed Model (CMM) Viterbi

UMM Viterbi From Viterbi point of view –Add to mix all context classes and allow Viterbi to choose the best path through entire search space relax context rules by matching up all partial context phonemes –wild-card all monophones to match up with all biphones and triphones sharing same center phone –wild-card all biphones to mach up with triphones whose other context they share –Add class weight, W c, to each context class c applied to each model as we enter it From POMDP point of view –the model constrains actions add constraint to leave state with same action that we entered that state in the model –insures model’s context as in HMM

relax constraint of allowing to choose different context classes in model –differs from HMM –class weight is reward given at start state for entering model Viterbi expansion of “tomato” having two spellings –“t-ow-m-ey-t-ow” –“t-ow-m-aa-t-ow” (a) standard Viterbi and (b) UMM Viterbi t-ow+m ow-m+ey ow-m+aa t-ow+m ow-m+ey ow-m-aa m+ev m m+aa ow-m+aa (a)(b) ow+m ow

WMM Viterbi Similar to UMM Viterbi, except now we weigh each context model of each context class individually, based on frequency counts of its occurrence in training data w c m = L c + min(f c m / K c, 1) * (W c – L c ) f c m – frequency count for model m of context class c L c – lower bound for context class c W c – upper bound for context class c K c – frequency count cutoff threshold for context class c

Similar to WMM Viterbi, except now our POMDP model relaxes the constraint on actions –allows cross model jumps –jumps are now weighted by model weight w c m –constraint relaxed to sub-class of context models as follows: models can jump between triphone and associated biphone and monophone whose partial context they share CMM Viterbi

Various strategies to relaxing cross model jump constraints –Maximum cross context for each cross context model jump, add weight to the likelihood score and choose jump that yields highest score –Expanded cross context choose all context model jumps at every state, adding the weight to the likelihood score of each jump –Restricted form of both Maximum and Expanded add constraint that once we choose a lower order context class model, cannot go back to higher order context class model, only stay within own or lower –idea is to abandon higher order models that perform poorly t-ow+m ow+m ow

Experiments Tested our model on TIMIT data sets: –TIMIT – read English sentences 45 phonemes, ~8000 word dictionary 3 hours training on 3869 utterances by 387 speakers 6 minute decoding on 110 utterances by 11 speakers –independent of training data trigram language model built from training data and outside source (OGI: Stories and NatCell )

Found best system configuration for each corpus. –created 16 mixture SCTM models for each HMM context class using ISIP prototype system (v 5.10) –ran baseline for all 3 HMM models Baseline CorpusWERaccuracy Timit triphone biphone monophone

Results Results for all three modified Viterbi algorithms similar to development set POMDP model shows robustness to different test sets –not tuned to data CorpusViterbi L c / W c / K c WERaccuracy triphonebiphonemonophone TimitUMM -/5/--/100/--/10/ WMM 0/5/9025/100/900/10/ CMM 0/5/9045/100/900/10/

Future Work Apply new model to larger data set Find better method to generate individual context model weights – linear interpolation and backoff techniques used in language modeling Find better method for adjusting overall POMDP model context class weights for the various decoding strategies –current method of experimentation is inefficient For CMM Viterbi, look to find better ways to constrain cross model jumps outside of partial context classes –use similar technique of linguistic information used in tying mixtures at the state level