Hierarchical Multi-Stream Posterior Based Speech Recognition System

Slides:



Advertisements
Similar presentations
Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.
Advertisements

Hidden Markov Models Theory By Johan Walters (SR 2003)
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Introduction to Automatic Speech Recognition
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models for Sequence Analysis 4
7-Speech Recognition Speech Recognition Concepts
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Quantile Based Histogram Equalization for Noise Robust Speech Recognition von Diplom-Physiker Florian Erich Hilger aus Bonn - Bad Godesberg Berichter:
CS Statistical Machine learning Lecture 24
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Online Multiscale Dynamic Topic Models
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Conditional Random Fields for ASR
Computational NeuroEngineering Lab
Mplp(t) derived from PLP cepstra,. This observation
Speech Processing Speech Recognition
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Speech Processing Speech Recognition
Learning Long-Term Temporal Features
Presented by Chen-Wei Liu
Understanding Number I can check adding and subtracting calculation by rounding whole numbers to the nearest 10, 100 and I use knowledge of context.
Presenter: Shih-Hsiang(士翔)
Understanding Number I can check adding and subtracting calculation by rounding to the nearest 10 or 100. I use knowledge of context to decide if an answer.
Presentation transcript:

Hierarchical Multi-Stream Posterior Based Speech Recognition System Hamed Ketabdar, Herve Bourlard and Samy Bengio IDIAP Research Institute, Martigny, Switzerland MLMI 2005 Workshop, Edinburgh, UK gsggdgdfsgdfsg

Hierarchical Multi-stream Posterior Based Speech Recognition System Main idea (1) Estimating more informative posteriors by taking into account: Prior knowledge about the problem (e.g. phone transition probabilities, phone minimum durations) Contextual information Multiple stream of features Hierarchical Multi-stream Posterior Based Speech Recognition System

Hierarchical Multi-stream Posterior Based Speech Recognition System Main idea (2) Principled approach towards using posteriors in hierarchical structures Dividing the problem into multi-level sub problems. In each layer: Integrating relevant prior and contextual information Combining different kind of features Hierarchical Multi-stream Posterior Based Speech Recognition System

Posterior based speech recognition systems Posterior estimation: Based on local features Without taking into account prior knowledge about the problem. Posteriors usage: As features for a standard HMM/GMM recognizer (e.g. Tandem) As local scores for a decoder (e.g. hybrid HMM/ANN system). You should say that I am talking about phonemes but this approch and discussion and issue is genarl and can be applied to many cases of pattern recognition problems. Say that, As a review of state of the art posterior based speech recognition systems. The posteriors are usually estimated using MLPs. Most of the time a feature vector which is representing a slice of the MLP only see that slice of speech sifnal. It also does not know any thing about prior knowledge related to the problem, for example what is the minium duration which a phonme can appear, legl sequences of phonemes, etc. It just can see the current frame possibly concatineted with a small number of neighbor frames and decide about phoneme posteriors based on this knowledge. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

Prior knowledge, Contextual information Information about a phoneme is extended over time, therefore contextual information should be useful. There are usually some prior knowledge and assumptions about the problem. For example, transition between some phonemes cannot happen or it is less probable, lexical information Question: Is there any way to introduce prior and contextual information in posterior estimation? Make an example of the illegal sequence of phonemes. Say at the end, in order to address this problem, and then go to the next slide … you should say normally MLPs can not take into account this information about prior knowledeg or just they can concatinate feature vectors to some how take into account contextual information. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

“Gamma” posterior estimation The idea: Estimate posteriors through an HMM, based on “Gamma” state posterior definition: Posterior estimation taking into account: Prior knowledge encoded in the model M Whole sequence contextual information You should say something about integration. You should say also something Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

“Gamma” posterior estimation The “gamma” posterior can be written based on forward and backward HMM recursions: Forward and Backward recursions: functions of observation likelihoods (or scaled likelihoods) and state transition probabilities. You should say: and in fact the transition probability term is the place by which we can introduce Prior knowledge in the form of topological constraints Formula for alpha (mention that the emission probability term can be likelihood or scaled likelihood and can be estimated using GMMs or MLPs. For furtur details refer to Ref1 Ref2 Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

Example: introducing prior knowledge Phone posteriors estimated by MLP “Gamma” phone posteriors Prior knowledge: Minimum phone duration is 3 Hierarchical Multi-stream Posterior Based Speech Recognition System

Multi-stream “gamma” posterior The extension of the gamma posterior estimation idea to the multi-stream case. We define: Estimating more informative posteriors by Combining multiple feature streams having complementary information Taking into account prior knowledge (encoded in the Model M) and contextual information (whole observation sequence) You should say that the multi-stream case is the main concern in this talk, you should same recursion and usual HMM assumptions Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

Multi-stream “gamma” posterior estimation We define multi-stream forward and backward recursions as follows: Multi-stream forward and backward recursions can be written in terms of single stream forward and backward recursions (having some independence assumptions) The multi-stream gamma posterior can be written based on multi-stream forward and backward recursions: Some details about the multistream posterior estimation. For multi-stream posterior estimation we need to estimate Here or in the next slide, you should say: “Well now we have a theretical frame work for combinig the streams and also posterior estimation taking into account contextual information and prior knowledge You should mention that it can be rewritten based on single stream forward and backward recursions having the independence assumption You should also mention about symbols and terms and definitions and ext Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

Experiments with multi-stream posteriors Which streams to combine? The streams which are combined should have some complementary information. Candidates: TempoRAl Pattern (TRAP) features and PLP cepstral features. Here I explain, the first results of the multi-stream posterior estimation method. The first step, in building the system is to decide about the feature streams. Different feature stream should have some complementary information. TRAP and PLP features are good candidates since they carry some complementary information. The other candidates can be static and dynamic features like plp and delta plp features which carry some kind of complementary information. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

Hierarchical Multi-stream Posterior Based Speech Recognition System Feature streams PLP cepstral features: Showing whole spectrum for a limited period of time. TempoRAl Pattern (TRAP) features: Representing critical band spectral energies over long time. Hierarchical Multi-stream Posterior Based Speech Recognition System

Hierarchical multi-stream posterior based speech recognition system

Hierarchical Multi-stream Posterior Based Speech Recognition System Results CTS database OGI digits database Features WER PLP posteriors 48.7% TRAP posteriors 55.1% Inverse entropy combination Multi-stream gamma posteriors 46.8% Features WER PLP posteriors 3.6% TRAP posteriors 4.8% Inverse entropy combination 3.5% Multi-stream gamma posteriors 2.9% Databases: OGI digits: Recognition of continuous digits, 11 words Reduced vocabulary version of DARPA Conversational Telephone Speech (CTS) task, 1000 words Tel about the specifications of the databse, number of utterance, time and etc for each database. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

Hierarchical Multi-stream Posterior Based Speech Recognition System Conclusions Proposing a theoretical framework for Multiple feature stream combination Posterior estimation taking into account Prior knowledge Contextual information Design a hierarchical ASR system based on the multi-stream posterior estimation method You should say that till now, I estmated emition probabilities with MLPs. I can furthur work to estimate emission likelihoods by GMMs. The MLP uvery low sually gives a kind of yes/no answer. In wihich the posterios are eighter very high or low even for wrong cases. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg