Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Multi-Stream Posterior Based Speech Recognition System

Similar presentations


Presentation on theme: "Hierarchical Multi-Stream Posterior Based Speech Recognition System"— Presentation transcript:

1 Hierarchical Multi-Stream Posterior Based Speech Recognition System
Hamed Ketabdar, Herve Bourlard and Samy Bengio IDIAP Research Institute, Martigny, Switzerland MLMI 2005 Workshop, Edinburgh, UK gsggdgdfsgdfsg

2 Hierarchical Multi-stream Posterior Based Speech Recognition System
Main idea (1) Estimating more informative posteriors by taking into account: Prior knowledge about the problem (e.g. phone transition probabilities, phone minimum durations) Contextual information Multiple stream of features Hierarchical Multi-stream Posterior Based Speech Recognition System

3 Hierarchical Multi-stream Posterior Based Speech Recognition System
Main idea (2) Principled approach towards using posteriors in hierarchical structures Dividing the problem into multi-level sub problems. In each layer: Integrating relevant prior and contextual information Combining different kind of features Hierarchical Multi-stream Posterior Based Speech Recognition System

4 Posterior based speech recognition systems
Posterior estimation: Based on local features Without taking into account prior knowledge about the problem. Posteriors usage: As features for a standard HMM/GMM recognizer (e.g. Tandem) As local scores for a decoder (e.g. hybrid HMM/ANN system). You should say that I am talking about phonemes but this approch and discussion and issue is genarl and can be applied to many cases of pattern recognition problems. Say that, As a review of state of the art posterior based speech recognition systems. The posteriors are usually estimated using MLPs. Most of the time a feature vector which is representing a slice of the MLP only see that slice of speech sifnal. It also does not know any thing about prior knowledge related to the problem, for example what is the minium duration which a phonme can appear, legl sequences of phonemes, etc. It just can see the current frame possibly concatineted with a small number of neighbor frames and decide about phoneme posteriors based on this knowledge. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

5 Prior knowledge, Contextual information
Information about a phoneme is extended over time, therefore contextual information should be useful. There are usually some prior knowledge and assumptions about the problem. For example, transition between some phonemes cannot happen or it is less probable, lexical information Question: Is there any way to introduce prior and contextual information in posterior estimation? Make an example of the illegal sequence of phonemes. Say at the end, in order to address this problem, and then go to the next slide … you should say normally MLPs can not take into account this information about prior knowledeg or just they can concatinate feature vectors to some how take into account contextual information. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

6 “Gamma” posterior estimation
The idea: Estimate posteriors through an HMM, based on “Gamma” state posterior definition: Posterior estimation taking into account: Prior knowledge encoded in the model M Whole sequence contextual information You should say something about integration. You should say also something Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

7 “Gamma” posterior estimation
The “gamma” posterior can be written based on forward and backward HMM recursions: Forward and Backward recursions: functions of observation likelihoods (or scaled likelihoods) and state transition probabilities. You should say: and in fact the transition probability term is the place by which we can introduce Prior knowledge in the form of topological constraints Formula for alpha (mention that the emission probability term can be likelihood or scaled likelihood and can be estimated using GMMs or MLPs. For furtur details refer to Ref1 Ref2 Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

8 Example: introducing prior knowledge
Phone posteriors estimated by MLP “Gamma” phone posteriors Prior knowledge: Minimum phone duration is 3 Hierarchical Multi-stream Posterior Based Speech Recognition System

9 Multi-stream “gamma” posterior
The extension of the gamma posterior estimation idea to the multi-stream case. We define: Estimating more informative posteriors by Combining multiple feature streams having complementary information Taking into account prior knowledge (encoded in the Model M) and contextual information (whole observation sequence) You should say that the multi-stream case is the main concern in this talk, you should same recursion and usual HMM assumptions Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

10 Multi-stream “gamma” posterior estimation
We define multi-stream forward and backward recursions as follows: Multi-stream forward and backward recursions can be written in terms of single stream forward and backward recursions (having some independence assumptions) The multi-stream gamma posterior can be written based on multi-stream forward and backward recursions: Some details about the multistream posterior estimation. For multi-stream posterior estimation we need to estimate Here or in the next slide, you should say: “Well now we have a theretical frame work for combinig the streams and also posterior estimation taking into account contextual information and prior knowledge You should mention that it can be rewritten based on single stream forward and backward recursions having the independence assumption You should also mention about symbols and terms and definitions and ext Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

11 Experiments with multi-stream posteriors
Which streams to combine? The streams which are combined should have some complementary information. Candidates: TempoRAl Pattern (TRAP) features and PLP cepstral features. Here I explain, the first results of the multi-stream posterior estimation method. The first step, in building the system is to decide about the feature streams. Different feature stream should have some complementary information. TRAP and PLP features are good candidates since they carry some complementary information. The other candidates can be static and dynamic features like plp and delta plp features which carry some kind of complementary information. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

12 Hierarchical Multi-stream Posterior Based Speech Recognition System
Feature streams PLP cepstral features: Showing whole spectrum for a limited period of time. TempoRAl Pattern (TRAP) features: Representing critical band spectral energies over long time. Hierarchical Multi-stream Posterior Based Speech Recognition System

13 Hierarchical multi-stream posterior based speech recognition system

14 Hierarchical Multi-stream Posterior Based Speech Recognition System
Results CTS database OGI digits database Features WER PLP posteriors 48.7% TRAP posteriors 55.1% Inverse entropy combination Multi-stream gamma posteriors 46.8% Features WER PLP posteriors 3.6% TRAP posteriors 4.8% Inverse entropy combination 3.5% Multi-stream gamma posteriors 2.9% Databases: OGI digits: Recognition of continuous digits, 11 words Reduced vocabulary version of DARPA Conversational Telephone Speech (CTS) task, 1000 words Tel about the specifications of the databse, number of utterance, time and etc for each database. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg

15 Hierarchical Multi-stream Posterior Based Speech Recognition System
Conclusions Proposing a theoretical framework for Multiple feature stream combination Posterior estimation taking into account Prior knowledge Contextual information Design a hierarchical ASR system based on the multi-stream posterior estimation method You should say that till now, I estmated emition probabilities with MLPs. I can furthur work to estimate emission likelihoods by GMMs. The MLP uvery low sually gives a kind of yes/no answer. In wihich the posterios are eighter very high or low even for wrong cases. Hierarchical Multi-stream Posterior Based Speech Recognition System gsggdgdfsgdfsg


Download ppt "Hierarchical Multi-Stream Posterior Based Speech Recognition System"

Similar presentations


Ads by Google