The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Why is ASR Hard? Natural speech is continuous
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Automatic Continuous Speech Recognition Database speech text Scoring.
Introduction to Automatic Speech Recognition
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
By Sarita Jondhale1 Pattern Comparison Techniques.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Reading Chapter Outline 1
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Speech, Perception, & AI Artificial Intelligence CMSC March 5, 2002.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
Speech Recognition Created By : Kanjariya Hardik G.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
ECE 8443 – Pattern Recognition EE 8524 – Speech Signal Processing Objectives: Word Graph Generation Lattices Hybrid Systems Resources: ISIP: Search ISIP:
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Juicer: A weighted finite-state transducer speech decoder
Speech Recognition UNIT -5.
An overview of decoding techniques for LVCSR
Conditional Random Fields for ASR
Dynamic Programming Search
Speaker Identification:
Presentation transcript:

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Hsu-Ting Wei

2

3 Context

4 Contents (cont.)

5 Introduction The use of context dependent models introduces two major problems: –1. Sparsely and unevenness training data –2. Efficient decoding strategy which incorporates context dependencies both within words and across word boundaries

6 Introduction (cont.) About problem 1 (ch3) –Construct a robust and accurate recognizers using decision tree bases clustering techniques –Linguistic knowledge is used –The approach allows the construction of models which are dependent upon contextual effects occurring across word boundaries About problem 2 (ch4~) –The thesis presents a new decoder design which is capable of using these models efficiently –The decoder can generate a lattice of word hypotheses with little computational overhead.

7 Ch3. Context dependency in speech 3.1 Contextual Variation –In order to maximize the accuracy of HMM based speech recognition systems, it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses. Signal parameterisation Model structure –Ensure that their between class variance is higher than the within class variance

8 Ch3. Context dependency in speech (cont.) Most of the variability inherent in speech is due to contextual effects: –Session effects Speaker effects –Major source of variation Environmental effects –Control by minimizing the background noise and ensuring that the same microphone is used –Local effects Utterance –Co-articulation, stress, emphasis By taking these contextual effects into account, the variability can be reduced and the accuracy of the models increased.

9 Ch3. Context dependency in speech (cont.) Session effects –Speaker dependent system (SD) is significantly more accurate than a similar speaker independent system (SI). –Speaker effects Gender and age Dialect Style –In order to making the SI system to simulate SD system, we can do : Operating recognizers in parallel Adapting the recognizer to match the new speaker

10 Ch3. Context dependency in speech (cont.) Session effects (cont.) –Operating recognizers in parallel Disadvantage: –The computational load appears to rises linearly with the number of systems Advantage: –The system tends to dominate quickly and the computational load is high for only the first few seconds of speech Speaker type answer

11 Ch3. Context dependency in speech (cont.) Session effects (cont.) –Adapting the recognizer to match the new speaker Problem: There is insufficient data to update the model –It is possible to make use of both techniques and initially use parallel systems to choose the speaker characteristics, then, once enough data is available, adapt the chosen system to better match the speaker. MAP, MLLR

12 Ch3. Context dependency in speech (cont.) Local effects –Co-articulation means that the acoustic realization of a phone in a particular phonetic context is more consistent than the same phone occurring in a variety of contexts. – Ex: ”We were away with William in Sea World.” w iy w er… s iy w er

13 Ch3. Context dependency in speech (cont.) Local effects –Context Dependent Phonetic Models IN LIMSI –45 monophone context (Festival CMU: 41) »STEAK = sil s t ey k sil –2071 biphone context (Festival CMU :1364) »STEAK = sil sil-s s-t t-ey ey-k sil –95221 triphone context »STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil –Word Boundaries Word Internal Context Dependency (Intra-word) –STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n- d ch+ih ch-ih+p ih-p+s p-s sil Cross World Context Dependency (Inter-word) =>can increase accuracy –STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

14 English dictionary Festlex CMU - Lexicon (American English) for Festival Speech System ( ) –40 distinct phones. ("hello" nil (((hh ax l) 0) ((ow) 1))). ("world" nil (((w er l d) 1)))..

15 English dictionary (cont.) The LIMSI dictionary phones set (1993) –45 phones

16 Linguistic knowledge (cont.) 鼻音 摩擦音 流音 General questions

17 Linguistic knowledge (cont.) Vowel questions

18 Linguistic knowledge (cont.) Consonant questions 發音時很用力的子音 發音較不費力的子音 舌尖音 刺耳的 音節主音 摩擦音 破擦音

19 Linguistic knowledge (cont.) Questions which is used in HTK <= State tying

20 Ch4.Decoding This chapter described several decoding techniques suitable for recognition of continuous speech using HMM. It is concerned with the use of cross word context dependent acoustic and long span language models. Ideal decoder –4.2 Time-Synchronous decoding Token passing Beam pruning N-Best decoding Limitations Back-Off implementation –4.3 Best First Decoding A* Decoding The stack decoder for speech recognition –4.4 A Hybrid approach

21 Ch4.Decoding (cont.) 4.1 Requirements –Ideal decoder : It should find the most likely grammatical hypothesis for an unknow utterance Acoustic model likelihood Language model likelihood

22 Ch4.Decoding (cont.) 4.1 Requirements (cont.) –The ideal decoder would have following characteristics Efficiency: Ensure that the system does not lag behind the speaker. Accuracy: Find the most likely grammatical sequence of words for each utterance. Scalability ( 可擴放性 ): (?) The computation required by the decoder would also increase less than linearly with the size of the vocabulary. Versatility( 多樣性 ): Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency. (n-gram language + cross-word context dependent models)

23 Conclusion Implement HTK right biphone task and triphone task