Automatic Continuous Speech Recognition Database speech text Scoring.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
27 th, February 2004Presentation for the speech recognition system An overview of the SPHINX Speech Recognition System Jie Zhou, Zheng Gong Lingli Wang,
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Automatic Speech Recognition Introduction. The Human Dialogue System.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Advanced Signal Processing 2, SE Professor Horst Cerjak, Andrea Sereinig Graz, Basics of Hidden Markov Models Basics of HMM-based.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Speech Recognition Application
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
7-Speech Recognition Speech Recognition Concepts
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
SPEECH RECOGNITION Presented to Dr. V. Kepuska Presented by Lisa & Za ECE 5526.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
Training Tied-State Models Rita Singh and Bhiksha Raj.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CS Statistical Machine learning Lecture 24
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Natural Language Processing Statistical Inference: n-grams
Statistical Models for Automatic Speech Recognition Lukáš Burget.
5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang 2. “ Predicting Unseen Triphones with Senones”, IEEE Trans. on Speech & Audio Processing,
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Statistical Models for Automatic Speech Recognition
5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang
Statistical Models for Automatic Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Introduction to Digital Speech Processing
Presentation transcript:

Automatic Continuous Speech Recognition Database speech text Scoring

Automatic Continuous Speech Recognition n Problems with isolated word recognition: –Every new task contains novel words without any available training data. –There are simply too many words, and this words may have different acoustic realizations. Increases variability n coarticulation of “words” n Speech velocity –we don´t know the limits of the words.

n In CSR, should we use words? Or what is the basic unit to represent salient acoustic and phonetic information?

Model Units Issues n Accurate. –Represent the acoustic realization that appears in different contexts. n Trainable n Generalizable: –New words can be derived

Comparison of Different Units n Words: –Small task. n accurate, trainable, no-generalizable –Large Vocabulary: n accurate, non-trainable, no-generalizable. n Phonemes: –Large Vocabulary: n No-accurate, trainable, over-generalizable

n Syllables –English: 30,000 n No-very-accurate, no-trainable, generalizable –Chinese: 1200 tone-dependent syllables –Japanese: 50 syllables for n accurate, trainable, generalizable n Allophones: Realizations of phonemes in different context. –accurate, no-trainable, generalizable –Triphones : Example of allophone.

Traning in Sphinx phonemes set is trained senons are trained: 1-gaussians to 8_or_16-gaussinas triphones are created senons are created senons are prunned triphones are trained

n Context Independent: Phonemes –SPHINX: model_architecture/Telefonica.ci.mdef n Context Dependent:Triphone: –SPHINX: model_architecture/Telefonica.untied.mdef

Clustering Acoustic-Phonetic Units n Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states. n A senone is a cluster of similar Markov states. n Advantages: –More training data. –Less memory used.

Senonic Decision Tree (SDT) n SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined questions.

Linguistic Questions QuestionPhones in Each Question AspgenHh Sil Alvstpd,t Dentaldh, th Labstpb, p Liquidl, r Lwl, w S/ShS, sh ….…

Decision Tree for Classifying the second state of k-triphone Is left phone (LP) a sonorant or nasal? yes Is right phone (RP) a back-R? Is LP /s,z,sh,sh/? Is RF voiced? Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 1 Senone 5 Senone 6 Senone 4 Senone 3 Senone 2

When applied to the word welcome Is left phone (LP) a sonorant or nasal? yes Is right phone (RP) a back-R? Is left phone /s,z,sh,sh/? Is RF voiced? Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 1 Senone 5 Senone 6 Senone 4 Senone 3 Senone 2

n The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease –Sphinx: n Construction: $base_dir/ c_scripts/03.bulidtrees. n Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree n When the tree grows, it needs to be pruned –Sphinx: n $base_dir/ c_scripts/ 04.bulidtrees. n Results:aA n $base_dir/trees/Telefonica.500/A-0.dtree n $base_dir/Telefonica_arquitecture/Telefonica.500.mdef

Subword unit Models based on HMMs

Words n Words can be modeled using composite HMMs n A null transition is used to go from one subword unit to the following /sil/ /t/ /uw//sil/

Continuous Speech Training Database speech text Scoring

n For each utterance to train, the subword units are concatenated to form words model. –Sphinx: Dictionary –$base_dir/training_input/dict.txt –$base_dir/training_input/train.lbl

n Let’s assume we are going to train the phonemes in the sentence: –Two four six. n The phonems of this sentence are: –/t//w//o//f//o//r//s//i//x/ n Therefore the HMM will be: /sil/ /t/ /uw/ /sil/ /f/ /o/ /r//s/ /i/ /x/

n We can estimate the parameters for each HMM using the forward-backward reestimation formulas already definded.

n The ability to automatically align each individual HMM to the corresponding unsegmented speech observation sequence is one of the most powerful features in the forward-backward algorithm.

Language Models for Large Vocabulary Speech Recognitin Database speech text Scoring

n Instead of using: n The recongition can be imporved using the calculating the Maximum Posteriory Probability: Languaje Model Viterbi

Language Models for Large Vocabulary Speech Recognitin n Goal: –Provide an estimate of the probability of a “word” sequence (w 1 w 2 w 3...w Q ) for the given recognition task. n This can be solved as follows:

n Since, it is impossible to reliable estimate the conditional probabilities, n hence in practice it is used an N-gram language model: n En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram). j

Examples: n Unigram: P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro) n Bigram: P(Maria| )P(loves|Maria)P(Pedro|loves)P( |Pedro)

CMU-Cambridge Language Modeling Tools n $base_dir/c_scripts/languageModelling

Database speech text Scoring

P(W i | W i-2,W i-1 )= C(W i-2 W i-1 )=Total Number Sequence W i-2 W i-1 was observed C(W i-2 W i-1 W i ) =Total Number Sequence W i-2 W i-1 W i was observed C(W i-2 W i-1 W i ) C(W i-2 W i-1 ) where