Automatic Continuous Speech Recognition Database speech text Scoring
Automatic Continuous Speech Recognition n Problems with isolated word recognition: –Every new task contains novel words without any available training data. –There are simply too many words, and this words may have different acoustic realizations. Increases variability n coarticulation of “words” n Speech velocity –we don´t know the limits of the words.
n In CSR, should we use words? Or what is the basic unit to represent salient acoustic and phonetic information?
Model Units Issues n Accurate. –Represent the acoustic realization that appears in different contexts. n Trainable n Generalizable: –New words can be derived
Comparison of Different Units n Words: –Small task. n accurate, trainable, no-generalizable –Large Vocabulary: n accurate, non-trainable, no-generalizable. n Phonemes: –Large Vocabulary: n No-accurate, trainable, over-generalizable
n Syllables –English: 30,000 n No-very-accurate, no-trainable, generalizable –Chinese: 1200 tone-dependent syllables –Japanese: 50 syllables for n accurate, trainable, generalizable n Allophones: Realizations of phonemes in different context. –accurate, no-trainable, generalizable –Triphones : Example of allophone.
Traning in Sphinx phonemes set is trained senons are trained: 1-gaussians to 8_or_16-gaussinas triphones are created senons are created senons are prunned triphones are trained
n Context Independent: Phonemes –SPHINX: model_architecture/Telefonica.ci.mdef n Context Dependent:Triphone: –SPHINX: model_architecture/Telefonica.untied.mdef
Clustering Acoustic-Phonetic Units n Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states. n A senone is a cluster of similar Markov states. n Advantages: –More training data. –Less memory used.
Senonic Decision Tree (SDT) n SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined questions.
Linguistic Questions QuestionPhones in Each Question AspgenHh Sil Alvstpd,t Dentaldh, th Labstpb, p Liquidl, r Lwl, w S/ShS, sh ….…
Decision Tree for Classifying the second state of k-triphone Is left phone (LP) a sonorant or nasal? yes Is right phone (RP) a back-R? Is LP /s,z,sh,sh/? Is RF voiced? Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 1 Senone 5 Senone 6 Senone 4 Senone 3 Senone 2
When applied to the word welcome Is left phone (LP) a sonorant or nasal? yes Is right phone (RP) a back-R? Is left phone /s,z,sh,sh/? Is RF voiced? Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 1 Senone 5 Senone 6 Senone 4 Senone 3 Senone 2
n The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease –Sphinx: n Construction: $base_dir/ c_scripts/03.bulidtrees. n Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree n When the tree grows, it needs to be pruned –Sphinx: n $base_dir/ c_scripts/ 04.bulidtrees. n Results:aA n $base_dir/trees/Telefonica.500/A-0.dtree n $base_dir/Telefonica_arquitecture/Telefonica.500.mdef
Subword unit Models based on HMMs
Words n Words can be modeled using composite HMMs n A null transition is used to go from one subword unit to the following /sil/ /t/ /uw//sil/
Continuous Speech Training Database speech text Scoring
n For each utterance to train, the subword units are concatenated to form words model. –Sphinx: Dictionary –$base_dir/training_input/dict.txt –$base_dir/training_input/train.lbl
n Let’s assume we are going to train the phonemes in the sentence: –Two four six. n The phonems of this sentence are: –/t//w//o//f//o//r//s//i//x/ n Therefore the HMM will be: /sil/ /t/ /uw/ /sil/ /f/ /o/ /r//s/ /i/ /x/
n We can estimate the parameters for each HMM using the forward-backward reestimation formulas already definded.
n The ability to automatically align each individual HMM to the corresponding unsegmented speech observation sequence is one of the most powerful features in the forward-backward algorithm.
Language Models for Large Vocabulary Speech Recognitin Database speech text Scoring
n Instead of using: n The recongition can be imporved using the calculating the Maximum Posteriory Probability: Languaje Model Viterbi
Language Models for Large Vocabulary Speech Recognitin n Goal: –Provide an estimate of the probability of a “word” sequence (w 1 w 2 w 3...w Q ) for the given recognition task. n This can be solved as follows:
n Since, it is impossible to reliable estimate the conditional probabilities, n hence in practice it is used an N-gram language model: n En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram). j
Examples: n Unigram: P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro) n Bigram: P(Maria| )P(loves|Maria)P(Pedro|loves)P( |Pedro)
CMU-Cambridge Language Modeling Tools n $base_dir/c_scripts/languageModelling
Database speech text Scoring
P(W i | W i-2,W i-1 )= C(W i-2 W i-1 )=Total Number Sequence W i-2 W i-1 was observed C(W i-2 W i-1 W i ) =Total Number Sequence W i-2 W i-1 W i was observed C(W i-2 W i-1 W i ) C(W i-2 W i-1 ) where