Research on the Modeling of Chinese Continuous Speech Recognition Xiao Xi
Content Tri-phone Modeling of Chinese Continuous Speech Pinyin Pre-processing of Language Model
Issue on selecting appropriate acoustic units Acceptable Accuracy Unit should be accurate enough to represent acoustic-phonetic events Easy to model New word model can be derived from these predefined units Easy to train We should have enough data to estimate the unit parameters Word, syllable, semi-syllable or phoneme unit ?
Characteristics of Chinese Chinese speech is a tonal speech 4 basic tone patterns The tone is meaningful for understanding. e.g. mai3(买 buying) and mai4(卖 selling) have contrary meaning About 1254 tonal syllables,408 un-tonal syllables Pinyin is the transcription of prounciation
Characteristics of Chinese All characters are monosyllabic Each syllable is composed of an initial and a final semi-syllable Initial semi-syllable is majorly the consonant of a Chinese syllable Final semi-syllable follows the initial semi-syllable and is majorly of a simple or compound vowel
Bi-phone modeling of Initial-Final structure Bi-phone modeling only consider the intra syllable constrain, i.e. the literally reasonable combination of initial and final semi-syllable based on phonetic knowledge 100 initial models 41 final models (un-tone) or 164 final models( considering tone )
Bi-phone modeling by HMM An initial is modeled by 2 states HMM A final is modeled by 4 states HMM
Tri-phone modeling of Initial-Final structure Considering the left-context and right-context of a semi-syllable Semi-syllables with different context are regarded as different tri-phones Tri-phone model number is increased dramatically Sharing techniques is employed to trade off between the model accuracy and the shortage of training corpus
Tri-phone modeling by HMM Considering the co-articulation influence of the previous syllable
The Sharing Strategy Too many models if we evolve tri-phone model from the bi-phone model. e.g,164*100*164 tri-phone The Intra syllable’s initial-final model remains unchanged The Inter syllable tri-phone expansion is derived from the final class-initial class definition (sharing)
The Sharing Strategy (cont.) Classification of the final model Categorized into 29 classes according the ending vowel 30 classes if considering SILENCE Two schemes Un-tonal classification, 29 classes Tonal classification, 112 classes
The Sharing Strategy (cont.) Classification of the initial model Categorized into 27 classes, considering the influence of the previous FINAL 28 classes if considering SILENCE The tone of syllable is regarded as less important in initial modeling
The Sharing Strategy (cont.)
Tri-phone Experiment Different Tri-phone models for Experiment. ( Bi-phone is the baseline system)
Experimental Results – 1st Cand 863 + Intel bj sh male TEST ON 98test data Experimental Results – 1st Cand
Experimental Results – 25 Cands 863 + Intel bj sh male TEST ON 98test data Experimental Results – 25 Cands
Error rate vs. model complexity
Advantage of Phonetic Context Based Tri-phone Modeling Training algorithm is very easy to implement and is time-saving Less training data is possible Tri-phone models based on phonetic context knowledge are accurate and can significantly improve the ASR performance
Language Model for Chinese Continuous Speech Recognition Capable of processing multi-length and multi-candidate output of the ASR Tolerant of deletion errors, insert error and substitute errors of the ASR Convert Pinyin strings to Chinese characters correctly
Framework of speech recognition where W is the sentence of speech, A is Pinyin, O is the observation of the sentence’s acoustic feature. The sentence W comprises of L Chinese characters
Framework of speech recognition (cont.) For simplicity, substitute Σ calculation by the likely-hood of the best Pinyin candidate, then Here P(W, A) is Chinese language model, P(O/A) is the acoustic model. In the following , we will focus on the language model.
Multi-pass strategy of Language Model Here P(W/A) is the Pinyin to Chinese character conversion model, P(A) is the pinyin language model, where P(W) is Chinese character language model. P(O/A) is the acoustic model. So in the multi-pass language model, the Pinyin model is used to refine the output of acoustic model and then fed into the P(W/A) model
Advantage of the multi-pass language model Pinyin based tri-gram model is much more simplified than the character based tri-gram model. At most1254 tonal syllables At least 6000 frequently used Chinese characters Acceptable time to process multi-length and multi-candidate output from the acoustic model
Convert Pinyin Lattice to Words Example: the speech is “我们来了”(We comes),The PinYin Lattice after rescoring is: wo3 men2 lai2 le1 wo4 men1 lai4 le4 wo1 men4 lei1 lo1 wu3 min2 la1 lei1 The word graph is created by checking lexicon and the LM (trigram):
我们 来 了 我 们 赖 勒 Start 握 门 莱 乐 End 五 民 拉 垒 屋门 啦 嘞
The result is the best way of word graph: 我们 来 了 我 们 赖 勒 Start 握 门 莱 乐 End 五 民 拉 垒 屋门 啦 嘞
Experiment on Pinyin language model Training corpus : 20 millions Chinese characters Testing Sentence: 1680 sentences, about 35455 Chinese characters. Acoustic Model: Tri-phone duration distribution based HMM model
Experiment result
Conclusion from Experiment on Pinyin language model Dramatically improvement on the accuracy of refined candidates 45% improvement for the first candidate’s hit rate by using tri-gram Pinyin model. Top 20 candidates’ hit rate (97.21%) has exceeded the top 100 candidates’ hit rate(97.12%) of the baseline system
The End
Q & A