Research on the Modeling of Chinese Continuous Speech Recognition

Research on the Modeling of Chinese Continuous Speech Recognition
Xiao Xi

Content Tri-phone Modeling of Chinese Continuous Speech
Pinyin Pre-processing of Language Model

Issue on selecting appropriate acoustic units
Acceptable Accuracy Unit should be accurate enough to represent acoustic-phonetic events Easy to model New word model can be derived from these predefined units Easy to train We should have enough data to estimate the unit parameters Word, syllable, semi-syllable or phoneme unit ?

Characteristics of Chinese
Chinese speech is a tonal speech 4 basic tone patterns The tone is meaningful for understanding. e.g. mai3(买 buying) and mai4(卖 selling) have contrary meaning About 1254 tonal syllables,408 un-tonal syllables Pinyin is the transcription of prounciation

Characteristics of Chinese
All characters are monosyllabic Each syllable is composed of an initial and a final semi-syllable Initial semi-syllable is majorly the consonant of a Chinese syllable Final semi-syllable follows the initial semi-syllable and is majorly of a simple or compound vowel

Bi-phone modeling of Initial-Final structure
Bi-phone modeling only consider the intra syllable constrain, i.e. the literally reasonable combination of initial and final semi-syllable based on phonetic knowledge 100 initial models 41 final models (un-tone) or 164 final models( considering tone )

Bi-phone modeling by HMM
An initial is modeled by 2 states HMM A final is modeled by 4 states HMM

Tri-phone modeling of Initial-Final structure
Considering the left-context and right-context of a semi-syllable Semi-syllables with different context are regarded as different tri-phones Tri-phone model number is increased dramatically Sharing techniques is employed to trade off between the model accuracy and the shortage of training corpus

Tri-phone modeling by HMM
Considering the co-articulation influence of the previous syllable

The Sharing Strategy Too many models if we evolve tri-phone model from the bi-phone model. e.g,164*100*164 tri-phone The Intra syllable’s initial-final model remains unchanged The Inter syllable tri-phone expansion is derived from the final class-initial class definition (sharing)

The Sharing Strategy (cont.)
Classification of the final model Categorized into 29 classes according the ending vowel 30 classes if considering SILENCE Two schemes Un-tonal classification, 29 classes Tonal classification, 112 classes

Classification of the initial model Categorized into 27 classes, considering the influence of the previous FINAL 28 classes if considering SILENCE The tone of syllable is regarded as less important in initial modeling

Tri-phone Experiment Different Tri-phone models for Experiment. ( Bi-phone is the baseline system)

Experimental Results – 1st Cand
863 + Intel bj sh male TEST ON 98test data Experimental Results – 1st Cand

Experimental Results – 25 Cands
863 + Intel bj sh male TEST ON 98test data Experimental Results – 25 Cands

Error rate vs. model complexity

Advantage of Phonetic Context Based Tri-phone Modeling
Training algorithm is very easy to implement and is time-saving Less training data is possible Tri-phone models based on phonetic context knowledge are accurate and can significantly improve the ASR performance

Language Model for Chinese Continuous Speech Recognition
Capable of processing multi-length and multi-candidate output of the ASR Tolerant of deletion errors, insert error and substitute errors of the ASR Convert Pinyin strings to Chinese characters correctly

Framework of speech recognition
where W is the sentence of speech, A is Pinyin, O is the observation of the sentence’s acoustic feature. The sentence W comprises of L Chinese characters

Framework of speech recognition (cont.)
For simplicity, substitute Σ calculation by the likely-hood of the best Pinyin candidate, then Here P(W, A) is Chinese language model, P(O/A) is the acoustic model. In the following , we will focus on the language model.

Multi-pass strategy of Language Model
Here P(W/A) is the Pinyin to Chinese character conversion model, P(A) is the pinyin language model, where P(W) is Chinese character language model. P(O/A) is the acoustic model. So in the multi-pass language model, the Pinyin model is used to refine the output of acoustic model and then fed into the P(W/A) model

Advantage of the multi-pass language model
Pinyin based tri-gram model is much more simplified than the character based tri-gram model. At most1254 tonal syllables At least 6000 frequently used Chinese characters Acceptable time to process multi-length and multi-candidate output from the acoustic model

Convert Pinyin Lattice to Words
Example: the speech is “我们来了”（We comes），The PinYin Lattice after rescoring is: wo3 men2 lai2 le1 wo4 men1 lai4 le4 wo1 men lei1 lo1 wu3 min la lei1 The word graph is created by checking lexicon and the LM (trigram):

我们来了我们赖勒 Start 握门莱乐 End 五民拉垒屋门啦嘞

The result is the best way of word graph:
我们来了我们赖勒 Start 握门莱乐 End 五民拉垒屋门啦嘞

Experiment on Pinyin language model
Training corpus : 20 millions Chinese characters Testing Sentence: 1680 sentences, about Chinese characters. Acoustic Model: Tri-phone duration distribution based HMM model

Experiment result

Conclusion from Experiment on Pinyin language model
Dramatically improvement on the accuracy of refined candidates 45% improvement for the first candidate’s hit rate by using tri-gram Pinyin model. Top 20 candidates’ hit rate (97.21%) has exceeded the top 100 candidates’ hit rate(97.12%) of the baseline system

The End

Research on the Modeling of Chinese Continuous Speech Recognition

Similar presentations

Presentation on theme: "Research on the Modeling of Chinese Continuous Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Research on the Modeling of Chinese Continuous Speech Recognition

Similar presentations

Presentation on theme: "Research on the Modeling of Chinese Continuous Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback