Research on the Modeling of Chinese Continuous Speech Recognition

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Automatic Continuous Speech Recognition Database speech text Scoring.
Online Chinese Character Handwriting Recognition for Linux
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
English vs. Mandarin: A Phonetic Comparison Experimental Setup Abstract The focus of this work is to assess the performance of three new variational inference.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Assessment of Phonology
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
REWARDS Multisyllabic Word Strategy
Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
College of Engineering Temple University
2 Research Department, iFLYTEK Co. LTD.
Phonological Priming and Lexical Access in Spoken Word Recognition
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Text-To-Speech System for English
Speech Analysis TA:Chuan-Hsun Wu
Conditional Random Fields for ASR
5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang
8.0 Search Algorithms for Speech Recognition
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
CRANDEM: Conditional Random Fields for ASR
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Review for Test 2.
汉语连续语音识别 年1月4日访北京工业大学 973 Project 2019/4/17 汉语连续语音识别 年1月4日访北京工业大学 郑 方 清华大学 计算机科学与技术系 语音实验室
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Introduction to Digital Speech Processing
Speaker Identification:
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Low Level Cues to Emotion
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Research on the Modeling of Chinese Continuous Speech Recognition Xiao Xi

Content Tri-phone Modeling of Chinese Continuous Speech Pinyin Pre-processing of Language Model

Issue on selecting appropriate acoustic units Acceptable Accuracy Unit should be accurate enough to represent acoustic-phonetic events Easy to model New word model can be derived from these predefined units Easy to train We should have enough data to estimate the unit parameters Word, syllable, semi-syllable or phoneme unit ?

Characteristics of Chinese Chinese speech is a tonal speech 4 basic tone patterns The tone is meaningful for understanding. e.g. mai3(买 buying) and mai4(卖 selling) have contrary meaning About 1254 tonal syllables,408 un-tonal syllables Pinyin is the transcription of prounciation

Characteristics of Chinese All characters are monosyllabic Each syllable is composed of an initial and a final semi-syllable Initial semi-syllable is majorly the consonant of a Chinese syllable Final semi-syllable follows the initial semi-syllable and is majorly of a simple or compound vowel

Bi-phone modeling of Initial-Final structure Bi-phone modeling only consider the intra syllable constrain, i.e. the literally reasonable combination of initial and final semi-syllable based on phonetic knowledge 100 initial models 41 final models (un-tone) or 164 final models( considering tone )

Bi-phone modeling by HMM An initial is modeled by 2 states HMM A final is modeled by 4 states HMM

Tri-phone modeling of Initial-Final structure Considering the left-context and right-context of a semi-syllable Semi-syllables with different context are regarded as different tri-phones Tri-phone model number is increased dramatically Sharing techniques is employed to trade off between the model accuracy and the shortage of training corpus

Tri-phone modeling by HMM Considering the co-articulation influence of the previous syllable

The Sharing Strategy Too many models if we evolve tri-phone model from the bi-phone model. e.g,164*100*164 tri-phone The Intra syllable’s initial-final model remains unchanged The Inter syllable tri-phone expansion is derived from the final class-initial class definition (sharing)

The Sharing Strategy (cont.) Classification of the final model Categorized into 29 classes according the ending vowel 30 classes if considering SILENCE Two schemes Un-tonal classification, 29 classes Tonal classification, 112 classes

The Sharing Strategy (cont.) Classification of the initial model Categorized into 27 classes, considering the influence of the previous FINAL 28 classes if considering SILENCE The tone of syllable is regarded as less important in initial modeling

The Sharing Strategy (cont.)

Tri-phone Experiment Different Tri-phone models for Experiment. ( Bi-phone is the baseline system)

Experimental Results – 1st Cand 863 + Intel bj sh male TEST ON 98test data Experimental Results – 1st Cand

Experimental Results – 25 Cands 863 + Intel bj sh male TEST ON 98test data Experimental Results – 25 Cands

Error rate vs. model complexity

Advantage of Phonetic Context Based Tri-phone Modeling Training algorithm is very easy to implement and is time-saving Less training data is possible Tri-phone models based on phonetic context knowledge are accurate and can significantly improve the ASR performance

Language Model for Chinese Continuous Speech Recognition Capable of processing multi-length and multi-candidate output of the ASR Tolerant of deletion errors, insert error and substitute errors of the ASR Convert Pinyin strings to Chinese characters correctly

Framework of speech recognition where W is the sentence of speech, A is Pinyin, O is the observation of the sentence’s acoustic feature. The sentence W comprises of L Chinese characters

Framework of speech recognition (cont.) For simplicity, substitute Σ calculation by the likely-hood of the best Pinyin candidate, then Here P(W, A) is Chinese language model, P(O/A) is the acoustic model. In the following , we will focus on the language model.

Multi-pass strategy of Language Model Here P(W/A) is the Pinyin to Chinese character conversion model, P(A) is the pinyin language model, where P(W) is Chinese character language model. P(O/A) is the acoustic model. So in the multi-pass language model, the Pinyin model is used to refine the output of acoustic model and then fed into the P(W/A) model

Advantage of the multi-pass language model Pinyin based tri-gram model is much more simplified than the character based tri-gram model. At most1254 tonal syllables At least 6000 frequently used Chinese characters Acceptable time to process multi-length and multi-candidate output from the acoustic model

Convert Pinyin Lattice to Words Example: the speech is “我们来了”(We comes),The PinYin Lattice after rescoring is: wo3 men2 lai2 le1 wo4 men1 lai4 le4 wo1 men4 lei1 lo1 wu3 min2 la1 lei1 The word graph is created by checking lexicon and the LM (trigram):

我们 来 了 我 们 赖 勒 Start 握 门 莱 乐 End 五 民 拉 垒 屋门 啦 嘞

The result is the best way of word graph: 我们 来 了 我 们 赖 勒 Start 握 门 莱 乐 End 五 民 拉 垒 屋门 啦 嘞

Experiment on Pinyin language model Training corpus : 20 millions Chinese characters Testing Sentence: 1680 sentences, about 35455 Chinese characters. Acoustic Model: Tri-phone duration distribution based HMM model

Experiment result

Conclusion from Experiment on Pinyin language model Dramatically improvement on the accuracy of refined candidates 45% improvement for the first candidate’s hit rate by using tri-gram Pinyin model. Top 20 candidates’ hit rate (97.21%) has exceeded the top 100 candidates’ hit rate(97.12%) of the baseline system

The End

Q & A