Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Building an ASR using HTK CS4706
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Advances in WP2 Torino Meeting – 9-10 March
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition. What makes speech recognition hard?
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
Why is ASR Hard? Natural speech is continuous
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Speech and Language Processing
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Copyright © 2015 by Educational Testing Service. 1 Feature Selection for Automated Speech Scoring Anastassia Loukina, Klaus Zechner, Lei Chen, Michael.
1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Automatic Speech Recognition: Conditional Random Fields for ASR
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology

Follow-up Work of Our Previous Paper Tom Ko, Brian Mak, “Improving Speech Recognition by Explicit Modeling of Phone Deletions,” in ICASSP We extend our investigation of modeling phone deletion in conversational speech. We present some plausible explanations for why phone deletion modeling is more successful in read speech.

Motivations of Modeling Phone Deletion Phone deletion rate is about 12% in conversational speech. [ Greenberg, “Speaking in shorthand — a syllable-centric perspective for understanding pronunciation variation,” ESCA Workshop 1998 ] Phone deletions cannot be modeled well by triphone training. [ Jurafsky, “What kind of pronunciation variation is hard for triphones to model?” ICASSP 2001 ]

Explicit Modeling of Phone Deletions Conceptually, we may explicitly model phone deletions by adding skip arcs. ahbawt Practically, it requires a unit bigger than a phone to implement the skip arcs. Since we want to capture the word-specific behavior of phone deletion, we choose to use the whole word models.

assume a vocabulary size of W(~5k)and a set of N(~40) phones: => large number of context-dependent models: vs. => training data sparsity Problems in Creating Whole Word Models 3 W 3 N

Solutions To decrease the number of model parameters by: 1. Bootstrapping  construct the word models from triphone models. 2.Fragmentation  cut the word units into several segments to limit the increase in the number of units during tri-unit expansion.

Bootstrapping from Triphone Models sil- ah +b ah- b +aw b- aw +t aw- t +sil WordPhonetic Transcription Context-dependent Triphone Units ABOUTah b aw t sil- ah +b ah- b +aw b- aw +t aw- t +sil ah ^ b ^ aw ^ t (ABOUT)

Consider the model of word “CONSIDER” Fragmentation k^ah^n^s^ih^d^er kahernsihdr n^s^ih^dr Segment : 1 st 2 nd 3 rd (Sub Word Unit) 4 th nsihderkah

Fragmentation CI mono-unit : CD tri-unit : #model : Assume the vocabulary size is 5k and #monophones is 40 : ah^b^aw^t ah b^aw t ? - ah^b^aw^t + ? 40 x 5k x 40= 8M ? – ah +b^aw ah- b^aw +t b^aw- t +? 40 x 5k x 2 + 5k = 0.4M Not fragmentedFragmented WordPhonetic Transcription ABOUTah b aw t

Consider the model of word “CONSIDER” Context-dependent Fragmented Word Models (CD-FWM) k^ah^n^s^ih^d^er kahernsihdr n^s^ih^dr Segment : 1 st 2 nd 3 rd (Sub Word Unit) 4 th

Setup of the Read Speech Experiment Training Set : WSJ0 + WSJ1 (46995 utterances), about 44 hours of read speech, 302 speakers Dev. Set : WSJ0 Nov92 Evaluation Set (330 utterances) Test Set : WSJ1 Nov93 Evaluation Set (205 utterances) Vocabulary size in test set: 5000 #Tri-phones : 17,107 #HMM states : 5,864 #Gaussian / state : 16 #State / phone : 3 Language model : Bigram Feature Vector : standard 39 dimensional MFCC

Result of the Read Speech Experiment Model#CD Phones #CD SWUs #Skip Arcs Word Acc. Baseline triphones17, % CD-FWM : no phone deletion arcs + phone deletion arcs to words with >= 4 phones 58,581 11, , % 92.4%

Setup of the Conversational Speech Experiment Training Set : Partition A, B and C of the SVitchboard 500-word tasks (13,597 utterances), about 3.69 hours of conversational speech, 324 speakers Dev. Set : Partition D of the SVitchboard 500-word tasks (4,871 utterances) Test Set : Partition E of the SVitchboard 500-word tasks (5,202 utterances) Vocabulary size in test set: 500 #Tri-phones: 4,558 #HMM states : 660 #Gaussian / state : 16 #State / phone : 3 Language model : Bigram Feature Vector : standard 39 dimensional PLP

Result of the Conversational Speech Experiment Model#CD Phones #CD SWUs #Skip Arcs Word Acc. Baseline triphones4, % CD-FWM : no phone deletion arcs + phone deletion arcs to words with >= 4 phones 4, , % 44.43%

Analysis of Word Tokens Coverage Words differ greatly in terms of their frequency of occurrence in spoken English. In conversational speech, the most common words occur far more frequently than the least, and most of them are short words (<= 3 phones). [ Greenberg, “Speaking in shorthand — a syllable-centric perspective for understanding pronunciation variation,” ESCA Workshop 1998 ]

Comparison of Word Tokens Coverage The coverage of words with >= 4 phones is smaller in conversational speech test set (20% vs. 50%). The coverage of words with >= 6 phones is even much smaller in conversational speech test set (3.5% vs. 26%). As a result, the improvement of our proposed method in conversational speech may not be as obvious as in read speech. 20% 50% 26% 3.5%

Breakdown of #Words According to Result in the Conversational Speech Experiment CD-FWM Without Phone Deletion CD-FWM With Phone Deletion CorrectWrong Correct9, (confused) Wrong342 (corrected)9,407

Summary & Future Work We proposed a method of modeling pronunciation variations from the acoustic modeling perspective. The pronunciation weights are captured naturally by the skip arc probabilities in the context-dependent fragmented word models (CD- FWM). Currently, phone deletion modeling is not applied on short words (<= 3 phones) which cover 80% of tokens in conversational speech. We would like to investigate which set of skip arcs can lead to largest gain. If those skip arcs which lead to confusions more than improvement are removed, the recognition performance can be further improved.

The End