Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Natural Language Understanding

Introduction to Automatic Speech Recognition

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

7-Speech Recognition Speech Recognition Concepts

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Handing Uncertain Observations in Unsupervised Topic-Mixture Language Model Adaptation Ekapol Chuangsuwanich 1, Shinji Watanabe 2, Takaaki Hori 2, Tomoharu.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Tom Ko and Brian Mak The Hong Kong University of Science and Technology.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Performance Comparison of Speaker and Emotion Recognition

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.

A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Online Multiscale Dynamic Topic Models

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition

Progress Report of Sphinx in Summer 2004 (July 1st to Aug 31st )

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

Towards Automatic Fluency Assessment

Research on the Modeling of Chinese Continuous Speech Recognition

Automatic Speech Recognition

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Automatic Speech Recognition

Speaker Identification:

A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.

Presented by Chen-Wei Liu

Presenter: Shih-Hsiang(士翔)

2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.

Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

Robust word boundary detection in spontaneous speech using acoustic and lexical cues Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh, Panayiotis Georgiou, Shrikanth Narayanan University of Southern California Presenter：ChiaHua

Outline INTRODUCTION DESCRIPTION OF THE FEATURES EXPERIMENTAL SETUP RESULTS SUMMARY

INTRODUCTION Automatic word boundary detection: assess speech recognition performance. make recognizers faster. detecting regions of out of vocabulary (OOV) words.

INTRODUCTION Automatic word boundary detection: assess speech recognition performance. make recognizers faster. detecting regions of out of vocabulary (OOV) words. In the past, researchers have tried to address this problem using just the acoustic information from speech.

INTRODUCTION Automatic word boundary detection: assess speech recognition performance. make recognizers faster. detecting regions of out of vocabulary (OOV) words. In the past, researchers have tried to address this problem using just the acoustic information from speech. While acoustic cues carry useful word boundary information, they suffer from certain limitations.

INTRODUCTION Importantly, acoustic cues often fail to give clues about the word boundaries. (particularly when the beginning of a word gets co-articulated with the end of the previous word in many lexical contexts. )

INTRODUCTION This is especially common in spontaneous speech. Importantly, acoustic cues often fail to give clues about the word boundaries. (particularly when the beginning of a word gets co-articulated with the end of the previous word in many lexical contexts. ) This is especially common in spontaneous speech. It is, in these cases, where lexical cues could become an important factor for estimating word boundaries.

INTRODUCTION In our work, we focus on capturing lexical information from the ASR framework (not necessarily best hypothesis) and incorporating them with improved acoustic features to obtain a robust word boundary estimate. The rationale here is that although the lexical hypotheses may be erroneous, they provide rough, albeit potentially imperfect, word segmentation information which can be advantageously used in con-junction with additional acoustic information for improved word boundary detection.

DESCRIPTION OF THE FEATURES Acoustic Features 𝑆ℎ𝑜𝑟𝑡−𝑡𝑖𝑚𝑒 𝑒𝑛𝑒𝑟𝑔𝑦 Since speakers frequently reduce their loudness level while making transitions from one word to the next 𝐸 𝑚 = 1 3 𝑖=𝑚−1 𝑚+1 𝐸 [𝑖] 𝐸 𝑖 = 1 𝑁 𝑖 𝑁 𝑠ℎ − 𝑁 2 𝑖 𝑁 𝑠ℎ + 𝑁 2 𝑥 2 [𝑛] 𝑥[𝑛]: speech signal 𝑁 𝑠ℎ : frame shift in number of samples 𝑁: the length of a analysis frames in number of samples 𝑚: frame 是代表音量的高低依據短時距能量大小來消除語音的一些細小雜訊平方平均數（Quadratic mean） 𝑀= 𝑖=1 𝑛 𝑥 𝑖 2 𝑛 = 𝑥 1 2 + 𝑥 2 2 +… 𝑥 𝑛 2 𝑛

DESCRIPTION OF THE FEATURES Acoustic Features 𝑠ℎ𝑜𝑟𝑡−𝑡𝑖𝑚𝑒 𝑧𝑒𝑟𝑜−𝑐𝑟𝑜𝑠𝑠𝑖𝑛𝑔 𝑟𝑎𝑡𝑒 𝑍𝐶𝑅 When there is reduced speech activity in a word to word transition, we expect the ZCR at that word boundary to be different from that within the word. ZCR is computed by finding the number of times the signal crosses level zero within an analysis frame. 如果給定信號中過零點的數量多，則信號快速變化，因此信號可能包含高頻信息。 A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

DESCRIPTION OF THE FEATURES Acoustic Features 𝑆ℎ𝑜𝑟𝑡−𝑡𝑖𝑚𝑒 𝑃𝑖𝑡𝑐ℎ 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 The speech signal is pseudo-periodic only during the voiced portions and can be tracked by computing the pitch value in every analysis frame.

Clip and discontinuity Yeah and it was usually

DESCRIPTION OF THE FEATURES Lexical Features We define lexical information based 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐𝑜𝑒𝑓− 𝑓𝑖𝑐𝑖𝑒𝑛𝑡 (BCC) at frame 𝑙 as follows :

DESCRIPTION OF THE FEATURES Lexical Features We define lexical information based 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑐𝑜𝑒𝑓− 𝑓𝑖𝑐𝑖𝑒𝑛𝑡 (BCC) at frame 𝑙 as follows : 𝐵𝐶𝐶 𝑙 = 1 𝑁 𝑡ℎ 𝑘=1 𝑁 𝑙 𝑔( 𝑝 𝑘 ) 𝑔 𝑝 𝑘 = 1 if 𝑝 𝑘 > 𝑝 𝑡ℎ 0 otherwise 𝑁 𝑙 : the total number of paths in the lattice passing through frame 𝑙. 𝑝 𝑘 : the probability of the 𝑘th path in the lattice from start frame to end frame.

EXPERIMENTAL SETUP SwitchBoard corpus perform word boundary detection on spontaneous speech utterances. randomly selected 7000 utterances. (training: 5000; test: 2000, contained 4% OOVs.) The recognition system that we use in this setup is Sphinx-3. WSJ and TIMIT acoustic data. Features: 12 MFCCs and energy along with first and second derivatives. We use word boundaries tagged by Misissipi State University 1 as the reference to train and evaluate the system.

EXPERIMENTAL SETUP Features extraction Classifier setup a phoneme recognition is performed on all utterances using Sphinx. We rescore the phoneme lattice using general purpose language model (LM) and we extracted the BCC feature. the LM is composed of 16K unigrams, 1.6M bigrams and 1.7M trigrams; Classifier setup Classifier setupWe use the features described above to train a Hidden Markov Model (HMM) based classifier for word boundary detection. It can recognize a boundary and a non-boundary symbol.

EXPERIMENTAL SETUP Evaluation we first compute the precision and recall of the word boundary detection and finally we report the F-score. An estimated word boundary location is considered to be correct if it is within 10 frames of the actual boundary location. We chose the following parameters for our experiment N=320, 𝑁𝑠ℎ =160 and 𝑁𝑡ℎ =50.

RESULTS

SUMMARY In this paper, we proposed a novel lexically derived feature called the boundary confidence coefficient (BCC) in addition to acoustic features to improve automatic word boundary detection in spontaneous speech. The robustness of our feature was demonstrated through the experimental evaluation at various noise levels of white and pink noise upto 5dB. the results obtained using lexical cues are significant. We also observe that the F- score for colored noise is poorer than that of white noise at the same SNR level. BCC feature also suffers due to low SNR since the ASR lattice becomes more noisy due to poor acoustics of noisy speech.