Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Speaker Adaptation for Vowel Classification
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Automatic Continuous Speech Recognition Database speech text Scoring.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Speech Processing Laboratory
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Speaker independent Digit Recognition System Suma Swamy Research Scholar Anna University, Chennai 10/22/2015 9:10 PM 1.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Olivier Siohan David Rybach
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
LECTURE 15: REESTIMATION, EM AND MIXTURES
A maximum likelihood estimation and training on the fly approach
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter : Chen, Tzan Hwei

2 Reference Panikos Heracleous, Tohru Shimizu, “AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARYLANGUAGE FOR FILLER MODELS TRAINING”, Eurospeech 2003 Rose, R.C.; Paul, D.B, “A hidden Markov model based keyword recognition system”, ICASSP 1990

3 Outline Introduction Proposed system Definitions of evaluation measures Experiments Conclusions

4 Introduction The task of keyword spotting is to detect a set of keywords (single or multiple keywords) In a keyword spotter not only the keywords, but also the non-keywords or noise components must be explicitly modeled. a set of HMM (garbage or filler models) is chosen to represent the non-keyword intervals

5 Introduction (cont) the choice of an appropriate garbage model set is a critical issue. The most common approaches are as follows: The training corpus for a specific task is split into keyword and non-keyword (extraneous) data. The man disadvantage of such methods is task- dependency The garbage models are selected from a set of common acoustic models The main disadvantage of such methods is the high rate of false rejections

6 Proposed system we propose a novel method for modeling the non- keyword intervals based on the use of bilingual hidden Markov models. To develop a task-independent keyword spotter. To overcome the problem of the overlapping of contexts.

7 Proposed system (cont) Main requirements in our approach is acoustic similarity between the target and ‘garbage’ languages. We compared the two languages based on the International Phonetic Alphabet (IPA) The IPA has been developed by the International Phonetic Association, and is a set of symbols which represents the sounds of language in written form.

8 Proposed system (cont) Based on IPA, American English acoustically covers the Japanese language efficiently. The English HMM garbage models - trained from a large speech corpus of guaranteed non-keyword speech - are expected to represent the non-keyword intervals without rejecting the true keyword hits.

9 Proposed system (cont) Fig. 1. Block diagram of the system

10 Proposed system (cont) The background network is composed of garbage models connected to form syllables as in Japanese language. Using background network, we can account for the variabilities in time of the keyword scores. The decision for separating true keyword hits from false alarms is more reliable.

11 Proposed system (cont) Fig. 2. Histogram of log likelihood ratio scores

12 Definitions of evaluation measures Recognition Rate (RCR) - The percentage of keywords detected. Rejection Rate (RJR) - The percentage of non- keywords rejected. Equal Rate (ER) - It shows equal RCR and RJR.

13 Experiments The Japanese keywords are represented by gender- dependent, context-dependent HMM. The feature vectors are of size 38 (12 MFCC + 12 delta-MFCC + 12 delta-delta-MFCC + delta-Energy + delta-delta-Energy). A set of 28 context-independent, 3-state single Gaussian HMM trained using the same speech corpus is chosen as the Japanese garbage models for comparison purposes.

14 Experiments (cont) The English garbage models are represented by context-independent, 3-state single Gaussian HMM. Twenty-eight models trained using the MACROPHONE American English telephone speech corpus are used. Allowing at most one keyword per utterance. The vocabulary consists of 100 keywords.

15 Experiments (cont) The English garbage models are represented by context-independent, 3-state single Gaussian HMM. Twenty-eight models trained using the MACROPHONE American English telephone speech corpus are used. Allowing at most one keyword per utterance. The vocabulary consists of 100 keywords.

16 Experiments (cont) the test set consists of Japanese telephone contains 1,548 short utterances (of which only 1,133 contain a keyword). Fig. 3. Recognition rates (left fig.) and rejection rates (right fig.) using clean test data In first pass

17 Experiments (cont) Fig. 4. Performance using English garbage models (clean test data) (2nd pass) Fig. 5. Performance using Japanese garbage models (clean test data) (2nd pass)

18 Experiments (cont) Fig. 6. Recognition rates using noisy test data (first pass) Fig. 7. Rejections rates using noisy test data (first pass)

19 Conclusions The main advantage of this method is the task- independency, and also parameter tuning (e.g. word insertion penalty) does not have a serious effect on the performance. In a future study, we plan to evaluate our method using larger vocabularies.

Utterance verification evaluation

21 the True Rejection Rate (TRR) the False Rejection Rate (FRR)