Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Confidence Measures for Speech Recognition Reza Sadraei.

Speech Recognition Training Continuous Density HMMs Lecture Based on:

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

IBM Haifa Research Lab © 2008 IBM Corporation Retrieving Spoken Information by Combining Multiple Speech Transcription Methods Jonathan Mamou Joint work.

Automatic Continuous Speech Recognition Database speech text Scoring.

Introduction to Automatic Speech Recognition

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

English Pronunciation Learning System for Japanese Students Based on Diagnosis of Critical Pronunciation Errors Yasushi Tsubota, Tatsuya Kawahara, Masatake.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Graphical models for part of speech tagging

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.

Integrated Stochastic Pronunciation Modeling Dong Wang Supervisors: Simon King, Joe Frankel, James Scobbie.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Introduction of Grphones Dong Wang 05/05/2008. Content  Grphones  Graphone-based LVCSR  Graphone-based STD.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.

Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.

Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.

ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Letter to Phoneme Alignment Using Graphical Models N. Bolandzadeh, R. Rabbany Dept of Computing Science University of Alberta 1 1.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Brno University of Technology Keyword spotting and searching in speech Igor Szöke Lectures of SRE at BUT FIT, 2010.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Linguistic knowledge for Speech recognition

2 Research Department, iFLYTEK Co. LTD.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

College of Engineering

8.0 Search Algorithms for Speech Recognition

From Word Spotting to OOV Modeling

Research on the Modeling of Chinese Continuous Speech Recognition

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Network Training for Continuous Speech Recognition

Presentation transcript:

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University of Techbnology BUT 44 th Asilomar Conference on Signals, Systems and Computers,

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Agenda Word-based STD, OOV problem, subwords Experiments Sub-word units Hybrid word-subword system What can we do with OOVs Conclusion

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Goal of STD and glossary of terms Goal: detect keywords or key-phrases in input speech, for each detection, output: Identity Position Score Glossary Large Vocabulary Continuous Speech Recognizer – LVCSR – system converting spoken speech into text. Out-of-vocabulary – OOV – word which is not in the LVCSR vocabulary. Term – textual entry consisting of one or more words in sequence. Spoken Term Detection – STD – a way to search for a term in spoken data. Subword(s) – unit(s) that are parts of words (phones, syllables, automatically found, etc.).

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Word-based STD Due to the presence of language model, Word-based STD systems are reaching better accuracies than acoustic ones.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Implementation Term is searched in recognition lattice Allows to estimate posterior probability of a term.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 The OOV problem REF: THIS IS AN EXAMPLE OF RECOGNIZER OUTPUT REC: THIS IS AMEX APPLE OF RECOGNIZER OUTPUT One OOV causes several errors: OOV can not be found (in the output of LVCSR). OOV impairs recognition of neighboring words. OOV usually carries lot of information (named entity).  We need to handle OOVs ! Word accuracy. Spoken term detection accuracy. Practical (memory, CPU, index size, etc.).

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Answer to OOV problem – sub-word STD Subword recognizer is built (output is subword lattice). Term is converted from words to sequence of subwords. This sequence is searched in the subword lattice.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Agenda Word-based STD, OOV problem, subwords Experiments Sub-word units Hybrid word-subword system What can we do with OOVs Conclusion

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Evaluation - TWV Defined by NIST for NIST STD 2006 evaluation: one number higher is better depending on normalization Requires full STD system

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Normalization-independent evaluation - UBTVW UBTWV - Upper Bound Term Weighted Value Finds optimum threshold for each term one number higher is better Independent on normalization

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Data NIST STD 2006 evaluations. 3h of English telephone conversations words long terms occurring 4737/196 times.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Recognizer I. LVCSR developed in AMI/AMIDA project State-of the art system including VTLN, MPE, posterior features, SAT, 3 passes. Acoustic models trained on 278h of speech. Language model trained on 977M word tokens (50k vocabulary). Dictionary pruned to generate OOVs -> WRDRED. Word accuracy – 69.04%.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Recognizer II.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Results Words Words converted to phones Phone recognizer Phones too small => need longer units

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Agenda Word-based STD, OOV problem, subwords Experiments Sub-word units Hybrid word-subword system What can we do with OOVs Conclusion

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Better subwords – phone multigrams Statistics of phone n-grams are collected (up to 6) from training data (phone transcriptions of speech). Probabilities of all units are estimated. Training data are segmented by the most probable sequence of multigrams. Statistics are recomputed and low occurring units are deleted. Several iterations. N-gram language model is estimated on top of the multigram segmentation of the training data.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Constrained multigrams nosil – sil is not part of multigram unit. noxwrd – add information of word boundary to multigram unit. Term (word representation): PRIME MINISTER Term pronunciation: p r ay m m ih n ih s t axr Term (subword representation): *p-r-ay m* *m-ih-n ih-s t-axr*

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Results Subword search can process OOV terms. Subword search is not so accurate as word search of in-vocabulary terms. Subword search consumes more index space. => Need for combination of word and subword searches.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Agenda Word-based STD, OOV problem, subwords Experiments Sub-word units Hybrid word-subword system What can we do with OOVs Conclusion

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Parallel word-subword … works, but needs to maintain and run 2 systems.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Hybrid word-subword

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Implementation by composition of networks

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Multigram dictionary for hybrid system For hybrid system, phone multigrams must not be trained on utterances. Phone multigrams are trained on dictionary. Experimented with LVCSR vs. big vs. OOV dictionary.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Results – different configurations Pruning factors play role in the memory consumption, size of index, RT factor … “Reasonable system” ~2.5x slower than word ~2.5x bigger index than word Matches the accuracy of word system for IV OOVs found.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Agenda Word-based STD, OOV problem, subwords Experiments Sub-word units Hybrid word-subword system What can we do with OOVs Conclusion

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 OOV detection by the hybrid system Comparison of the subword confidence measure to a threshold => detection of OOVs

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 OOV recovery Use of phoneme to grapheme (P2G) to derive word-form of detected OOV

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Alignment error model Some detected OOVs could be even converted back to in-vocabulary words ! But the phone pronunciation in 1-best output is not ideal… … alignment error model Parameters (probabilities of deletion, insertion, substitution) trained from data. Can process dictionary and look up detected OOVs.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Going more complex … Can construct an wFST accounting for Sequences of in-vocabulary words In-vocabulary words + common pre- and suffixes OOVs And combinations … m ey sh en -> INFORMATION ae l k ax hh aa l ih z em (ALCOHOLISM) -> ALCOHOL / ISM aa f ax s m ae k s (’Office Max’) -> OFFICE OOV1572

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 OOV clustering Alignment model allows for the evaluation of similarity Clustering possible

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Agenda Word-based STD, OOV problem, subwords Experiments Sub-word units Hybrid word-subword system What can we do with OOVs Conclusion

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Conclusion Subword system with constrained multigrams - very good STD performace and OOV tolerant system. Improved hybrid word-subword system tested from STD accuracy and real application point of view. Hybrid system brings better accuracy/size ratio and is faster than the standalone system. It works well in a real indexing & search engine. With a hybrid system, we can Recover OOVs (simple P2G or more elaborate model) Measure similarity of OOVs Cluster them, find re-occurring ones, update vocabulary.

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Reading and playing with Igor Szöke: Hybrid word-subword spoken term detection, Ph.D. thesis, Brno University of Technology, Oct 2010 Stefan Kombrink, Mirko Hannemann, Lukáš Burget, and Hynek Heřmanský: Recovery of Rare Words in Lecture Speech, in Proc. Text, Speech and Dialogue (TSD) 2010, Brno, 2010 Mirko Hannemann, Stefan Kombrink, Martin Karafiát, and Lukáš Burget: Similarity Scoring for Recognizing Repeated Out-of-VocabularyWords, in Proc. Interspeech 2010, Makuhari, Japan, … ‘Publications’ section of

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink /34 Thank you for your attention