Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.

Chapter 5: Information Retrieval and Web Search

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

IBM Haifa Research Lab © 2008 IBM Corporation Retrieving Spoken Information by Combining Multiple Speech Transcription Methods Jonathan Mamou Joint work.

Automatic Continuous Speech Recognition Database speech text Scoring.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

Chapter 6: Information Retrieval and Web Search

Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Survey of Approaches to Information Retrieval of Speech Message Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

N-best list reranking using higher level phonetic, lexical, syntactic and semantic knowledge sources Mithun Balakrishna, Dan Moldovan and Ellis K. Cave.

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Compact Query Term Selection Using Topically Related Text

Document Expansion for Speech Retrieval (Singhal, Pereira)

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Automatic Speech Recognition

Relevance and Reinforcement in Interactive Browsing

INF 141: Information Retrieval

Learning Long-Term Temporal Features

Network Training for Continuous Speech Recognition

Presentation transcript:

Word and Sub-word Indexing Approaches for Reducing the Effects of OOV Queries on Spoken Audio Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research Laboratory Beth Logan Pedro J. Moreno Om Deshmukh Cambridge Research Laboratory

Audio indexing and OOVs  Audio indexing has appeared as a novel application of ASR and IR technologies  However, OOV’s are a limiting factor –While only 1.5% of indexed words they represent 13% of queries  Based on index (active since Dec. 1999)  Cost of retraining Dictionaries/Acoustics/LMs is just to high!  Subword recognizers might solve the problem but are too inaccurate

Types of OOVs on word ASR  OOV’s happen both on queries and audio  The ASR system makes mistakes –It will map an OOV into similarly sounding sequences (deletions/substitutions/insertions)  TALIBAN  (ASR)  TELL A BAND  ENRON  (ASR)  AND RON  ANTHRAX  (ASR)  AMTRAK – or it can just make a mistake  COMPAQ  (ASR)  COMPACT

Solutions  Abandon word based approaches? –Phoneme based ASR  Too many false alarms –Subword based ASR  Compromise between words and phonemes  But word transcript is not available –Very useful in the UI –Allows rapid navigation of multimedia  Combine approaches? –What is the optimal way of combining?

Experimental Setup  Broadcast news style audio –75 hours of HUB4-96/HUB4-97 audio as testing/indexed corpora –65 hours of HUB4-96 (disjoint) for acoustic training of ASR models  Large newspaper corpora for LM training  Queries selected from index logswww.speechbot.com –OOV rate in queries artificially raised to 50%

Experimental Setup  Approximate tf.idf. + score based on proximity of query terms –Long documents are broken into 10 seconds pseudo-documents –Hits occurring in high density areas are considered more relevant  Precision-Recall plots for performance. –False alarm as a secondary metric

Query Examples In dictionaryCountOut of dictionary Count Bill Clinton56Cunaman70 Clinton626Fayed52 Microsoft40Dodi37 China226Plavsic18 Jesus11Mair70  OOV’s are 20% by count  OOV’s are 50% of all queries –Results normalized per query, then merged

Speech Recognition Systems  Large Vocabulary word based system –CMU’s Sphinx3 derived system, 65k word vocabulary, 3-gram LM  Particle based system –7,000 particles, particle 3-gram LM  Phonetic recognizer –Phonemes derived from word recognizer output

Phonetic Indexing Systems  Experiments based on phonetic index and phone sequence index –Expand query word into phonemes –Build expanded query as sequence of N phones with overlap of M phones taliban T AE L IH B AE N AE-L-IH L-IH-B IH-B-AE B-AE-N T-AE-L N=3, M=2

Particle based recognition system  Particles are phone sequences –Based on Ed Whitaker Ph.D. work for Russian –Arbitrary length, learned from data –From 1 to 3 phones, word (internal) –All 1 phones are in particle set  Worst case word = sequence of 1-phone particles  Best case word = single multiphone particle (BUT, THE)  Once particles are learned everything works as in LVCSR systems –Particle dictionary: particles to phones –Particle LM: unigrams, bigrams, trigrams, backoff weights –Acoustics: Triphones

Initialize: Done all l-character particles? Desired number of particles? Improvement? Insert next particle in all words Compute change in likelihood Remove particle from all words l=l+1 TERMINATE Insert best l-character particle Decompose all words into l-character particles Iterate: yes no yes no yes l=1 Training words mapped from orthographic to default phonetic representation Word delimiter added to end of word phone Particle bigram leaving-one-out optimization criterion on training corpus Once particle set (7,000) is determined transform text corpora to particles and learn LM Trigram particle model built with Katz back-off and Good-Turing discounting Learning Particles

Particle Recognizer: Examples  Recognizer transcripts  IN WASHINGTON TODAY A CONGRESSIONAL COMMITTEE HAS BEEN STUDYING BAD OR WORSE BEHAVIOR…  IH_N  W_AA  SH_IH_NG  T_AH_N  T_AH_D  EY  AH  K_AH  N_G  R_EH  SH_AH  N_AH_L  K_AH_M  IH_T_IY  IH_Z  B_AH_N  S_T  AH_D_IY  IH_NG  B_AE_D  AO_R  W_ER_S  B_IH  HH_EY_V  Y_ER  ….  Dictionary examples  T_AH_N (as in washingTON)  T AH N  T_AH_D (as in TODay)  T AH D  HH_EY_V (as in beHAVior)  HH EY V

Experimental Results System11-point Average Precision RecallTop 5 Precision Top 10 Precision False Positives Word Particle Phonemes Phonemes (5/4) Linear Combine OOV combine Results averaged over ALL queries (OOV and non OOV)

Experimental results: non OOV Non-OOV 11 point precision-recall

Experimental results: OOV OOV 11 point precision-recall

 Simplest approach –For non OOV’s queries use word recognizer –For OOV’s queries use phonetic recognizer Combining recognizers

 Continue exploration of particle approach –Across word particles  Explore query expansion techniques based on acoustic confusability  Explore new index combination schemes –Bayesian combination  Take into account uncertainty in recognizers –Combine confidence scores into IR  Explore classic IR techniques –Query expansion, relevance feedback Future Work

Conclusions  Subword approaches can help recover some of the OOV –But at the cost of higher false alarms  No single approach (word/subword/phoneme) can solve the problem alone  Combining different recognizers looks promising –How to combine is still an open research question  The space of possible queries is very large and discrete, effective techniques are elusive…

Combining recognizers

Background  OOV’s happen both on queries and audio –TALIBAN  (ASR)  TELL A BAND –ENRON  (ASR)  AND RON –ANTHRAX  (ASR)  AMTRAK  Two possible solutions: –Map queries to subwords, build index with subwords –Map queries to similarly sounding words, build index with words –Combine both approaches