A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Language Modeling.

Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University.

Speech Recognition Training Continuous Density HMMs Lecture Based on:

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

Graphical models for part of speech tagging

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

1 E. Fatemizadeh Statistical Pattern Recognition.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

Chapter 23: Probabilistic Language Models April 13, 2004.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

Yuya Akita , Tatsuya Kawahara

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

CS Statistical Machine learning Lecture 24

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

Natural Language Processing Statistical Inference: n-grams

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Smoothing Issues in the Strucutred Language Model

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Statistical Models for Automatic Speech Recognition

Jun Wu Department of Computer Science and

Statistical Models for Automatic Speech Recognition

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Presentation transcript:

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins University ICASSP 99

Abstract This paper presents a compact language model which incorporates local dependencies in the form of N-grams and long distance dependencies through dynamic topic These information are integrated using the maximum entropy principle

Introduction Most current algorithms for language modeling tend to suffer from an acute myopia, basing their predictions of the next word on only a few immediately preceding words This paper presents a method for exploiting long- distance dependencies through dynamic models of the topic of the conversation

Introduction Most other methods exploit topic related information for language modeling by constructing separate N-gram models for each individual topic Such a construction however results in fragmentation of the training text The usual remedy is to interpolate such a topic specific N-gram model with a topic-independent model constructed using all the available data

Introduction In the method presented here the term-frequencies are treated as topic-dependent features of a corpus, and the N-gram frequencies are topic-independent features An admissible model is then required to satisfy constraints that reflect both the sets of features –The ME principle is used to select a statistical model which meets all these constraints

Combining N-gram and Topic Dependencies Let V denote the vocabulary of a speech recognizer

The ME Framework

The ME Framework Intuition suggests that (1) not every word in the vocabulary will have strong dependence on the topic of the conversation. (2) Estimating a separate conditional pmf for each [w k-1, w k-2, t k ] however fragments the training data. We therefore seek a model which, in addition to topic- independent N-gram constraints, meets topic-dependent marginal constraints:

The ME Framework It is known that the ME model has an exponential form, with one parameter corresponding to each linear constraint placed on the model: where Z is a suitable normalization constant. The generalized iterative scaling (GIS) algorithm is used to compute the ME model parameters

Topic Assignment for Test Utterances Topic assign to –Entire test conversation –Each utterance –Parts of an utterance Topic assignment for an utterance should be based on –Only that utterance –Include a few preceding utterances –Include a few preceding and succeeding utterances

Experimental Results on Switchboard Training set –1200 conversations containing a total 2.1 M words –Each conversation is annotated with one of 70 topics, this is the topic recommended to the callers during data collection Testing set –19 conversations (38 conversation sides) comprising words in over 2400 utterances Vocabulary –22K words, includes all the words in the training and testing set

Experimental Results on Switchboard For every test utterance, a list of the 2500-best hypotheses is generated by an HTK-based recognizer, using state-0clustered crossword triphone HMMs with Gaussian mixture output densities and a back-off bigram LM The recognition WER for rescoring these hypotheses and the average perplexity of the transcriptions of the test set are reported

Baseline Experiments Standard back-off trigram model ME model with only N-gram constraints Minimum count for bigram and trigram

Estimation of Topic-Conditional Models Each conversation side in the training corpus is processed to obtain a representative vector of weighted frequencies of vocabulary terms excluding stop words –Where a stop word is any of a list of about 700 words with low semantic content which are ignored by the topic classifier These vectors are then clustered using a K-means procedure (K=70) Words whose unigram frequency f t in a cluster t differs significantly from its frequency f in the whole corpus are designated as topic-related words. There are roughly 300 such words for every topic cluster, about 16K such words in the 22K vocabulary

Topic Assignment During Testing Assign topic to test utterance –Manual assignment of topics to the conversation –Automatic topic assignment based on the reference transcriptions –Or the 10-best hypotheses generated by the first recognition pass –Assignment by an oracle to minimize perplexity (or WER)

Topic Assignment During Testing Assign a topic to each utterance based on the 10-best hypotheses of the current and the three preceding utterances Note that utterance-level topic assignment of Table 3 is more effective than the conversation-level assignment (Table 2) Abs WER 0.8%, relative ppl 9.7%

Topic Assignment During Testing To gain insight into improved performance from utterance- level topic assignment, we examine agreement between topics assigned at the two level As seen in Table 4, 8 out of 10 utterances prefer the topic- independent model, probably serving vital discourse function (e.g., acknowledgments, back-channel responses)

Analysis of Recognition Performance The vocabulary is divided into two sets: –Topic words and the others Each word token in the reference transcription and recognizer’s output is then marked as belonging to one of the two sets and their perplexity is calculated separately About 7% of the tokens in the test set have topic- dependent constraints

Analysis of Recognition Performance Divide the vocabulary simply into content-bearing words and stop words Nearly 25% of tokens in the test set are content-bearing and the remainder are stop words

ME v.s Interpolated Topic N-grams Construct back-off unigram, bigram and trigram models specific to each topic using the partitioning of the 2.1M word corpus Interpolate each topic-specific N-gram with the topic- independent trigram model, and choose the interpolation weight to minimize the perplexity of the test set

Topic-Dependent ME Model Including N-grams With Lower Counts

Conclusion A ME language model which combines topic dependencies and N-gram constraints The performance improvement on content-bearing words is even more significant A small number of topic-dependent unigrams are able to provide this improvement because the information they provide is complementary