Berlin Chen Department of Computer Science & Information Engineering

Slides:

Advertisements

Similar presentations

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Visual Recognition Tutorial

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Language Modeling Approaches for Information Retrieval Rong Jin.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

1 Advanced Smoothing, Evaluation of Language Models.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Chapter 23: Probabilistic Language Models April 13, 2004.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Yuya Akita , Tatsuya Kawahara

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.

Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Natural Language Processing Statistical Inference: n-grams

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Minimum Rank Error Training for Language Modeling Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University,

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

An Empirical Study on Language Model Adaptation Jianfeng Gao, Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Language Model for Machine Translation Jang, HaYoung.

Statistical Models for Automatic Speech Recognition

Language Models for Information Retrieval

Statistical Models for Automatic Speech Recognition

N-Gram Model Formulas Word sequences Chain rule of probability

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Language Model Approach to IR

Parametric Methods Berlin Chen, 2005 References:

CS590I: Information Retrieval

Berlin Chen Department of Computer Science & Information Engineering

Presentation transcript:

Statistical Language Modeling for Speech Recognition and Information Retrieval Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Outline What is Statistical Language Modeling Statistical Language Modeling for Speech Recognition, Information Retrieval, and Document Summarization Categorization of Statistical Language Models Main Issues for Statistical Language Models Conclusions

What is Statistical Language Modeling ? Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptance of a given word sequence Adopted from Joshua Goodman’s public presentation file

What is Statistical LM Used for ? It has continuously been a focus of active research in the speech and language community over the past three decades It also has been introduced to the information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems Other application domains Machine translation Input method editor (IME) Optical character recognition (OCR) Bioinformatics etc.

Statistical LM for Speech Recognition Speech recognition: finding a word sequence out of the many possible word sequences that has the maximum posterior probability given the input speech utterance Acoustic Modeling Language Modeling

Statistical LM for Information Retrieval Information retrieval (IR): identifying information items or “documents" within a large collection that best match (are most relevant to) a “query” provided by a user that describes the user’s information need Query-likelihood retrieval model: a query is considered generated from an “relevant” document that satisfies the information need Estimate the likelihood of each document in the collection being the relevant document and rank them accordingly Query Likelihood Document Prior Probability

Statistical LM for Document Summarization Estimate the likelihood of each sentence of the document being in the summary and rank them accordingly The sentence generative probability can be taken as a relevance measure between the document and sentence The sentence prior probability is, to some extent, a measure of the importance of the sentence itself Sentence Generative Probability (or Document Likelihood) Sentence Prior Probability

n-Gram Language Models (1/3) For a word sequence , can be decomposed into a product of conditional probabilities E.g., However, it’s impossible to estimate and store if is large (the curse of dimensionality) multiplication rule History of wi

n-Gram Language Models (2/3) n-gram approximation Also called (n-1)-order Markov modeling The most prevailing language model E.g., trigram modeling How do we find probabilities? (maximum likelihood estimation) Get real text, and start counting (empirically) History of length n-1 Probability may be zero count

n-Gram Language Models (3/3) Minimum Word Error (MWE) Discriminative Training Given a training set of observation sequences , the MWE criterion aims to minimize the expected word errors of these observation sequences using the following objective function MWE objective function can be optimized with respect to the language model probabilities using Extended Baum-Welch (EBW) algorithm

n-Gram-Based Retrieval Model (1/2) Each document is a probabilistic generative model consisting of a set of n-gram distributions for predicting the query Document models can be optimized by the expectation-maximization (EM) or minimum classification error (MCE) training algorithms, given a set of query and relevant document pairs Features: 1. A formal mathematic framework 2. Use collection statistics but not heuristics 3. The retrieval system can be gradually improved through usage Query

n-Gram-Based Retrieval Model (2/2) MCE training Given a query and a desired relevant doc , define the classification error function as: “>0”: means misclassified; “<=0”: means a correct decision Transform the error function to the loss function Iteratively update the weighting parameters, e.g., Also can take all irrelevant doc in the answer set into consideration

n-Gram-Based Summarization Model (1/2) Each sentence of the spoken document is treated as a probabilistic generative model of n-grams, while the spoken document is the observation : the sentence model, estimated from the sentence : the collection model, estimated from a large corpus In order to have some probability of every word in the vocabulary

n-Gram-Based Summarization Model (2/2) Relevance Model (RM) In order to improve the estimation of sentence models Each sentence has its own associated relevance model , constructed by the subset of documents in the collection that are relevant to the sentence The relevance model is then linearly combined with the original sentence model to form a more accurate sentence model

Categorization of Statistical Language Models (1/4)

Categorization of Statistical Language Models (2/4) 1. Word-based LMs The n-gram model is usually the basic model of this category Many other models of this category are designed to overcome the major drawback of n-gram models That is, to capture long-distance word dependence information without increasing the model complexity rapidly E.g., mixed-order Markov model and trigger-based language model, etc. 2. Word class (or topic)-based LMs These models are similar to the n-gram model, but the relationship among words is constructed via (latent) word classes When the relationship is established, the probability of a decoded word given the history words can be easily found out E.g., class-based n-gram model, aggregate Markov model and word topical mixture model (WTMM)

Categorization of Statistical Language Models (3/4) 3. Structure-based LMs Due to the constraints of grammars, rules for a sentence may be derived and represented as a parse tree Then, we can select among candidate words by the sentence patterns or head words of the history E.g., structured language model 4. Document class (or topic)-based LMs Words are aggregated in a document to represent some topics (or concepts). During speech recognition, the history is considered as an incomplete document and the associated latent topic distributions can be discovered on the fly The decoded words related to most of the topics that the history probably belongs to can be therefore selected E.g., mixture-based language model, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA)

Categorization of Statistical Language Models (4/4) Ironically, the most successful statistical language modeling techniques use very little knowledge of what language is may be a sequence of arbitrary symbols, with no deep structure, intention, or thought behind them F. Jelinek said “put language back into language modeling” “Closing remarks” presented at the 1995 Language Modeling Summer Workshop, Baltimore

Main Issues for Statistical Language Models Evaluation How can you tell a good language model from a bad one Run a speech recognizer or adopt other statistical measurements Smoothing Deal with data sparseness of real training data Various approaches have been proposed Adaptation The subject matters and lexical characteristics for the linguistic contents of utterances or documents (e.g., news articles) might be are very diverse and are often changing with time LMs should be adapted consequently Caching: If you say something, you are likely to say it again later Adjust word frequencies observed in the current conversation

Evaluation (1/7) Two most common metrics for evaluation a language model Word Recognition Error Rate (WER) Perplexity (PP) Word Recognition Error Rate Requires the participation of a speech recognition system (slow!) Need to deal with the combination of acoustic probabilities and language model probabilities (penalizing or weighting between them)

Evaluation (2/7) Perplexity Perplexity is geometric average inverse language model probability (measure language model difficulty, not acoustic difficulty/confusability) Can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model For trigram modeling:

Evaluation (3/7) More about Perplexity Perplexity is an indication of the complexity of the language if we have an accurate estimate of A language with higher perplexity means that the number of words branching from a previous word is larger on average A langue model with perplexity L has roughly the same difficulty as another language model in which every word can be followed by L different words with equal probabilities Examples: Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 Ask a speech recognizer to recognize names at a large institute (10,000 persons) – hard – perplexity  10,000

Evaluation (4/7) More about Perplexity (Cont.) Training-set perplexity: measures how the language model fits the training data Test-set perplexity: evaluates the generalization capability of the language model When we say perplexity, we mean “test-set perplexity”

Evaluation (5/7) Is a language model with lower perplexity is better? The true (optimal) model for data has the lowest possible perplexity The lower the perplexity, the closer we are to the true model Typically, perplexity correlates well with speech recognition word error rate Correlates better when both models are trained on same data Doesn’t correlate well when training data changes The 20,000-word continuous speech recognition for Wall Street Journal (WSJ) task has a perplexity about 128 ~ 176 (trigram) The 2,000-word conversational Air Travel Information System (ATIS) task has a perplexity less than 20

Evaluation (6/7) The perplexity of bigram with different vocabulary size

Evaluation (7/7) A rough rule of thumb (recommended by Rosenfeld) Reduction of 5% in perplexity is usually not practically significant A 10% ~ 20% reduction is noteworthy, and usually translates into some improvement in application performance A perplexity improvement of 30% or more over a good baseline is quite significant Vocabulary Perplexity WER zero |one |two |three |four |five |six |seven |eight |nine 10 5 John |tom |sam |bon |ron | |susan |sharon |carol |laura |sarah 7 bit |bite |boot |bait |bat |bet |beat |boat |burt |bart 9 Perplexity cannot always reflect the difficulty of a speech recognition task Perplexity cannot always reflect the difficulty of a speech recognition task Tasks of recognizing 10 isolated-words using IBM ViaVoice

Smoothing (1/3) Maximum likelihood (ML) estimate of language models has been shown previously, e.g.: Trigam probabilities Bigram probabilities count

Smoothing (2/3) Data Sparseness Many actually possible events (word successions) in the test set may not be well observed in the training set/data E.g. bigram modeling P(read|Mulan)=0 P(Mulan read a book)=0 P(W)=0 P(X|W)P(W)=0 Whenever a string such that occurs during speech recognition task, an error will be made

Smoothing (3/3) Smoothing Assign all strings (or events/word successions) a nonzero probability if they never occur in the training data Tend to make distributions flatter by adjusting lower probabilities upward and high probabilities downward

Smoothing: Simple Models Add-one smoothing For example, pretend each trigram occurs once more than it actually does Add delta smoothing Work badly! DO NOT DO THESE TWO.

Smoothing: Back-Off Models The general form for n-gram back-off : normalizing/scaling factor chosen to make the conditional probability sum to 1 I.e., P(蔣介石 | 中華民國總統)=0.3 P(嚴家淦 | 中華民國總統)=0.1 P(蔣經國 | 中華民國總統)=0.3 P(李登輝 | 中華民國總統)=0.2 P(陳水扁 | 中華民國總統)=0, but P(陳水扁 | 總統)=0.3 P(蘇貞昌 | 中華民國總統)=0, but P(蘇貞昌 | 總統)=0.1 …… n-gram smoothed (n-1)-gram

Smoothing: Interpolated Models The general form for Interpolated n-gram back-off The key difference between backoff and interpolated models For n-grams with nonzero counts, interpolated models use information from lower-order distributions while back-off models do not Moreover, in interpolated models, n-grams with the same counts can have different probability estimates count

Caching (1/2) The basic idea of cashing is to accumulate n-grams dictated so far in the current document/conversation and use these to create dynamic n-grams model Trigram interpolated with unigram cache Trigram interpolated with bigram cache

Caching (2/2) Real Life of Caching Someone says “I swear to tell the truth” System hears “I swerve to smell the soup” Someone says “The whole truth”, and, with cache, system hears “The toll booth.” – errors are locked in Caching works well when users correct as they go, poorly or even hurts without corrections Cache remembers! Swerve:突然轉向;轉彎;偏離方向 Adopted from Joshua Goodman’s public presentation file

LM Integrated into Speech Recognition Theoretically, Practically, language model is a better predictor while acoustic probabilities aren’t “real” probabilities Penalize insertions E.g.,

n-Gram Language Model Adaptation (1/4) Count Merging n-gram conditional probabilities form a a multinominal distribution The parameters form sets of independent Dirichlet distributions with hyperparameters The MAP estimate is the posterior distribution of All possible N-gram histories Vocabulary Size

n-Gram Language Model Adaptation (2/4) Count Merging (cont.) Maximize the posterior distribution of w.r.t. the constraint Differentiate w.r.t. Largrange Multiplier

n-Gram Language Model Adaptation (3/4) Count Merging (cont.) Parameterization of the prior distribution (I): The adaptation formula for Count Merging E.g., Background Corpus Adaptation Corpus

n-Gram Language Model Adaptation (4/4) Model Interpolation Parameterization of the prior distribution (II): The adaptation formula for Model Interpolation E.g.,

Known Weakness in Current n-Gram LM Brittleness Across Domain Current language models are extremely sensitive to changes in the style or topic of the text on which they are trained E.g., conversations vs. news broadcasts, fictions vs. politics Language model adaptation In-domain or contemporary text corpora/speech transcripts Static or dynamic adaptation Local contextual (n-gram) or global semantic/topical information False Independence Assumption In order to remain trainable, the n-gram modeling assumes the probability of next word in a sentence depends only on the identity of last n-1 words n-1-order Markov modeling

Conclusions Statistical language modeling has demonstrated to be an effective probabilistic framework for NLP, ASR, and IR-related applications There remains many issues to be solved for statistical language modeling, e.g., Unknown word (or spoken term) detection Discriminative training of language models Adaptation of language models across different domains and genres Fusion of various (or different levels of) features for language modeling Positional Information? Rhetorical (structural) Information?

References J. R. Bellegarda. Statistical language model adaptation: review and perspectives. Speech Communication 42(11), 93-108, 2004 X., W. Liu, B. Croft. Statistical language modeling for information retrieval. Annual Review of Information Science and Technology 39, Chapter 1, 2005 R. Rosenfeld. Two decades of statistical language modeling: where do we go from here? Proceedings of IEEE, August 2000 J. Goodman, A bit of progress in language modeling, extended version. Microsoft Research Technical Report MSR-TR-2001-72, 2001 H.S. Chiu, B. Chen. Word topical mixture models for dynamic language model adaptation. ICASSP2007 J.W. Kuo, B. Chen. Minimum word error based discriminative training of language models. Interspeech2005 B. Chen, H.M. Wang, L.S. Lee. Spoken document retrieval and summarization. Advances in Chinese Spoken Language Processing, Chapter 13, 2006

Maximum Likelihood Estimate (MLE) for n-Grams (1/2) Given a a training corpus T and the language model Essentially, the distribution of the sample counts with the same history referred as a multinominal (polynominal) distribution N-grams with same history are collected together …陳水扁總統訪問美國紐約 … 陳水扁總統在巴拿馬表示 … P(總統|陳水扁)=?

Maximum Likelihood Estimate (MLE) for n-Grams (2/2) Take logarithm of , we have For any pair , try to maximize and subject to