Language Modeling for Speech Recognition

Slides:

Advertisements

Similar presentations

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Advertisements

Language Modeling.

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Natural Language Understanding

1 Advanced Smoothing, Evaluation of Language Models.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.

1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Graphical models for part of speech tagging

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Large Vocabulary Continuous Speech Recognition. Subword Speech Units.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Natural Language Processing Statistical Inference: n-grams

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Hidden Markov Models BMI/CS 576

Statistical NLP: Lecture 7

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Rule Induction for Classification Using

Chapter 11 Language Modeling

The minimum cost flow problem

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

8.0 Search Algorithms for Speech Recognition

Context-based Data Compression

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Introduction to Data Mining, 2nd Edition by

CSCI 5832 Natural Language Processing

Introduction to Data Mining, 2nd Edition by

Statistical NLP: Lecture 9

Hidden Markov Models Part 2: Algorithms

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

N-Gram Model Formulas Word sequences Chain rule of probability

CONTEXT DEPENDENT CLASSIFICATION

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

LECTURE 15: REESTIMATION, EM AND MIXTURES

CPSC 503 Computational Linguistics

Chapter 6 Network Flow Models.

Text Categorization Berlin Chen 2003 Reference:

Presenter : Jen-Wei Kuo

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

CS249: Neural Language Model

Professor Junghoo “John” Cho UCLA

Presentation transcript:

Language Modeling for Speech Recognition

Language Modeling for Speech Recognition Introduction n-gram language models Probability estimation Evaluation Beyond n-grams 28 November 2018 Veton Këpuska

Language Modeling for Speech Recognition Speech recognizers seek the word sequence Ŵ which is most likely to be produced from acoustic evidence A. Speech recognition involves acoustic processing, acoustic modeling, language modeling, and search Language models (LMs) assign a probability estimate P(W ) to word sequences W = {w1,...,wn} subject to 28 November 2018 Veton Këpuska

Language Modeling for Speech Recognition Language models help guide and constrain the search among alternative word hypotheses during recognition 28 November 2018 Veton Këpuska

Language Model Requirements 28 November 2018 Veton Këpuska

Finite-State Networks (FSN) Language space defined by a word network or graph. Describable by a regular phrase structure grammar. A  aB | a Finite coverage can preset difficulties for ASR. Graph arcs or rules can be augmented with probabilities. 28 November 2018 Veton Këpuska

Context-Free Grammars (CFGs) Language space defined by context-free rewrite rules, e.g., A  BC | a More powerful representation than FSNs Stochastic CFG rules have associated probabilities which can be learned automatically from corpus. Finite coverage can present difficulties for ASR. 28 November 2018 Veton Këpuska

Word-Pair Grammars & Bigrams Language space defined by lists of legal word-pairs Can be implemented efficiently within Viterbi search Finite coverage can present difficulties for ASR Bigrams define probabilities for all word-pairs and can produce a nonzero P(W) for all possible sentences 28 November 2018 Veton Këpuska

Example of LM Impact (Lee, 1988) Resource Management domain Speaker-independent, continuous speech corpus Sentences generated form a finite state network 997 word vocabulary Word-pair perplexity ~60. Bigram ~20 Error includes substitutions, deletions, and insertions 28 November 2018 Veton Këpuska

Note on Perplexity Language can be thought of as an information source whose outputs are words wi belonging to the vocabulary of the language. Probability of a language model assigning to test word stings is known as perplexity. P(W) – probability (that a language model assignees) to a word sequence. Cross-entropy H(W) of a model P(wi|wi-n+1,…,wi-1) on data W can be approximated as: Nw is the length of the text W measured in number of words. 28 November 2018 Veton Këpuska

Note on Perplexity PP(W)=2H(W) The perplexity PP(W) of a language model P(W) is defined as the reciprocal of the (geometric) average probability assigned by the model to each word in the test set W. This is a measure, related to cross-entropy, known as test-set perplexity: PP(W)=2H(W) The perplexity can be roughly interpreted as the geometric mean of the branching factor of the text when presented to the language model. 28 November 2018 Veton Këpuska

LM Formulation for ASR Language model probabilities P(W) are usually incorporated into the ASR search as early as possible Since most searches are performed unidirectionally, P(W) is usually formulated as a chain rule Where hi = {<>,…,wi-1} is the word history for wi. hi is often reduced to equivalence class (hi): 28 November 2018 Veton Këpuska

LM Formulation for ASR Good equivalence classes maximize the information about the next word wi given its history (hi). Language models which require the full word sequence W are usually used as post-processing filters 28 November 2018 Veton Këpuska

n-gram Language Models n-gram models use the previous n - 1 words to represent the history (hi)= {wi-1 ,...,wi-(n-1)} Probabilities are based on frequencies and counts: Due to sparse data problems, n-grams are typically smoothed with lower order frequencies subject to: 28 November 2018 Veton Këpuska

n-gram Language Models Bigrams are easily incorporated in Viterbi search Trigrams used for large vocabulary recognition in mid-1970’s and remain the dominant language model 28 November 2018 Veton Këpuska

IBM Trigram Example (Jelinek, 1997) 28 November 2018 Veton Këpuska

IBM Trigram Example (cont.) 28 November 2018 Veton Këpuska

n-gram Issues: Sparse Data (Jelinek, 1985) Text corpus of IBM patent descriptions 1.5 million words for training 300,000 words used to test models Vocabulary restricted to 1,000 most frequent words 23% of trigrams occurring in test corpus were absent from training corpus! In general, a vocabulary of size V will have Vn n-grams (e.g., 20,000 words will have 400 million bigrams, and 8 trillion trigrams!) 28 November 2018 Veton Këpuska

n-gram Interpolation Probabilities are a linear combination of frequencies For example: ’s are computed with EM algorithm on held-out data Different ’s can be used for different histories hi Simplistic formulation of ’s can be used: 28 November 2018 Veton Këpuska

n-gram Interpolation Estimates can be solved recursively: 28 November 2018 Veton Këpuska

Interpolation Example 28 November 2018 Veton Këpuska

Deleted Interpolation Initialize λ’s (e.g., uniform distribution) Compute probability P(j|wi) that the jth frequency estimate was used when word wi was generated Recompute λ’s for ni words in held-out data: Iterate until convergence 28 November 2018 Veton Këpuska

Back-Off n-grams (Katz, 1987) ML estimates are used when counts are large Low count estimates are reduced (discounted) to provide probability mass for unseen sequences Zero count estimates based on weighted (n-1)-gram Discounting typically based on Good-Turing estimate Factor q(w1) chosen so that High order n-grams computed recursively 28 November 2018 Veton Këpuska

Good-Turing Estimate Probability a word will occur r times out of N ,given θ Probability a word will occur r+1 times out of N+1 Assume nr words occurring r times have same value of θ 28 November 2018 Veton Këpuska

Good-Turing Estimate Assuming large N, we can solve for θ or discounted r* 28 November 2018 Veton Këpuska

Good-Turing Example (Church and Gale, 1991) GT estimate for an item occurring r times out of N is: where nr is number of items (e.g., n-grams) occurring r-times. Consider bigram counts from a 22 million word corpus of AP news articles (273,000 word vocabulary): 28 November 2018 Veton Këpuska

Integration into Viterbi Search Bigrams can be efficiently incorporated into Viterbi search using an intermediate node between words Interpolated: Q(wi) = (1 - λi) Back-off: Q(wi)= q(wi) 28 November 2018 Veton Këpuska

Evaluating Language Models Recognition accuracy Qualitative assessment Random sentence generation Sentence reordering Information-theoretic measures 28 November 2018 Veton Këpuska

Random Sentence Generation: Air Travel Domain Bigram Show me the flight earliest flight from Denver How many flights that flight leaves around is the Eastern Denver I want a first class Show me a reservation the last flight from Baltimore for the first I would like to fly from Dallas I get from Pittsburgh Which just small In Denver on October I would like to San Francisco Is flight flying What flights from Boston to San Francisco How long can you book a hundred dollars I would like to Denver to Boston and Boston Make ground transportation is the cheapest Are the next week on AA eleven ten First class How many airlines from Boston on May thirtieth What is the city of three PM What about twelve and Baltimore 28 November 2018 Veton Këpuska

Random Sentence Generation: Air Travel Domain Trigram What type of aircraft What is the fare on flight two seventy two Show me the flights I’ve Boston to San Francisco on Monday What is the cheapest one way Okay on flight number seven thirty six What airline leaves earliest Which airlines from Philadelphia to Dallas I’d like to leave at nine eight What airline How much does it cost How many stops does Delta flight five eleven o’clock PM that go from What AM Is Eastern from Denver before noon Earliest flight from Dallas I need to Philadelphia Describe to Baltimore on Wednesday from Boston I’d like to depart before five o’clock PM Which flights do these flights leave after four PM and lunch and <unknown> 28 November 2018 Veton Këpuska

Sentence Reordering (Jelinek, 1991) Scramble words of a sentence Find most probable order with language model Results with trigram LM Short sentences from spontaneous dictation 63% of reordered sentences identical 86% have same meaning 28 November 2018 Veton Këpuska

IBM Sentence Reordering would I report directly to you I would report directly to you now let me mention some of the disadvantages let me mention some of the disadvantages now he did this several hours later this he did several hours later this is of course of interest to IBM of course this is of interest to IBM approximately seven years I have known John I have known John approximately seven years these people have a fairly large rate of turnover of these people have a fairly large turnover rate in our organization research has two missions in our missions research organization has two exactly how this might be done is not clear clear is not exactly how this might be done 28 November 2018 Veton Këpuska

Quantifying LM Complexity One LM is better than another if it can predict an n word test corpus W with a higher probability P(W) For LMs representable by the chain rule, comparisons are usually based on the average per word logprob, LP A more intuitive representation of LP is the perplexity: PP = 2LP (a uniform LM will have PP equal to vocabulary size) PP is often interpreted as an average branching factor. ^ 28 November 2018 Veton Këpuska

Perplexity Examples 28 November 2018 Veton Këpuska

Language Entropy The average logprob LP is related to the overall uncertainty of the language, quantified by its entropy If W is obtained from a well-behaved source (ergodic), P(W ) will converge to the expected value and H is 28 November 2018 Veton Këpuska

Language Entropy The entropy H is a theoretical lower bound on LP 28 November 2018 Veton Këpuska

Human Language Entropy (Shannon, 1951) An attempt to estimate language entropy of humans Involved guessing next words in order to measure subjects probability distribution Letters were used to simplify experiments Shannon estimated Ĥ≈1 bit/letter 28 November 2018 Veton Këpuska

Why do n-grams work so well? Probabilities are based on data (the more the better) Parameters determined automatically from corpora Incorporate local syntax, semantics, and pragmatics Many languages have a strong tendency toward standard word order and are thus substantially local Relatively easy to integrate into forward search methods such as Viterbi (bigram) or A* 28 November 2018 Veton Këpuska

Problems with n-grams Unable to incorporate long-distance constraints Not well suited for flexible word order languages Cannot easily accommodate New vocabulary items Alternative domains Dynamic changes (e.g., discourse) Not as good as humans at tasks of Identifying and correcting recognizer errors Predicting following words (or letters) Do not capture meaning for speech understanding 28 November 2018 Veton Këpuska

Clustering words Many words have similar statistical behavior e.g., days of the week, months, cities, etc. n-gram performance can be improved by clustering words Hard clustering puts a word into a single cluster Soft clustering allows a word to belong to multiple clusters Clusters can be created manually, or automatically Manually created clusters have worked well for small domains Automatic clusters have been created bottom-up or top-down 28 November 2018 Veton Këpuska

Bottom-Up Word Clustering (Brown et al., 1992) Word clusters can be created automatically by forming clusters in a stepwise-optimal or greedy fashion Bottom-up clusters created by considering impact on metric of merging words wa and wb to form new cluster wab Example metrics for a bigram language model: Minimum decrease in average mutual information Minimum increase in training set conditional entropy 28 November 2018 Veton Këpuska

Example of Word Clustering 28 November 2018 Veton Këpuska

Word Class n-gram models Word class n-grams cluster words into equivalence classes W = {w1,…,wn} → {c1,…,cn} If clusters are non-overlapping, P(W) is approximated by: Fewer parameters than word n-grams Relatively easy to add new words to existing clusters Can be linearly combined with word n-grams if desired 28 November 2018 Veton Këpuska

Predictive Clustering (Goodman, 2000) For word class n-grams : P(wi|hi) ≈ P(wi|ci)P(ci|ci-1 ...) Predictive clustering is exact: P(wi|hi) = P(wi|hici)P(ci|hi) History, hi, can be clustered differently for the two terms This model can be larger than the n-gram, but has been shown to produce good results when combined with pruning 28 November 2018 Veton Këpuska

Phrase Class n-grams (PCNG) (McCandless, 1994) Probabilistic context-free rules parse phrases W = {w1,…,wn} → {u1,…,um} n-gram produces probability of resulting units P(W) is product of parsing and n-gram probabilities P(W) = Pr(W) Pn(U) Intermediate representation between word-based n-grams and stochastic context-free grammars Context-free rules can be learned automatically 28 November 2018 Veton Këpuska

PCNG Example 28 November 2018 Veton Këpuska

PCNG Experiments Air-Travel Information Service (ATIS) domain Spontaneous, spoken language understanding 21,000 train, 2,500 development, 2,500 test sentences 1,956 word vocabulary 28 November 2018 Veton Këpuska

Decision Tree Language Models (Bahl et al., 1989) Equivalence classes represented in a decision tree Branch nodes contain questions for history hi Leaf nodes contain equivalence classes Word n-gram formulation fits decision tree model Minimum entropy criterion used for construction Significant computation required to produce trees 28 November 2018 Veton Këpuska

Exponential Language Models P(wi|hi) modeled as product of weighted features fj(wihi) where λ’s are parameters, and Z(hi) is a normalization factor Binary-valued features can express arbitrary relationships, e.g., When E(f(wh)) corresponds to empirical expected value, ML estimates for λ’s correspond to maximum entropy distribution ML solutions are iterative, and can be extremely slow Demonstrated perplexity and WER gains on large vocabulary tasks 28 November 2018 Veton Këpuska

Adaptive Language Models Cache-based language models incorporate statistics of recently used words with a static language model P(wi|hi)= Pc(wi|hi)+ (1-)Ps(wi|hi ) Trigger-based language models increase word probabilities when key words observed in history hi Self triggers provide significant information Information metrics used to find triggers Incorporated into maximum entropy formulation 28 November 2018 Veton Këpuska

Trigger Examples (Lau, 1994) Triggers determined automatically from WSJ corpus (37 million words) using average mutual information Top seven triggers per word used in language model 28 November 2018 Veton Këpuska

Language Model Pruning n-gram language models can get very large (e.g., 6B/n-gram ) Simple techniques can reduce parameter size Prune n-grams with too few occurrences Prune n-grams that have small impact on model entropy Trigram count-based pruning example: Broadcast news transcription (e.g., TV, radio broadcasts) 25K vocabulary; 166M training words (∼ 1GB), 25K test words 28 November 2018 Veton Këpuska

Entropy-based Pruning (Stolcke, 1998) Uses KL distance to prune n-grams with low impact on entropy Select pruning threshold θ Compute perplexity increase from pruning each n-gram Remove n-grams below θ, and recompute backoff weights Example: resorting Broadcast News N -best lists with 4-grams 28 November 2018 Veton Këpuska

Perplexity vs. Error Rate (Rosenfeld et al., 1995) Switchboard human-human telephone conversations 2.1 million words for training, 10,000 words for testing 23,000 word vocabulary, bigram perplexity of 109 Bigram-generated word-lattice search (10% word error) 28 November 2018 Veton Këpuska

References X. Huang, A. Acero, and H. -W. Hon, Spoken Language Processing, Prentice-Hall, 2001. K. Church & W. Gale, A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating Probabilities of English Bigrams, Computer Speech & Language, 1991. F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997. S. Katz, Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. ASSP-35, 1987. K. F. Lee, The CMU SPHINX System, Ph.D. Thesis, CMU, 1988. R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go from Here?, IEEE Proceedings, 88(8), 2000. C. Shannon, Prediction and Entropy of Printed English, BSTJ, 1951. 28 November 2018 Veton Këpuska