Lanugage Modeling Lecture 12 Spoken Language Processing Prof. Andrew Rosenberg.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Language Modeling: Ngrams

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.

1 N-Grams and Corpus Linguistics September 2009 Lecture #5.

N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.

Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.

N-Grams and Corpus Linguistics

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.

CS 4705 Lecture 6 N-Grams and Corpus Linguistics.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

N-Grams and Language Modeling

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Language Modeling Julia Hirschberg CS Approaches to Language Modeling Context-Free Grammars –Use in HTK Ngram Models.

CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:

CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 Advanced Smoothing, Evaluation of Language Models.

Machine Translation Course 3 Diana Trandab ă ț Academic year:

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Parsing with Context-Free Grammars for ASR Julia Hirschberg CS 4706 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.

Tokenization & POS-Tagging

Lecture 4 Ngrams Smoothing

Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.

N-gram Models CMSC Artificial Intelligence February 24, 2005.

LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Estimating N-gram Probabilities Language Modeling.

Natural Language Processing Statistical Inference: n-grams

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Statistical Methods for NLP Diana Trandab ă ț

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Language Model for Machine Translation Jang, HaYoung.

Statistical Methods for NLP

N-Grams Chapter 4 Part 2.

N-Grams and Corpus Linguistics

N-Grams and Corpus Linguistics

N-Gram Model Formulas Word sequences Chain rule of probability

CS4705 Natural Language Processing

CSCE 771 Natural Language Processing

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Presentation transcript:

Lanugage Modeling Lecture 12 Spoken Language Processing Prof. Andrew Rosenberg

Approaches to Language Modeling Context Free Grammars –Use in Sphinx N-gram models 1

Context-Free Grammars Defined in formal language theory –Terminals: e.g. cat –Non-terminal symbols: e.g. NP, VP –Start Symbol (a non-terminal): e.g. S –Rewrite ruels: e.g. S -> NP VP Start with the start symbol, rewrite using rules, done when there are no non- terminals remaining 2

Small Grammar for English S -> NP VP VP -> V PP NP -> DetP N N -> cat | mat V -> is PP -> Prep NP Prep -> on DetP -> the 3 Input: the cat is on the mat S NP VP cat PrepNP PP V is the onDetP N the mat N DetP

More Complicated Grammar S  NP VP S  VP VP  V PP VP  V NP VP  V NP  DetP NP NP  N NP NP  N PP  Prep NP N  cat | mat | food | bowl | Mary V  is | likes | sits Prep  on | in | under DetP  the | a 4 Mary likes the cat bowl

Using CFGs in Simple ASR applications LHS of rules are semantic categories: –LIST -> show me | I want | can I see|… –DEPARTTIME -> (after|around|before) HOUR | morning | afternoon | evening –HOUR -> one|two|three…|twelve (am|pm) –FLIGHTS -> (a) flight|flights –ORIGIN -> from CITY –DESTINATION -> to CITY –CITY -> Boston | San Francisco | Denver | Washington 5

Sphinx Grammar Format Variables are surrounded by <> (e.g., ) Terminals are not (e.g., FRIDAY, TICKET) X Y is concatenation (e.g., I WANT) (X | Y) means X or Y – e.g., (WANT | NEED) [X] means optional, (e.g., [ON] FRIDAY) * Kleene closure (e.g., *) Can include “probabilities”: – = /10/ open |/2/ close |/1/ delete |/1/move; For productions to be available to sphinx, they must be declared “public” 6

Examples public = ((what trains leave) | (what time can I travel | (is there a train)) (from | to) (from | to) on [ ] = Boston | NewYork | Washington | Baltimore; = morning | evening = Friday | Monday 7

Problems for Larger Vocabulary Applications CFGs are complicated to build and hard to modify to accommodate new data: –Add capability to make a reservation –Add capability to ask for help –Add ability to understand greetings –… Parsing input with large CFGs can be slow in real time applications In Large Applications we use n-gram models 8

Next Word Prediction The air traffic control supervisor who admitted falling asleep while on duty at Reagan National Airport has been suspended, and the head of the Federal Aviation Administration on Friday ordered new rules to ensure a similar incident doesn't take place. FAA chief Randy Babbitt said he has directed controllers at regional radar facilities to contact the towers of airports where there is only one controller on duty at night before sending planes on for landings. Babbitt also said regional controllers have been told that if no controller can be raised at the airport, they must offer pilots the option of diverting to another airport. Two commercial jets were unable to contact the control tower early Wednesday and had to land without gaining clearance. 9

Word Prediction How do we know which words occur together? –Domain knowledge –Syntactic knowledge –Lexical knowledge Can we model this knowledge computationally? –Simple statistics do pretty well. –Most common way of constraining ASR predictions to conform to probabilities of word sequences: –Language modeling via N-grams 10

N-Gram Models of Language Use the previous N-1 words in a sequence to predict the next word. Language Model (LM) –unigrams, bigrams, trigrams, 4-grams How do we train these models to discover co-occurrence probabilities? 11

Finding Corpora Corpora are collections of text and speech –Available online –Brown Corpus –Wall Street Journal, AP newswire, web –DARPA/NIST text/speech corpora (Call Home, Call Friend, ATIS, Switchboard, Broadcast News, TDT, Communicator) 12

Tokenization: Counting Words in Corpora What is a word? –e.g., are cat and cats the same word? What about Cat and cat? –September and Sept? –zero and oh? –Is _ a word? *? ‘(‘? Uh? –Should we count parts of words? Going to Bo- Boston. –How many words are there in don’t? gonna? –Is any token separated by white space a word? In Japanese, Thai, and Chinese text, how do we identify words? 13

Terminology Sentence: unit of written language Utterance: unit of spoken language (prosodic phrase) Wordform: inflected form as it appears in the corpus Lemma: an abstract form, shared by word forms having the same stem, part of speech and word sense – stands for the class of words with stem X Types: number of distinct words in a corpus (vocabulary size) Tokens: total number of words. 14

Simple word probability Assume a language has T word types, and N tokens, how likely is word y to follow word x? –Simplest model: 1/T But is every word equally likely? –Alternative 1: estimate likelihood of y occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability) ct(y)/N But is every word equally likely in every context? –Alternative 2: condition the likelihood of y occurring on the context of previous words ct(x,y)/ct(x) 15

Computing word sequence probabilities Compute probability of a word given a preceding sequence –P(the mythical unicorn…) = P(the| ) P(mythical| the) * P(unicorn| the mythical)… Joint probability: P(w n-1,w n ) = P(w n | w n-1 ) P(w n-1 ) –Chain Rule: Decompose joint probability, e.g. P(w 1,w 2,w 3 ) asChain Rule P(w 1,w 2,...,w n ) = P(w 1 ) P(w 2 |w 1 ) … P(w n |w 1 to n-1 ) But…the longer the sequence, the less likely we are to find it in a training corpus P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal) 16

Bigram Model Markov assumption: the probability of a word depends only on the probability of a limited history Approximate by –P(unicorn|the mythical) by P(unicorn|mythical) Generalization: the probability of a word depends only on the probability of the n previous words –trigrams, 4-grams, 5-grams… –the higher n is, the more training data needed 17

From –P(the mythical unicorn…) = P(the| ) P(mythical| the) * P(unicorn| the mythical)… To –P(the,mythical,unicorn) = P(unicorn|mythical) P(mythical|the) P(the| ) 18

Bigram Counts 19 n eatshoneymythicalcatunicornthea eats honey mythical cat unicorn the a

Determining Bigram Probabilities Normalization: divide each row's counts by appropriate unigram counts for w n-1 Computing the bigram probability of mythical mythical –C(m,m)/C(all m-initial bigrams) –p (m|m) = 2 / 35 = Maximum Likelihood Estimation (MLE): relative frequency of e.g. 20 amythicalcateatshoney

A Simple Example P(a mythical cat…) = P(a | ) P(mythical | a) P(cat | mythical) … P( |…) = 90/1000 * 5/200 * 8/35 … Needed: –Bigram counts for each of these word pairs (x,y) –Counts for each unigram (x) to normalize –P(y|x) = ct(x,y)/ct(x) Why do we usually represent bigram probabilities as log probabilities? What do these bigrams intuitively capture? 21

Training and Testing N-Gram probabilities come from a training corpus –overly narrow corpus: probabilities don't generalize –overly general corpus: probabilities don't reflect task or domain A separate test corpus is used to evaluate the model, typically using standard metrics –held out test set; development (dev) test set –cross validation –results tested for statistical significance – how do they differ from a baseline? Other results? 22

Evaluating N-gram Models: Perplexity Information theoretic, intrinsic metric that usually correlates with extrinsic measures (e.g. ASR performance) At each choice point in a grammar or LM –Weighted average branching factor: Average number of choices y following x, weighted by their probabilities of occurrence –Or, if LM(1) assigns more probability to test set sentences than LM(2), the lower is LM(1)’s perplexity and the better it models the test set 23

N-gram Properties As we increase the value of N, the accuracy of an ngram model increases – why? Ngrams are quite sensitive to the corpus they are trained on A few events (words) occur with high frequency, e.g.? –Easy to collect statistics on these A very large number occur with low frequency, e.g.? –You may wait an arbitrarily long time to get valid statistics on these –Some of the zeroes in the table are really zeros –Others are just low frequency events you haven't seen yet –How to allow for these events in unseen data? 24

N-gram Smoothing Every n-gram training matrix is sparse, even for very large corpora –Zipf’s law: a word’s frequency is approximately inversely proportional to its rank in the word distribution listZipf’s law Solution: –Estimate the likelihood of unseen n-grams –Problem: how do to adjust the rest of the corpus to accommodate these ‘phantom’ n-grams? –Many techniques described in J&M 25

Backoff methods For e.g. a trigram model –Compute unigram, bigram and trigram probabilities –In use: Where trigram unavailable back off to bigram if available, o.w. unigram probability E.g An omnivorous unicorn 26

Language modeling toolkits The CMU-Cambridge LM toolkit (CMULM) – ml The SRILM toolkit – 27

Next Class Human Speech Perception Reading: J&M Chapter 4 28