CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Evaluation.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Ensemble Learning (2), Tree and Forest
Albert Gatt Corpora and Statistical Methods Lecture 9.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Text Classification, Active/Interactive learning.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Yuya Akita , Tatsuya Kawahara
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Ensemble Methods in Machine Learning
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Data Mining and Decision Support
Natural Language Processing Statistical Inference: n-grams
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Learning From Observations Inductive Learning Decision Trees Ensembles.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
CSCI 5832 Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CS249: Neural Language Model
Presentation transcript:

CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing C. D. Manning and H. Schütze Requirements: –Several programming projects –Research Proposal

Training Examples Learned Model Test Examples Classification/ Labeling Results Learning Algorithm Machine Learning

Modeling Decide how to represent learned models: –Decision rules –Linear functions –Markov models –… Type chosen affects generalization accuracy (on new data)

Generalization

Example Representation Set of Features: –Continuous –Discrete (ordered and unordered) –Binary –Sets vs. Sequences Classes: –Continuous vs. discrete –Binary vs. multivalued –Disjoint vs. overlapping

Learning Algorithms Find a “good” hypothesis “consistent” with the training data –Many hypotheses may be consistent, so may need a “preference bias” –No hypothesis may be consistent, so need to find “nearly” consistent May rule out some hypotheses to start with: –Feature reduction

Estimating Generalization Accuracy Accuracy on the training says nothing about new examples! Must train and test on different example sets Estimate generalization accuracy over multiple train/test divisions Sources of estimation error: –Bias: Systematic error in the estimate –Variance: How much the estimate changes between different runs

Cross-validation 1.Divide training into k sets 2.Repeat for each set: 1.Train on the remaining k-1 sets 2.Test on the kth 3.Average k accuracies (and compute statistics)

Bootstrapping For a corpus of n examples: 1.Choose n examples randomly (with replacement) Note: We expect ~0.632n different examples 2.Train model, and evaluate: acc 0 = accuracy of model on non-chosen examples acc S = accuracy of model on n training examples 3.Estimate accuracy as 0.632*acc *acc S 4.Average accuracies over b different runs Also note: there are other similar bootstrapping techniques

Bootstrapping vs. Cross-validation Cross-validation: –Equal participation of all examples –Dependency of class distribution in tests on distributions in training –Stratified cross-validation: equalize class dist. Bootstrap: –Often has higher bias (fewer distinct examples) –Best for small datasets

Natural Language Processing Extract useful information from natural language texts (articles, books, web pages, queries, etc.) Traditional method: Handcrafted lexicons, grammars, parsers Statistical approach: Learn how to process language from a corpus of real usage

Some Statistical NLP Tasks 1.Part of speech tagging - How to distinguish between book the noun, and book the verb. 2.Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going 3.Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. 4.Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.

A Paradigmatic Task Language Modeling: Predict the next word of a text (probabilistically): P(w n | w 1 w 2 …w n-1 ) = m(w n | w 1 w 2 …w n-1 ) To do this perfectly, we must capture true notions of grammaticality So: Better approximation of prob. of “the next word”  Better language model

Measuring “Surprise” The lower the probability of the actual word, the more the model is “surprised”: H(w n | w 1 …w n-1 ) = -log 2 m(w n | w 1 …w n-1 ) (The conditional entropy of w n given w 1,n-1 ) Cross-entropy: Suppose the actual distribution of the language is p(w n | w 1 …w n-1 ), then our model is on average surprised by: E p [ H(w n |w 1,n-1 ) ] =  w p(w n =w|w 1,n-1 ) H (w n =w |w 1,n-1 ) = E p [-log 2 m(w n | w 1,n-1 )]

Estimating the Cross-Entropy How can we estimate E p [ H(w n |w 1,n-1 ) ] when we don’t (by definition) know p? Assume: Stationarity: The language doesn’t change Ergodicity: The language never gets “stuck” Then: E p [ H(w n |w 1,n-1 ) ] = lim n   (1/n)  n H(w n | w 1,n-1 )

Perplexity Commonly used measure of “model fit”: perplexity(w 1,n,m) = 2 H(w 1,n,m) = m(w 1,n ) -(1/n) How many “choices” for next word on average? Lower perplexity = better model

N-gram Models Assume a “limited horizon”: P(w k | w 1 w 2 …w k-1 ) = P(w k | w k-n …w k-1 ) –Each word depends only on the last n-1 words Specific cases: – Unigram model: P(w k ) – words independent –Bigram model: P(w k | w k-1 ) Learning task: estimate these probabilities from a given corpus

Using Bigrams Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|S TART )P(cat|The)P(sat|cat)  P(on|sat)P(the|on)P(mat|the)P(E ND |mat) Generate a random text and examine for “reasonableness”

Maximum Likelihood Estimation P MLE (w 1 …w n ) = C(w 1 …w n ) / N P MLE (w n | w 1 …w n-1 ) = C(w 1 …w n ) / C(w 1 …w n-1 ) Problem: Data Sparseness!! For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus The larger the context, the greater the problem But there are always new cases not seen before!

Smoothing Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 P Lap (X) = C(X)+1 / N+B where X is any entity, B is the number of entity types Problem: Assigns too much probability to new events The more event types there are, the worse this becomes

Interpolation Lidstone: P Lid (X) = (C(X) + d) / (N + dB) [ d < 1 ] Johnson: P Lid (X) = m P MLE (X) + (1 – m)(1/B) where m = N/(N+dB) How to choose d? Doesn’t match low-frequency events well

Held-out Estimation Idea: Estimate freq on unseen data from “unseen” data Divide data: “training” &“held out” subsets C 1 (X) = freq of X in training data C 2 (X) = freq of X in held out data T r =  X:C 1 (X)=r C 2 (X) P ho (X) = T r /(N r N)where C(X)=r

Deleted Estimation Generalize to use all the data : Divide data into 2 subsets: N a r = number of entities s.t. C a (X)=r T a r b =  X:C a (X)=r C b (X) P del (X) = ( T 0 r 1 + T 1 r 0 ) / N( N 0 r 1 + N 1 r 0 ) [C(X)=r] Needs a large data set Overestimates unseen data, underestimates infrequent data

Good-Turing For observed items, discount item count: r* = (r+1) E[N r+1 ] / E[N r ] The idea is that the chance of seeing the item one more time, is about E[N r+1 ] / E[N r ] For unobserved items, total probability is: E[N 1 ] / N –So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N 1 ] / (N 0 N)

Good-Turing Issues Has problems with high-frequency items (consider r max * = E[N r max +1 ]/E[N r max ] = 0) Usual answers: –Use only for low-frequency items (r < k) –Smooth E[N r ] by function S(r) How to divide probability among unseen items? –Uniform distribution –Estimate which seem more likely than others…

Back-off Models If high-order n-gram has insufficient data, use lower order n-gram: P bo (w i | w i-n+1,i-1 ) = Note recursive formulation { (1-d(w i-n+1,i-1 )) P( w i | w i-n+1,i-1 ) if enough data  (w i-n+1,i-1 )P bo ( w i | w i-n+2,i-1 ) otherwise

Linear Interpolation More generally, we can interpolate: P int (w i | h) =  k k ( h )P k (w i | h ) Interpolation between different orders Usually set weights by iterative training (gradient descent – EM algorithm) Partition histories h into equivalence classes Need to be responsive to the amount of data!