CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:

CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing C. D. Manning and H. Schütze Requirements: –Several programming projects –Research Proposal

Training Examples Learned Model Test Examples Classification/ Labeling Results Learning Algorithm Machine Learning

Modeling Decide how to represent learned models: –Decision rules –Linear functions –Markov models –… Type chosen affects generalization accuracy (on new data)

Generalization

Example Representation Set of Features: –Continuous –Discrete (ordered and unordered) –Binary –Sets vs. Sequences Classes: –Continuous vs. discrete –Binary vs. multivalued –Disjoint vs. overlapping

Learning Algorithms Find a “good” hypothesis “consistent” with the training data –Many hypotheses may be consistent, so may need a “preference bias” –No hypothesis may be consistent, so need to find “nearly” consistent May rule out some hypotheses to start with: –Feature reduction

Estimating Generalization Accuracy Accuracy on the training says nothing about new examples! Must train and test on different example sets Estimate generalization accuracy over multiple train/test divisions Sources of estimation error: –Bias: Systematic error in the estimate –Variance: How much the estimate changes between different runs

Cross-validation 1.Divide training into k sets 2.Repeat for each set: 1.Train on the remaining k-1 sets 2.Test on the kth 3.Average k accuracies (and compute statistics)

Bootstrapping For a corpus of n examples: 1.Choose n examples randomly (with replacement) Note: We expect ~0.632n different examples 2.Train model, and evaluate: acc 0 = accuracy of model on non-chosen examples acc S = accuracy of model on n training examples 3.Estimate accuracy as 0.632*acc 0 + 0.368*acc S 4.Average accuracies over b different runs Also note: there are other similar bootstrapping techniques

Bootstrapping vs. Cross-validation Cross-validation: –Equal participation of all examples –Dependency of class distribution in tests on distributions in training –Stratified cross-validation: equalize class dist. Bootstrap: –Often has higher bias (fewer distinct examples) –Best for small datasets

Natural Language Processing Extract useful information from natural language texts (articles, books, web pages, queries, etc.) Traditional method: Handcrafted lexicons, grammars, parsers Statistical approach: Learn how to process language from a corpus of real usage

Some Statistical NLP Tasks 1.Part of speech tagging - How to distinguish between book the noun, and book the verb. 2.Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going 3.Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. 4.Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.

A Paradigmatic Task Language Modeling: Predict the next word of a text (probabilistically): P(w n | w 1 w 2 …w n-1 ) = m(w n | w 1 w 2 …w n-1 ) To do this perfectly, we must capture true notions of grammaticality So: Better approximation of prob. of “the next word”  Better language model

Measuring “Surprise” The lower the probability of the actual word, the more the model is “surprised”: H(w n | w 1 …w n-1 ) = -log 2 m(w n | w 1 …w n-1 ) (The conditional entropy of w n given w 1,n-1 ) Cross-entropy: Suppose the actual distribution of the language is p(w n | w 1 …w n-1 ), then our model is on average surprised by: E p [ H(w n |w 1,n-1 ) ] =  w p(w n =w|w 1,n-1 ) H (w n =w |w 1,n-1 ) = E p [-log 2 m(w n | w 1,n-1 )]

Estimating the Cross-Entropy How can we estimate E p [ H(w n |w 1,n-1 ) ] when we don’t (by definition) know p? Assume: Stationarity: The language doesn’t change Ergodicity: The language never gets “stuck” Then: E p [ H(w n |w 1,n-1 ) ] = lim n   (1/n)  n H(w n | w 1,n-1 )

Perplexity Commonly used measure of “model fit”: perplexity(w 1,n,m) = 2 H(w 1,n,m) = m(w 1,n ) -(1/n) How many “choices” for next word on average? Lower perplexity = better model

N-gram Models Assume a “limited horizon”: P(w k | w 1 w 2 …w k-1 ) = P(w k | w k-n …w k-1 ) –Each word depends only on the last n-1 words Specific cases: – Unigram model: P(w k ) – words independent –Bigram model: P(w k | w k-1 ) Learning task: estimate these probabilities from a given corpus

Maximum Likelihood Estimation P MLE (w 1 …w n ) = C(w 1 …w n ) / N P MLE (w n | w 1 …w n-1 ) = C(w 1 …w n ) / C(w 1 …w n-1 ) Problem: Data Sparseness!! For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus The larger the context, the greater the problem But there are always new cases not seen before!

Smoothing Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 P Lap (X) = C(X)+1 / N+B where X is any entity, B is the number of entity types Problem: Assigns too much probability to new events The more event types there are, the worse this becomes

Interpolation Lidstone: P Lid (X) = (C(X) + d) / (N + dB) [ d < 1 ] Johnson: P Lid (X) = m P MLE (X) + (1 – m)(1/B) where m = N/(N+dB) How to choose d? Doesn’t match low-frequency events well

Held-out Estimation Idea: Estimate freq on unseen data from “unseen” data Divide data: “training” &“held out” subsets C 1 (X) = freq of X in training data C 2 (X) = freq of X in held out data T r =  X:C 1 (X)=r C 2 (X) P ho (X) = T r /(N r N)where C(X)=r

Deleted Estimation Generalize to use all the data : Divide data into 2 subsets: N a r = number of entities s.t. C a (X)=r T a r b =  X:C a (X)=r C b (X) P del (X) = ( T 0 r 1 + T 1 r 0 ) / N( N 0 r 1 + N 1 r 0 ) [C(X)=r] Needs a large data set Overestimates unseen data, underestimates infrequent data

Good-Turing For observed items, discount item count: r* = (r+1) E[N r+1 ] / E[N r ] The idea is that the chance of seeing the item one more time, is about E[N r+1 ] / E[N r ] For unobserved items, total probability is: E[N 1 ] / N –So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N 1 ] / (N 0 N)

Good-Turing Issues Has problems with high-frequency items (consider r max * = E[N r max +1 ]/E[N r max ] = 0) Usual answers: –Use only for low-frequency items (r < k) –Smooth E[N r ] by function S(r) How to divide probability among unseen items? –Uniform distribution –Estimate which seem more likely than others…

Back-off Models If high-order n-gram has insufficient data, use lower order n-gram: P bo (w i | w i-n+1,i-1 ) = Note recursive formulation { (1-d(w i-n+1,i-1 )) P( w i | w i-n+1,i-1 ) if enough data  (w i-n+1,i-1 )P bo ( w i | w i-n+2,i-1 ) otherwise

Linear Interpolation More generally, we can interpolate: P int (w i | h) =  k k ( h )P k (w i | h ) Interpolation between different orders Usually set weights by iterative training (gradient descent – EM algorithm) Partition histories h into equivalence classes Need to be responsive to the amount of data!

CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:

Similar presentations

Presentation on theme: "CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:

Similar presentations

Presentation on theme: "CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:"— Presentation transcript:

Similar presentations

About project

Feedback