600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow.

600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow N VP NP NVD N S Time flies like an arrow V PP NP NPD N VP S Time flies like an arrow V NP VVD N V V S S … ?

600.465 - Intro to NLP - J. Eisner2 Structured Prediction with Perceptrons and CRFs But now, model structures! Back to conditional log-linear modeling …

600.465 - Intro to NLP - J. Eisner3

Reply today to claim your … goodmailspam Wanna get pizza tonight? goodmailspam Thx; consider enlarging the … goodmailspam Enlarge your hidden … goodmailspam

600.465 - Intro to NLP - J. Eisner6 … S  NP VP NP[+wh] V S/V/NP VP NP PP P S  N VP Det N S 

600.465 - Intro to NLP - J. Eisner7 … … S  NP VP NP[+wh] V S/V/NP VP NP PP P S  N VP Det N S  … NP  NP VP NP CP/NP VP NP NP PP NP  N VP Det N NP 

600.465 - Intro to NLP - J. Eisner8 Time flies like an arrow …

600.465 - Intro to NLP - J. Eisner9 Time flies like an arrow …

Structured prediction The general problem  Given some input x  Occasionally empty, e.g., no input needed for a generative n- gram or model of strings (randsent)  Consider a set of candidate outputs y  Classifications for x(small number: often just 2)  Taggings of x (exponentially many)  Parses of x (exponential, even infinite)  Translations of x (exponential, even infinite) ……  Want to find the “best” y, given x 600.465 - Intro to NLP - J. Eisner10

11 Remember Weighted CKY … (find the minimum-weight parse) time 1 flies 2 like 3 an 4 arrow 5 0NP3 Vst3 NP10 S8 NP24 S22 1 NP4 VP4 NP18 S21 VP18 2 P2V5P2V5 PP12 VP16 3 Det1NP10 4 N8N8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP

12 We used weighted CKY to implement probabilistic CKY for PCFGs time 1 flies 2 like 3 an 4 arrow 5 0NP3 Vst3 NP10 S8 NP24 S22 1 NP4 VP4 NP18 S21 VP18 2 P2V5P2V5 PP12 VP16 3 Det1NP10 4 N8N8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP 2 -8 2 -12 2 -2 multiply to get 2 -22 But is weighted CKY good for anything else??

13 Can set weights to log probs S NP time VP flies PP P like NP Det an N arrow w( | S ) = w( S  NP VP ) + w( NP  time ) + w( VP  VP NP ) + w( VP  flies ) + … Just let w( X  Y Z ) = -log p( X  Y Z | X ) Then lightest tree has highest prob But is weighted CKY good for anything else?? Do the weights have to be probabilities?

600.465 - Intro to NLP - J. Eisner14 Probability is Useful  We love probability distributions!  We’ve learned how to define & use p(…) functions.  Pick best output text T from a set of candidates  speech recognition (HW2); machine translation; OCR; spell correction...  maximize p 1 (T) for some appropriate distribution p 1  Pick best annotation T for a fixed input I  text categorization; parsing; part-of-speech tagging …  maximize p(T | I); equivalently maximize joint probability p(I,T)  often define p(I,T) by noisy channel: p(I,T) = p(T) * p(I | T)  speech recognition & other tasks above are cases of this too:  we’re maximizing an appropriate p 1 (T) defined by p(T | I)  Pick best probability distribution (a meta-problem!)  really, pick best parameters  : train HMM, PCFG, n-grams, clusters …  maximum likelihood; smoothing; EM if unsupervised (incomplete data)  Bayesian smoothing: max p(  |data) = max p( , data) =p(  )p(data|  ) summary of half of the course (statistics)

600.465 - Intro to NLP - J. Eisner15 Probability is Flexible  We love probability distributions!  We’ve learned how to define & use p(…) functions.  We want p(…) to define probability of linguistic objects  Trees of (non)terminals (PCFGs; CKY, Earley, pruning, inside-outside)  Sequences of words, tags, morphemes, phonemes (n-grams, FSAs, FSTs; regex compilation, best-paths, forward-backward, collocations)  Vectors (decis.lists, Gaussians, naïve Bayes; Yarowsky, clustering/k-NN)  We’ve also seen some not-so-probabilistic stuff  Syntactic features, semantics, morph., Gold. Could be stochasticized?  Methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, LSA, competitive linking  But probabilities have wormed their way into most things  p(…) has to capture our intuitions about the ling. data summary of other half of the course (linguistics)

600.465 - Intro to NLP - J. Eisner16 An Alternative Tradition  Old AI hacking technique:  Possible parses (or whatever) have scores.  Pick the one with the best score.  How do you define the score?  Completely ad hoc!  Throw anything you want into the stew  Add a bonus for this, a penalty for that, etc.  “Learns” over time – as you adjust bonuses and penalties by hand to improve performance.  Total kludge, but totally flexible too …  Can throw in any intuitions you might have

 Given some input x  Consider a set of candidate outputs y  Define a scoring function score(x,y) Linear function: A sum of feature weights (you pick the features!)  Choose y that maximizes score(x,y) Scoring by Linear Models 600.465 - Intro to NLP - J. Eisner17 Ranges over all features, e.g., k=5 (numbered features) or k=“see Det Noun” (named features) Whether (x,y) has feature k(0 or 1) Or how many times it fires(  0) Or how strongly it fires(real #) Weight of feature k (learned or set by hand)

 Given some input x  Consider a set of candidate outputs y  Define a scoring function score(x,y) Linear function: A sum of feature weights (you pick the features!)  Choose y that maximizes score(x,y) Linear model notation 600.465 - Intro to NLP - J. Eisner18 (learned or set by hand)

 Choose y that maximizes score(x,y)  But how?  Easy when only a few candidates y (text classification, WSD, …) : just try each one in turn!  Harder for structured prediction: but you now know how!  Find the best string, path, or tree …  That’s what Viterbi-style or Dijkstra-style algorithms are for.  That is, use dynamic programming to find the score of the best y.  Then follow backpointers to recover the y that achieves that score. Finding the best y given x 600.465 - Intro to NLP - J. Eisner19

time 1 flies 2 like 3 an 4 arrow 5 0 NP3 Vst3 NP10 S8 S13 NP24 S22 S27 NP24 S27 S22 S27 1 NP4 VP4 NP18 S21 VP18 2 P2V5P2V5 PP12 VP16 3 Det1NP10 4 N8N8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP Given sentence x You know how to find max-score parse y (or min-cost parse) Provided that the score of a parse = total score of its rules Time flies like an arrow N PP NP VPD N VP S

Given word sequence x You know how to find max-score tag sequence y Provided that the score of a tagged sentence = total score of its emissions and transitions These don’t have to be log-probabilities! Emission scores assess tag-word compatibility Transition scores assess goodness of tag bigrams Bill directed a cortege of autos through the dunes …? Prep Adj Verb Verb Noun Verb PN Adj Det Noun Prep Noun Prep Det Noun

Given upper string x You know how to find max-score path that accepts x (or min-cost path) Provided that the score of a path = total score of its arcs Then choose lower string y from that best path (So in effect, score(x,y) is score of best path that transduces x to y) Q: How do you make sure that the path accepts aaaaaba? A: Compose with a straight-line automaton, then find best path.

 “Provided that the score of a parse = total score of its rules”  “Provided that the score of a tagged sentence = total score of its transitions and emissions”  “Provided that the score of a path = total score of its arcs” How does this fit with what linear models can do? When can you efficiently choose best y? 600.465 - Intro to NLP - J. Eisner23 e.g, θ 3 = score of VP  VP PP θ 8 = score of V  flies f 3 (x,y) = # times VP  VP PP appears in y f 8 (x,y) = # times V  flies appears in (x,y)

 So it’s fine to have one feature for each rule, transition, emission, arc, …  E.g., VP  VP PP or V  flies  The feature counts the # of occurrences of that substructure in (x,y)  But is that all? Or can we allow other features and remain efficient?  Features that count configurations smaller than a rule (or arc)?  Backoff feature V…  V… P… ?  Backoff feature foo  foo PP (asks whether PP is “adjoined” to some foo)?  Sure. These just add to the overall score of a rule like VP  VP PP that we then use in the parsing algorithm. When can you efficiently choose best y? 600.465 - Intro to NLP - J. Eisner24

 So it’s fine to have one feature for each rule, transition, emission, arc, …  E.g., VP  VP PP or V  flies  The feature counts the # of occurrences of that substructure in (x,y)  But is that all? Or can we allow other features and remain efficient?  Features that count configurations bigger than a rule (or arc)?  E.g., “NP  him” is a good rule when the NP is immediately to right of a V  Would have to change algorithm  Or, enrich CFG nonterminals with attributes (or split states of FSM)  NP[subject]  him vs. NP[object]  him  Now the information about the configuration can be seen locally within one rule When can you efficiently choose best y? 600.465 - Intro to NLP - J. Eisner25

 So it’s fine to have one feature for each rule, transition, emission, arc, …  E.g., VP  VP PP or V  flies  The feature counts the # of occurrences of that substructure in (x,y)  But is that all? Or can we allow other features and remain efficient?  Features that count configurations bigger than a rule (or arc)?  Reasonably easy case: “Does the tree have even depth along left spine?”  Harder case: “Do left and right children have the same # of words?”  Extra-hard case: “Is the # of NPs in the parse a prime number?” When can you efficiently choose best y? 600.465 - Intro to NLP - J. Eisner26

 So it’s fine to have one feature for each rule, transition, emission, arc, …  E.g., VP  VP PP or V  flies  The feature counts the # of occurrences of that substructure in (x,y)  But is that all? Or can we allow other features and remain efficient?  Features that count configurations bigger than a rule (or arc)?  Surprisingly easy case:  Features that look at a rule in y together with any properties of x! When can you efficiently choose best y? 600.465 - Intro to NLP - J. Eisner27

 Choose y that maximizes score(x,y)  But how?  Easy when only a few candidates y (text classification, WSD, etc.) : just try each one in turn!  Harder for structured prediction: but you now know how!  At least for linear scoring functions with certain kinds of features.  Generalizing beyond this is an active area!  Approximate inference in graphical models, integer linear programming, weighted MAX-SAT, etc. … see 600.325/425 Declarative Methods Linear model notation 600.465 - Intro to NLP - J. Eisner28

Finding the best y given x  Given some input x  Consider a set of candidate outputs y  Define a scoring function score(x,y)  We’re talking about linear functions: A sum of feature weights  Choose y that maximizes score(x,y)  Easy when only two candidates y (spam classification, binary WSD, etc.) : just try both!  Hard for structured prediction: but you now know how!  At least for linear scoring functions with certain kinds of features.  Generalizing beyond this is an active area!  Approximate inference in graphical models, integer linear programming, weighted MAX-SAT, etc. … see 600.325/425 Declarative Methods 600.465 - Intro to NLP - J. Eisner29

600.465 - Intro to NLP - J. Eisner30 An Alternative Tradition  Old AI hacking technique:  Possible parses (or whatever) have scores.  Pick the one with the best score.  How do you define the score?  Completely ad hoc!  Throw anything you want into the stew  Add a bonus for this, a penalty for that, etc.  “Learns” over time – as you adjust bonuses and penalties by hand to improve performance.  Total kludge, but totally flexible too …  Can throw in any intuitions you might have Exposé at 9 Probabilistic Revolution Not Really a Revolution, Critics Say Log-probabilities no more than scores in disguise “We’re just adding stuff up like the old corrupt regime did,” admits spokesperson really so alternative?

600.465 - Intro to NLP - J. Eisner32 Nuthin’ but adding weights  n-grams: … + log p(w7 | w5, w6) + log p(w8 | w6, w7) + …  PCFG: log p(NP VP | S) + log p(Papa | NP) + log p(VP PP | VP) …  Can describe any linguistic object as collection of “features” (here, a tree’s “features” are all of its component rules) (different meaning of “features” from singular/plural/etc.)  Weight of the object = total weight of features  Our weights have always been conditional log-probs (  0)  but what if we changed that?  HMM tagging: … + log p(t7 | t5, t6) + log p(w7 | t7) + …  Noisy channel: [ log p(source) ] + [ log p(data | source) ]  Cascade of FSTs: [ log p(A) ] + [ log p(B | A) ] + [ log p(C | B) ] + …  Naïve Bayes: log(Class) + log(feature1 | Class) + log(feature2 | Class) + …

Change log p(this | that) to  (this ; that) 600.465 - Intro to NLP - J. Eisner34 What if our weights were arbitrary real numbers?  n-grams: … +  (w7 ; w5, w6) +  (w8 ; w6, w7) + …  PCFG:  (NP VP ; S) +  (Papa ; NP) +  (VP PP ; VP) …  HMM tagging: … +  (t7 ; t5, t6) +  (w7 ; t7) + …  Noisy channel: [  (source) ] + [  (data ; source) ]  Cascade of FSTs: [  (A) ] + [  (B ; A) ] + [  (C ; B) ] + …  Naïve Bayes:  (Class) +  (feature1 ; Class) +  (feature2 ; Class) … In practice,  is a hash table Maps from feature name (a string or object) to feature weight (a float) e.g.,  (NP VP ; S) = weight of the S  NP VP rule, say -0.1 or +1.3

Change log p(this | that) to  (this ; that)  (that & this) [prettier name] 600.465 - Intro to NLP - J. Eisner35 What if our weights were arbitrary real numbers?  n-grams: … +  (w5 w6 w7) +  (w6 w7 w8) + …  PCFG:  (S  NP VP) +  (NP  Papa) +  (VP  VP PP) …  HMM tagging: … +  (t5 t6 t7) +  (t7  w7) + …  Noisy channel: [  (source) ] + [  (source, data) ]  Cascade of FSTs: [  (A) ] + [  (A, B) ] + [  (B, C) ] + …  Naïve Bayes:  (Class) +  (Class, feature 1) +  (Class, feature2) … In practice,  is a hash table Maps from feature name (a string or object) to feature weight (a float) e.g.,  (S  NP VP) = weight of the S  NP VP rule, say -0.1 or +1.3 WCFG (multi-class) logistic regression

Change log p(this | that) to  (that & this) 600.465 - Intro to NLP - J. Eisner36 What if our weights were arbitrary real numbers?  n-grams: … +  (w5 w6 w7) +  (w6 w7 w8) + …  Best string is the one whose trigrams have the highest total weight  PCFG:  (S  NP VP) +  (NP  Papa) +  (VP  VP PP) …  Best parse is one whose rules have highest total weight (use CKY/Earley)  HMM tagging: … +  (t5 t6 t7) +  (t7  w7) + …  Best tagging has highest total weight of all transitions and emissions  Noisy channel: [  (source) ] + [  (source, data) ]  To guess source: max (weight of source + weight of source-data match)  Naïve Bayes:  (Class) +  (Class, feature 1) +  (Class, feature 2) …  Best class maximizes prior weight + weight of compatibility with features WCFG (multi-class) logistic regression

time 1 flies 2 like 3 an 4 arrow 5 0 NP3 Vst3 NP10 S8 S13 NP24 S22 S27 NP24 S27 S22 S27 1 NP4 VP4 NP18 S21 VP18 2 P2V5P2V5 PP12 VP16 3 Det1NP10 4 N8N8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP Given sentence x You know how to find max-score parse y (or min-cost parse) Provided that the score of a parse = a sum over its indiv. rules Each rule score can add up several features of that rule But a feature can’t look at 2 rules at once (how to solve?) fund TO NP to TO NP projectsSBAR S that... SBAR...

Given upper string x You know how to find lower string y such that score(x,y) is highest Provided that score(x,y) is a sum of arc scores along the best path that transduces x to y Each arc score can add up several features of that arc But a feature can’t look at 2 arcs at once (how to solve?)

 Given some input x  Consider a set of candidate outputs y  Define a scoring function score(x,y) Linear function: A sum of feature weights (you pick the features!)  Choose y that maximizes score(x,y) Linear model notation 600.465 - Intro to NLP - J. Eisner39 Ranges over all features, e.g., k=5 (numbered features) or k=“see Det Noun” (named features) Whether (x,y) has feature k(0 or 1) Or how many times it fires(  0) Or how strongly it fires(real #) Weight of feature k To be learned …

 Given some input x  Consider a set of candidate outputs y  Define a scoring function score(x,y) Linear function: A sum of feature weights (you pick the features!)  Choose y that maximizes score(x,y) Linear model notation 600.465 - Intro to NLP - J. Eisner40 To be learned …

600.465 - Intro to NLP - J. Eisner41 Probabilists Rally Behind Paradigm “.2,.4,.6,.8! We’re not gonna take your bait!” 1.Can estimate our parameters automatically  e.g., log p(t7 | t5, t6) (trigram tag probability)  from supervised or unsupervised data 2.Our results are more meaningful  Can use probabilities to place bets, quantify risk  e.g., how sure are we that this is the correct parse? 3.Our results can be meaningfully combined  modularity!  Multiply indep. conditional probs – normalized, unlike scores  p(English text) * p(English phonemes | English text) * p(Jap. phonemes | English phonemes) * p(Jap. text | Jap. phonemes)  p(semantics) * p(syntax | semantics) * p(morphology | syntax) * p(phonology | morphology) * p(sounds | phonology) 83% of ^

600.465 - Intro to NLP - J. Eisner42 Probabilists Regret Being Bound by Principle  Problem with our course’s “principled” approach:  All we’ve had is the chain rule + backoff.  But this forced us to make some tough “either-or” decisions.  p(t7 | t5, t6): do we want to back off to t6 or t5?  p(S  NP VP | S) with features: do we want to back off first from number or gender features first?  p(spam | message text): which words of the message do we back off from?? p(Paul Revere wins | weather’s clear, ground is dry, jockey getting over sprain, Epitaph also in race, Epitaph was recently bought by Gonzalez, race is on May 17, … )

News Flash! Hope arrives …  So far: Chain rule + backoff  = directed graphical model  = Bayesian network or Bayes net  = locally normalized model  We do have a good trick to help with this:  Conditional log-linear model [look back at smoothing lecture]  Solves problems on previous slide!  Computationally a bit harder to train  Have to compute Z(x) for each condition x

Gradient-based training  Gradually try to adjust  in a direction that will improve the function we’re trying to maximize  So compute that function’s partial derivatives with respect to the feature weights in  : the gradient.  Here’s how the key part works out: General function maximization algorithms include gradient ascent, L-BFGS, simulated annealing … E as in EM …These feature expectations are just what forward-backward computes! Inside-outside too!

600.465 - Intro to NLP - J. Eisner45 Why Bother?  Gives us probs, not just scores.  Can use ’em to bet, or combine w/ other probs.  We can now learn weights from data!

 So far: Chain rule + backoff  = directed graphical model  = Bayesian network or Bayes net  = locally normalized model  Also consider: Markov Random Field  = undirected graphical model  = log-linear model (globally normalized)  = exponential model  = maximum entropy model  = Gibbs distribution News Flash! More hope …

600.465 - Intro to NLP - J. Eisner47 Maximum Entropy  Suppose there are 10 classes, A through J.  I don’t give you any other information.  Question: Given message m: what is your guess for p(C | m)?  Suppose I tell you that 55% of all messages are in class A.  Question: Now what is your guess for p(C | m)?  Suppose I also tell you that 10% of all messages contain Buy and 80% of these are in class A or C.  Question: Now what is your guess for p(C | m), if m contains Buy ?  OUCH!

600.465 - Intro to NLP - J. Eisner48 Maximum Entropy ABCDEFGHIJ Buy.051.0025.029.0025 Other.499.0446  Column A sums to 0.55 (“55% of all messages are in class A”)

600.465 - Intro to NLP - J. Eisner49 Maximum Entropy ABCDEFGHIJ Buy.051.0025.029.0025 Other.499.0446  Column A sums to 0.55  Row Buy sums to 0.1 (“10% of all messages contain Buy ”)

600.465 - Intro to NLP - J. Eisner50 Maximum Entropy ABCDEFGHIJ Buy.051.0025.029.0025 Other.499.0446  Column A sums to 0.55  Row Buy sums to 0.1  ( Buy, A) and ( Buy, C) cells sum to 0.08 (“80% of the 10%”)  Given these constraints, fill in cells “as equally as possible”: maximize the entropy (related to cross-entropy, perplexity) Entropy = -.051 log.051 -.0025 log.0025 -.029 log.029 - … Largest if probabilities are evenly distributed

600.465 - Intro to NLP - J. Eisner51 Maximum Entropy ABCDEFGHIJ Buy.051.0025.029.0025 Other.499.0446  Column A sums to 0.55  Row Buy sums to 0.1  ( Buy, A) and ( Buy, C) cells sum to 0.08 (“80% of the 10%”)  Given these constraints, fill in cells “as equally as possible”: maximize the entropy  Now p( Buy, C) =.029 and p(C | Buy ) =.29  We got a compromise: p(C | Buy ) < p(A | Buy ) <.55

600.465 - Intro to NLP - J. Eisner52 Generalizing to More Features ABCDEFGH… Buy.051.0025.029.0025 Other.499.0446 <$100 Other

600.465 - Intro to NLP - J. Eisner53 What we just did  For each feature (“contains Buy”), see what fraction of training data has it  Many distributions p(c,m) would predict these fractions (including the unsmoothed one where all mass goes to feature combos we’ve actually seen)  Of these, pick distribution that has max entropy  Amazing Theorem: This distribution has the form p(m,c) = (1/Z( )) exp  i i f i (m,c)  So it is log-linear. In fact it is the same log-linear distribution that maximizes  j p(m j, c j ) as before!  Gives another motivation for our log-linear approach.

600.465 - Intro to NLP - J. Eisner54 Overfitting  If we have too many features, we can choose weights to model the training data perfectly.  If we have a feature that only appears in spam training, not ling training, it will get weight  to maximize p(spam | feature) at 1.  These behaviors overfit the training data.  Will probably do poorly on test data.

600.465 - Intro to NLP - J. Eisner55 Solutions to Overfitting 1.Throw out rare features.  Require every feature to occur > 4 times, and to occur at least once with each output class. 2.Only keep 1000 features.  Add one at a time, always greedily picking the one that most improves performance on held-out data. 3.Smooth the observed feature counts. 4.Smooth the weights by using a prior.  max p( |data) = max p(, data) =p( )p(data| )  decree p( ) to be high when most weights close to 0

600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow.

Similar presentations

Presentation on theme: "600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow.

Similar presentations

Presentation on theme: "600.465 - Intro to NLP - J. Eisner1 Structured Prediction with Perceptrons and CRFs Time flies like an arrow N PP NP VPD N VP S Time flies like an arrow."— Presentation transcript:

Similar presentations

About project

Feedback