Presentation is loading. Please wait.

Presentation is loading. Please wait.

CRFs vs CMMs, and Stacking

Similar presentations


Presentation on theme: "CRFs vs CMMs, and Stacking"— Presentation transcript:

1 CRFs vs CMMs, and Stacking
William W. Cohen Sep 30, 2009

2 Announcements Wednesday, 9/29: Next Wed, 10/4: Next Friday, 10/8:
Project abstract due, one/person Next Wed, 10/4: Sign up for a slot to present a paper 20min + q/a time Warning: I might shuffle the schedule around a little after I see the proposals Next Friday, 10/8: Project abstract due, one/team Put addresses of project members on the proposal

3 Conditional Random Fields
Review

4 Label Bias Problem - 1 Pr(0123|rib)=1 Pr(0453|rob)=1
Consider this as an HMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123)=Pr(0453)=0.5 Pr(rob|0123) = Pr(r|0)*Pr(o|1)*Pr(b|2) = 1*0*1 = 0 Pr(rob|0453) = Pr(r|0)*Pr(o|4)*Pr(b|5) = 1*1*1 = 1 Pr(0453|rob) = … = Pr(rob|0453)*Pr(0453) / Z Pr(0123|rob) = … = Pr(rob|0123)*Pr(0123) / Z Forward/backward

5 Label Bias Problem - 2 Pr(0123|rib)=1 Pr(0453|rob)=1
Consider this as an MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1 No “next-state classifier” will model this well. There are some things HMMs can learn that MEMM’s can’t.

6 From MEMMs to CRFs

7 CRF inference When will prof Cohen post … B B B B B I I I I I O O O O
succinct & like maxent can see the locality When will prof Cohen post … To classify, find the highest-weight path through the lattice. The normalizer is the sum of the weights of all paths through the lattice. B B B B B I I I I I O O O O O

8 When will prof Cohen post …
can see the locality When will prof Cohen post … B B B B B I I I I I O O O O O

9 With Z[j,y] we can also compute stuff like:
what’s the probability that y2=“B” ? what’s the probability that y2=“B” and y3=“I”? When will prof Cohen post … B B B B B I I I I I O O O O O

10 CRF learning Goal is to learn how to weight edges in the graph:
When will prof Cohen post the notes … B B B B B B B I I I I I I I O O O O O O O Goal is to learn how to weight edges in the graph: weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]

11 CRF learning – from Sha & Pereira

12 CRF learning – from Sha & Pereira

13 CRF learning – from Sha & Pereira
Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage j Mj[y,y’] = “unnormalized probability” of transition from y to y’ at stage j – as in notes above Mj * Mj+1 = “unnormalized probability” of any path through stages j and j+1

14 Forward backward ideas
name name name c g b f nonName nonName nonName d h

15 CRF learning – from Sha & Pereira

16 Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

17 Sha & Pereira results in minutes, 375k examples

18 Klein & Manning: Conditional Structure vs Estimation

19 Task 1: WSD (Word Sense Disambiguation)
Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

20 Task 1: WSD (Word Sense Disambiguation)
Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

21 Task 1: WSD (Word Sense Disambiguation)
Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

22 In other words… MaxEnt Naïve Bayes Different “optimization goals”…
… or, dropping a constraint about f’s and λ’s

23 Task 1: WSD (Word Sense Disambiguation)
Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” using Lagrange penalties to enforce “soft” version of deficiency constraint Makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL

24

25 Conclusion: maxent beats NB?
All generalizations are wrong?

26 Task 2: POS Tagging Sequential problem Replace NB with HMM model.
Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)

27 HMM CRF

28 Using conditional structure vs maximizing conditional likelihood
CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

29 Task 2: POS Tagging Experiments with a simple feature set:
For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

30 Error analysis for POS tagging
Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

31 Error analysis for POS tagging

32 Stacked Sequential Learning
William W. Cohen Center for Automated Learning and Discovery Carnegie Mellon University Vitor Carvalho Language Technology Institute Carnegie Mellon University

33 Outline Motivation: New method: Results More results... Conclusions
MEMMs don’t work on segmentation tasks New method: Stacked sequential MaxEnt Stacked sequential Anything Results More results... Conclusions

34 However, in celebration of the locale, I will present this results in the style of Sir Walter Scott ( ), author of “Ivanhoe” and other classics. In that pleasant district of merry Pennsylvania which is watered by the river Mon, there extended since ancient times a large computer science department. Such being our chief scene, the date of our story refers to a period towards the middle of the year

35 Chapter 1, in which a graduate student (Vitor) discovers a bug in his advisor’s code that he cannot fix The problem: identifying reply and signature sections of messages. The method: classify each line as reply, signature, or other.

36 Chapter 1, in which a graduate student discovers a bug in his advisor’s code that he cannot fix
The problem: identifying reply and signature sections of messages. The method: classify each line as reply, signature, or other. The warmup: classify each line is signature or nonsignature, using learning methods from Minorthird, and dataset of 600+ messages The results: from [CEAS-2004, Carvalho & Cohen]....

37 Chapter 1, in which a graduate student discovers a bug in his advisor’s code that he cannot fix
But... Minorthird’s version of MEMMs has an accuracy of less than 70% (guessing majority class gives accuracy around 10%!)

38 Flashback: In which we recall the invention and re-invention of sequential classification with recurrent sliding windows, ..., MaxEnt Markov Models (MEMM) From data, learn Pr(yi|yi-1,xi) MaxEnt model To classify a sequence x1,x2,... search for the best y1,y2,... Viterbi beam search probabilistic classifier using previous label Yi-1 as a feature (or conditioned on Yi-1) reply sig Xi-1 Xi Xi+1 Yi-1 Yi Yi+1 Pr(Yi | Yi-1, f1(Xi), f2(Xi),...)=... features of Xi

39 Flashback: In which we recall the invention and re-invention of sequential classification with recurrent sliding windows, ..., MaxEnt Markov Models (MEMM) ... and also praise their many virtues relative to CRFs MEMMs are easy to implement MEMMs train quickly no probabilistic inference in the inner loop of learning You can use any old classifier (even if it’s not probabilistic) MEMMs scale well with number of classes and length of history Pr(Yi | Yi-1,Yi-2,...,f1(Xi),f2(Xi),...)=... Xi-1 Xi Xi+1 Yi-1 Yi Yi+1

40 The flashback ends and we return again to our document analysis task , on which the elegant MEMM method fails miserably for reasons unknown MEMMs have an accuracy of less than 70% on this problem – but why ?

41 Chapter 2, in which, in the fullness of time, the mystery is investigated...
predicted false positive predictions true ...and it transpires that often the classifier predicts a signature block that is much longer than is correct ...as if the MEMM “gets stuck” predicting the sig label.

42 Pr(Yi=sig|Yi-1=sig) = 1-ε
Chapter 2, in which, in the fullness of time, the mystery is investigated... ...and it transpires that Pr(Yi=sig|Yi-1=sig) = 1-ε as estimated from the data, giving the previous label a very high weight. Xi-1 Xi Xi+1 Yi-1 Yi Yi+1 reply sig

43 Chapter 2, in which, in the fullness of time, the mystery is investigated...
We added “sequence noise” by randomly switching around 10% of the lines: this lowers the weight for the previous-label feature improves performance for MEMMs degrades performance for CRFs Adding noise in this case however is a loathsome bit of hackery.

44 Chapter 2, in which, in the fullness of time, the mystery is investigated...
Label bias problem CRFs can represent some distributions that MEMMs cannot [Lafferty et al 2000]: e.g., the “rib-rob” problem this doesn’t explain why MaxEnt >> MEMMs Observation bias problem: MEMMs can overweight “observation” features [Klein and Manning 2002] : here we observe the opposite: the history features are overweighted CRFs rib-rob MEMMs MaxEnt

45 Chapter 2, in which, in the fullness of time, the mystery is investigated...and an explanation is proposed. From data, learn Pr(yi|yi-1,xi) MaxEnt model To classify a sequence x1,x2,... search for the best y1,y2,... Viterbi beam search probabilistic classifier using previous label Yi-1 as a feature (or conditioned on Yi-1) reply sig Xi-1 Xi Xi+1 Yi-1 Yi Yi+1

46 Chapter 2, in which, in the fullness of time, the mystery is investigated...and an explanation is proposed. From data, learn Pr(yi|yi-1,xi) MaxEnt model To classify a sequence x1,x2,... search for the best y1,y2,... Viterbi beam search Learning data is noise-free, including values for Yi-1 Classification data values for Yi-1 are noisy since they come from predictions i.e., the history values used at learning time are a poor approximation of the values seen in classification

47 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
From data, learn Pr(yi|yi-1,xi) MaxEnt model To classify a sequence x1,x2,... search for the best y1,y2,... Viterbi beam search While learning, replace the true value for Yi-1 with an approximation of the predicted value of Yi-1 To approximate the value predicted by MEMMs, use the value predicted by non-sequential MaxEnt in a cross-validation experiment. After Wolpert [1992] we call this stacked MaxEnt. find approximate Y’s with a MaxEnt-learned hypothesis, and then apply the sequential model to that

48 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
Learn Pr(yi|xi) with MaxEnt and save the model as f(x) Do k-fold cross-validation with MaxEnt, saving the cross-validated predictions the cross-validated predictions y’i=fk(xi) Augment the original examples with the y’’s and compute history features: g(x,y’)  x’ Learn Pr(yi|x’i) with MaxEnt and save the model as f’(x’) To classify: augment x with y’=f(x), and apply f to the resulting x’: i.e., return f’(g(x,f(x)) Xi-1 Xi Xi+1 Y’i-1 Y’i Y’i+1 Yi-1 Yi Yi+1 f’ f

49 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
StackedMaxEnt (k=5) outperforms MEMMs and non-sequential MaxEnt, but not CRFs StackedMaxEnt can also be easily extended.... It’s easy (but expensive) to increase the depth of stacking It’s easy to increase the history size It’s easy to build features for “future” estimated Yi’s as well as “past” Yi’s. stacking can be applied to any other sequential learner

50 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
StackedMaxEnt can also be easily extended.... It’s easy (but expensive) to increase the depth of stacking It’s cheap to increase the history size It’s easy to build features for “future” estimated Yi’s as well as “past” Yi’s. stacking can be applied to any other sequential learner Yi-1 Yi Yi+1 . . . . . . ^ ^ ^ Yi-1 Yi Yi+1 . . . . . . ^ ^ ^ . . . Yi-1 Yi Yi+1 . . . Xi-1 Xi Xi+1 . . . . . .

51 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
StackedMaxEnt can also be easily extended.... It’s easy (but expensive) to increase the depth of stacking It’s cheap to increase the history size It’s easy to build features for “future” estimated Yi’s as well as “past” Yi’s. stacking can be applied to any other sequential learner Yi+1 ^ ^ Yi Yi+1 ^ ^ . . . Yi-1 ^ Yi Yi+1 . . . Xi-1 Xi Xi+1 . . . . . .

52 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
StackedMaxEnt can also be easily extended.... It’s cheap to increase the history size, and build features for “future” estimated Yi’s as well as “past” Yi’s. Xi-1 Xi Xi+1 Yi-1 Yi Yi+1 ^ Xi-2 Yi-2

53 Chapter 3, in which a novel extension to MEMMs is proposed that will correct the performance problem
CRF StackedMaxEnt can also be easily extended.... It’s easy (but expensive) to increase the depth of stacking It’s cheap to increase the history size It’s easy to build features for “future” estimated Yi’s as well as “past” Yi’s. stacking can be applied to any other sequential learner Learn Pr(yi|xi) with MaxEnt and save the model as f(x) Do k-fold cross-validation with MaxEnt, saving the cross-validated predictions the cross-validated predictions y’i=fk(xi) Augment the original examples with the y’’s and compute history features: g(x,y’)  x’ Learn Pr(yi|x’i) with MaxEnt and save the model as f’(x’) To classify: augment x with y’=f(x), and apply f to the resulting x’: i.e., return f’(g(x,f(x))

54 Chapter 3, in which a novel extension to MEMMs is proposed and several diverse variants of the extension are evaluated on signature-block finding.... non-sequential MaxEnt baseline Reduction in error rate for stacked-MaxEnt (s-ME) vs CRFs is 46%, which is statistically significant stacked MaxEnt, no “future” With large windows stackedME is better than CRF baseline CRF baseline stacked MaxEnt, stackedCRFs with large history+future window/history size

55 newsgroup FAQ segmentation (2 labels x three newsgroups)
Chapter 4, in which the experiment above is repeated on a new domain, and then repeated again on yet another new domain. +stacking (w=k=5) -stacking newsgroup FAQ segmentation (2 labels x three newsgroups) video segmentation

56 Chapter 4, in which the experiment above is repeated on a new domain, and then repeated again on yet another new domain.

57 Chapter 5, in which all the experiments above were repeated for a second set of learners: the voted perceptron (VP), the voted-perceptron-trained HMM (VP-HMM), and their stacked versions.

58 *on a randomly chosen problem using a 1-tailed sign test
Chapter 5, in which all the experiments above were repeated for a second set of learners: the voted perceptron (VP), the voted-perceptron-trained HMM (VP-HMM), and their stacked versions. Stacking usually* improves or leaves unchanged MaxEnt (p>0.98) VotedPerc (p>0.98) VPHMM (p>0.98) CRFs (p>0.92) *on a randomly chosen problem using a 1-tailed sign test

59 Chapter 4b, in which the experiment above is repeated again for yet one more new domain....
Classify pop songs as “happy” or “sad” 1-second long song “frames” inherit the mood of their containing song Song frames are classified with a sequential classifier Song mood is majority class of all its frames 52,188 frames from 201 songs, 130 features per frame, used k=5, w=25

60 Solution: sequential stacking
Epilog: in which the speaker discusses certain issues of possible interest to the listener, who is now fully informed of the technical issues (or it may be, only better rested) and thus receptive to such commentary Scope: we considered only segmentation tasks—sequences with long runs of identical labels—and 2-class problems. MEMM fails here. Issue: learner is brittle w.r.t. assumptions training data for local model is assumed to be error-free, which is systematically wrong Solution: sequential stacking model-free way to improve robustness stacked MaxEnt outperforms or ties CRFs on 8/10 tasks; stacked VP outperforms CRFs on 8/9 tasks. a meta-learning method applies to any base learner, and can also reduce error of CRF substantially experiments with non-segmentation problems (NER) had no large gains

61 Epilog: in which the speaker discusses certain issues of possible interest to the listener, who is now fully informed of the technical issues (or it may be, only better rested) and thus receptive to such commentary ... and in which finally, the speaker realizes that the structure of the epic romantic novel is ill-suited to talks of this ilk, and perhaps even the very medium of PowerPoint itself, but none-the-less persists with a final animation... Sir W. Scott R.I.P.


Download ppt "CRFs vs CMMs, and Stacking"

Similar presentations


Ads by Google