Presentation is loading. Please wait.

Presentation is loading. Please wait.

2011/03/11 Yi-Ting Huang Evans, R. (2001). Applying Machine Learning Toward an Automatic Classification of It. Literary and Linguistic Computing, 16(1),

Similar presentations


Presentation on theme: "2011/03/11 Yi-Ting Huang Evans, R. (2001). Applying Machine Learning Toward an Automatic Classification of It. Literary and Linguistic Computing, 16(1),"— Presentation transcript:

1 2011/03/11 Yi-Ting Huang Evans, R. (2001). Applying Machine Learning Toward an Automatic Classification of It. Literary and Linguistic Computing, 16(1), pp. 45-57. Muller, C. (2006). Automatic Detection of Nonrerferential It in Spoken Multi-Party Dialog. In the Proceedings of European Chapter of the ACL (EACL), pp. 49-56. Bergsma, S., Lin, D., & Goebel, R. (2008). Distributional Identification of Non-Referential Pronouns. In Proceedings of ACL HLT, pp. 10-18. 1

2 Outline Applying Machine Learning Toward an Automatic Classification of It. Automatic Detection of Nonrerferential It in Spoken Multi-Party Dialog. Distributional Identification of Non-Referential Pronouns. Comparison –Definition –Features –Method and Results 2

3 Applying Machine Learning Toward an Automatic Classification of It Literary and Linguistic Computing, 16(1), pp. 45-57. 2001 3

4 Research problem and purpose Research problem: –'Do not sweep the dust i, when dry, you will only recirculate it i. ' (anaphoric) –'It is worth having more than one size or a good-quality set with interchangeable bits.‘ (Pleonastic) –'I can make it.‘ (Idiomatic) Research purpose: –In this paper an automatic classification system for the pronoun it has been proposed. 4

5 Definition nominal anaphoric(anaphoric: backward-search) –'Do not sweep the dust i, when dry, you will only recirculate it i ', Clause anaphoric(anaphoric: backward-search) –'One day in 1970, fifty thousand women marched down Fifth Avenue in New York i. It i is said to have been the biggest women's gathering since suffrage days', (preceding noun phrase) Proaction(anaphoric: backward-search) –'Mays walloped four home runs in a span of nine innings i. Incidentally, only two did it i before a home audience', in which it combines with do to form a unit that takes its interpretation from a preceding verb phrase in the text. Cataphoric(anaphoric: forward-search), –'when it i fell, the glass i, broke’ Discourse topic(non-anaphoric) –'Always use a tool for the job it was designed to do. Always use tools correctly. If it feels very awkward, stop’ Pleonastic(non-anaphoric) –'It is worth having more than one size or a good-quality set with interchangeable bits‘ Idiomatic/stereotypic(non-anaphoric) –'I take it you're going now’ 5

6 Method: Preprocessing Part of speech, Morphological lemma of words the dependence relations between those words in a text by CFDG-Parser 6

7 Method: Features Positional information –the position of the instance in terms of word position in a sentence and –sentence position in a paragraph. Context number of pos: The number of elements suggestive of the pronoun's class in the surrounding text. Context lemmas: Lemmas of preceding material such as verbs and following material such as verbs or adjectives in the same sentence as the instance. Context pos: The parts of speech of eight tokens, four words before and four words after the instance. Context pattern: The sequences used at present are 'adjective + noun phrase,' and 'complementizer + noun phrase' as in constructions –'It was obvious the book would fall' –'It was obvious that the book would fall'. Context grammar form: Proximity of following elements such as complementizers, -ing forms of verbs, and prepositions, expressed in tokens.

8 Experiment: Corpora Corpora: SUSANNE (Surface and Underlying Structural Analysis of Naturalistic English) and BNC Data distribution: 8 classpercentage (number) classPercentage (number) nominal anaphoric 67.93 % (2154)discourse topic mentions 2.08 % (66) clause anaphoric0.82 % (26)pleonastic26.77 % (849) proaction uses0.06 % (2)idiomatic/stereotypic constructions 2.24 % (71) cataphoric0.09 % (3) The corpus contains 368,830 words with 3,171 examples of it.

9 Experiment: Baseline system Rule-based system(baseline): Litman (1996) used C 4.5 and CGRENDEL to derive classification procedures from human annotated training data. Machine learning based system: KNN in TiMBL(Tilburg University‘s Memory Based Learner) package. K=15, distance metric was gain ratio(Quinlan, 1993). 9 Litman, D. J. (1996). Cue phrase classification using machine learning, journal of Artificial Intelligence Research, 4: 53-94 Quinlan, J. R. (1993). C 4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

10 Experiment: Result 10

11 Experiment: Error analysis The features assigned to instances in the training sets are most appropriate for classification of pleonastic instances. The training data are insufficient. Given that classification is being made on the basis of fifteen nearest neighbours. 11

12 Conclusion In this paper an automatic classification system for the pronoun it has been proposed. The system itself is based on memory-based learning and was found to compare well with a rule-based classification system. The system was found to be totally ineffective in the identification of clause anaphoric, proaction, and cataphoric uses of if. 12

13 Automatic Detection of Nonreferential It in Spoken Multi-Party Dialog In the Proceedings of European Chapter of the ACL (EACL), pp. 49-56. 2006. 13

14 Research problem and purpose Problem: We present an automatic approach to determining whether a pronoun in a multi-party dialog refers to a preceding noun phrase or is instead nonreferential. Purpose: We present an implemented machine learning system for the automatic detection of nonreferential it in spoken dialog based on shallow features. 14

15 Definition normal(anaphoric) vague(anaphoric): –It covers instances of it which are indeed referential, but whose referent is not an identifiable linguistic string in the context of the pronoun. –A frequent (but not the only) type of vague it is the one referring to the current discourse topic. discarded(non-anaphoric): –“MN059: Yeah. Yeah. Yeah. I’m sure I could learn a lot about um, yeah, just how to - how to come up with these structures, cuz it 1 ’s – it 2 ’s very easy to whip up something quickly.” extrapos it(non-anaphoric): –the second it 2 is the subject of an extraposition construction (to-infinitive, ing-form or that-clause with or without complementizer) prop-it(non-anaphoric), –“So it seems like a lot of - some of the issues are the same.” –use prop-it otherwise. other(non-anaphoric): –The actual tag set was larger, including categories like idiom which, however, the annotators turned out o use extremely rarely only. 15

16 Method: Preprocessing Segmentation: –we removed all single dashes (i.e. sentence breaker and interruption points), non-lexicalised filled pauses (like em and eh), and all word fragments. A short list of expressions like actually, you know, I mean, but also so and sort of. POS tags were obtained automatically with the Stanford tagger 16

17 Method: Features 17

18 Experiment: Corpora Corpora: ICSI Meeting Corpus (Janin et al., 2003) Annotation: two annotators. The annotators received instructions including descriptions and examples for all categories, and a decision tree diagram which using wh-question formation as a test to distinguish extrapos it and prop-it on the one hand from normal and vague on the other. inter-annotator agreement(kappa): (percentage/ respective kappa value) Data distribution: normal: 588(57%), vague: 48(4%), discarded: 222(21%), extrapos it: 71(7%), and prop-it: 88(8%). total:1017, 62.5% of which were referential, 37.5 were non-referential. 18

19 Experiment: Result Method: We used JRip, the WEKA reimplementation of Ripper. All following figures were obtained by means of ten- fold cross-validation. 19 757757 rule

20 Analyze In None, –discarded feature is very successful because no false positives. –the seem list feature alone was able to correctly identify 22 instances, producing 9 false positives. –A rule involving distance features, which produces 37 true positives, 16 false positives: –the feature encoding the distance to the next complementizer produces 14 true positives, five false positives: In Sentence Breaks, –The model produced the identical rule using the discarded-feature. –The same applies to the seem list feature, which produces 23 true positives and 6 false positives. –the distance to the next to infinitive and the next adjective, which produces 57 true positives and only 30 false positives. In Interruption Points, –Complex rules 20

21 Conclusion We present an implemented machine learning system for the automatic detection of nonreferential it in spoken dialog. The system builds on shallow features extracted from dialog transcripts. Our experiments indicate a level of performance that makes the system usable as a preprocessing filter for a coreference resolution system. 21

22 Distributional Identification of Non-Referential Pronouns In Proceedings of ACL HLT08, pp. 10-18. 2008. 22

23 Research problem and purpose Problem: We present an automatic approach to determining whether a pronoun in text refers to a preceding noun phrase or is instead nonreferential. Purpose: –We extract the surrounding textual context of the pronoun and gather, from a large corpus, the distribution of words that occur within that context. –We learn to reliably classify these distributions as representing either referential or non-referential pronoun instances. 23

24 Definition an anaphoric pronoun(anaphoric): “You can make it in advance.” it is an anaphoric pronoun referring to some previous noun phrase, like “the sauce” or “an appointment”. idiomatic expression(non-anaphoric): “You can make it in Hollywood.”, it is part of the idiomatic expression “make it” meaning “succeed”. clause anaphoric(non-anaphoric): The word It in the sentence “The paper reported that it had snowed. It was obvious.” is considered referential. –From our perspective, this interpretation is somewhat arbitrary. –Indeed, annotation experiments using very fine-grained categories show low annotation reliability (Muller, 2006). pleonastic(non-anaphoric): “It was obvious that it had snowed.” is considered non-referential. 24

25 Method: Preprocessing To lower-case To convert sequences of digits to the # symbol To run the Porter stemmer (Porter, 1980). To generalize rare names, we convert capitalized words longer than five characters to a special NE tag. We also added a few simple rules to stem the irregular verbs be, have, do, and said, and convert the common contractions ’nt, ’s, ’m, ’re, ’ve, ’d, and ’ll to their most likely stem. 25

26 Method: n-gram model We then find all n-grams matching our patterns, allowing any token to match the wildcard in place of it. 26 Pattern filler Context patterns X O

27 Method: Features Five 5-gram: For each of the five 5-gram patterns, ordered by the position of the wildcard, we have features for the logarithm of counts for filler types #1, #2,... #5. Five 4-gram: Before taking the logarithm, we smooth the counts by adding a fixed number(40) to all observed values. Nine(5+4) indicator: To indicates if the pattern is not available because the it- position would cause the pattern to span beyond the current sentence. 27

28 Method: Classification Note that leaving the pattern counts unnormalized automatically allows patterns with higher counts to contribute more to the prediction of their associated instances. For classification, we use a maximum entropy model from the logistic regression package in Weka with all default parameter settings. –Note that our maximum entropy classifier actually produces a probability of non-referentiality, which is thresholded at 50% to make a classification. 28

29 Experiment: Corpora n-gram corpus: the Google Web 1T 5-gram Corpus –tokens appearing less than 200 times have been mapped to the UNK symbol. –only n-grams appearing more than 40 times are included. experimental corpora: It-Bank available online. –the dry-run and formal sets (129) from MUC-7, and merge them into a single data set. –We annotated 1020 instances in a collection of Science News articles (from 1995-2000), downloaded from the Science News website. –We also annotated 709 instances in the WSJ portion of the DARPA TIPSTER Project (Harman, 1992), and –279 instances in the English portion of the Europarl Corpus (Koehn, 2005). 29

30 Experiment: Corpora (cont.) Annotation: three annotators. The Kappa statistic, with P(E) computed from the confusion matrices, was a high 0.90 for A1- A2, and 0.79 and 0.81 for the other pairs, around the 0.80 considered to be good reliability. Data distribution: 30

31 Experiment: Baseline systems LL: Lappin and Leass (1994), the patterns are robust to intervening words and modiers (e.g..it was never thought by the committee that....) provided the sentence is parsed correctly. MINIPL: This system(Cherry and Bergsma, 2005), also for Minipar, additionally detects instances of it labelled with Minipar's pleonastic category Subj. DISTRIB: the method. COMBO: add the LL and MINIPL decisions as binary features in the DISTRIB. 31 Shalom Lappin and Herbert J. Leass. 1994. An algorithm for pronominal anaphora resolution. Computational Linguistics, 20(4):535-561.

32 Experiment: Results The difference between COMBO and DISTRIB is not statistically signicant, while both are signicantly better than the rule-based approaches.  This provides strong motivation for a “light- weight” approach to non-referential it detection. one that does not require parsing or hand-crafted rules and - is easily ported to new languages and text domains. 32

33 Experiment: Results We can reduce data fragmentation by removing features. –if we only use the length-4 patterns (4-gram) in COMBO (labelled as COMBO4), performance increases dramatically on Europarl and MUC, while dipping slightly for the larger Sci-News and WSJ sets. –selecting just the three most useful filler type counts as features (#1,#2,#5), boosts F-Score on Europarl to 86.5%, 10% above the full COMBO system. 33

34 Discussion 1 One key question is to what extent a limited context restricts identification performance. We first tested the importance of the pattern length by using only the length-4 counts in the DISTRIB system (Train/Test split). Surprisingly, the drop in F-score was only one percent, to 74.8%. Using only the length-5 counts drops F-Score to 71.4%. Neither are statistically significant; however there seems to be diminishing returns from longer context patterns. 34

35 Discussion 2 Another way to view the limited context is to ask, given the amount of context we have, are we making optimum use of it? We answer this by seeing how well humans can do with the same information. –Our system uses 5-gram context patterns, and we thus provide these same nine-token windows to our human subjects, and ask them to decide whether the pronouns refer to previous noun phrases or not, based on these contexts. –Subjects first performed a dryrun experiment on separate development data. They were shown their errors and sources of confusion were clarified. 35

36 Discussion 2 & Error Analysis 1 Their results show a range of preferences for precision versus recall, with both F-Score and Accuracy on average below the performance of COMBO. It is instructive to inspect the 25 Test-200 instances that the COMBO system classified incorrectly, given human performance on this same set. –17/25 COMBO errors were also made by one or more human subjects, suggesting system errors are also mostly due to limited context. “it takes an astounding amount of time to compare very long DNA sequences with each other.” 36

37 Discussion 2 & Error Analysis 2 –4/8 could have referred to entire sentences or clauses rather than nouns. These confusing cases, for both humans and our system, result from our definition of a referential pronoun: pronouns with verbal or clause antecedents are considered non-referential. If an antecedent verb or clause is replaced by a nominalization (Smith researched... to Smith’s research), a referring pronoun, in the same context, becomes referential.  we see only a weak bias for the non-referential class on these examples, reflecting our classifier's uncertainty. It would likely be possible to improve accuracy on these cases by encoding the presence or absence of preceding nominalizations as a feature of our classifier. 37

38 Discussion 2 & Error Analysis 3 –Another false non-referential decision is for the phrase... “machine he had installed it on”. The it is actually referential, but the extracted patterns (e.g. “he had install * on”) are nevertheless usually filled with it.  This example also suggests using filler counts for the word “the” as a feature when it is the last word in the pattern.  It might be possible to x such examples by leveraging the preceding discourse, such as noun-phrase before the context is the word “software.” 38

39 Conclusion We have presented an approach to detecting nonreferential pronouns in text based on the distribution of the pronoun's context. We extract the surrounding textual context of the pronoun and gather, from a large corpus, the distribution of words that occur within that context. We learn to reliably classify these distributions as representing either referential or non-referential pronoun instances. Experimental results on classifying the English pronoun it show the system achieves the highest performance yet attained on this important task. 39

40 Comparison: Definition categoryEvans (2001) [1]Muller (2006) [2]Bergsma et, al. (2008) [3] nominal anaphoricanaphoric Clause anaphoricanaphoricnon-anaphoric (called extrapos it) (part of) non-anaphoric Proactionanaphoric Cataphoricanaphoric (forward) Discourse topicnon-anaphoricanaphoric (called vague) (part of) Pleonasticnon-anaphoricnon-anaphoric (called prop-it) (part of) non-anaphoric Idiomatic/stereotypicnon-anaphoricnon-anaphoric (called other) non-anaphoric discardednon-anaphoric 40

41 Comparison: Features categoryEvans (2001) [1]Muller (2006) [2]Bergsma et, al. (2008) [3] position O context number O context lemmas O context pos O context pattern O context grammar O syntactic patterns O lexical features O distance features O oblique O seem list O discard O 5-gram O 4-gram O indicator features O 41

42 Comparison: Method and Result categoryEvans (2001) [1]Muller (2006) [2]Bergsma et, al. (2008) [3] Method KNN(K=15), distance metric was gain ratio JRip, the WEKA reimplementation of Ripper n-gram + maximum entropy model non-anaphoric (P) 73.38 (pleoastic)8081.3 non-anaphoric (R) 69.25 (pleoastic)60.973.4 42


Download ppt "2011/03/11 Yi-Ting Huang Evans, R. (2001). Applying Machine Learning Toward an Automatic Classification of It. Literary and Linguistic Computing, 16(1),"

Similar presentations


Ads by Google