Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Similar presentations


Presentation on theme: "Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06."— Presentation transcript:

1 Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06

2 Example: George W. Bush met with Vladimir Putin in Moscow. How long was the meeting? Most people would say the meeting lasted between an hour and three days. This research is potentially very important in applications in which the time course of events is to be extracted from news. Introduction

3

4 Inter-Annotator Agreement The kappa statistic (Krippendorff, 1980; Carletta,1996) has become the de facto standard to assess inter-annotator agreement. It is computed as: P(A) is the observed agreement among the annotators, and P(E) is the expected agreement, which is the probability that the annotators agree by chance

5 What Should Count as Agreement? Determining what should count as agreement is not only important for assessing inter- annotator agreement, but is also crucial for evaluation we use the normal distribution (i.e., Gaussian distribution) to model our duration distributions.

6 What Should Count as Agreement? If the area between lower and upper bounds covers 80% of the entire distribution area, the bounds are each 1.28 standard deviations from the mean. With this data model, the agreement between two annotations can be defined as the overlapping area between two normal distributions. The agreement among many annotations is the average overlap of all the pair wise overlapping areas.

7 A logarithmic scale is used for the output

8 Expected Agreement

9 There are two peaks in this distribution. One is from 5 to 7 in the natural logarithmic scale, which corresponds to about 1.5 minutes to 30 minutes. The other is from 14 to 17 in the natural logarithmic scale, which corresponds to about 8 days to 6 months. We also compute the distribution of the widths (i.e., X upper – X lower ) of all the annotated durations

10

11 Expected Agreement Two different methods were used to compute the expected agreement (baseline), both yielding nearly equal results. These are described in detail in (Pan et al., 2006). For both, P(E) is about 0.15.

12 Features Local Context Syntactic Relations WordNet Hypernyms

13 Local Context window of n tokens to its left and right The best n determined via cross validation turned out to be 0, i.e., the event itself with no local context. But we also present results for n = 2 to evaluate the utility of local context three features are included: the original form of the token, its lemma (or root form), and its part-of-speech (POS) tag

14 Local Context

15 Syntactic Relations For a given event, both the head of its subject and the head of its object are extracted from the parse trees generated by the CONTEX parser. in sentence (1), the head of its subject is “ presidents ” and the head of its object is “ plan ”. the feature vector is [presidents, president, NNS, plan, plan, NN].

16 WordNet Hypernyms Events with the same hypernyms may have similar durations. Hypernyms are only extracted for the events and their subjects and objects, not for the local context words. A word disambiguation module might improve the learning performance. But since the features we need are the hypernyms, not the word sense itself, even if the first word sense is not the correct one, its hypernyms can still be good enough in many cases.

17 WordNet Hypernyms

18 Experiments The corpus that we have annotated currently contains all the 48 non-Wall-Street-Journal (non-WSJ) news articles (a total of 2132 event instances),as well as 10 WSJ articles (156 event instances), from the TimeBank corpus annotated in TimeML (Pustejovky et al., 2003). The non-WSJ articles (mainly political and disaster news) include both print and broadcast news that are from a variety of news sources, such as ABC, AP, and VOA.

19 Experiments Annotators were instructed to provide lower and upper bounds on the duration of the event, encompassing 80% of the possibilities, and taking the entire context of the article into account. Our first machine learning experiment, we have tried to learn this coarse-grained event duration information as a binary classification task.

20 Data For each event annotation, the most likely (mean) duration is calculated first by averaging (the logs of) its lower and upper bound durations. If its most likely (mean) duration is less than a day (about 11.4 in the natural logarithmic scale), it is assigned to the “ short ” event class, otherwise it is assigned to the “ long ” event class.

21 Data We divide the total annotated non-WSJ data (2132 event instances) into two data sets: a training data set with 1705 event instances (about 80% of the total non-WSJ data) and a held-out test data set with 427 event instances (about 20% of the total non-WSJ data). The WSJ data (156 event instances) is kept for further test purposes

22 Learning Algorithms Since 59.0% of the total data is “ long ” events, the baseline performance is 59.0%. Support Vector Machines (SVM) Na ï ve Bayes (NB) Decision Trees (C4.5)

23 Experimental Results (non- WSJ)

24 Feature Evaluation

25 We can see that most of the performance comes from event word A significant improvement above that is due to the addition of “ Syn ” Local context and hypernym does not seem to help In the “ Syn+Hyper ” cases, the learning algorithm with and without local context gives identical results, probably because the other features dominate.

26 Experimental Results (WSJ) The precision (75.0%) is very close to the test performance on the non-WSJ

27 Learning the Most Likely Temporal Unit Seven classes (second, minute, hour, day, week, month, and year) However, human agreement on this more fine-grained task is low (44.4%) “ approximate agreement ” is computed for the most likely temporal unit of events. In “ approximate agreement ”, temporal units are considered to match if they are the same temporal unit or an adjacent one.

28 Learning the Most Likely Temporal Unit Human agreement becomes 79.8% by using approximate agreement Since the “ week ”, “ month ”, and “ year ” classes together take up largest portion (51.5%) of the data, the baseline is always taking the “ month ” class

29 Conclusion We have addressed a problem -- extracting information about event durations encoded in event descriptions We describe a method for measuring inter-annotator agreement when the judgments are intervals on a scale We have shown that machine-learning techniques achieve impressive results


Download ppt "Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06."

Similar presentations


Ads by Google