Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu,

Slides:



Advertisements
Similar presentations
SWG Strategy (C) Copyright IBM Corp. 2006, All Rights Reserved. P4 Task 2 Fact Extraction using a CNL Current Status David Mott, Dave Braines, ETS,
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Practical Applications of Temporal and Event Reasoning
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Using Query Patterns to Learn the Durations of Events Andrey Gusev joint work with Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky.
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
Semantics and Time in Language MAS.S60 Rob Speer Catherine Havasi Some slides: James Pustejovsky.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
April 26th, 2007 Workshop on Treebanking, HLT/NAACL, Rochester 1 Layering of Annotations in the Penn Discourse TreeBank (PDTB) Rashmi Prasad Institute.
Introduction to Computational Linguistics Lecture 2.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 Annotation Guidelines for the Penn Discourse Treebank Part B Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, Bonnie Webber.
DS-to-PS conversion Fei Xia University of Washington July 29,
TimeML Annotation Tool Suite Tutorial Using Callisto and Tango for TimeML Annotation 10/26/04.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
TimeBank Status Status of TimeML annotation for the ULA project James Pustejovsky and Marc Verhagen Brandeis University.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Chapter 2: Algorithm Discovery and Design
Meaning and Language Part 1.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Introduction Chapter 1 Types of sentences Examples.
Temporal Reasoning Intro to TimeML cs112 October, 2004.
Summarization using Event Extraction Base System 01/12 KwangHee Park.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Developing Reading Skills. Key Reading Skills 1.Selecting what is relevant for the current purpose; 2.Using all the features of the text e.g. headings,
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
The TERN Task EVALITA 2007 Valentina Bartalesi Lenzi & Rachele Sprugnoli
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
NATURAL LANGUAGE PROCESSING
A Database of Narrative Schemas A 2010 paper by Nathaniel Chambers and Dan Jurafsky Presentation by Julia Kelly.
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for.
Social Knowledge Mining
Presentation transcript:

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 1 Semi-automatic Annotation of the Romanian TimeBank 1.2 Corina Forăscu, Radu Ion, Dan Tufiş Faculty of Computer Science, Al.I. Cuza University of Iasi, Romania & Research Institute for Artificial Intelligence of the Romanian Academy {radu,

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 2 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 3 Fundamentals Temporal information in Natural Language: 1. Time-denoting expressions – references to a calendar or clock system expressed by NPs, PPs, or AdvPs expressed by NPs, PPs, or AdvPs the 23 rd of May, 1998; Monday; tomorrow; the second semester the 23 rd of May, 1998; Monday; tomorrow; the second semester 2. Event-denoting expressions - reference to an event  expressed by 1. sentences – more precisely their syntactic head, the main verb: John listens to the music. John listens to the music. 2. noun phrases: Israel will ask the USA to delay a military strike against Iraq. Israel will ask the USA to delay a military strike against Iraq.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 4 Motivation (1) NLP applications to benefit: lexicon induction, linguistic investigation, using very large annotated corpora; question answering (questions like when, how often or how long); information extraction or information retrieval; machine translation (translated and normalized temporal references; mappings between different behavior of tenses from language to language); discourse processing: temporal structure of discourse and summarization.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 5 Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Now he realised that exactly because of this inicident he decided suddenly to come home and to begin his jurnal exactly today. Motivation (2)

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 6 Acum îşi dădea seama că tocmai din cauza acestui incident se hotărâse el brusc să vină acasă şi să-şi înceapă jurnalul taman astăzi. Acum îsi dădea seama ca tocmai din cauza acestui incident se hotarâse el brusc sa vină acasa si sa -si înceapă jurnalul taman astăzi. Motivation (3)

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 7 State of the Art 1947Reichenbach: The tenses of verbs 1998MUC TIMEX 2004ACE – TERN: TIMEX2 v.1.1.TARSQI: TimeML v ACE – TERN: TIMEX2 v.1.2.ACL 2005: TARSQI system ACL-COLING WS: ARTE Annotating and Reasoning about Time and Events 2006Time Symposium ACL: Temporal and Spatial Information Processing2001STAG (Setzer)TIDES 2001: TIMEX2 v LREC 2002 Annotation Standards for Temporal Information in Natural Language 2002DAML-TimeTERQAS: TimeML v.1.0.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 8 TERQAS TimeML v.1.0 metadata standard for: marking events, marking events, their temporal anchoring and their temporal anchoring and links in news articles links in news articles + TimeBank corpus v guidelines for temporal annotation

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 9 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 10 TimeML v.1.2 A metadata standard developed especially for news articles, for marking Events: EVENT, MAKEINSTANCE Events: EVENT, MAKEINSTANCE temporal anchoring of events: TIMEX3, SIGNAL temporal anchoring of events: TIMEX3, SIGNAL links between events and/or timexes: TLINK, ALINK, SLINK links between events and/or timexes: TLINK, ALINK, SLINK

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 11 Events (1) situations that happen or occur, states or circumstances in which something obtains or holds true situations that happen or occur, states or circumstances in which something obtains or holds true tensed verbs, adjectives, nominalizations tensed verbs, adjectives, nominalizations The oat-bran craze e190 has cost e189 the world's largest cereal maker market share. 7 classes of EVENTs: OCCURRENCE, PERCEPTION, REPORTING, ASPECTUAL, STATE, I_STATE, I_ACTION

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 12 Events (2) The oat-bran craze e190 has cost e189 the world's largest cereal maker market share. Analysts say e28 much of Kellogg's erosion e204 has been in such core brands as Corn Flakes,...

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 13 Instances Based on the event annotation: how many different instances or realizations has a given event – at least one Based on the event annotation: how many different instances or realizations has a given event – at least one Carries the tense and aspect of the verb- denoted event Carries the tense and aspect of the verb- denoted event John learns e1 twice on Monday.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 14 Temporal expressions: TIMEX3 (1) Explicit & implicit temporal expressions: Times: 11 o’clock; midnight Times: 11 o’clock; midnight Dates: Dates: Fully Specified (May 23, 2006; winter, 2005), Fully Specified (May 23, 2006; winter, 2005), Underspecified (Monday; next week; last month; two years ago) Underspecified (Monday; next week; last month; two years ago) Durations: two months; three hours Durations: two months; three hours Sets: every week; every Tuesday Sets: every week; every Tuesday

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 15 Temporal expressions: TIMEX3 (2) 10/30/89 10/30/89 the next two years or so the next two years or so soon soon

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 16 Temporal signals: SIGNAL Function words that indicate how temporal objects are to be related to each other: temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when temporal prepositions, conjunctions and/or modifiers: on, in, at, from, to, before, after, during; before, after, while, when negative expressions negative expressions modal verbs modal verbs prepositions signaling modality (“to”) prepositions signaling modality (“to”) special characters denoting ranges in temporal expressions: “-” and “/” special characters denoting ranges in temporal expressions: “-” and “/”

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 17 Dependencies: LINK s Temporal Relations: TLINK Temporal Relations: TLINK Anchors to Time Anchors to Time Orders between Time and Events Orders between Time and Events Aspectual Relations: ALINK Aspectual Relations: ALINK Phases of an event Phases of an event Subordinating Relations: SLINK Subordinating Relations: SLINK Events that syntactically subordinate other events Events that syntactically subordinate other events

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 18 Temporal relations: TLINK (1) temporal relation between two temporal elements (event-event, event-timex); temporal relation between two temporal elements (event-event, event-timex); EVENT s – through their INSTANCE s EVENT s – through their INSTANCE s 13 relTypes – as Allen’s: 13 relTypes – as Allen’s: Simultaneous Simultaneous Identical Identical One before (/after) the other One before (/after) the other One immediately before (+after) the other One immediately before (+after) the other One including / being included in the other One including / being included in the other One holding during the duration of the other One holding during the duration of the other One being the beginning (/ending) of the other One being the beginning (/ending) of the other One being begun (/ended) by the other One being begun (/ended) by the other

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 19 Temporal relations: TLINK (2) The oat-bran craze e190/ei1994 has cost e189/ei1995 the world's largest cereal maker market share. The company's president quit e3 /ei1996 suddenly. crazecost 10/30/89 ei1994 ei1995t192 quit ei1996

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 20 Temporal relations: TLINK (3) crazecost 10/30/89 ei1994 ei1995t192 quit ei1996

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 21 Aspectual relations: ALINK relationship between an aspectual event and its argument event: relationship between an aspectual event and its argument event: Initiation: John started ei5 to read ei6. Initiation: John started ei5 to read ei6. Culmination : John finished ei5 assembling ei6 the table. Culmination : John finished ei5 assembling ei6 the table. Termination: John stopped talking. Termination: John stopped talking. Continuation : John kept talking. Continuation : John kept talking.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 22 Subordination relations: SLINK for contexts introducing relations between two events of type: for contexts introducing relations between two events of type: Modal: John should have bought some wine. Modal: John should have bought some wine. Factive: John forgot that he was in Boston yesterday. Factive: John forgot that he was in Boston yesterday. Counterfactive: John prevented the divorce. Counterfactive: John prevented the divorce. Evidential: John said he bought some wine. Evidential: John said he bought some wine. Negative evidential: John denied he bought only beer. Negative evidential: John denied he bought only beer. Conditional: If John leaves today, Mary will cry. Conditional: If John leaves today, Mary will cry.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 23 TimeBank English news report documents TimeML annotated, distributed through LDC 4715 sentences with unique lexical units, from a total of lexical units Non-TimeML Markup in Time Bank 1.1: structure information: header named entity recognition:,, sentence boundary information:

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 24 TimeBank 1.2 events 7935 instances 7940 timexes 1414 signals 688 alinks 265 slinks 2932 tlinks 6418 TOTAL27592

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 25 Outline 1. Fundamentals 2. TimeML & TimeBank 3. Corpus processing 1. translation 2. pre-processing 3. Alignment 4. Annotation import 4. Conclusions

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 26 Translation 2 “trained translators”; one final correction Translation desiderata: 1-1 sentence aligned Preserving POS Verb tense – mapped onto Romanian Format of the dates, moments of day and numbers conforms to the norms of written Romanian 4715 sentences (translation units), lexical tokens, including punctuation marks, representing lexical types

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 27 Preprocessing the corpus Tokenisation – MtSeg, with idiomatic expressions, clitic splitting POS-tagging – TnT adapted & improved to determine the POS of unknown words Lemmatisation – probabilistic, based on a lexicon Chunking – REs over POS tags to determine non-recursive NPs, APs, AdvPs, PPs

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 28 Alignment : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian- English track organized at ACL2005 YAWA : 4 stages, evaluated over the data in the Shared Task on Word Alignment, Romanian- English track organized at ACL2005 Current: P = 88.80%, R = 74.83%, F = 81.22% alignments, manually checked, out of which are NULL-alignments

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 29 Alignment 1. Content words alignment: based on the translation lexicons P = 94.08%, R = 34.99%, F = 51.00%. 2. Inside-Chunks alignment: simple empirical rules to align the words within the corresponding chunks; P = 89.90%, R = 53.90%, F = 67.40% 3. Alignment in contiguous sequences of unaligned words: using the POS-affinities of the unaligned words and their relative positions 4. Correction phase: the wrong links introduced mainly in stage 3 are now removed.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 30 Alignment

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 31 Alignment The parallel corpus = 183 files in XCES format Pe_de_altă_parte, se dovedeşte a fi altă săptămână financiară foarte proastă … On_the_other_hand, it 's turning out to be another very bad financial week …

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 32 Annotation import Based on the Romanian-English lexical alignment

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 33 Annotation import For every pair of sentences Sro and Sen from the TimeBank parallel corpus with the Ten English equivalent sentence: 1. construct a list E of pairs of English text fragments with sequences of English indexes from Sen and Ten. E = {,,,,, }.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 34 Annotation import 2. add to every element of E the XML context in which that text fragment appeared in the original English TimeBank. E’ = {,, …} 3. construct the list RW of Romanian words along with the transferred XML contexts using E’ and the lexical alignment between Sro and Sen. If a word in Sro is not aligned, the top context for it, namely s, is considered. RW = {,, …}.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 35 Annotation import 4. construct the final list R of Romanian text fragments from RW by conflating adjacent elements of RW that appear in the same XML context. Output the list in XML format.

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 36 Annotation import Offline markup ( MAKEINSTANCE, ALINK, TLINK and SLINK tags) : the transfer kept only those XML tags from the English version whose IDs belong to XML structures that have been transferred to Romanian

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 37 Annotation import TimeML tags  % transfered events instances timexes signals alinks slinks tlinks TOTAL

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 38 Conclusions & future work improve & evaluate the annotation transfer adequacy of temporal theories to Romanian (semi) automatically mark-up of the temporal information in Romanian texts (news + literature)

Semi-automatic Annotation of the Romanian TimeBank 1.2 CALP07 RANLP 39 Thank you! (Temporal) Questions???