A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Predicting the Semantic Orientation of Adjectives
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Mining and Summarizing Customer Reviews
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Introduction Chapter 1 Foundations of statistical natural language processing.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Supertagging CMSC Natural Language Processing January 31, 2006.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Language Identification and Part-of-Speech Tagging
A Brief Introduction to Distant Supervision
Learning Attributes and Relations
Social Knowledge Mining
Presentation transcript:

A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣ Department of Computer Science, Vassar College, USA LREC 08 – Marrakech, 30 May 2008

Outline Motivations A (gentle) introduction to TimeML, TimeBank and Italian TimeBank Corpora The Bilingual Corpus: linking events in TimeBank & Italian TimeBank by means of the Inter-Lingual Index Evaluation and Experiments :  similar events in the two corpora;  the ILI: a bootstrapping device for creating comparable corpora; Conclusion

Motivations: Retrieving the temporal relations between events in texts is required to improve the performance of I.R. and Open Domain Q.A. systems; one of the most challenging task is represented by event identification:  can we facilitate events’ recognition by linking two comparable corpora - on size, content and annotation – in two different languages by means of the Inter-Lingual Index (ILI), which links IWN & WN?  are events encoded in the same way in Italian and English?  can we import layers of annotations from a corpus to another in two different languages by exploiting the ILI?

TimeML, TimeBank & Italian TimeBank TimeML (Pustejovsky et al., 2003) is a specification language to annotate core elements in a temporal framework: temporal expressions ( ); e.g.: December 1st, three years a wide range of linguistic expressions, like verbs, nouns, nominalizations, stative adjectives..., realizing eventualities ( ), i.e. events and states, and classifies them into 7 classes, i.e. ASPECTUAL, REPORTING, I_ACTION, I_STATE, PERCEPTION, STATE, OCCURRENCE, according to semantic and syntactic criteria; it creates dependencies between events (,, ) and between events and times ( ). connectives and temporal prepositions ( ), which make explicit the relation holding between two entities;

TimeML, TimeBank & Italian TimeBank (2)‏ TimeBank 1.2.:  first available corpus annotated with TimeML;  183 news article from different sources, including the Penn TreeBank2 Wall Street Journal, for a total of 61K words;  7,935 events; K=0.81 on partial match on event identification & K=0.67 on event classification Italian TimeBank:  Italian corpus comparable in size (62K words), content and annotation to TB 1.2. (171 articles from the Italian TreeBank and the PAROLE corpus);  under development: >13K words annotated, 1,755 events;  customization of TimeML to Italian (ISO-TimeML): imperfect value for TENSE; two new attributes -V_FORM & MOOD – for the tag, modification of tag text span;  mapping of the 7 TimeML event classes to the SIMPLE Ontology to improve event classification (K=0.84)‏

The Bilingual Corpus: Linking Events Linkage between the TimeBank (TB) & Italian TimeBank is accomplished through the Inter-Lingual Index (ILI), developed in the EuroWordNet Project (1999)‏ The ILI is effectively an unstructured version of WN, used as a ”hub” through which WN synsets are associated with synsets in WNs of other languages In IWN the ILI is augmented with several semantic relations, such as eq_synonym, eq_hyperonym, eq_cause... specific information on the synsets relations between English and Italian. 1,835 events (1,777 verbs & 658 nominalization); manual annotation of WN 2.0. senses, by 2 native speakers; 91% annotators’ agreement 1,686 events 1,253 events (778 verbs & 462 nominalizations and nouns); semi-automatic annotation of IWN sense.

The Bilingual Corpus: Linking Events (2)‏ WN 2.0 IWD 1.5 Auto-Generated Mapping from WD 2.0 to IWD 1.5 “Augmented” TimeBank SENSE (WN 2.0) Italian TimeBank SENSE (from IWN)‏ WN SENSE ILI The ILI link is automatically determined and restricted to the eq_synonym and eq_near synonym relations ILI LINK only events with exaclty or approximately the same meaning IWN SENSE 1,103 events in TB with 115 event synsets & 1,250 event in Italian TB with 653 event synsets ILI ILI (IWN)

Evaluation: Similar Events  To which extent the introduction of WN senses is useful for event identification?  Verify the Semantic Homogeneity Hypothesis: events with (almost) the same meaning assign the same TimeML class i.e. are semantically homogeneous. Automatic extraction of all events (nouns and verbs) with same ILI from both corpora:  56 common event synsets DATA SPARSNESS  35 common event synsets for verbs vs. 11 common event synsets for nouns

Evaluation: Similar Events - Verbs Analysis of common event synsets with a significant number of occurrences in both languages: 25 event synsets, each with 5 occurrences at least  for each event token we analyzed its semantic pattern: basic argument structure; e.g. [ARG0] E [ARG1] [ARG2]; semantic class of each argument and thematic role; e.g. [ARG0:Person:Agent]; subvalency features; e.g. [Person:Agent: Def_Np] E [Event:Theme:Clause] 30 different patterns have been identified for the 25 common synsets 93.22% of cases support the Semantic Homogeneity Hypothesis: same meaning, same semantic pattern, same TimeML class instances of event subcategorization (5 cases) i.e. more than one pattern. and its TimeML class.

Evaluation: Similar Events – Verbs (2)‏ < 10% of cases seem to question the validity of Semantic Homogeneity ILI = ; WN seek#3 – IWN cercare#2; same semantic pattern: [person/organization] E [event]; TimeBank class: I_ACTION – Italian TB: I_STATE All other instances of possible counterexamples we've found can all be explained in terms of factors others than a real difference between event realizations in the 2 languages NOT A COUNTEREXAMPLE Inconsistency of the data is due to the exploitation of the SIMPLE – TimeML Mapping and Heuristics (Caselli et al. 2007)‏ - SIMPLE–TimeML Mapping: SIMPLE Semantic_type Modal Event : I_STATE - cercare#2 = Modal Event : I_STATE Purpose Act : I_ACTION

Evaluation: Similar Events – Nouns All 11 common types have been analyzed. They are all instances of nominalization of a corresponding event verb. Presence of WN senses is useful for identifying incorrect or inconsistent annotations in the source and target corpora and to more easily identify those instances which satisfy the criteria for an event in TimeML Incorrect Annotations in Italian TB: missing semantic types in SIMPLE; e.g. aumento_n has 3 senses in IWN but 1 semantic type in SIMPLE; Incorrect Annotations in TB: over-extension of the notion ”nominalization=event”; e.g.: payment_n 8/10 occurrences are marked as EVENT when their meaning is ''a sum of money''. BUT WN senses are not always sufficient to determine if a nominal realize an event or not, due to the existence in the lexicon of cases where the (non- )eventive reading is, somehow, always possible.

Experiments: the ILI as Bootstrapping Device  Can the ILI and wordnet senses be used as a bootstrapping strategy for the creation of comparable corpora?  Key idea: if the Semantic Homogeneity Hypothesis holds, this will enable the import of one layer of annotation from a source corpus to a target one. To verify the validity of this hypothesis we developed a system which takes as input the events augmented with WN senses from the TB, and gives as output an additional layer of annotation, i.e. it creates the EVENT tag in Italian. TB + WN sense‏ ILI & P.O.S of TB EVENT Italian Corpus + IWN sense Italian Corpus + IWN sense + (partial) EVENT annotation

Experiments: the ILI as Bootstrapping Device (2) To evaluate the reliability of this approach we have used the entire corpus of the Italian TreeBank where a total of 62,522 words (9,832 verbs and 44,957 nouns) are manually assigned a sense from IWD. Our system has identified 3,700 events (6.7%), 1,183 of which are considered as ''probable events'' which need human post-processing. 58 new event synsets have been retrieved. - identification of annotation inconsistencies i.e. over-extension of the notion of event for nominalizations (e.g. movement#4 = social movement); - sense assigment is not sufficient to disambiguate eventive/non eventive reading of nominals e.g. indication#1 – segnale#1; - partial matches occur due to the way sense annotation is performed with WN; - significant reduction of manual effort: only the set of probable events requires validation and is restricted to those words whose event reading is not present in WN senses.

Conclusion Identification of a new methodology to link comparable corpora in different languages by means of WN senses and the ILI; Data from the resulting resource can be used for contrastive analysis of events as well as multilingual temporal analysis of texts; There is a semantic homogeneity between similar events in different languages, including semantic preferences for thematic roles and TimeML classes; Sense assignment to events improves accuracy in annotation, in particular for event identification, and useful to reveal inconsistencies and errors; Modification to TimeML is suggested: introduction of a tag for those instances of ambiguous cases where a double reading (eventive/non-eventive) is always possible The ILI can be used as a semi-automatic bootstrapping device to create resources by importing layers of annotation for words with similar sense

Thank You!!

Experiments and Evaluation: Similar Events – Nouns (2)‏ Identification of the senses is not enough to determine if a nominal may realize an event or not. the couple ''agreement#1 eq_synonym intesa#3, accordo#3'' do not have a clearcut eventive sense in both wordnets BUT:  in TB 31/32 occurrences are tagged as events; over-extension of the event reading;  in Italian TB only 7/16 occurrences are tagged as events, in Italian intesa#3, accordo#3 cannot be systematically interpreted as events;  no difference in WN & IWN senses is signalled between the eventive and non eventive readings!!  This calls for a refinement of annotation schemes for events to provide explicit means to mark ambiguous cases where the double reading is, somehow, always possible.