A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Simple Features for Chinese Word Sense Disambiguation Hoa Trang Dang, Ching-yi Chia, Martha Palmer, Fu- Dong Chiou Computer and Information Science University.
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
ELN – Natural Language Processing Giuseppe Attardi
Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Graphical models for part of speech tagging
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
L’età della parola Giuseppe Attardi Dipartimento di Informatica Università di Pisa ESA SoBigDataPisa, 24 febbraio 2015.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1 Query Operations Relevance Feedback & Query Expansion.
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Language Independent Method for Question Classification COLING 2004.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
NUMBER OR NUANCE: Factors Affecting Reliable Word Sense Annotation Susan Windisch Brown, Travis Rood, and Martha Palmer University of Colorado at Boulder.
Hedge Detection with Latent Features SU Qi CLSW2013, Zhengzhou, Henan May 12, 2013.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Learning Attributes and Relations
Date of Inception: 21st July 2012
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Text Analytics Giuseppe Attardi Università di Pisa
WordNet: A Lexical Database for English
Intent-Aware Semantic Query Annotation
WordNet WordNet, WSD.
A method for WSD on Unrestricted Text
CS224N Section 3: Corpora, etc.
Presentation transcript:

A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi* Simonetta Montemagni + Giulia Di Pietro* Maria Simi* * Universit à di Pisa + ILC - CNR, Pisa

Summary Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work

Semantic tagging Named Entity Recognition (NER)  Simple ontologies: person, organization, location …  Limited semantic/syntactic coverage  High accuracy Word Sense Disambiguation  Identifying WordNet senses  tens of thousands of specific “word senses”  all open class words covered, domain-independent  inadeguate performance

Super Senses  Introduced by Ciaramita and Altun (2006) WordNet super senses  Noun and verb synsets mapped to 41 general semantic classes (lexicographic categories)  26 noun categories; 15 verb categories Example: “Clara Harris person, one of the guests person in the box artifact, stood up motion and demanded communication water substance ”

Super Sense Tagging For English (Ciaramita and Altun, 2006)  training on SemCor (Senseval-3)  discriminative HMM, trained with an average perceptron algorithm  average F-Score on 41 categories: For Italian (Picca, Gliozzo, Ciaramita, 2008)  trained on MultiSemCor (Bentivoglio et al.)  average F-Score on 41 categories: 62,90

Improving MultiSemCor Problems  Smaller size (64% of English corpus)  Incomplete alignment (sense in Eng., no sense in Ita.)  PoS coarseness  Word by word translation Stategy  Retagging, adding morphology Results  average F-Score: 64,95 (same algorithm; 45 categories)

Further work Our requirements  Integration of a SST tagger in the TANL pipeline  Useful model for annotating realistic Italian texts Two directions for improvement  A brand new resource for SST  A new algorithm for SST, based on Maximum Entropy

Building the new resource ISST - Italian Syntactic-Semantic Treebank  305,547 tokens  81,236 content words annotated at the lexico- semantic level, including IWN senses  ILI* mapping from IWN to WN senses * Inter Linguistic Index WordNet IWN ILI Supersense ISST Corpus

From Italian senses to English super senses Starting from sense S i : 1. If S i is in ILI, return che corresponding S e 2. If not, look for the first hyperonym in the ILI and return the corresponding S e In both cases return the super sense of S e in WN ItalWordNet Synset ILI n#24931n # WordNet Supersense Synset ILI n#16564 – Synset ILI n#12484 # Hyperonym Token L’ atmosfera di festa ISST Corpus

ISST-SST after mapping Tokens with super-sense Tokens with ambiguous super-sense Tokens without super-sense direct ILIILI from hyp noun verb adjective adverb Total

Revision Mapping of adverbs in adv.all (~ 10,000) Listing of possible super senses  Alternative for nouns: 2-6  Alternative for verbs: 3-10 An ad-hoc tool for revision Difficulties  Aspectual verbs: “continuare a …”, “stare per …”  Support verbs: “prestare attenzione a …”, “ dar una mano …”

ISST-SST after revision Tokens with super sense Tokens without super sense noun69,36011,545 verb27,6677,075 adjective17,4784,649 adverb12,2321,596 Total126,73724,865

Super Sense Tagger Adapting a generic chunker, part of the Tanl pipeline Maximum Entropy classifier  Effective for chunking since it does not assume independence of features Dynamic programming to select sequences of tags with higher probability The tagger is flexibile and customizable for different tasks  specialization of class FeatureExtractor

Features No external resources, no first sense heuristics Local features  Token Attribute features POSTAG CPOSTAG-1 0  Form features FORM ^\p{Lu} FORM ^\p{Lu}*$ 0 Global features  Whether a word in the document was previously annotated with a given tag

Detailed results

Results for Italian Improvement for Italian due to:  new corpus  the different algorithm and the tuning of features PrecisionRecallF1 Italian Picca et al ,90 Italian our

Analysis of improvement Improvement due to new corpus  MultiSemCor vs ISST-SST, ME tagger  about +4.5 on the F1 score Improvement due to new algorithm and features  Ciaramita-Altun tagger vs ME tagger, on ISST-SST  about +10 on the F1 score

Conclusions Significant improvement in accuracy for SS tagging The tagger has been used to annotate the Italian Wikipedia Examples of queries made possible on the semantic index  Who proves emotions? (the subj of a verb.emotion)  What did Edison invent/create/discover …? (Edison as the subject of a verb.creation) Completion of the ISST-ISST resource can further improve accuracy