A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi* Simonetta Montemagni + Giulia Di Pietro* Maria Simi* * Universit à di Pisa + ILC - CNR, Pisa
Summary Why Super Sense tagging Preliminary results Improving an existing resource Building a new resource A new tagger for the task Discussion on the results Future work
Semantic tagging Named Entity Recognition (NER) Simple ontologies: person, organization, location … Limited semantic/syntactic coverage High accuracy Word Sense Disambiguation Identifying WordNet senses tens of thousands of specific “word senses” all open class words covered, domain-independent inadeguate performance
Super Senses Introduced by Ciaramita and Altun (2006) WordNet super senses Noun and verb synsets mapped to 41 general semantic classes (lexicographic categories) 26 noun categories; 15 verb categories Example: “Clara Harris person, one of the guests person in the box artifact, stood up motion and demanded communication water substance ”
Super Sense Tagging For English (Ciaramita and Altun, 2006) training on SemCor (Senseval-3) discriminative HMM, trained with an average perceptron algorithm average F-Score on 41 categories: For Italian (Picca, Gliozzo, Ciaramita, 2008) trained on MultiSemCor (Bentivoglio et al.) average F-Score on 41 categories: 62,90
Improving MultiSemCor Problems Smaller size (64% of English corpus) Incomplete alignment (sense in Eng., no sense in Ita.) PoS coarseness Word by word translation Stategy Retagging, adding morphology Results average F-Score: 64,95 (same algorithm; 45 categories)
Further work Our requirements Integration of a SST tagger in the TANL pipeline Useful model for annotating realistic Italian texts Two directions for improvement A brand new resource for SST A new algorithm for SST, based on Maximum Entropy
Building the new resource ISST - Italian Syntactic-Semantic Treebank 305,547 tokens 81,236 content words annotated at the lexico- semantic level, including IWN senses ILI* mapping from IWN to WN senses * Inter Linguistic Index WordNet IWN ILI Supersense ISST Corpus
From Italian senses to English super senses Starting from sense S i : 1. If S i is in ILI, return che corresponding S e 2. If not, look for the first hyperonym in the ILI and return the corresponding S e In both cases return the super sense of S e in WN ItalWordNet Synset ILI n#24931n # WordNet Supersense Synset ILI n#16564 – Synset ILI n#12484 # Hyperonym Token L’ atmosfera di festa ISST Corpus
ISST-SST after mapping Tokens with super-sense Tokens with ambiguous super-sense Tokens without super-sense direct ILIILI from hyp noun verb adjective adverb Total
Revision Mapping of adverbs in adv.all (~ 10,000) Listing of possible super senses Alternative for nouns: 2-6 Alternative for verbs: 3-10 An ad-hoc tool for revision Difficulties Aspectual verbs: “continuare a …”, “stare per …” Support verbs: “prestare attenzione a …”, “ dar una mano …”
ISST-SST after revision Tokens with super sense Tokens without super sense noun69,36011,545 verb27,6677,075 adjective17,4784,649 adverb12,2321,596 Total126,73724,865
Super Sense Tagger Adapting a generic chunker, part of the Tanl pipeline Maximum Entropy classifier Effective for chunking since it does not assume independence of features Dynamic programming to select sequences of tags with higher probability The tagger is flexibile and customizable for different tasks specialization of class FeatureExtractor
Features No external resources, no first sense heuristics Local features Token Attribute features POSTAG CPOSTAG-1 0 Form features FORM ^\p{Lu} FORM ^\p{Lu}*$ 0 Global features Whether a word in the document was previously annotated with a given tag
Detailed results
Results for Italian Improvement for Italian due to: new corpus the different algorithm and the tuning of features PrecisionRecallF1 Italian Picca et al ,90 Italian our
Analysis of improvement Improvement due to new corpus MultiSemCor vs ISST-SST, ME tagger about +4.5 on the F1 score Improvement due to new algorithm and features Ciaramita-Altun tagger vs ME tagger, on ISST-SST about +10 on the F1 score
Conclusions Significant improvement in accuracy for SS tagging The tagger has been used to annotate the Italian Wikipedia Examples of queries made possible on the semantic index Who proves emotions? (the subj of a verb.emotion) What did Edison invent/create/discover …? (Edison as the subject of a verb.creation) Completion of the ISST-ISST resource can further improve accuracy