Semantic Annotation Eckhard Bick Institute of Language and Communication University of Southern Denmark & GrammarSoft ApS

Slides:

Advertisements

Similar presentations

CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,

Advertisements

A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣

The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.

Statistical NLP: Lecture 3

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.

LING NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.

Steven Schoonover.  What is VerbNet?  Levin Classification  In-depth look at VerbNet  Evolution of VerbNet  What is FrameNet?  Applications.

What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.

January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.

NLP and Speech 2004 Feature Structures Feature Structures and Unification.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.

Markov Model Based Classification of Semantic Roles A Final Project in Probabilistic Methods in AI Course Submitted By: Shlomit Tshuva, Libi Mann and Noam.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.

Type shifting and coercion Henriëtte de Swart November 2010.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

PropBank, VerbNet & SemLink Edward Loper. PropBank 1M words of WSJ annotated with predicate- argument structures for verbs. –The location & type of each.

Interpreting Dictionary Definitions Dan Tecuci May 2002.

IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

The Current State of FrameNet CLFNG June 26, 2006 Fillmore.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Treebank Troubles Eckhard Bick Southern Denmark University

Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.

Linguistic Essentials

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.

CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.

Combining Lexical Resources: Mapping Between PropBank and VerbNet Edward Loper,Szu-ting Yi, Martha Palmer September 2006.

Rules, Movement, Ambiguity

Supertagging CMSC Natural Language Processing January 31, 2006.

Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark

DeepDict en ny type online-ordbog til leksikografi og undervisning Eckhard Bick Syddansk Universitet & GrammarSoft ApS.

Automatic Semantic Role Annotation for Spanish Eckhard Bick Institute of Language and Communication University of Southern Denmark

Constraint Grammar ESSLLI Tuesday: Lexicon, PoS, Morphology.

DeepDict A Graphical Corpus-based Dictionary of Word Relations Eckhard Bick University of Southern Denmark & GrammarSoft ApS.

Generate text from CG (p.3)

Leonardo Zilio Supervisors: Prof. Dr. Maria José Bocorny Finatto

Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.

The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies

Approaches to Machine Translation

Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur

Statistical NLP: Lecture 3

INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.

Computational and Statistical Methods for Corpus Analysis: Overview

Natural Language Processing (NLP)

Representation of Actions as an Interlingua

Improving a Pipeline Architecture for Shallow Discourse Parsing

Constraint Grammar ESSLLI

CSC 594 Topics in AI – Applied Natural Language Processing

Machine Learning in Natural Language Processing

Statistical NLP: Lecture 9

Approaches to Machine Translation

Linguistic Essentials

Natural Language Processing (NLP)

Structure of a Lexicon Debasri Chakrabarti 13-May-19.

Artificial Intelligence 2004 Speech & Natural Language Processing

Statistical NLP : Lecture 9 Word Sense Disambiguation

Statistical NLP: Lecture 10

Natural Language Processing (NLP)

Presentation transcript:

Semantic Annotation Eckhard Bick Institute of Language and Communication University of Southern Denmark & GrammarSoft ApS

Higher-level Corpus Annotation ● Why annotate? – structurally and semantically annotated data provide a better basis for linguistic research than raw text data – lexicography: grammatical patterns rather than simple collocations (verb complementation, modifiers etc.) – training/learning of automatic NLP systems ● What to annotate? – Layered annotation: PoS (word class) & morphology -> syntactic function & dependency -> semantic annotation ● How to annotate? – In our approach. with rule-based CG parsers, i.e. all information encoded as tags on tokens or as treebanks – progressively, deducing syntactic structures from morphology & PoS, and semantic structures from lexical types

Semantic Annotation ● Semantic vs. syntactic annotation – semantic sentence structure, defined as a dependency tree of semantic roles, provides a more stable alternative to syntactic surface tags ● “Comprehension” of sentences – semantic role tags can help identify linguistically encoded information for applications like dialogue system, IR, IE and MT ● Less consensus on categories – The higher the level of annotation, the lower the consensus on categories. Thus, a semantic role set has to be defined carefully, providing well-defined category tests, and allowing the highest possible degree of filtering compatibility

The annotation method ● Information encoding – token-based – one tag per feature/attribute ● Process: progressive modular annotation levels – morhplogical multi-tagging -> disambiguation – syntactic function mapping -> disambiguation – structural annotation (dependency, constituent trees) – semantic annotation, higher-level structural annotation ● Constraint Grammar – rule based parser – context conditioned removal, selection, addition or substitution of tags

Integrating structure and lexicon: 2 different layers of semantic information ● (a) "lexical perspective": contextual selection of – a (noun) sense [WordNet style, or – semantic prototype [SIMPLE style, ● (b) "structural perspective": thematic/semantic roles reflecting the semantics of verb argument frames – Fillmore 1968: case roles – Jackendoff 1972: Government & Binding theta roles – Foley & van Valin 1984, Dowty 1987: ● universal functors postulated ● feature precedence postulated (+HUM, +CTR)

Lexico-semantic tags in Constraint Grammar ● secondary: semantic tags employed to aid disambiguation and syntactic annotation (traditional CG):,,,, ● primary: semantic tags as the object of disambiguation ● existing applications using lexical semantic tags – Named Entity classification (Nomen Nescio, HAREM) – semantic prototype tagging for treebanks (Floresta, Arboretum) – semantic tag-based applications ● Machine translation (GramTrans) ● QA, library IE, sentiment surveys, referent identification (anaphora)

Semantic argument slots ● the semantics of a noun can be seen as a "compromise" between its lexical potential or “form” (e.g. prototypes) and the projection of a syntactic-semantic argument slot by the governing verb (“function”) ● e.g. (country, town) prototype – (a) location, origin, destination slots (adverbial argument of movement verbs) – (b) agent or patient slots (subject of cognitive or agentive verbs) ● Rather than hypothesize different senses or lexical types for these cases, a role annotation level can be introduced as a bridge between syntax and true semantics

PALAVRAS Semantic prototypes ● ca. 160 types for ~ nouns ● SIMPLE- and cross-VISL-compatible (7 languages), thus a possibility for integration across languages ● Ontology with umbrella classes and subcategories, e.g. – :,,,,... – :,,,,.... – :,,,... ● allows composite “ambiguous” tags: – ( + ), ( + ) ● metaphors and systematic category inheritance are underspecified: -> ● prototypes expressed as bundles of atomic semantic features, e.g. (vehicle) = +concrete, -living, -human, +movable, +moving, -location, -time...

what is a semantic prototype? ● semantic prototype classes perceived as distinctors rather than semantic definitions ● intended to at the same time – capture semantically motivated regularities and relations in syntax by similarity- lumping (syntax-restrictions, IR, anaphora) – distinguish different senses (polysemy) – select different translation equivalents in MT ● prototypes seen as (idealized) best instance of a given class of entities (Rosch 1978) ● but: class hypernym tags used ( for “land animal”) rather than low-level-prototypes ( or )

Disambiguation of semantic prototype bubbles by dimensional downscaling (lower-dimension projections) (touwn, country) +LOC (person) - LOC +HUM e.g. “Washington”

feature inheritance

Feature bundles in prototype based polysemy resolution

prototypes or atomic features? ● Rioting continued in Paris. The town imposed a curfew – anaphoric relations visible as tag – not visible after HUM/PLACE disambiguation ● Paris +PLACE -HUM, due to “in” ● town -PLACE +HUM, due to “impose” ● The Itamarati announced new taxes, but voters may not allow the government ot go ahead. – semantic context projection announce) used to mark metaphorical transfer --> allow reference between the government and its seat (place name)

Semantic role annotation for Portuguese ● inspired by the Spanish 3LB-LEX project (Taulé et al. 2005) ● allows, together with the syntactic function annotation (ARG structure) a mapping onto PropBank argument frames (Palmer et al. 2005) ● will allow the extraction of argument frames from treebanks (i.e. the 400kW Floresta Sintá(c)tica) ● manual vs. automatic: due to the quality of the syntactic parser and the existance of the prototype lexicon, a boot-strapping is envisioned, where – syntactic valency is exploited in conjunction with the prototype lexicon (ontology) to create semantic role annotation, – which in turn provides "semantic valency frames" – which then is used to improve the semantic role annotation

Semantic role granularity ● 52 semantic roles (15 core argument roles and 37 minor and “adverbial” roles) ● Covering the major categories of the tectogrammatical layer of the PDT (Hajicova et al. 2000) ● ARG structure (a la PropBank, Palmer et al 2005) can be added without information loss by combining roles and syntactic function tags ● all clause level constituents are tagged, and where the same categories can be used for group-level annotation, this is annotated, too ● semantic heads: np heads, pp dependents

The semantic role inventory

Exploiting lexical semantic information through syntactic links ● corpus information on verb complementation: – CG set definitions – e.g. V-SPEAK = “contar” “dizer” “falar”... – MAP (§SP) (p V-SPEAK) ● ~ 160 semantic prototypes from the PALAVRAS lexicon – e.g. N-LOC =.. combined with destination prepositions PRP-DES = “até” “para”... – MAP (§DES) (0 N-LOC LINK p PRP-DES) ● Needs dependency trees as input, created with the syntactic levels of the PALAVRAS parser

Dependency trees organizará Ministerio_de_Salud_Pública El programa para trabajadores sus LOC EV BEN fiesta en edificio supropio un una y AG The Ministry of Health a program and a party for its empolyees in their own building will organize

Source format

● subjects of ergatives MAP (§PAT) (p LINK NOT ; ● the give sb-DAT s.th.-ACC frame MAP (§TH) ; Inferring semantic roles from verb classes and syntactic function and dependency (p, c and s) implicit inference of semantics: syntactic function and valency potential (e.g. ditransitive ) are not semantic by themselves, but help restrict the range of possible argument roles (e.g. §BEN

(a) "Genitivus objectivus/subjectivus" ● MAP (§PAT) (p PRP-AF LINK p N-VERBAL) ; # the destruction of the town ● MAP (§AG) TARGET (p N-ACT) ; # The government's release of new data ● MAP (§PAT) TARGET (p N-HAPPEN) ; # The collapse of the economy Inferring semantic roles from semantic prototype sets using syntactic function and dependency (p, c and s) explicit use of lexical semantics: semantic prototypes: (human professional), (ideology-follower), (nationality)... restrict the role range by themselves, but are ultimately still dependent on verb argument frames

● Agent: "he was chased by three police cars" MAP (§AG) (p LINK p PAS) (0 N-HUM OR N- VEHICLE) ; ● Possessor: "the painter's brush" MAP (§POS) N) (p N-OBJECT) ; ● Instrumental: “destroy the piano with a hammer” MAP (§INS) (0 N-TOOL) (p PRP-MED ; ● Content: “a bottle of wine" MAP (§CONT) ) ; ● Attribute: “a statue of gold” MAP (§ATR) + N-MAT (p ("of") ; ● Location: “live in a big house” MAP (§LOC) + N-LOC (p PRP-LOC LINK ● Origin: “send greetings from Athens”, “drive all the way from the border” MAP (§ORI) (0 N-LOC) (p PRP-FRA LINK LINK p V-MOVE/TR) ; ● Temporal extension: “The session lasted 4 hours” MAP (§EXT-TMP) (0 N-DUR) ;

Integrating the lexical resources in the grammar: Set definitions for rule use ● N-HUM =... ● N-LOC =... ● N-TOOL =.. ● N-OBJECT = ● verb semantics, largely defined as lists in the grammar itself: – V-SPEAK, V-MOVE, V-MOVE-TR... – with valency: VT-ACC-HUM, VT-ACC-NON-HUM, VT-ACC-ORG, VT- ACC-TITLE, V-SUBJ-HUM, VT-SUBJ-HUM (2.000 senses) ● adjective semantics (in part matching noun prototypes): – (1.300 senses),,...

Grammar profile ● About 500 CG role mapping rules, with a backbone of unambiguous mapping rules ● enhanced rule economy by using semantic tags and sets rather than individual lexemes - for both targets and contexts ● processes live input from the PALAVRAS parser ● uses a variety of syntactic and semantic markers – dependency structures &syntactic function – noun prototypes, semantic verb classes, adjective domain Text Annotat ed Corpus syntax parser Role CG CG3 semantic noun lexicon semantic verb lexicon

Advantages of the dependency-based method in semantic annotation ● in a traditional CG (without dependency), about 3 times as many rules would be needed: – left-pointing arguments vs. right-pointing arguments – special rules for embedded clauses (e.g. relative clauses after subjects) ● greater context precision of dependency-based rules – safer mapping rules – less need for ambiguous mapping followed by REMOVE/SELECT- disambiguation ● depdendency rules are simpler and less error prone – no need for BARRIER contexts and "negative" sets to handle/delimit constituent boundaries

Semantic role tagging performance on CG-revised Floresta + live dependency + live prototype tagging R=86.8%, P=90.5%, F=88.6%

Corpus results from a recent Spanish sister project ● compilation and annotation of a Spanish internet corpus (11.2 million words) ● to infer tendencies about the relationship between semantic roles and other grammatical categories:

● smallest syntactic “spread”: §AG, §COG, §SP (subject and agent of passive) ● easy: §SP and §COG, inferable from the verb alone ● difficult: §TH, covers a wide range of verb types and semantic features match >= 20 roles, but unevenly ● human roles tend to appear left, others right

● Problems – interdependence between syntactic and semantic annotation – multi-dimensionality of prototypes (e.g.,, ) – a certain gradual nature of role definitions – the verb frame bottleneck ● Plans: – annotate what is possible, one argument at a time, use function generalisation and noun types where verb frames are not available – Boot-strap a frame lexicon from automatically role-annotated text corpora annotated data Port. FrameNet Port. PropBank human postrevision good role annotation grammar frequency-based frame extraction

Portuguese parser: beta.visl.sdu.dk Corpora: corp.hum.sdu.dk

References

A usage example: Using “deeply” annotated corpora for “live” lexicography: DeepDict 1) annotate corpora with dependency, syntactic function and semantic classes – Linguateca's Público and Folha corpora (CETEMPúblico, CETENFolha) – ca M words – Portuguese Wikipedia (Nov. 2005) – ca. 8 M words – Portuguese section of Europarl – ca. 27 M words 2) extract dependency pairs of word + complements – N A + ADJ (gravemente doente) – V (ganhar ktp, V + PRP (pensar em) 3) identify statistically significant relations (not n-grams but “dep- grams!”): log (p(AB)^2 / (p(A) * p (B)))

4) normalisation (lemma) and classification 5) build a graphical interface * semantic prototypes will allow generalisations and cross-language comparisons * uses: advanced learner's dictionaries, production dictionaries, language and linguistics teaching...

corpus: encoding cleaning sentence separation id-marking raw text: - Wikipedia - newspaper - Internet - Europarl example concordance * DanGra m EngGra m... Comp. lexica CG grammars Dep grammar annotated corpora dep- pair extract or Statistical Database friendly user interface CG I norm frequencies

DeepDict user interface

DeepDict: explicit semantics (semantic prototypes)

DeepDict: sources and sizes