A FrameNet for Danish Eckhard Bick University of Southern Denmark

A FrameNet for Danish Eckhard Bick University of Southern Denmark eckhard.bick@mail.dk

Verb classification lexical classification is central to linguistics, and to computational linguistics in particular as the pivot of the sentence, verbs play a special integrative role in lexical ontologies, quite different from other word classes ● nouns are accessible to (relatively) easy ISA classification ● for verbs, structural aspects are meshed with semantics ● verb-complement relations are subject to complex combinatorial restrictions ● verbs integrate phrasal material, especially adverbs and prepositions, but also nouns, in support constructions

The roots of the Framenet concept Levin 1993 - semantic verb classes VerbNet (Kipper et al. 2006) ● non-np complements, 23 thematic roles, 94 sem. predicates FrameNet (Baker et al. 1998, Johnson & Filmore 2000) ● semantic frames -> corpus examples e.g. Commerce: Buyer+Seller+Goods+Money ● sense distinctions implied by frame membership ● roles, morphosyntactic restrictions, ontological slot information PropBank (Palmer et al. 2005) ● annotated corpora -> verb frames for each given sentence ● roles, arg structure, morphosyntactic restrictions

Danish verb classification projects DanNet (Pedersen et al. 2008), modelled on Princeton WordNet ● 3000 verbs with 6000 senses ● 80 top classes, e.g. BoundedEvent+Physical+Location ● particle incorporation and reflexivity as sense discriminators, but no frame roles or systematic selection restrictions STO database (Braasch & Olsen 2004) ● 6000 verbal entries, 80% with syntactic, 20% with semantic information Odense Valency Dictionary (Schösler & Kirchmeier- Andersen 1997) ● 4000 verbs classification of verbal argument semantics through the semantics of pronoun complements

Danish FrameNet: idea valency as a stepping stone for semantics, turn ”syntactic frames” into semantic frames assumption: in most cases, knowing form, function and complement semantic class is enough to distinguish verbal subsenses and to assign a full thematic role frame manual assignment of verb classes and frames to each valency ”sense” in the DanGram (parser) dictionary, adding further distinctions where necessary semi-automatic asignment of syntactic function and form restrictions for complements (implicit in the valency tags) frames checked in corpus examples, facilitated by the fact that all syntactic complementation patterns were already available in DanGram-annotated corpora

Danish FrameNet: status high lexicon coverage, fairly stable at 6825 lexemes ● ~ 11.000 valency patterns, ~12,075 verb frames 1.77 frames/lexeme, 1.46 senses/lexeme 494 verb categories ● 200 coarser categories with hypernym mapping, to allow generalisation in CG rules, and facilitate cross language comparison (transfer?) ● grouped using Levine senses, but modified & expanded – use real hypernym verb names where possible, avoiding both example-based category names (common in VerbNet) and abstract concept names (common in FrameNet) – (unlike WordNet and VerbNet) class distinction for polarity antonyms (increase - decrease, like - dislike) and self/other (move_selv, move_other) ● hierarchy as flat as possible (unlike WordNet), in order to facilitate using categories as corpus annotation tags or CG disambiguation tags ● avoid large underspecified classes, e.g. VerbNet change_of_state --> heat - cool, activate - deactivate, open - close, leaving only a ”wastebin” rest category

Frame distinctor skeleton: valency derived from DanGram lexicon, e.g. ● monotransitive ● ditransitive ● prepositional ditransitive with the preposition ”på” and a verb-incorporated 'ind'-adver each valency frame is assigned at least one verb sense, with its own semantic frame several valency or semantic frames may share the same verb sense (e.g. variations in number of obligatory arguments) two different verb senses will almost always differ in at least on syntactic or semantic aspect of their arugment frame ● --> guaranteeing that all senses can in principle be disambiguated exploiting a parser's argument tags and dependency links

Frame information fields Thematic role (case/semantic role, Fillmore 1968) Syntactic function Morphosyntactic form (PoS and phrases) typical semantic prototype slot filler (np's only) English gloss / skeleton sentence (46%: best-guess DanNet link based on semi- automatic matches for adverb incorporation and hypernym classification)

Thematic roles §STI - Stimulus §REFL - Reflexive §DON - Donor §PATH - Path §ORI - Origin §EXT - Extension §VAL - Value §EXT-TMP - Duration §MES - Message §TP - Topic §SOA - State of Affairs §CAU - Cause §ROLE - Role §INS - Instrument §MNR - Manner §FIN - Purpose §COMP - Comparison §HOL - Whole §PART - Part §POSS Possessor §ASS - Asset §CONT - Content §COM - Co-role §INC - Incorporated

Syntactic functions

Semantic (prototype) noun class

Form types

Verb incorporations adverb incorporates @MV< ● kaste op (vomit) -, slå fra (deactivate) -, komme ind på (discuss) - noun incorporates @ACC §INC (syntactic object, semantically part of the predicator) ● holde kæft (shut up) -, have brug for (need) - ● verbal dependency convention is preserved for incorporations too (rather than make nominal incorporated material the head for real arguments) pp incorporates @PIV §INC ● træde i kraft (take effect) - frozen pp expressions, e.g. with otherwise inexistent dative --> fused by preprocessor ● have i sinde (intend) -, være på færde (be going on) -

Frame annotation assumption: FrameNet + CG parser = Frame annotator for running text (i.e. verb senses, frame element roles) experiment: CG implementation of Frame annotator ● tag compatibility with DanGram output as input ● rules based system allows later fine tuning and contextual exceptions

Frame annotator: Step 1 converter program (framenet2cgrules.pl) turns each frame into a verb sense mapping rule argument checking = LINKed dep context in CG ● SUBSTITUTE (V) ( V) TARGET ("bestå" V) (1 (*) LINK *-1 VFIN LINK c @SUBJ LINK 0 ) (c @PIV LINK 0 ("af") LINK c @P OR ) – frame class (implicitly: sense) – argument relations: @SUBJ -> §HOL, @PIV -> §PART/MAT – syntactic conditions PRP 'af' for the @PIV object – semantic conditions: (object) for the subject, or (material) for the object grammar section with set (LIST) definitions, to be used by the conversion rules ● LIST = r r r r ; (subtypes, clothing, containers, fruits, furniture, tools, vehicles) ● LIST = r ; (materials, chemicals, mass nouns)

Frame annotator: step 2 methods for assigning thematic roles to arguments: ● (a) MAP on multiple (argument) contexts ● (b) MAP on individual arguments, unify their functions with the verb's new tag and retrieve (map) the correct role from the latter neither (a) nor (b) were supported in any CG compiler ● new feature for CG3: allow unification between tag-internal string variables and ordinary tag and map sets ● MAP KEEPORDER (VSTR:§$1) TARGET @SUBJ (*p V LINK -1 (*) LINK *1 ( r) LINK 0 PAS LINK 0 ( r)) ; – $1 variable extracted from and MAPped on the subject (@SUBJ), if the verb is in the passive voice comlete rules contain also negative (cautionary) contexts, e.g. ruling our object daughters for intrasitive valency frames rule order is important, since the first matching rule will ”win”. The default for a given lexical rule batch is the first intransitive or monotransitive frame in the lexicon.

Corpus - grammar interaction using ”typical” semantic noun classes for argument slots can help disambiguate frames, but ● does this mean we will only find (frame-semantic) corpus examples our grammar has already incorporated? ● does this decrease robustness in the face of rare cases and metaphor? Solution: run all rules twice, first with semantic noun class restrictions in place, then - if necessary -without ● this provides a ”syntax-only” (semantics-free) skeletal annotation for backup frame assignment ● and allows corpus-based extension of semantic noun class restrictions without fear of circularity

example: 2 frame senses for 'nedsætte' (Literally: Now establishes government-the a commission that shall investigate how...) (Literally: In Odense's Vollsmose is it first of all the environment's lacking standing,that decreases expectations-the and increases problems-the.)

Evaluation 2.4 million words (newspaper Information) annotated with DanGram dependency parser (Bick 2005) + automatically derived FrameNet conversion rules 98.8% of main verbs were assigned a frame sense ● 19.2% default assignments (lack of matchable surface arguments) ● 15.0% subject-less infinitive and participle construction – of these, 2/3 (10.9%) had other, non-subject arguments to support frame assignment 4051 verb lexeme types, assigned 9195 frame types and 5929 verb sense types ● types: 2.26 frames/verb, 1.46 senses/verb (~ same as in the lexicon itself) ● tokens: ambiguity twice as high (multi-sense verbs are more frequent)

Surface expressions of arguments dative (DAT) objects are least obligatory (subject count suffers from non-finite clauses) prepositional objects (PIV) have almost as high an expressivity as predicatives, simply because most verbs have alternative valency frames of lower order (intransitive or monotransitive accusative) that the tagger would have chosen in the absence of a PIV --> PIVs are strong sense markers, and their sense will rarely be false positive

Frame tagger performance manual evaluation (”inspection method”) of a random frame-annotated 5000 word chunk with 562 main verbs ● DanGram: 0.7% false positive verbs (4), 0.5% lemma error (3) ● 561 frames, 478 correct (missed 3, over-tagged 2) only 1 verb not covered --> high raw lexicon coverage 99.82% Comparison: ● Shi & Mihalcea (2004), FrameNet-derived rules for English, F=74.5 ● Gildea & Jurafsky (2002), statistic tagger, F=80.4 (frame roles), 82.1 (abstract thematic roles) ● Johansson & Nugues (2006), English support constructions, F=71- 73

Error types 39% of false positive errors (5.7% of all frames) were cases where the human ”gold sense” was not on the list of possible senses. 1/3 of these got a default tags 13.3% of false positives were caused by parsing stage errors (wrong lemma, auxiliary or syntactic tag) --> correct input would raise precision (F=86.20) frequency influence: ● frequent verbs have a higher sense ambiguity (i.e. in-chunk) ● verbs with a high sense ambiguity are more error prone than 1-sense verbs

framenet.dk example: valency-differentiated frames

framenet.dk example: same-valency subframes

Conclusion & futue work Current status: ● 12.000 frames, high lexical coverage, ● 94.3% frame coverage (i.e. only 5.7% matchless default mappings), ● overall F-Score of 85.12 Since 2/5 of frame tagging errors were due to missing frame senses, the current framenet should be checked against larger amounts of corpus data to identyfy senses not covered by our valency-based approach (especially noun incorporations which DanGram mostly regards as objects: finde sted -take place) Frametagger improvements: ● ordering or modifying the frame-derived MAP rules ● more complex context conditions where necessary Alternative tagging approach (for comparison with the CG conversion approach) ● scoring method where frame conditions are matched individually against the parse tree Creation of a manually revised, frame-annotated corpus (later allowing hybridization with a statistical frame tagger)

www.framenet.dk eckhard.bick@mail.dk

Bibliography

A FrameNet for Danish Eckhard Bick University of Southern Denmark

Similar presentations

Presentation on theme: "A FrameNet for Danish Eckhard Bick University of Southern Denmark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A FrameNet for Danish Eckhard Bick University of Southern Denmark

Similar presentations

Presentation on theme: "A FrameNet for Danish Eckhard Bick University of Southern Denmark"— Presentation transcript:

Similar presentations

About project

Feedback