Shallow semantic parsing: Making most of limited training data Katrin Erk Sebastian Pado Saarland University
Introduction Frame semantics: –“ Who does what to whom ” analysis: senses and roles –Cross-lingual appeal (Boas 2005) Prerequisite for use in NLP: Automatic, robust, accurate methods for analysis of free text Predominant machine learning paradigm: Supervised classification –Learn relation between features and classes from training corpus; guess classes in test corpus –Gildea and Jurafsky (2002) and many since
Frame-semantic analysis Step 1: Frame disambiguation –WSD-style classification of predicate in terms of frames Step 2: Role assignment –Classification of nodes in terms of role labels
Frame-semantic analysis Creeping in its shadow I reached a point whence I could look straight through the uncurtained window. (A. Conan Doyle, The Hound of the Baskervilles)
Problems of supervised learning setting Coverage: –lemmas may be missing –frames may be missing Languages other than English: –Training data may not be available –Can we take advantage of existing resources for English?
Today ’ s talk Shalmaneser: a system for automatic frame-semantic analysis Unknown sense detection: dealing with missing frames Annotation projection for cross-lingual data creation Summary
Shalmaneser: Automatic frame-semantic analysis Assignment of –senses (frames) to predicates –semantic roles Aim: easy use, for exploring applications of frame-semantic analysis –Input: plain text –Syntactic preprocessing integrated –Visualization with SALTO tool
Shalmaneser: Automatic frame-semantic analysis Semantic analysis as supervised learning tasks –Pre-trained classifiers available for English (FrameNet) and German (SALSA) Performance of English models: –Frame assignment: accuracy 0.93, baseline 0.89 High baseline because some senses are missing –Role assignment: Role recognition F-score 0.75 Role labeling Accuracy 0.78 –Not top-scoring, but okay. Focus on ease of use and on flexibility.
Shalmaneser: Flexibiliby Processing steps linked only by interface format: Salsa/Tiger XML (Erk & Pado 04) –Adding a module: just needs to speak Salsa/Tiger XML Model features specified in experiment file, can be changed easily Adding new parser by instantiating an interface class New language: only syntactic preprocessing changes
Today ’ s talk Shalmaneser: a system for automatic frame-semantic analysis Unknown sense detection: dealing with missing frames Annotation projection for cross-lingual data creation Summary
Detecting unknown word senses (frames) Conan Doyle, The Hound of the Baskervilles. Syntax: Collins parser Semantics: Shalmaneser Unseen senses normal WSD approach will assign wrong sense Automatically detect senses we haven ’ t seen before?
Unknown sense detection as outlier detection Outlier detection: detect occurrences of previously unseen events (overview articles: Markou & Singh 2003a,b) –training data: positive cases only. Derive model of “ normal ” cases –test data: positive and negative cases training items test items
A Nearest Neighbor-based outlier detection method Tax and Duin (2000): simple method, easy to implement Given test point and its nearest training neighbor : Is closer to than ‘ s nearest neighbor? –Test point x, nearest training neighbor t, nearest neighbor t ’ of t, (Euclidean) distances d: Accept x if p NN (x) is below a given threshold yes no
Unknown sense detection: Results Evaluation (Erk NAACL 2006): –Use FrameNet data –Treat one sense of a lemma as pseudo-unknown (iterate over all senses) Results (assignment of label “ unknown ” ): –Tax&Duin ’ s method, one lemma at a time: Prec 0.70, Rec 0.35 –More data: all data for a frame, not just that of one lemma Prec 0.77, Rec 0.82
Results What features are important? 1. Best: just context words 2. Almost as good: features of 1, 3, 4 together 3. Just the subcategorization frame: high precision, low recall 4. Subcat frame, plus headwords of arguments: inbetween 3 and 2, but obviously too sparse
Unknown sense detection as outlier detection: The bigger picture Why assume missing word senses in the sense inventory and in the training data? –Growing, unfinished resources, like FrameNet –Domain-specific senses may be missing from general-purpose sense inventories Outlier detection method presented here: applicable to any resource that groups words into senses, e.g. WordNet Using outlier detection to detect occurrences of nonliteral use?
Today ’ s talk Shalmaneser: a system for automatic frame-semantic analysis Unknown sense detection: dealing with missing frames Annotation projection for cross-lingual data creation Summary
Motivation Definitions, Role set: Language-independent Predicate classes: Language-specific Annotated Sentences: Specific, too
Agenda For new language, induce: 1.Frame-semantic predicate classification 2.Corpus with frame-semantic annotation Method: Annotation projection in parallel corpus –Word alignments approximate semantic equivalence Corresponding word pairs (predicates) Corresponding constituents Evaluation: Study on EUROPARL corpus (De/En/Fr)
An idealised example Peter comes homePierre revient à la maison Arriving
Frame-semantic classes Idea: For each frame, construct list of predicates in new language occurring aligned to predicates of this frame => FEEs for new languages Main obstacle: Translational divergence –Corresponding predicates don ’ t evoke same frame Address by shallow, language-independent filtering (Pado and Lapata AAAI 2005) –Important: Distributional patterns Evaluation: Can obtain predicate classes for German and French with precision of 65-70% –Main remaining problem: English polysemy not covered by FrameNet
Role annotations (I) Idea: For each sentence, transfer semantic role annotation onto translated sentence Obstacle 1: Frame divergence –Role projection only sensible if frames match –Good news: In En-De test corpus (Pado and Lapata HLT/EMNLP 2005), 70% of frames match Obstacle 2: Role divergence –Even if frames are parallel, do roles match? –Good news: In En-De test corpus, matching frames show 90% role matches Remaining cases mostly elisions (e.g. passive)
Role annotations (II) Obstacle 3: Errors/omissions in automatically induced word alignments –Can be overcome by using bracketing information (chunks / constituents) –Induction of cross-lingual correspondences as graph optimisation problem (Pado and Lapata ACL 2006) Evaluation (all exact match F-score): –Word-based projection: 0.50 –Constituent-based: 0.75 –Upper limit: 0.85 Remaining errors mostly parsing-related
Summary Frame-semantic analysis potentially interesting for many NLP applications –Goal of Shalmaneser: flexible and easy-to-use system Address incompleteness in resources –Unknown sense detection as outlier detection Porting Frame Semantics to new languages –Parallel corpora for automatic annotation projection