Supertagging CMSC Natural Language Processing January 31, 2006
Roadmap Motivation Tagging, Parsing & Lexicalization Supertags Definition & Examples Parsing as disambiguation Structural filters Unigram & N-gram models Results & Discussion
Motivation: Good, Robust Parsing Main approaches: Finite-state parsers: Shallow parsing, longest match resolves ambiguity Hand-crafted Partition domain-independent, domain-dependent Statistical parsers: Assign some structure to any string w/probability Automatically extracted from manual annotations Not linguistically transparent, hard to modify
Lexicalization Lexicalization: Provides integration of syntax, semantics of lexical Enforces subcategorization, semantic constraints Finite set of elementary structures Trees, strings,etc Anchored on lexical item Lexical items Associated w/at least one elementary grammar struct Finite set of operations combines
Framework FB-LTAG: Feature-based, lexicalized TAG Elementary trees: Lexical item (anchor) on frontier Provides complex description of anchor Specifies domain of locality for syn/sem constraints Initial (non-recursive), auxiliary (recursive) Derived by substitution and adjunction
Supertags Inspired by POS taggers Locally resolve much POS ambiguity before parsing (96-97% accurate) Limited context, e.g. tri-gram Elementary trees localize dependencies All and only dependent elements are in tree Supertag=elementary structure Highly ambiguous One per lexical item per distinct use Word with one POS has (probably) many supertags E.g. always a verb, many subcat frames
Extended Domain of Locality Each supertag must contain all and only the arguments of the anchor in the struct For each lexical item, grammar must contain supertag for each syntactic environment in which the item can appear
Factoring Recursion Recursive constructs represented as auxiliary trees / supertags Initial supertags define domains of locality for agreement, subcategorization Auxiliary trees can capture long distance dependencies by adjunction
Supertags: Ambiguity and Parsing One supertag per word in complete parse Must select to parse Problem: Massive ambiguity Solution: Manage with local disambiguation Supertags localize dependencies Apply local n-gram constraints b/f parsing POS disambiguation makes parsing easier Supertagging disambiguation makes it trivial Just structure combination
Example
Structural Filtering Simple local tests for supertag use Span of supertag: Minimum number of lexical items covered Can’t be larger than input string Left/Right span constraint: Left or right of anchor can’t be longer than input Lexical items: Can’t use if terminals don’t appear in input Features, etc Reduces ambiguity by 50% before parsing Verbs, esp. light verbs worst, reduce 50% Still > 250 per POS
N-gram Supertagging Initially, (POS,supertag) pairs Tri-gram model, trained on 5K WSJ sentences 68% accuracy – small corpus, little smoothing Dependency parsing Avoid fixed context length Dependency if substitutes or adjoins to tree Limited by size of LTAG parsed corpus Too much like regular parsing
Smoothed N-gram Models: Data Training data: Two sources: XTAG parses of WSJ, IBM, ATIS Small corpus but clean TAG derivations Converted Penn Treebank WSJ sentences Associates each lexical item with supertag using parse Requires heuristics using local tree contexts Labels of dominating nodes, siblings, parent’s sibs Approximate
Smoothed N-gram Models: Unigram Disambiguation redux: Assume structural filters applied Unigram approach: Select supertag for word by its preference Pr(t|w) = freq(t,w)/freq(w) Most frequent supertag for word in training Missing word: Backoff to most frequent supertag of POS of word Results: 73-77% top 1 vs 91% POS Errors: Verb subcat, PP attachment, NP head/mod
Smoothed N-gram Models: Trigram Enhances previous models Trigram: lexicalized, (word,supertag) Unigram: Adds context Ideally T=argmaxTP(T1,T2,..Tn)*P(W1,..Wn|T1,..Tn) Really Good-Turing smoothing, Katz backoff Results: up to 92%, 1M words training
Supertagging & Parsing Front-end to parser Analogous to POS tagging Key: disambiguate supertags A) Pick most probable by trigram method 4 vs 120 seconds per sentence to parse If tag is wrong, parse is wrong B) Pick n-best supertags Recovers many more parses, slower Still fails if ill-formed
Conclusion Integration of light-weight n-gram approach With rich, lexicalized representation N-gram supertagging produces almost parse Good tag, parsing effectiveness > 90% Faster than XTAG by factor of 30 Issues: conversion, ill-formed input, etc
Applications Lightweight dependency analysis Enhanced information retrieval Exploit syntactic structure of query Can enhance precision 33-> 79% Supertagging in other formalisms CCG
Lightweight Dependency Analysis Heuristic, linear-time, deterministic Pass 1: modifier supertags Pass 2: non-modifier supertags Compute dependencies for s (w/anchor w) For each frontier node d in s Connect word to left or right of w by position Label arc to d with internal node P=82.3, R=93.8 Robust to fragments