NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources2 “Shallow” grammar: defines language (set of strings) “Deep” Grammar: as above + maps strings to “meaning” representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted Very difficult & expensive to scale to unrestricted text Motivation for treebank-based deep grammar acquisition (LFG/CCG/HPSG/TAG/DepGr/…)!! LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001] Constraint-based (“unification”), lexicalised c(onstituent)-str & f(unctional) structure c-str: surface configuration (CFG trees) f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL, COMP, XCOMP, ADJN, POSS, APP, …) f-str: AVM (feature-structure) encoding of dependencies/pred-arg. Lexical-Functional Grammar (LFG)
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources3 Lexical-Functional Grammar LFG
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources4 Lexical-Functional Grammar LFG Treebank: trees How do we get from trees to f-structures? What’s missing is the equations! Automatic f-structure annotation algorithm Traverses tree and assigns LFG equations Principle-based c-str/f-str interface
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources5 F-Structure Annotation Algorithm Algorithm exploits: –Categorial information (NP, VP, VBZ, …) –Configurational information: Local head, left/right of head Leftmost NP sister to right of V(erbal) head: ( OBJ)= –Morphological information: Him: ( OBJ)= –“Functional” tag information: -LGS ( PASSIVE)=+, -SBJ, -CLR, … –Trace/co-indexation information Translate traces + co-indexation to corresponding re-entrancies at f- str.
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources6 F-Structure Annotation Algorithm Left-Right Context Annotation Principles Coordination Annotation Principles Catch-All and Clean-Up Traces Proto F-Structures Proper F-Structures Head-Lexicalization [Magerman,1994] Lemmatization + Macros Lexical Entries Defaults – “Functional Tags”
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources7 Treebank Annotation: Control & Wh-Rel. LDD
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources8 Multilingual Treebank-Based LFG Resources English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer Pilots/proof of concept: multilingual treebank-based LFG acquisition: –German: TIGER (Cahill et al 2003, 2005) –Chinese: CTB (Burke et al 2004) –Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006) GramLab Project ( ): Chinese, Japanese, Arabic, Spanish, French and German
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources9 Multilingual Treebank-Based LFG Resources LanguageTreebank EnglishPenn-II Chinese CTB 5.1 JapaneseKTC 4.0 GermanTIGER 2.0 German TűBa-D/Z SpanishCast3LB ArabicATB FrenchP7T SizeCoding/Data 50,000CFG+traces+FT 18,000CFG+traces+FT 38,000Dep (+traces) 50,000Graphs+CFG+Dep 22,000CFG+Dep+f-traces 3,500CFG+Dep+f-traces 300,000 (words) 20,000CFG+Dep+f-traces > 200,000
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources10 Q2 What was missing in TB resource? –F-structures, pred-argument structure, dependencies => f-structure annotation algorithm –Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG) –GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero- anaphora, tense/aspect, … What was done by hand? –F-structure annotation algorithm (principle-based c-/f-str interface) –No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T) –No manual additions (unlike CCG/HPSG/TAG) –Future work …
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources11 Q3 Methodological Issues - Quality Assurance: Evaluation against hand-crafted/corrected Gold Standard DepBanks –PARC 700 –CBS 500 –PropBank –Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French ( ) CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG) Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500)
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources12 Q4 Phrase Structure or Dependencies? Both!!! Why?: Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing) Dependencies close to f-structure/predicate-argument structures … –Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags –TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges –Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources13 Q5 & Q6 Pros/Cons Formalism-Specific Treebank? –Formalism-Specific Treebank? Bad! Limits usefulness/user group/… –Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars Grammar First vs. Treebank First? –Depends on what you want to do … –If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping) –Problem: many traditionally trained linguists see TreeBanking as menial task –Highly qualified and interesting task: empirical linguistics: confront/rather than invent data –Sociological task: how to make treebanking/bootstrapping sexy?
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources14 Some Resources ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier LFG parser demo: A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005 A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005 R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005 M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages , 2004 A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp , Barcelona, Spain, 2004 Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp