A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart EACL 2003, Budapest April 17 th, 2003
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 2 Dependency-Based Evaluation " every word either depends on another word (the head) or is independent " parsing seen as classification task (Lin:95) " measured in (labelled) precision and recall: assign to every word a pair or a marker TOP (for independent words) " unlabelled precision and recall: neglect grammatical role: only assign and TOP
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 3 Dependency Structure (Details) " PPs: headed by internal arguments (NP), not by Prep " coordination: multi-headed constituent: every conjunct is a head conjunction only linked to final conjunct " verb complex (auxiliary verbs + full verb): abstraction over verb complexes all attachments into verb complex are correct (Lin:95)
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 4 Test Environment " tokenized version of NEGRA tree bank " ca. 340,000 tokens in 19,547 sentences " investigated effect of POS tagging quality I : ideal tags from tree bank L: lexicon tags from tagger trained on tree bank T: tagger tags as determiner by tagger trained on independent corpus
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 5 Baseline: Tagging Approach " determine dependency tuples directly " used Tree Tagger (Schmid:94) on tag trigrams " three approaches to encode head exact position of head: pos head distance of head from dependent: pos head -pos dep nth-tag method (Lin:95): e.g. <<<N (third noun left) " category of head, " direction in which to find head from token, " number of words with same category between token and head
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 6 Tagging Approach (contd.) " hybrid method: choose between nth-tag and distance result on the basis of POS tag build decision list greedily so as to optimize F-value in training set (using 10-fold cross-validation) " all results achieved by 10-fold cross-validation " if no head is found, token counts as not assigned (=> precision usually higher than recall)
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 7 Results for Tagging Approach
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 8 Overview of Finite-State Parser
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 9 Recognition Phase " consists of cascaded deterministic transducers (like Abney:97) " noun chunker also recognizes nested noun phrases (`full noun chunks') " inflectional information checked on-line " clause chunker recognizes complete clauses, not simplex clauses (Abney:97)
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 10 Example Output of Noun Chunker
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 11 Example Output of Clause Chunker
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 12 Rule Interpretation " inserts syntactic structure (AdjP, coordinated VP or Prep) grammatical roles (13 different roles) " recognition grammar generated from interpretation grammar by removing semicolon symbols, e.g. det ;SPR ( ;[ADJP ( adv ;ADJ )* adja ;HD ;]ADJP )* nn ;HD FINAL:NP " nondeterministic transducer (like Abney:97)
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 13 Example Output of Rule Interpreter
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 14 Subcat Frame Recognition " deterministic transducer to find lexically given subcategorization frames " fine-grained distinction of complements (61 additional roles), partially disambiguates between adjuncts and complements " if no corresponding frame is found, unspecified role (CMP, ACMP) remains only correct in half-labelled precision and recall " several frames can be encoded at once
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 15 Example Output of Frame Recognizer
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 16 Conversion into Dependency Tuples " explicit representation of ambiguities (subcat roles and attachment) with context variables " measuring performance of parsers with underspecified output (Riezler et al.:02) lower bound: random disambiguation upper bound: ideal disambiguation " also heuristic disambiguation: choose highest attachment and most frequent subcat frame
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 17 Example Output: Dependency Tuples Udo/0kennt/1[1a]:NPnom,[1b]:NPakk kennt/1TOP eine/2Frau/5SPR sehr/3nette/4ADJ nette/4Frau/5ADJ Frau/5kennt/1[1a]:NPakk,[1b]:NPnom aus/6Rio/7MRK Rio/7kennt/1ADJ [1A0] Frau/5ADJ [1A1]./8TOP
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 18 Results for Finite-State Parser
IMS Stuttgart EACL 2003 April 17 th, 2003 © Michael Schiehlen 19 Conclusion " two approaches to partial parsing: tagger, finite- state parser " hybrid model of nth-tag tagging and finite-state achieves % on I-tags (gain of 4.8% in lower and 1% in upper bound) " some constructions not yet handled in parser attachment of extraposed relative clauses and noun- complement clauses distribution of constituents in the middle field under VP coordination