Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart COLING.

Similar presentations


Presentation on theme: "Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart COLING."— Presentation transcript:

1

2 Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING 2002, Taipei August 27th, 2002

3 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 2 chunk: maximal string containing a major head, dominated by root, not contained in other chunk major head: a content word not between a function word f and the word selected by f Definition of Chunks (Abney:93) root: highest node with major head as semantic head

4 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 3 Base Noun Chunks A base noun chunk is a chunk with a noun as major head. (Base noun chunks are core NPs and PPs.) The underlying grammar assumes null determiners Ø poor people forms a base noun chunk and empty nouns. the poor Ø forms a base noun chunk

5 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 4 if chunks may be multi-headed if conjunctions are excluded from chunks by Abneys definition Problems with Coordination

6 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 5 Disambiguation in a Cascaded Finite-State Parser POS ambiguities resolved by POS tagger. PP attachment ambiguities are kept underspecified. All other ambiguities are resolved using the longest-match criterion (Abney, 1993). Chunks should be as long as possible.

7 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 6 System Overview

8 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 7 Determining Predicate-Argument Structure in German In German, case is important, not position! Der/Den Hund kennt Anna. the dog knows Anna (Anna knows the dog./The dog knows Anna.) Case is determined jointly by determiners, adjectives and nouns. der/den hohen Schäden (the heavy damages, gen.pl/dat.pl) der große/großen Felsen (the large rock(s), nom.sg/gen.pl) den großen Stein/Steinen (the large stone(s), acc.sg/dat.pl)

9 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 8 Problem: Center-Embedding The words needed to compute case may be separated by other (embedded) noun chunks. [der/den [mehrere Milliarden Euro] hohen Schäden] the several billions Euro high damages (the damages amounting to several billion Euro) Base noun chunks may be ungrammatical. {die} {im Alter} {nachlassenden Kräfte} the in-the age diminishing forces (the strength diminishing in old age)

10 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 9 Definition of Full Noun Chunks (1) part of NP between determiner and (first) head noun (Schmid and Schulte im Walde: 2000) includes names {the discoverer Christopher Columbus} but not coordinated NPs {parts} {of Scotland} and {Northern Ireland} and not appositions {Christopher Columbus}, {the famous discoverer},

11 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 10 Definition of Full Noun Chunks (2) NP/PP stripped of adverbials at the front and PPs and relative clauses at the back (Brants, 1999) coordinations (attachment ambiguity!) and appositions {? parts {? of Scotland and Northern Ireland} pre- and postnominal genitives {{Marias} Version {der Geschichte}} Mary's version of the story measure phrases {{20 Dollar} Strafe} 20 dollars penalty (a penalty of 20$)

12 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 11 Recognizing Full Noun Chunks (1) explicit representation of ambiguities (potential noun chunks) used in previous work on full noun chunking (Brants:99, Schmid and Schulte im Walde:00, Kermes and Evert:02) drawback: requires search Parser is not deterministic any longer. Linear complexity is lost.

13 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 12 Recognizing Full Noun Chunks (2) a new method retaining determinism and linear complexity recognize base noun chunks that could form beginning, middle or end of a full noun chunk discard those noun chunks (monotonicity lost!) re-apply original noun chunk transducer

14 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 13 Recognizing Recursive NPs by Non-Monotonic Cascades

15 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 14 Three Approaches to Agreement Checking in FS Parsers 1.add agreement info to POS tags and compile the grammar out (drawback: explosion of trans table) 2.postpone agreement check until after chunk recognition (Abney, 1997) 3.interleave agreement checking with chunking (Neumann et al., 2000), problems with subcategorizing multi-words um Gottes willen (for God's sake) um takes acc., um-willen takes gen.!

16 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 15 Evaluation: Test Data gold standard: NEGRA tree bank 321,000 tokens 100,974 base noun chunks 78,942 full noun chunks Structure of full noun chunks not considered. Agreement information extracted not considered. Same test data were used by Brants (1999) and Kermes and Evert (2002).

17 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 16 Evaluation: Baseline baseline: statistical knowledge-free method of Ramshaw and Marcus (1995) Instead of Brill tagger, the tree tagger (Schmid:94) was used. - precision/recall on R&Ms test data with the tree tagger: 90.7/91.2% (R&M got 91.8/92.3%)

18 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 17 Parameters Tested agreement checking online or offline left-to-right or right-to-left traversal {der 14} {Jahre} {alte Junge} (the 14-year-old boy) {der} {14 Jahre} {alte Junge} quality of POS tagging POS-I(deal): POS tags from tree bank POS-L(exicon): from tree tagger trained on tree bank POS-T(agger): from tree tagger POS-C(hunker): POS tags disambiguated by chunker

19 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 18 F-Values for Base Noun Chunks

20 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 19 Discussion English is harder than German. - German nouns are less ambiguous than English nouns POS-I > POS-L > POS-T > POS-C Tags from the chunker (POS-C) are worse than baseline. Using a POS tagger is a good idea. Direction of processing makes no difference. Checking agreement yields small improvement.

21 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 20 F-Values for Full Noun Chunks maximum entropy model PCFG model

22 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 21 Discussion Online agreement checking pays (see next slide). Better results with right-to-left parsing are mainly due to a heuristic which could only be incorporated in right-to-left parser: - prefer shortest match with conjunct attachment {? The presidents of {? France and the U.S.A.} met.

23 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 22 Online Agreement Checking (+) errors avoided genitives (case mismatch) {in John's} {house} conjunction attachment (case mismatch) {das Leben {von Schauspielern} und Zirkusleuten} the life (nom;acc) of actors and circus people (dat) adjacent NPs (adjective declination) {diese beiden ähnliche Erfolge} those two (weak) similar (strong) successes

24 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 23 Some grammar errors become visible only with agreement checking. N coordination is missing. {die nachlassenden Kräfte} the diminishing strength {die Verletzungen} und {nachlassenden Kräfte} the injuries and diminishing strength Online Agreement Checking (-) no noun chunk!

25 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 24 Conclusion (1) Writing a finite-state grammar is worth the effort. FS method performs better than statistical method Noun chunker is not very good at determining POS tags. Online agreement checking improves performance. Shortest match is better than longest match for conjunction attachment.

26 IMS Stuttgart COLING 2002 August 27th, 2002 © Michael Schiehlen 25 Conclusion (2) Two chunkers have been implemented (base noun chunker, full noun chunker). Both are completely deterministic. On a SUN Ultra-250, the base noun chunker processes 12,500 words per second, the full noun chunker achieves 5,200 wps. plans for the future: extend the system to recognize predicate-argument structure for Information Extraction


Download ppt "Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart COLING."

Similar presentations


Ads by Google