Coping with Problems in Grammars Automatically Extracted from Treebanks Carlos A. Prolo Computer and Info. Science Dept. University of Pennsylvania
● The extraction of a Tree Adjoining Grammar (TAG) ● From the Penn Treebank (English WSJ corpus) ● Using Xia's extraction tool + Other stuff Context Focus ● Extraction problems: – Some case studies
Teaser ● Business of grammar extraction from corpora is intended to produce a grammar with “full” coverage of the constructions in a language ● But we know we don't know how to model many syntactic phenomena ● So, what are we doing? ● We have to start looking, pragmatically, to the quality of the extracted grammars we produce
Sources of extraction problems 1 Lack of proper linguistic account 2 Treebank annotation style 3 Extraction tool/process itself 4 Unsuitability of the language model 5 Unsuitability of the grammar formalism 6 Annotation errors 7... and, of course, Inability on the part of the grammar developers
Sources of extraction problems 1 Lack of proper linguistic account 2 Treebank annotation style 3 X Extraction tool/process itself 4 Unsuitability of the language model 5 Unsuitability of the grammar formalism 6 X Annotation errors
VPVP VP * Adv NP N S V VP Lexicalized Tree Adjoining Grammar (LTAG) 4
VPVP VP * Adv NP N S V VP S NPVP Adv V NP N N LTAG: combining trees 4
Automatic TAG extraction Figure is thanks to Fei Xia
Automatic TAG extraction Figure is thanks to Fei Xia
A few selected problem cases 1 (PTB) Extraction of Free Relatives 2 Wh percolation up 3 “Unlike Coordinated Phrases” (UCP) 4 Extraposition (Verb Subcategorization) 5 Parentheticals 6 VP topicalization 7 X (PTB) Projection of Parts-of-speech
Extraction of Free Relatives ( problem due to PTB annotation style) (S-3 (NP-SBJ (PRP We)) (VP (VBP make) (SBAR-NOM (WHNP-1 (WP what)) (S we know how to make)))) (S-3 (NP-SBJ (PRP We)) (VP (VBP make) (NP (NP (WP what)) (SBAR (WHNP-1 (-NONE- 0)) (S we know how to make))))) ● Problem: Free relatives are annotated as wh sentential complements. Verb is extracted with the wrong argument category: “S (SBAR)” ● Solution: Change the free relatives to NP (relative clause has empty wh: “head” account – Bresnan 78)
Wh percolation up (NP (NP (DT the) (NNS researchers)) (SBAR (WHNP-3 (WP who)) (S (NP-SBJ (-NONE- *T*-3)) (VP (VBD studied) (NP (DT the) (NNS workers)))))) WHNP WP NP NNS
Wh percolation up (SBARQ (WHNP-46 (WP What) (NN sector)) (SQ (VBZ is) (NP-SBJ-2 (-NONE- *T*-46)) (VP (VBG stepping) (ADVP-DIR (RB forward))))) NP NN NPNP WP NP * +
Wh percolation up (SBARQ (WHNP-46 (WP What) (NN sector)) (SQ (VBZ is) (NP-SBJ-2 (-NONE- *T*-46)) (VP (VBG stepping) (ADVP-DIR (RB forward))))) NP NN NPNP WP NP * + WHN P WP WHNP *
Wh percolation up (SBARQ (WHNP-46 (WP What) (NN sector)) (SQ (VBZ is) (NP-SBJ-2 (-NONE- *T*-46)) (VP (VBG stepping) (ADVP-DIR (RB forward))))) NP NN WHN P WP NP * NPNP WP NP * + WHN P WP WHNP *
Wh percolation up (SBARQ (WHNP-46 (WP What) (NN sector)) (SQ (VBZ is) (NP-SBJ-2 (-NONE- *T*-46)) (VP (VBG stepping) (ADVP-DIR (RB forward))))) NP NN WHN P WP NP * NPNP WP NP * + WHN P WP WHNP * (Vijay-Schanker et al.)
Wh percolation up (SBARQ (WHNP-46 (WP What) (NN sector)) (SQ (VBZ is) (NP-SBJ-2 (-NONE- *T*-46)) (VP (VBG stepping) (ADVP-DIR (RB forward))))) WHNP WP + ?
Wh percolation up (NP-SBJ (NP The bid) (PP for Great Northern) (,,) (SBAR (WHNP-1 (NP (DT a) (NN notice)) (WHPP (IN of) (WHNP (WDT which)))) (S *T* appears in an advertisement))) WHNP WP
Wh percolation up (SBARQ (WHNP-46 (WP What) (NN sector)) (SQ (VBZ is) (NP-SBJ-2 (-NONE- *T*-46)) (VP (VBG stepping) (ADVP-DIR (RB forward))))) WHNP NN + WHN P WP WHNP * WHN P WHNP * IN WHNP WHPP
Unlike Coordinated Phrases (UCP) (NP (UCP (NN construction) (CC and) (JJ commercial)) (NNS loans)) (VP (VB be) (UCP-PRD (NP (CD 35)) (CC or) (ADJP (JJR older)))) (VP (VB take) (NP (NN effect)) (UCP-TMP (ADVP 96 days later) (,,) (CC or) (PP in early February)))
Unlike Coordinated Phrases (UCP) (NP (UCP (NN construction) (CC and) (JJ commercial)) (NNS loans)) (VP (VB be) (UCP-PRD (NP (CD 35)) (CC or) (ADJP (JJR older)))) (VP (VB take) (NP (NN effect)) (UCP-TMP (ADVP 96 days later) (,,) (CC or) (PP in early February))) NPNP NP * UCP JJ NNCC S NP VB UCP VP [be]
Unlike Coordinated Phrases (UCP) ● We give the UCP the status of an independent non- terminal as if it had some intrinsic categorial significance ● Multiple conjunts: it is enough for one of them to be of a distinct category to turn the entire constituent into a UCP
Unlike Coordinated Phrases (UCP): as the head of a constituent (S (NP-SBJ-1 The Series 1989 B bonds) (VP (VBP are) (VP (VBN rated) (S *-1 double-A)))) (S (NP-SBJ-1 The Series 1989 B bonds) (VP (VBP are) (UCP-PRD (ADJP-PRD (JJ uninsured)) (CC and) (VP (VBN rated) (S *-1 double-A)))))
Extraposition (“it” extraposition) (S (NP-SBJ-1 (NP (PRP it)) (S (-NONE- *EXP*-2))) (VP (MD would) (ADVP-TMP (RB no) (RBR longer)) (VP (VB be) (ADJP-PRD (JJ possible)) (S-2 (NP-SBJ (-NONE- *-1)) (VP (TO to) (VP (VB win) (NP (NN reinstatement)))))))) VPVP VP * S [win]
Extraposition (relative clause) (S (ADVP-TMP (RB Soon)) (,,) (NP-SBJ (NP (NNS T-shirts)) (SBAR (-NONE- *ICH*-1))) (VP (VBD appeared) (PP-LOC (IN in) (NP (DT the) (NNS corridors))) (SBAR-1 (WHNP-2 (WDT that)) (S (NP-SBJ (-NONE- *T*-2)) (VP (VBD carried) (NP (NP the school 's familiar logo) (PP-LOC on the front) )))))) VPVP VP * SBAR [carried]
Extraposition (Object) (S (NP-SBJ Mr. Peters) (VP (VBZ says) (PP-LOC in his affidavit) (SBAR (IN that) (S (NP the movie 's staff) (VP (VBD was) (VP (VBN told) (NP (-NONE- *-1)) (NP-TMP last week) (SBAR that Warner was...) S NP VBZ SBAR VP + [says] S NP VBD NP VP + [told] SBAR VPVP VP * PP [in] VPVP VP * NP [week]
Extraposition (Object) (S (NP-SBJ Mr. Peters) (VP (VBZ says) (PP-LOC in his affidavit) (SBAR (IN that) (S (NP the movie 's staff) (VP (VBD was) (VP (VBN told) (NP (-NONE- *-1)) (NP-TMP last week) (SBAR that Warner was...) S NP VBZ SBAR VP + [says] S NP VBD NP VP + [told] SBAR VPVP VP * PP [in] VPVP VP * NP [week] Note: Chiang 2000 (sister adjunction)
Extraposition (Object) (S (NP-SBJ Mr. Peters) (VP (VBZ says) (PP-LOC in his affidavit) (SBAR (IN that) (S (NP the movie 's staff) (VP (VBD was) (VP (VBN told) (NP (-NONE- *-1)) (NP-TMP last week) (SBAR that Warner was...) S NP VBZ VP + [says] S NP VBD NP VP + [told] VPVP VP * PP [in] VPVP VP * NP [week] VPVP VP * SBAR [S compl] + +
Extraposition (Object) S NP VP SBAR VP S NP VP SBAR VP NP VBD VBZ [says] [told] Note: Multi-component tags (Bleam & Xia, TAG+ 2000)
Parentheticals (non-lexicalized trees !!) (NP (NP the 3 billion New Zealand dollars) (PRN (-LRB- -LRB-) (NP US$ 1.76 billion *U*) (-RRB- -RRB-))) (S (NP-SBJ The total relationship) (PRN (,,) (SBAR-ADV as Mr. Lee sees it) (,,)) (VP (VBZ is)...)) VPVP PRN VP * SBAR NPNP NP * PRN NP
VP Topicalization S NP VP + VPVP V VP * S NP V VP + ver sus [be] [excluded] [be] VBN VP VBN Lexical HeadSyntactic Head (S (NP-SBJ-1 investments in...) (VP (MD will) (VP (VB be) (VP (VBN excluded)))))
VP Topicalization S NP VP + VPVP V VP * S NP V VP + ver sus [be] [excluded] [be] VBN VP VBN Lexical HeadSyntactic Head (SINV (ADVP (RB Also)) (VP-TPC-2 (VBN excluded)) (VP (MD will) (VP (VB be) (VP (-NONE- *T*-2)))) (NP-SBJ-1 investments in...))
Projections of Parts-of-speech (NP (DT a) (JJR stronger) (NN argument)) (NP (DT an) (ADJP (RB even) (JJR stronger)) (NN argument)) NPNP JJR NP * [stronger] NPNP ADJP NP * JJR [stronger]
Projections of Parts-of-speech (NP-SBJ-1 (NNP October) (NN weather)) (NP-SBJ-1 (NP (JJ late) (NNP October)) (NN weather)) NPNP NNP NP * [October] NPNP NP NP * NNP [October]
Forced Projections of Parts-of-speech PROJECTEDPROJECTION NN, NNP, PRP, EX NP JJ, JJR, JJSADJP RB, RBR, RBSADVP S, SINVSBAR SQ SBARQ WP WHNP WRBADVP CDQP QPNP UHINTJ LSLST
Conclusion ● Full coverage of language (currently) is utopic ● Grammar extraction can/should be used to search for solutions to grammar development problems ● We presented a few selected problems in grammar extraction and discussed solutions with various degrees of acceptability (using TAGs) ● There are more and harder ones where these came from ● Question: how would these problems be handled: – By other grammar formalisms ? – By other linguistic approaches using the TAG formalism ?
S NP V VP S NP V VP S NP VBN VP SBAR WHNP S NP VBN VP SBAR WHNP S NP V VP SBAR WHNP NP NP * LTAG Verb Trees 5
Automatic TAG extraction
Figure is thanks to Fei Xia
Wh percolation up (NP-SBJ (NP The bid) (PP for Great Northern) (,,) (SBAR (WHNP-1 (NP (DT a) (NN notice)) (WHPP (IN of) (WHNP (WDT which)))) (S *T* appears in an advertisement))) (NP-PRD (NP (NNS hitches)) (,,) (SBAR (RB not) (WHNP-17 (NP (DT the) (JJS least)) (WHPP (IN of) (WHNP (WDT which)))) (S *T* was that... )
Extraposition (S (NP-SBJ (NP (PRP it)) (S (-NONE- *EXP*-1))) (VP (VBZ is) (ADJP-PRD (JJ unjust)) (S-1 (NP-SBJ (-NONE- *)) (VP (TO to) (VP (VB reprove) (NP (NNP China)) (PP-PRP (IN for) (NP (PRP it))))))))