Extracted TAGs and Aspects of Their Use in Stochastic Modeling

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 11.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Semantic Role Labeling Abdul-Lateef Yussiff
A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.
LTAG Semantics on the Derivation Tree Presented by Maria I. Tchalakova.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
Annotation Types for UIMA Edward Loper. UIMA Unified Information Management Architecture Analytics framework –Consists of components that perform specific.
Extracting LTAGs from Treebanks Fei Xia 04/26/07.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
Interpreting Dictionary Definitions Dan Tecuci May 2002.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Improving Subcategorization Acquisition using Word Sense Disambiguation Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Handling Unlike Coordinated Phrases in TAG by Mixing Syntactic Category and Grammatical Function Carlos A. Prolo Faculdade de Informática – PUCRS CELSUL,
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Chapter 4 Syntax a branch of linguistics that studies how words are combined to form sentences and the rules that govern the formation of sentences.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
Step 1: Specify a null hypothesis
Advanced Computer Systems
CSC 594 Topics in AI – Natural Language Processing
Compiler Design (40-414) Main Text Book:
Raymond J. Mooney University of Texas at Austin
Statistical NLP: Lecture 7
PRESENTED BY: PEAR A BHUIYAN
An Introduction to the Government and Binding Theory
Textbook:Modern Compiler Design
Statistical NLP: Lecture 3
Web News Sentence Searching Using Linguistic Graph Similarity
David Mareček and Zdeněk Žabokrtský
Semantic Parsing for Question Answering
CJT 765: Structural Equation Modeling
Robust Semantics, Information Extraction, and Information Retrieval
Improving a Pipeline Architecture for Shallow Discourse Parsing
Learning to Transform Natural to Formal Languages
LING/C SC 581: Advanced Computational Linguistics
Lecture 7: Introduction to Parsing (Syntax Analysis)
Constraining Chart Parsing with Partial Tree Bracketing
CSCI 5832 Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
Linguistic Essentials
CSCI 5832 Natural Language Processing
Word embeddings (continued)
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Progress report on Semantic Role Labeling
Statistical NLP: Lecture 10
Owen Rambow 6 Minutes.
Presentation transcript:

Extracted TAGs and Aspects of Their Use in Stochastic Modeling John Chen Department of Computer Science Columbia University

Motivation (1/3) Lexicalized stochastic models are important for NLP Parsing (Collins 99; Charniak 00) Summarization (McKeown, et al. 01) Machine translation (Berger, et al. 94)

Motivation (2/3) Tree-Adjoining Grammar (TAG) is a lexicalized formalism… (Joshi, et al. 75; Schabes et al. 88) …but until recently, not much work in modeling of stochastic TAG (Srinivas 00; Chiang 00) S NP NP VP NP Nu Vu NP Nu Creationism lost credibility

Motivation (3/3) Problem: Lack of large scale corpora with TAG annotations Penn Treebank (Marcus, et al. 93) TAG-annotated corpus S NP-SBJ VP NNP VBD NP PRP NN Bell increased its earnings S NP NP VP NP NP Nu Vu NP Gu NP* Nu Bell increased its earnings

Introduction (1/2) Approach: automatically extract a TAG from the Penn Treebank Given a bracketed sentence, derive a set of TAG trees out of which it is composed Extracted TAGs should conform to the principles that guide the formation of hand-crafted TAGs S NP-SBJ VP S NNP VBD NP NP NP VP NP NP PRP NN Nu Vu NP Gu NP* Nu Bell increased its earnings Bell increased its earnings

Introduction (2/2) Uses To estimate parameters for statistical TAG models To avoid having to hand-craft your own grammar To evaluate TAGs extracted using different design methodologies To improve a hand-crafted grammar To do a comparative evaluation of grammars extracted from different kinds of treebanks Of different languages Of different sublanguages

Outline Motivation, Introduction Extraction of a TAG from a Treebank Tree-Adjoining Grammars Extraction Procedure Variations on Extraction Evaluation Experiment to increase coverage Smoothing Models for TAG Using Extracted TAG Features to Predict Semantic Roles Conclusions and future work

Tree-Adjoining Grammar (TAG) A TAG is a set of lexicalized trees Lexicalized tree == TAG elementary tree Anchor of an elementary tree == lexical item Operations combine lexicalized trees into parse trees NP NP NP Au NP* Nu A N Wet paint Wet paint

Kinds of Trees in TAG Lexicalized tree vs. tree frame Lexicalized trees Tree frames (supertags) Initial vs. auxiliary tree Initial trees Substitution node Auxiliary trees Foot node S NP NP NP VP NP Au NP* Nu Vu NP Nu Wet paint paint enjoys S NP VP NP NP NP Vu S* Au NP* Au NP* Nu Wet thinks

TAG Operations Substitution Adjoining NP-5 S0 NP VP NP VP NP VP Vu S* NP Vu NP -NONE- Vu NP thinks Nu enjoys *-5 enjoys S Terry NP-5 S0 S NP VP NP VP V S0’ N V NP “Who everyone thinks enjoys pea soup” “Terry enjoys pea soup” thinks NP VP Terry enjoys -NONE- V NP *-5 enjoys

Principles of TAG Formation (1/2) Localization of dependencies Project lexical head to include all of its complements S S S NP-9 S NP PP NP VP NP VP NP VP Nu INu NP Vu NP Vu NP PP Vu NP PP nectarines beside plays gives put -NONE- *-9

Principles of TAG Formation (2/2) Factoring recursion Modifier auxiliary trees Example Predicative auxiliary trees Example VP VP* PP S INu NP NP VP between The messenger ran [PP between the cars] [PP across the street] [PP towards the police station] . Vu S* thought What [S everyone thought] [S the manager believed] [S the employees imagined] to be a time-saver.

Extraction Procedure (1/4) (Chen 01; cf. Xia 99, Chiang 00) Extraction of a particular tree Example Step 1: Determination of path of projection of TAG tree How far to go up starting from lexical item? Find out using head percolation table (Magerman 95) Heuristics look at relationships between parent and its children S S S NP-SBJ VP VP VP NNP NNP ADVP-MNR VBD VBD NP VBDu Mr. Lane disputed disputed RB DT NNS vehemently those estimates

Extraction Procedure (2/4) Extraction of a particular tree (continued) Step 2: Distinguish complements and adjuncts Example Heuristics determine complements/adjuncts Complements become substitution nodes Adjuncts become modifier auxiliary trees S VP S NP-C VP ADVP VP* NP VP NNP NNP ADVP-A VBD NP-C RBu VBDu NP Mr. Lane disputed RB DT NNS vehemently disputed vehemently those estimates

Extraction Procedure (3/4) Other aspects of extraction procedure Extracting predicative auxiliary trees Localizing traces with their landing sites Detecting and extracting appropriate conjunction trees Extracting a TAG tree containing multiple lexical items

Extraction Procedure (4/4) S S NP VP WHNP S S -NONE- VBu NP NP VP WHNP-17 S * eat -NONE- VBu S WP NP VP * seems what NNP VB S S Jon seems NP VP WHNP-17 S -NONE- TO VP S NP VP to * VB NP NP VP -NONE- VBu NP eat -NONE- VBu S* * eat -NONE- *-17 seems *-17

Variations on Extracted Grammars Extraction procedure can be parameterized We want to study Effects of parameterization on resulting extracted grammars Effects on a statistical model based on different extracted grammars Kinds of variation of extracted grammars Detection of complements Empty elements Label set

Variation in Detection of Complements (1/2) Recall that an important principle of TAG formation is that an elementary tree includes a lexical head and all of its complements Principle of “domain of locality” Notion of complement of a lexical head is a fuzzy notion One way to extract grammars with different domains of locality is to vary the way complements are detected

Variation in Detection of Complements (2/2) CA1 CA2 S S NP VP NP VP NP-SBJ VP VBDu NP PP VBDu NP VBD NP PP-CLR joined IN NP joined DT NN IN NP as Pierre Vinken joined the board as an executive director Kinds of ways to detect complements CA1 (Chen, Vijay-Shanker 99) ---more nodes are complements CA2 (Xia 99) —more nodes are adjuncts

Variation of Treatment of Empty Elements Kinds of empty elements Penn Treebank has many different kinds of empty elements Standard TAG analyses only treat a certain subset of these Different treatments of empty elements ALL: include all empty elements in the Penn Treebank in the extracted grammar SOME: include only those empty elements in the extracted grammar that do not violate TAG’s domain of locality

Variation of Label Set Kinds of label sets Penn Treebank has a detailed label set, especially for part of speech Standard TAG analyses assume a simplified label set Extracted grammars based on different label sets FULL: Elementary trees labeled with Penn Treebank label set MERGED: Elementary trees labeled with (simplified) XTAG label set

Evaluation of Extracted Grammars Different ways to evaluated extracted grammars Size Coverage Supertagging accuracy Trace localization Each grammar variation is extracted from PTB Sections 02-21

Size of Grammar (1/3) Ways to measure size Importance of size Number of lexicalized trees Number of tree frames Importance of size Efficiency of statistical models Impact on sparse data problem Lexicalized Tree Tree Frame S S NP VP NP VP VBDu NP VBDu NP disputed

Size of Grammar (2/3) Change in #Frames > Change in #LexTrees Comp Empty Label #Frames #LexTrees CA1 ALL FULL 8675 113456 MERGE 5953 109774 SOME 7446 110134 5053 106457 CA2 6488 110034 4358 107285 4723 106422 3075 102679 Change in #Frames > Change in #LexTrees

Size of Grammar (3/3) Variance(#frames): Label > Comp = Empty #LexTrees CA1 ALL FULL 8675 113456 MERGE 5953 109774 SOME 7446 110134 5053 106457 CA2 6488 110034 4358 107285 4723 106422 3075 102679 Variance(#frames): Label > Comp = Empty Variance(#LexTree): Empty > Label > Comp

Coverage of Grammar (1/3) Measuring coverage Extract grammar G from training corpus Extract grammar G’ from test corpus (PTB Sec 23) Compute percentage of instances of (lex tree/tree frame) in G’ that are also in G Importance of coverage Impact on sparse data problem Measure of amount of linguistic generalization

Coverage of Grammar (2/3) Comp Empty Label %FramesSeen %LexTreesSeen CA1 ALL FULL 99.56 92.04 MERGE 99.69 92.35 SOME 99.65 92.41 99.77 92.74 CA2 92.26 99.76 92.59 99.81 92.76 99.88 93.08 Frame coverage > 99% Extraction procedure making good syntactic generalizations LexTree coverage poor Not surprising, given # of (word x frame) combinations

Coverage of Grammar (3/3) Comp Empty Label W,T seen separately W or T not seen CA1 ALL FULL 63.94 36.06 MERGE 36.08 SOME 63.24 36.76 63.09 36.91 CA2 64.00 36.00 63.83 36.17 63.54 36.46 62.86 37.14 We can recover about 2/3 of missing lextree coverage if we can guess “valid” word x frame combinations from words and frames found in training corpus

Supertagging Accuracy (1/5) Input: words of a sentence Output: each word associated with a tree frame S S NP VP NP VP * Vu NP Vu NP NP NP Au NP* Nu NP NP Au NP* Nu Wet paint Wet paint

Supertagging Accuracy (2/5) Supertagging as “almost parsing” S NP VP * V NP Wet NP NP N paint Au NP* Nu Wet paint NP A N Wet paint

Supertagging Accuracy (3/5) Importance of supertagging accuracy Measuring impact of different kinds of grammars on statistical model Experimental design Trigram model of supertagging (Srinivas 97; cf. Chen, et al. 99) Training set: PTB Sections 02-21 Test set: PTB Section 23

Supertagging Accuracy (4/5) Comp Empty Label %correct supertags CA1 ALL FULL 78.55 MERGE 79.23 SOME 79.34 80.09 CA2 79.07 79.65 80.03 80.62 Relatively low accuracy Sparse data problem with extracted grammars Variance(%correct): Empty > Label > Comp

Supertagging Accuracy (5/5) Correlation between supertagging accuracy and other measures Weak correlation between an extracted grammar’s supertagging accuracy and its size in tree frames Very strong correlation (R=0.98) between an extracted grammar’s supertagging accuracy and its size in lexicalized trees When designing a TAG to be modeled stochastically, it might be a good idea to design it so as to minimize in particular the number of lexicalized trees

Representation of Empty Elements in Extracted Grammars (1/3) Kinds of empty elements in linguistic theory Traces Null elements Recall TAG analyses typically include certain kinds of empty elements Penn Treebank has these and other kinds of empty elements

Representation of Empty Elements in Extracted Grammars (2/3) We measure Number of traces localized with landing sites Number of null elements Importance Theoretical Which kinds of traces can/cannot be localized by TAG formalism Practical Some kinds of localization may improve performance of statistical models (cf. Collins 99) It can ease interface between extracted TAG and semantics

Representation of Empty Elements in Extracted Grammars (3/3) Comp Empty #Trace Types #Null Types #Trace Tokens (%of all TT) #Null Tokens CA1 ALL 2381 2560 21508(60%) 42593 SOME 1258 1392 18394(50%) 25305 CA2 1847 2130 21458(59%) 42643 589 655 16153(44%) 24035 Variance(%trace types):Comp > Empty Examples of non-localizable traces: In TAG: traces across coordination CA1 vs CA2: complements far from head ALL vs SOME: Adverbial movement

Evaluation Reveals a Sparse Data Problem with Extracted Grammars Lexicalized tree coverage is generally bad for extracted grammars This is one major reason for poor supertagging accuracy #LexTrees %LexTree Coverage on Unseen %supertag accuracy Extracted Grammar 113456 92.04 78.55

Feature Vector Decomposition of Extracted Grammars (1/3) Motivation Can help ameliorate extracted gr’s sparse data problem Help map extracted grammar onto semantics, other grammars Feature vector description POS, Subcat frame, Modifyee, direction, co-anchors, root Transformations: declarative, Subj-aux inversion, Topicalization, Wh-movement, complement, etc Example S POS VB Subcat {NP} Modifyee S Direction left Compl? Yes … S S* IN S NP VP VBu NP hurt

Feature Vector Decomposition of Extracted Grammars (2/3) Detection of features based on pattern matching of structural relationships (after linguistic theory) Pos Preterminal Subcat Sister substitution nodes to preterminal Compl? S S S* POS VB Subcat {NP} Compl? Yes … IN S NP VP VBu NP hurt X S lexical item

Feature Vector Decomposition of Extracted Grammars (3/3) Determination of feature vector information allows annotation of tree frames with deep (syntactic) role information Examples of role: subject(0), object(1) Pass tranformation is activated for this tree frame Therefore, deep-roles for nodes in this tree frame differ from surface-roles surf-role: 0 deep-role: 1 S NP-9 VP Vu NP PP bitten -NONE- P NP by *-9 surf-role: 1 deep-role: 0

Procedure to Increase Coverage of Extracted Grammar (1/3) Step 1: Induce tree families from feature vector representation of extracted grammar A tree family (XTAG-Group 2001) is a set of tree frames Having the same pos, subcat features Represent pred-arg structure NP S S NP* S S NP S NP S NP S NP VP NP VP NP VP NP VP e e VBu NP VBu NP VBu NP VBu NP e e

Procedure to Increase Coverage of Extracted Grammar (2/3) Step 2: Augment extracted grammar using tree families F S Go S S NP S NP VP NP VP NP VP VBu NP VBu NP VBu NP hurt e S S S NP NP NP S NP S NP S NP* S NP* S NP VP NP VP NP VP NP VP NP VP e VBu NP VBu NP VBu NP VBu NP -NONE- VBu NP hurt e e -NONE- hurt -NONE- hurt * * *

Procedure to Increase Coverage of Extracted Grammar (3/3) Results 26% reduction in misses in overall coverage 62% reduction in misses in verb-only coverage

Outline Motivation, Introduction Extraction of TAG from a Treebank Smoothing Models for TAG Sparse data problem using extracted grammars Supertagging Baselines for supertagging Smoothing approaches for supertagging Future work Using Extracted TAG Features to Predict Semantic Roles Conclusions and Future Work

Sparse Data in Statistical Models using Extracted Grammars Sparse data in supertagging Sparse data in other kinds of stochastic modeling as well Focus on smoothing supertagging models, but our procedure is applicable to others #LexTrees %LexTree Coverage on Unseen %supertag accuracy Extracted Grammar 113456 92.04 78.55

Recall: Supertagging Input: words of a sentence Output: each word associated with a tree frame S S NP VP NP VP * Vu NP Vu NP NP NP Au NP* Nu NP NP Au NP* Nu Wet paint Wet paint

Trigram Model for Supertagging

Smoothing the Trigram Model Trigram model for supertagging Probability distribution to smooth P(ti| ti-1 ti-2) – use Katz’ backoff P(wi|ti) -- focus of consideration here Characterization of different kinds of P(wi|ti) W unseen: (Weischedel, et al. to smooth) Focus of smoothing here w and t seen (Recall: these are the majority of cases)

Experimental Setup Grammar: CA1-SOME-FULL Train: PTB 02-21, Development: 22, Test: 23

Two Supertagging Baselines Results Baseline 2: train on 02-21 and 23 for p(w|t) only Low score for Baseline 2 + w,t separate Overall w,t together w,t separate No smooth 79.24% 84.96% 0% Baseline 2 85.60% 87.36% 53.19% Trigram supertagging accuracy

Smoothing Equations Smoothing Unsmoothed probability

Smoothing using Part of Speech Results Issues Correct prediction hampered by flatness of pos probabilities This also causes efficiency problems Overall w,t together w,t separate No smooth 79.24% 84.96% 0% Baseline 2 85.60% 87.36% 53.19% Pos Smooth 79.34% 85.00% 1.36%

Smoothing using Tree Families (1/2) Training Tree families are defined as before FAMILY-tag each word in training as follows Word that is part of a tree family is tagged with POS+SUBCAT features Otherwise, word is tagged with POS feature only Example Compute p(w|FAMILY) given this markup The//DT cat//NN eats//VB_NP lettuce//NN

Smoothing using Tree Families (2/2) Results Only a bit better Can’t smooth two supertags that are both not in any tree family (the more common case) Can’t leverage fact that one non-tree family supertag can be evidence for existence of a tree-family supertag, and vice versa Flatness of probability distribution (but less flat than pos) Overall w,t together w,t separate No smooth 79.24% 84.96% 0% Baseline 2 85.60% 87.36% 53.19% Pos Smooth 79.34% 85.00% 1.36% Tree Family 79.46% 85.10% 1.92%

Smoothing using Distributional Similarity (1/4) (Dagan, et al.) Predicting the next word wnext in the sentence given the current word Approximate PSIM(w|t) using PMLE(w|t’) for t’ “close to” t SIM(t,t’) is distance between distribution of words over t and distribution of words over t’ Conjecture: If these two distributions are about the same, then t and t’ belong in the same tree family

Smoothing using Distributional Similarity (2/4) Results Baseline 2: train on 02-21 and 23 for p(w|t) only DS-smooth: about 7% reduction in error overall (statistically significant) Overall w,t together w,t separate No smooth 79.24% 84.96% 0% Baseline 2 85.60% 87.36% 53.19% Pos Smooth 79.34% 85.00% 1.36% Tree Family 79.46% 85.10% 1.92% DS-smooth 80.65% 85.39% 21.34% Trigram supertagging accuracy

Smoothing using Distributional Similarity (3/4) Most similar tree frames form automatically-induced tree families Most similar trees to a3 according to a-skew S S S IN S S NP VP NP VP NP VP NP VP VBu NP -NONE- VBu NP VBu NP VBPu NP * a3 a4 a5 a6

Smoothing using Distributional Similarity (4/4) Error Analysis Low frequency supertags not smoothed well Obviously, because of the way we smooth There are a lot of these cases (tree frames have Zipfian distribution) Error due to high freq supertags are dramatically reduced, but much error persists Reasons for error tend to be idiosyncratic Mistaking capitalization for start of sentence instead of headlines Errors in Penn Treebank annotation

Towards Improving Smoothing Smoothing using distributional similarity Works well, especially for high and med freq supertags Problem with low freq supertags (and there are a lot of these cases) Low freq supertags never smoothed together Smoothing using tree families Low freq supertags will be smoothed together But they will be given too much probability mass on average (flatness of distribution) Handling low freq supertags Have tree families approach suggest supertags if they are low frequency (if high freq, use distributional similarity) Make distribution less flat by taking more context into consideration ( p(w|big context), not p(w|t) )

Outline Motivation Introduction Extraction of TAG from a Treebank Smoothing Models for TAG Using Extracted TAG Features to Predict Semantic Roles (Joint work with Owen Rambow) PropBank semantic annotation Extracted TAG features in prediction models Conclusions and Future Work

Motivation Syntactic information in the form of TAG is useful for natural language applications Predicate is localized with its arguments Relations between words are disambiguated S NP NP NP VP NP NP NP* PP NP Nu Vu NP Gu NP* Nu Pu NP Nu Mitsubishi increased its sales of automobiles

Motivation Sometimes, a purely syntactic annotation is insufficient The subject argument in the first sentence stands in a different relationship with the predicate broke than the subject argument in the second sentence. S S NP NP VP NP NP VP NP Nu Vu Nu Vu NP Nu WindowsXP broke Hackers broke WindowsXP

Adding Semantic Labels to Arguments We can solve the problem by labeling each argument with how it relates semantically with its predicate Kinds of semantic labels Domain specific Flight-travel: ORIG-CITY, DEST-CITY, … Terrorism: PERPETRATOR, POLITICAL-GROUP, … Semantic roles are more general AGENT(0): entity performing some action PATIENT(1): entity being acted upon Etc…

Example of Annotating Arguments with Semantic Roles Semantic role information reifies the similarity between the subject of broke in the first sentence and the object of broke in the second sentence. sem-role: 0 sem-role: 1 sem-role: 1 S S NP NP VP NP NP VP NP Nu Vu Nu Vu NP Nu WindowsXP broke Hackers broke WindowsXP

PropBank (Kingsbury, et al. 02) PropBank adds a layer of semantic annotation to the Penn Treebank Semantic information in the PropBank Each predicate is annotated with Sense (word sense) Roleset (set of semantic roles assoc. with this predicate) Each argument is labeled with its semantic role

Incomplete State of PropBank Annotation Initial release of the PropBank is scheduled for June 2003 We used a pre-release version of the PropBank Not all predicates + arguments in the Penn Treebank are annotated, though the most frequently occurring ones are Predicates are not annotated for word sense We will therefore focus on semantic roles Word senses are also needed for semantic interpretation, but… 65% of predicate tokens in PropBank have only one sense In another 7%, semantic roles on arguments completely disambiguate the word sense

Our Problem: Predicting Semantic Roles Goal: Predict the semantic role of an argument given syntactic and lexical information sem-role: 1 S S NP NP VP NP NP VP Nu Vu Nu Vu WindowsXP broke WindowsXP broke

Previous Work (Gildea, Palmer 02) Predict the semantic role given either a gold-standard parse or an automatic parse Syntactic features Phrase Direction Path Voice Lexical Features Predicate headword Argument headword Path: V-VP-S-NP S Phrase: NP NP VP Direction:left N V WindowsXP broke Voice: Active

Some Results of Previous Work Task: Given: Automatically parsed text Mark Boundary of each argument in the input sentence Semantic role of each argument Results Recall: 50.0% Precision: 57.7%

One Problem with Previous Work (Gildea, Palmer 02) note sparse data afflicts their path feature Example of how sparse data can be exacerbated They try to get around it by modifying the path feature, but little improvement Path: V-VP-VP-S-NP S Path: V-VP-S-NP S NP VP NP VP N VP AP N V WindowsXP V A WindowsXP broke broke repeatedly

Conjecture Surface syntax features like path have some limitations in identification of semantic role Features based on TAG ameliorate some of the limitation because it localizes the syntactically relevant information See previous example Deep-syntax features that are in our extracted TAG can also help because it abstracts away from less relevant aspects of surface syntax Use of path versus use of path+voice

Our Deep-Syntax Features from Extracted TAG Feature Vectors deep-role, deep-subcat Example annotated with surface- and deep- syntax features Path: V-VP-S-NP deep-role: 0 S S Phrase: NP NP VP NP VP Direction:left NP N V Vu Nu WindowsXP broke broke WindowsXP Voice: Active deep-subcat: NP0

Prediction using Features Based on Gold-Standard Parses Corpora Training: Sec 02-21 of PropBank Test: Sec 00 of PropBank Use of C4.5 to train models Feature Set %Accuracy Pred_hw+Arg_hw+Deep_role+Deep_subcat 93.2 Pred_hw+Arg_hw+Deep_role 92.0 Pred_hw+Arg_hw+Phrase+Dir+Path+voice 86.2

Prediction using Features Based on LDA (1/2) LDA (Srinivas 97) is a deterministic partial parser which uses supertagging as a first step Procedure to predict semantic roles Partial parse raw input text using LDA For each deep-syntax argument that LDA identifies Extract features corresponding to that argument Run features through a C4.5 trained-model to get corresponding semantic role

Prediction using Features Based on LDA (2/2) Results are close to (Gildea, Palmer 02) (0.50 R/0.58 P) Feature set: Pred_hw+Arg_hw+Deep_role+Deep_subcat Task Recall Precision Sem_role+Arg_hw 0.64 0.74 Sem_role+Bnd 0.50 0.58 Sem_role+Bnd+Arg_hw 0.49 0.57

Conclusions (1/3) Extraction of TAG from a Treebank Procedure to extract a linguistically motivated TAG Error-free resource for statistical TAG models Evaluation of variations in extraction procedure Trade-offs between localizing dependencies and grammar size, supertagging accuracy Feature vector decomposition of extracted TAGs For increasing grammar coverage For mapping extracted TAGs onto semantics, other grammars

Conclusions (2/3) Smoothing Models for TAG Distributional similarity smoothing Significantly increases supertagging accuracy Similarity metric automatically induces tree families Generally, TAG suffers from greater sparse data problems than other grammars (e.g., Collins 99), but from TAG perspective, we can try different smoothing techniques than have been usually employed

Conclusions (3/3) Using Extracted TAG Features to Predict Semantic Roles TAG might help because it localizes dependencies Deep-syntax features from our extracted TAG improves prediction over surface-syntax features They may help to such an extent that a partial parser (LDA) can be used to annotate raw text with semantic roles with accuracy comparable to that of a full parser

Future Work (1/3) PropBank-inspired Work Compare using Deep-syntax versus Surface-syntax features over LDA output Wait for final version of PropBank to run experiments to predict word senses Extract a TAG with a semantic rather than syntactic domain of locality

Future Work (2/3) Extraction of TAG from a Treebank Replace heuristics with statistical approach Learn that V projects to VP by looking at the distribution of VPs in the treebank Try to minimize number of lextrees produced, to optimize stochastic model on which the resulting TAG is based Detailed comparison of extracted TAG and hand-written TAG Pinpoint missing constructions in hand-written TAG Find errors in treebank annotation Examine differences between extracted TAGs from different sublanguage corpora VP VP V NP V S

Future Work (3/3) Smoothing models for TAG Combining smoothing using distributional similarity and smoothing using tree families Smoothing using feature vectors, besides tree families Comparison of smoothing methods to smoothing traditionally employed for LCFGs Smoothing probability distributions other than P(w|t)