Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University.

Slides:



Advertisements
Similar presentations
Lecture 3a Clause functions Adapted from Mary Laughren.
Advertisements

Lexical Functional Grammar History: –Joan Bresnan (linguist, MIT and Stanford) –Ron Kaplan (computational psycholinguist, Xerox PARC) –Around 1978.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Syntax and Context-Free Grammars Julia Hirschberg CS 4705 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
Chapter 4 Syntax.
Syntactic analysis using Context Free Grammars. Analysis of language Morphological analysis – Chairs, Part Of Speech (POS) tagging – The/DT man/NN left/VBD.
Context-Free Grammars Julia Hirschberg CS 4705 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
Movement Markonah : Honey buns, there’s something I wanted to ask you
Introduction to Syntax Owen Rambow September 30.
Language and Cognition Colombo, June 2011 Day 2 Introduction to Linguistic Theory, Part 4.
Grammatical Relations and Lexical Functional Grammar Grammar Formalisms Spring Term 2004.
Hindi Syntax Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure Martha Palmer (University of Colorado, USA) Rajesh Bhatt.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Statistical NLP: Lecture 3
MORPHOLOGY - morphemes are the building blocks that make up words.
1 Words and the Lexicon September 10th 2009 Lecture #3.
The Hindi-Urdu Treebank Lecture 7: 7/29/ Multi-representational, Multi-layered treebank Traditional approach: – Syntactic treebank: PS or DS, but.
Introduction to Syntax Owen Rambow September
Introduction to Syntax Owen Rambow October
DS-to-PS conversion Fei Xia University of Washington July 29,
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Introduction to Syntax and Context-Free Grammars Owen Rambow September
Syntax and Context-Free Grammars CMSC 723: Computational Linguistics I ― Session #6 Jimmy Lin The iSchool University of Maryland Wednesday, October 7,
Syntax and Grammar John Goldsmith Cognitive Neuroscience May 1999.
Sanjukta Ghosh Department of Linguistics Banaras Hindu University.
The students will be able to know:
Models of Generative Grammar Smriti Singh. Generative Grammar  A Generative Grammar is a set of formal rules that can generate an infinite set of sentences.
P HRASES & CLAUSES D AY 6, S EPT. 10, 2012 Introduction to Syntax ANTH 3590/7590 Harry Howard Tulane University.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Embedded Clauses in TAG
Introduction to English Syntax Level 1 Course Ron Kuzar Department of English Language and Literature University of Haifa Chapter 2 Sentences: From Lexicon.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
SYNTAX Lecture -1 SMRITI SINGH.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
October 15, 2007 Non-finite clauses and control : Grammars and Lexicons Lori Levin.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
ENGLISH SYNTAX Introduction to Transformational Grammar.
Linguistic Essentials
Semantic Construction lecture 2. Semantic Construction Is there a systematic way of constructing semantic representation from a sentence of English? This.
Parsing with Context-Free Grammars for ASR Julia Hirschberg CS 4706 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Rules, Movement, Ambiguity
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
Topic and the Representation of Discourse Content
 Chapter 4 Noun Phrases Transformational Grammar Engl 424 Hayfa Alhomaid.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
ACTL 2008 Syntax: Introduction to LFG Peter Austin, Linguistics Department, SOAS with thanks to Kersti Börjars & Nigel Vincent.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
Language and Cognition Colombo, June 2011 Day 2 Introduction to Linguistic Theory, Part 3.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
X-Bar Theory. The part of the grammar regulating the structure of phrases has come to be known as X'-theory (X’-bar theory'). X-bar theory brings out.
Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.
Week 3. Clauses and Trees English Syntax. Trees and constituency A sentence has a hierarchical structure Constituents can have constituents of their own.
Embedded Clauses in TAG
Structure, Constituency & Movement
An Introduction to the Government and Binding Theory
Lecture 3: Functional Phrases
Statistical NLP: Lecture 3
4.3 The Generative Approach
Structural relations Carnie 2013, chapter 4 Kofi K. Saah.
Introduction to Syntax
Complementation.
Presentation transcript:

Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University – CCLS

Goal of this Talk In natural language processing, we all use syntactic representations (treebanks) In theoretical syntax, they use syntactic representations Problems in both communities: – Little reflection on what these syntactic representations mean – Representation confused with phenomenon which is the object of scientific investigation: natural language syntax Goal: increase meta-scientific understanding of syntactic representations for NLP

Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

Acknowledgements Rajesh Bhatt and Fei Xia – Bhatt, Rambow and Xia: Linguistic Phenomena, Analyses, and Representations: Understanding Conversion between Treebanks, IJCNLP 2011 Hindi-Urdu Treebank Project – Dipti Misra Sharma (IIIT Hyderabad) – Martha Palmer (Colorado) – NSF Grant All errors and infelicitous choices in this presentation are my own

What is a Syntactic Representation? 1.Syntactic phenomena, e.g.: – Subject of a verb – Relative clause – Small clause Linguists tend to agree on what phenomena exist 2.Mathematical representation type, e.g.: – Phrase structure tree – Dependency tree – Or something more complicated: LFG, TAG, … 3.Formal syntactic description: a.Mapping from phenomena to representations (in particular type) b.Chosen representation for a specific phenomenon also called analysis c.Phenomena extracted in representation are the interpretation d.Formal description is a syntactic theory if it makes predictions

Example: Small Clauses Hindi – अातिफ ने सीमा को बेवकूफ समझा – Atif ne Seema ko bewakuuf samjhaa – Atif Erg Seema Acc stupid consider.Pfv – ‘Atif considered Seema stupid.’ English – Atif considered Seema stupid – Atif considered her stupid

What is the Phenomenon? Syntactically and semantically, consider takes a clausal complement – Atif considered [ clause that she is stupid] – Atif considered [ clause her stupid] But two problems: – No verb – her is semantically subject of stupid but has accusative case, which is unusual (subjects are usually nominative) So: – Atif considered [ small clause her stupid]

What is the Representation Type? For this example, we will show dependency trees and phrase structure trees

Analysis 1a for Small Clauses: No Accusative Case Marking Structure represents her as subject but not accusative case marking of her considers Atifstupid Subj Obj her Subj

Analysis 1b for Small Clauses: Exceptional Case Marking Structure represents her as subject and accusative case marking through node label considers Atifstupid Subj Obj-ECM her Subj

Analysis 1a for Small Clauses: No Accusative Case Marking Structure represents her as subject but not accusative case marking of her considers Atifstupid Subj Obj her Subj S NP Atif VP considersS her VP AdjP stupid

Analysis 1b for Small Clauses: Exceptional Case Marking Structure represents her as subject but not accusative case marking of her considers Atifstupid Subj Obj-ECM her Subj S NP Atif VP considersSC her VP AdjP stupid Close to analysis adopted in Chomsky (1981)

Note on DS and PS These analyses are intuitively very similar Formal notion: “consistency” (Fei Xia, see Bhatt, Rambow & Fei 2011) – Intution: very simple and general algorithm can transform consistent DS to PS and vice versaI

Analysis 2a for Small Clauses: General Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject considers Atifstupid Subj Obj2 her Obj

Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label considers Atifstupid Subj ObjPred her Obj

Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label considers Atifstupid k1 k2s her k2 Neo-Paninian analysis from IIIT Hyderabad, Used for DS in Hindi-Urdu Treebank

Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label समझा अातिफ नेबेवकूफ k1 k2s सीमा को k2 Neo-Paninian analysis from IIIT Hyderabad, Used for DS in Hindi-Urdu Treebank

Analysis 2a for Small Clauses: General Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject considers Atifstupid Subj Obj2 her Obj NP Atif VP considersher AdjP stupid S

Analysis 2b for Small Clauses: Syntactic Monoclausal Analysis Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label considers Atifstupid k1 k2s her k2 NP Atif VP considersher AdjP stupid S SC

Analysis 3 for Small Clauses: Raising to Object Structure represents accusative case marking of her and her as semantic subject but requires empty category considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj

Analysis 3 for Small Clauses: Raising to Object Structure represents accusative case marking of her and her as semantic subject but requires empty category considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj S NP Atif VP considersS VP AdjP stupid her 1 e1e1 Analysis used for PS in Hindi-Urdu Treebank

Comparison of Representations Less Information Same information considers Atifstupid Subj Obj her Subj considers Atifstupid Subj Obj2 her Obj considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj considers Atifstupid Subj ObjPred her Obj considers Atifstupid Subj Obj-ECM her Subj Tree 1a Tree 2a Tree 1b Tree 2b Tree 3

Initial Summary: Syntactic Phenomena, Representation Types, Analyses Syntactic phenomena are the empirical data of syntax as part of the science of language – Can be very similar across languages There can be several possible analyses – Some have less information – But there can be different analyses that represent the same information differently The analyses can be similar in DS and PS Lots of choices in treebank design!

Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

Representation Types: Dependency and Phrase Structure Dependency Tree (DS): – One label alphabet, words (= words in a sentence) – All nodes labeled with words or empty strings Phrase Structure Tree (PS): – Two disjoint label alphabets, terminals (= words in sentence) and nonterminals – All and only interior nodes are labeled with nonterminals – Leaves are labeled with terminals or empty strings

Non-Differences Between DS and PS WRONG: “DS can’t have empty categories” considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj

Non-Differences Between DS and PS WRONG: “DS must be unordered, PS must be ordered” that the dancer a flower the violinist gives

Non-Differences Between DS and PS WRONG: “PS can’t be non-projective/have crossing arcs”

Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

The Double Life of Dependency and of Constituency There are two meanings to “dependency”: – A type of tree (no nonterminal labels) – A type of syntactic phenomenon, namely a relation between words (e.g., “subjecthood”) There are two meanings to “constituency” – A type of tree (terminal labels on leaf nodes; = PS tree) – A type of syntactic phenomenon, namely a grouping of words into phrases Mathematical objects = tools for linguists to represent data Empirical facts = data for linguists Don’t confuse data with its representation!

Constituency in PS Representation but not in Syntax VP premodifiers in PTB can be either sister to VP or in VP, there is no difference in meaning – (S (NP-SBJ Sandy (ADVP-TMP often) (VP throws (NP curves))) – (S (NP-SBJ Sandy (VP (ADVP-TMP often) throws (NP curves))) “The variation is in general free” (PTB Guidelines p. 137)

Constituency in Syntax but not in PS Representation Flat NP in PTB – Syntax of Adj-N-N sequences: (green (baby bassinet)) ((small business) administration) – Representation in Penn Treebank: (green baby bassinet) (small business administration) Conscious decision not to represent NP-internal pre-nominal sturcture

Dependency in DS Representation but not in Syntax --- links in CATiB dependency treebank of Arabic – Signal that connected words are not analyzed by annotators for dependency Pof link in DS part of Hindi-Urdu Treebank – Used for multiword expressions which form a unit syntactically (and thus have no structure) but may occur apart in sentence

Dependency in Syntax but not in DS Representation Small clauses in DS for Hindi-Urdu Treebank: dependency between embedded predicate and its semantic subject not marked as dependency link in tree considers Atifstupid k1 k2s her k2

Aren’t DS and PS Representations Complementary? NO! Syntactic dependency can be encoded in DS, and typically is Usual convention: attachment in projection shows type of dependency

Aren’t DS and PS Representations Complementary? NO! Syntactic constituency is represented in DS Usual convention: each node is the word, and the head of the phrase containing it and all descendents

Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

But What About Intermediate Projections? If PS: – only has preterminals (eg V) and maximal projections (eg S); – and no attchment happens at preterminal; then we have trivial correspondence to DS (consistency in the formal sense of Xia) Then PS and DS must have the exact same expressive power

No Intermediate Projections: DS and PS Representationally Equivalent S NAdvNV Rajeshhappilyeats laddus eats Rajeshhappilyladdus What about the POS tags in the DS?

No Intermediate Projections: DS and PS Representationally Equivalent S NAdvNV Rajeshhappilyeats laddus eats V Rajesh N happily Adv laddus N What about grammatical functions?

No Intermediate Projections: DS and PS Representationally Equivalent S N Subj Adv Adj N Obj V Rajeshhappilyeats laddus eats V Rajesh N happily Adv laddus N SubjObjAdj

Intermediate Projections If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS No direct equivalent of intermediate projections (eg VP) in DS

Intermediate Projections: DS and PS Different! S NP AdvNPV Rajesh happilyeats laddus eats V Rajesh N happily Adv laddus N SubjObjAdj VP AdvP N Note: X– X’—XP = maximal projection EXCEPT: V—VP—S: VP = intermediate projection

Intermediate Projections If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS No direct equivalent of intermediate projections (eg VP) in DS – we have a problem? What is important: how intermediate projections are used in the formal syntactic description A node is not information about the syntax, only a node and its interpretation

Why Have Intermediate Projections? Evidence for IntProj as a constituent – VP fronting in English Use of a VP to represent semantic facts – Interpretation of adverbs in English Use of IntProj to represent syntactic facts – Subject and object of verbs in English – Genitive versus adjectival modification in Arabic – Also: Arguments of verbs in Hindi Verb second in German

Intermediate Projection as Constituent: VP Fronting in English VP can be moved as a constituent – Rajesh said he would [ VP eat laddus] and [ VP eat laddus] Rajesh did

VP Fronting in English S NP did Rajesh VP Aux N NPV eat laddus VP Sbar

What can Dependency Do? ε1ε1 laddus Rajeshdid Aux Obj Subj eat 1 Top did laddus Rajesh Obj Subj eat Top did Rajesh Subj laddus Obj eat Comp eat didladdus ObjAux Rajesh Subj If auxiliary did is head (basically, we have a VP!) If eat is head Issue: two possible analyses of auxiliary-verb structures in DS

Possible claim: – VP adverbs characterize manner of event John quickly/happily played chess – S adverbs characterize speaker attitude towards event: Happily/fortunately (enough), John lost against me at chess Problem: interpretation independent of word order – Quickly/happily John played chess – John happily/oddly (enough) lost against me at chess Conflict between using VP to identify subject and using VP to describe semantics of adverbs – Would need traces to show scope In fact, Penn Treebank for English does not use VPs for scope Intermediate Projections and Semantic Scope

Use of VP to Indicate Subject considers Atifstupid Subj Obj-Pred her 1 Obj e1e1 Subj S NP Atif VP considersS VP AdjP stupid her 1 e1e1

Projections: Noun Modification in Arabic Treebank Idafa (genitive) construction: – (NP (N بيت /byt/house) (NP (N المصري /AlmSry/the-Egyptian))) the house of the Egyptian Modification construction: – (NP (N الوزير /Alwzyr/the-minister) (Adj المصري /AlmSry/the-Egyptian)) the Egyptian minister Syntactic difference signaled by projection to X or XP Can’t do in DS – need labels

Conclusion from Intermediate Projections Intermediate projections (eg VP, N’) are specific to PS They can be used to encode syntactic or semantic information But: DS can express the same information! Conclusion: the existence of intermediate projections in PS is not a reason to say that PS is more expressive than DS

Overview What is a Syntactic Representation? Dependency and Phrase Structure – Definitions – Syntactic and representational constituency and dependency – Intermediate projections What does this Mean for NLP? Takehome lessons

What Does This Mean for NLP? Treebanks are not naturally occurring data The guidelines are painstakingly produced by linguists and represent a formal description of the language Annotators understand a sentence, determine what syntactic phenomena exist, and use the guidelines to choose an analysis for the sentence (a structure) Users of the treebank can use the guidelines to interpret the structures and get back the syntactic phenomena present These phenomena, and not their representation in the treebank, can be used for NLP in whatever representation chosen by the researcher! There is already lots of linguistics in our resources, we just need to make use of that linguistic information!

Conclusion 4 takehome lessons

Lesson 1: There are All Sorts of Trees Fallacy: DS trees are unordered, have no empty categories,… and PS trees are ordered and … Truth: DS and PS trees are graphs that can have all sorts of properties

Lesson 2: Trees Need an Interpretation Fallacy: Trees have intrinsic syntactic meaning Truth: Trees only gain syntactic meaning through interpretation; in treebanks: guidelines

Lesson 3: Conversion between Formats Depends on Interpretation Fallacy: It is easier/harder to convert DS to PS than vice versa Truth: conversion means preserving interpretation; thus it depends on the two representations between which conversion happens

Lesson 4: There are two Meanings to “Constituency” and “Dependency” Fallacy: DS trees only show dependency and PS trees only show constituency Truth: there are distinct mathematical (data structure) notions and linguistic notions of “dependency” and “phrase structure”; each tree type can show each

Thank you!