GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

The Structure of Sentences Asian 401
Syntax and Context-Free Grammars Julia Hirschberg CS 4705 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
Chapter 4 Syntax.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Introduction to Syntax Owen Rambow September 30.
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
Introduction to Syntax Owen Rambow September
Introduction to Syntax Owen Rambow October
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Recovering empty categories. Penn Treebank The Penn Treebank Project annotates naturally occurring text for linguistic structure. It produces skeletal.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
Thoughts on Treebanks Christopher Manning Stanford University.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Meeting 3 Syntax Constituency, Trees, and Rules
LI 2013 NATHALIE F. MARTIN S YNTAX. Grammatical vs Ungrammatical.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.
Ling 570 Day 17: Named Entity Recognition Chunking.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Lecture E: Phrase functions and clause functions
Today Phrase structure rules, trees Constituents Recursion Conjunction
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
LING 581: Advanced Computational Linguistics Lecture Notes February 19th.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
Parsing with Context-Free Grammars for ASR Julia Hirschberg CS 4706 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
CSA2050 Introduction to Computational Linguistics Parsing I.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
Iceland 5/30-6/1/07 1 Parsing with Morphological Information for Treebank Construction Seth Kulick University of Pennsylvania.
Nov Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Supertagging CMSC Natural Language Processing January 31, 2006.
Syntax II “I really do not know that anything has ever been more exciting than diagramming sentences.” --Gertrude Stein.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Coping with Problems in Grammars Automatically Extracted from Treebanks Carlos A. Prolo Computer and Info. Science Dept. University of Pennsylvania.
Treebanks, Trees, Querying, QC, etc.
Construct State Modification in the Arabic Treebank
LING/C SC 581: Advanced Computational Linguistics
LING 581: Advanced Computational Linguistics
LING 581: Advanced Computational Linguistics
Constraining Chart Parsing with Partial Tree Bracketing
Chunk Parsing CS1573: AI Application Development, Spring 2003
Presentation transcript:

GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus

GALE Banks 11/9/06 2 Outline  Summary of recent results  Part of Speech/Treebank “mismatches”  Components of Flat NPs  Test and Train Results  Conclusion

GALE Banks 11/9/06 3 Recent Results  Effect of Sentence Splitting – S->S (wa) S (wa) S Breaking these improves F-measure by 1.25% Investigating automatic accuracy of S splitting  Effect of “Spurious NPs” in coordination (NP (NP x) and (NP y)) changed to (NP x and y and z) Improves F-measure by 0.5%

GALE Banks 11/9/06 4 Pos/Treebank Mismatches  “Ideal” – XP projection headed by X Ideal and Reality in the PTB and ATB  Ambiguities for (Pos word) makes parser’s job harder

GALE Banks 11/9/06 5 VP headed by noun  6% of VPs in ATB have a nonverbal head  Changed heads to have new POS tag – “DV”  Temporary approximation to current annotation changes  0.7 increase in F-measure ( VP (NOUN mugAdar+at+i- [departure]) (NP-SBJ (POSS_PRON –hi [his]) (NP-OBJ (DET+NOUN Al+bayot+a [the house]) (DET+ADJ Al+>aboyaD+a [the white])))

GALE Banks 11/9/06 6 NP headed by adj – #1 ( S (NP-SBJ (PRON_1S –niy [I]) (NP-PRD (ADJ saEiyd+N [happy])) ADJ heads NP-PRD, elsewhere ADJP-PRD ( VP (PV+PVSUFF_SUBJ kAn+a [be+he]) (NP-SBJ-1 (-NONE- *T*)) (ADJP-PRD (ADJ saEiyd+AF happy) (PP … [with the voting])))

GALE Banks 11/9/06 7 NP headed by adj - #2 (VP (IV ta+Eomal+a [they work]) (NP-SBJ rAbiT+ap+u Al+maxAtyr+i [league of the mukhtars(village chiefs)]) (NP-ADV (ADJ dA}im+AF [always])) ADJ heads NP-ADV, elsewhere ADVP,ADJP (VP (IV na+>omal+a [we hope for] (NP-SBJ (-NONE- *)) (ADVP (ADJ dA}im+AF [always])) (VP (IV ya+SiH~+u he/it+be correct (NP-SBJ-1 (-NONE- *T*)) (ADJP (ADJ dA}im+AF [always])

GALE Banks 11/9/06 8 ADJP headed by noun ( S (NP-SBJ (NOUN >um~ah+At+u- [mothers]) (POSS_PRON_3P -hum [their])) (ADJP-PRD (NOUN >amiyrokiy~+At+N [American])) Also as ADJ ( NP (NOUN >um~ah+At+K [mothers]) (ADJ >amiyrokiy~+At+K [American]))

GALE Banks 11/9/06 9 ADVP headed by conj (S (ADVP (FOCUS_PART >am~A [as_for/concerning])) (NP-TPC-1 Haqiyb+ap+u Al+xArijiy~+ap+I [the foreign ministry’s portfolio]) (ADVP (CONJ fa- [and/so])) (VP …. (CONJ fa-) also as child of S (S (S …) (PUNC,) (CONJ fa- [and/so]) (S…)

GALE Banks 11/9/06 10 Mismatches in ATB and PTB ATB3PTB2.0 VP6.0%0.5% NP5.0%1.6% ADJP7.3%23.4% ADVP45.37%8.0% PP0.8%1.8%

GALE Banks 11/9/06 11 XP/X mismatches - Summary  This matters: headless VPs to “DV” modification : +0.7% PTB: 23.4% mismatch for ADJP Overall: ADJP:  Real-life linguistic complexity Need guidelines – visual prop time Some automatic changes likely  No guarantee of level of improvement, but: Should be a priority

GALE Banks 11/9/06 12 Flat NPs  Flat NPs – only (Pos word) children  Experiment – Evaluate with Flat NPs as different bracket Affects overall score (Gold) ( NP (NOUN -<ijorA’+i [conducting]) (NP (NOUN {inotixAb+At+K [elections]) (ADJ niyAbiy~+ap+K [representative])))

GALE Banks 11/9/06 13 Flat NPs (Gold) (NP (NOUN -<ijorA’+i [conducting]) (NP (NOUN {inotixAb+At+K [elections]) (ADJ niyAbiy~+ap+K [representative]))) (Test) (NP (NN -<ijorA’+i [conducting]) (NNS {inotixAb+At+K [elections]) (JJ niyAbiy~+ap+K [representative])) Under regular evaluation, top NPs match

GALE Banks 11/9/06 14 Flat NPs (Gold) ( NP (NOUN -<ijorA’+i [conducting]) (FLATNP (NOUN {inotixAb+At+K [elections]) (ADJ niyAbiy~+ap+K [ representative]))) (Test) (FLATNP (NN -<ijorA’+i [conducting]) (NNS {inotixAb+At+K [elections]) (JJ niyAbiy~+ap+K [representative])) With FlatNP evalution, no match

GALE Banks 11/9/06 15 Flat NPs  Importance of Flat NPs 30% of brackets are Flat NPs Errors percolate Up  ATB3 score on Flat NPs not good enough  Unclear why, but need some things from ATB Flat NPsOverall PTB ATB

GALE Banks 11/9/06 16 Flat NPs  Clear statement of what can go in flat NPs  Regular expressions for each head  Certain things fall out: Questionable categories – e.g. (DET+NOUN DET+NOUN) (NP Al+baHor+i [the sea] Al+>aHomar+i [the red]) Nouns that occur before a head noun are limited to a small class : quantifiers

GALE Banks 11/9/06 17 Flat NPs (NP (NOUN kul~+a [every/all/each_one]) (DET+NOUN Al+nuSuws+I [the texts] (DET+ADJ Al+tijAriy~+ap+I [the business]) Quantifier as prenominal modifier in flat NP Quantifier as taking NP complement (NP (NOUN kul~+a [every/all/each_one]) (NP (DET+NOUN Al+duwal+i [the countries]) (DET+ADJ A+Earabiy~+ap+I [the Arabic])) Quantifiers take NP complement 15%

GALE Banks 11/9/06 18 Flat NPs - Summary  Real-life linguistic complexity Need guidelines for NP structure, quantifiers Some automatic changes likely Maybe different POS tag for NOUNs with different distribution?  No guarantee of level of improvement, but: Should be a priority

GALE Banks 11/9/06 19 Test on Train  ATB3 lower, but not so much  Analysis of dependency errors All<=40 PTB ATB

GALE Banks 11/9/06 20 Dependency Analysis PTB2.0ATB3 % allFmeas%allFmeas 31.08%99.19%16.33%95.83% 0.0%N/A10.13%97.08% NPB headmod NP headNP  % all = % of all dependencies  NPB = “base NP”, non-recursive NP  More evidence that minimal NPs matter a lot

GALE Banks 11/9/06 21 Dependency Analysis PTB2.0ATB3 % allFmeas%allFmeas 5.23% % % %65.08 NP NPBPP NP PP  Why the difference in PP adjoining to NP, and not just NPB?

GALE Banks 11/9/06 22 PP attachment in PTB Adjuncts at the same level OkayNot Okay (NP (NP ….) (PP ….) (PP …)) (NP (NP (NP …) (PP …)) (PP …))  This is true for ATB also

GALE Banks 11/9/06 23 PP attachment in PTB (NP (NP streets) (PP of (NP (NP the city) (PP of (NP Long Beach)) (PP in (NP the state…))))) (NP (NP streets) (PP of (NP (NP (NP the city) (PP of (NP Long Beach))) (PP in (NP the state…)))))  First is okay, second is not  PPs in PTB do not adjoin to recursive NPs  PPs in ATB do, because of Al<DAfp

GALE Banks 11/9/06 24 PP attachment in PTB and ATB (NP (NP streets) (PP of (NP (NP (NP the city) (PP of (NP Long Beach))) (PP in (NP the state…))))) (NP ($awAriE [streets]) (NP (NP madinyn+ap [the city]) (NP luwnog byt$ [Long Beach])) (PP fiy [in] (NP wilAy+ap [the state].. )))) PTB: PP adjoining to recursive NP – bad structure ATB: PP adjoining to recursive NP – good structure

GALE Banks 11/9/06 25 Dependency Analysis PTB2.0ATB3 % allFmeas%allFmeas 5.23% % % %65.08 NP NPBPP NP PP  Parser distinguishes NPB, helps for PTB.  A wider range of attachment possibilities for ATB  Challenge for the parser

GALE Banks 11/9/06 26 Conclusion  We need guidelines We need to create the guidelines  Interaction - Parsing and Treebank Identify useful consistency checks Run as part of each release  Better understanding of problematic areas What sort of changes are necessary? Parsing – automatic transformations Treebank – Pos changes, etc.  Proper time allocation?