CIS 630 1 Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania

Slides:



Advertisements
Similar presentations
Feature Structures and Parsing Unification Grammars Algorithms for NLP 18 November 2014.
Advertisements

Statistical NLP: Lecture 3
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
DS-to-PS conversion Fei Xia University of Washington July 29,
June 7th, 2008TAG+91 Binding Theory in LTAG Lucas Champollion University of Pennsylvania
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
1/17 Probabilistic Parsing … and some other approaches.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Thoughts on Treebanks Christopher Manning Stanford University.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
Overview of Search Engines
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
PFA Node Alignment Algorithm Consider the parse trees of a Chinese-English parallel pair of sentences.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi and introduced in Tree-adjoining grammars are somewhat similar to context-free.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
LING 581: Advanced Computational Linguistics Lecture Notes February 19th.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CSA2050 Introduction to Computational Linguistics Parsing I.
Iceland 5/30-6/1/07 1 Parsing with Morphological Information for Treebank Construction Seth Kulick University of Pennsylvania.
Supertagging CMSC Natural Language Processing January 31, 2006.
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Handling Unlike Coordinated Phrases in TAG by Mixing Syntactic Category and Grammatical Function Carlos A. Prolo Faculdade de Informática – PUCRS CELSUL,
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
 2003 CSLI Publications Ling 566 Oct 17, 2011 How the Grammar Works.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
Costas Busch - LSU1 Parsing. Costas Busch - LSU2 Compiler Program File v = 5; if (v>5) x = 12 + v; while (x !=3) { x = x - 3; v = 10; } Add v,v,5.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Treebanks, Trees, Querying, QC, etc.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Statistical NLP: Lecture 3
Basic Parsing with Context Free Grammars Chapter 13
Construct State Modification in the Arabic Treebank
LING/C SC 581: Advanced Computational Linguistics
LING 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
Parsing and More Parsing
Constraining Chart Parsing with Partial Tree Bracketing
Parsing Costas Busch - LSU.
Chapter 11 Indexing And Hashing (1)
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Presentation transcript:

CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania

CIS Outline  GURT 2010/TLT 2009 talk Elementary Tree decomposition of Arabic Treebank Search over Parallel trees for Parser Analysis, Inter- annotator agreement  More recent stuff Elementary Tree Decomposition of PPCEME Search over Elementary Trees with lexical constraints Search over Derivation Tree  Other stuff we want to do - maybe some projects

CIS A Quantitative Analysis of Syntactic Constructions in the Arabic Treebank Seth Kulick, Ann Bies, Mohamed Maamouri Linguistic Data Consortium University of Pennsylvania

CIS Overview  Arabic Treebank (ATB3) Morphological and syntactic annotations Trees can get big  Problems in Treebank construction and use Look for linguistic structures of interest Want to compare two sets of trees Two or more annotators, or Parser/Gold  Solution: Decompose ATB into smaller trees Based on Tree Adjoining Grammar

CIS Arabic Treebank Example NP 5 6 PP 7 NP 8 NP- LOC 9 NP 10 4 PP- LOC NP- SBJ 23 VP 1 S jarat Almasiyrapu Al>uxoraY fiy muxay~ami jabAliyA li lAji}iyna $amAl gaz~api occur the-march the-another in camp Jabaliya for refugees north Gaza جرت المسيرة الأخرى في مخيم جباليا للاجئين شمال غزة

CIS Our approach  Decompose the trees into core syntactic units Inspired by Tree Adjoining Grammar Modification and recursion factored out from main syntactic relations Idafa structures kept as single elementary trees.  Two key properties: Resulting “elementary trees” are the meaningful syntactic units to search and score upon Elementary trees are “anchored” by >=1 word(s). We exploit this for the searching and scoring.

CIS Arabic Treebank Example NP 5 6 PP 7 NP 8 NP- LOC 9 NP 10 4 PP- LOC NP- SBJ 23 VP 1 S jarat Almasiyrapu Al>uxoraY fiy muxay~ami jabAliyA li lAji}iyna $amAl gaz~api occur the-march the-another in camp Jabaliya for refugees north Gaza جرت المسيرة الأخرى في مخيم جباليا للاجئين شمال غزة

CIS Database of Elementary Trees  Each elementary tree instance consists of: “template” – the tree structure token “anchors” in this instance  Treebank “database normalization” of a sort. Repeated templates are extracted out and represented by integers in a MySQL database. Saves space, makes search faster  Massive structural repetition in (a) treebank: ATB3: 402,292 tokens, 321,404 etree instances 2805 etree templates

CIS Searching Treebank for Structures  Idafa complexity Frequency of different idafa levels Idafa levelCount  Verb frames – SVO,VSO, etc. – under study  Search across pairs of trees, for comparison of this and more complex verbal structure and idafa info

CIS Comparing Two Sets of Trees on the Same Text  Can arise in two ways: Two or more annotators: QC for corpus construction Parser/Gold: to analyze parser errors  … but in a meaningful way Want measures based on same syntactic units annotators make decisions on Want correlation between two sets of trees on same data  Unsatisfied with existing scoring and search programs (evalb, etc.)

CIS Parallel Annotation – How to Score? NP- LOC PP 9NP NP PP 9NP NP li lAji}+iyna $amAl gaz~+ap for refugees North Gaza GOLD: PARSED: Gold has 2-level idafa, parsed has 3-level. Gold has NP-LOC, parsed doesn’t. للاجئين شمال غزة

CIS Query Examples  Queries are specified over etree templates 1. (NP-LOC A$) – anchor projects to a NP-LOC 2. (NP-DIR A$) – anchor projects to a NP-DIR 3. (NP-ADV A$) – anchor projects to a NP-ADV 4. (NP A (NP A (NP A$))) – a three-level idafa  Queries grouped into confusion matrices Queries 1,2,3 Query 4 by itself, showing presence or absence  Each token is associated with query results for the etree instance that token anchors

CIS Query Processing  The etree templates are searched to determine which (likely >1) match query.  Etree instances are searched for which take a template that satisfies query.  Since each word is linked to an etree instance, its status for a query is now easy to determine.  Corresponding words in the two sets of trees are gone through in sequential order, each pair checked for which queries they satisfy in their set of trees.

CIS Query Results - 1  “N” indicates absence of query 4  (4,4) – 677 cases where trees agree on 3-level idafa  (N,4) – 194 cases where gold does not have query 4 for a token, but parse tree does Gold\ParsedNQuery 4 N194 Query  4. (NP A (NP A (NP A$))) – a three-level idafa

CIS Parallel Annotation – How to Score? Entry in (N,4) cell for query 4. Gold etree instance for token does not have 3-level idafa, but parsed tree does. NP PP 9NP NP GOLD: PARSED: NP- LOC PP 9NP NP للاجئين شمال غزة li lAji}+iyna $amAl gaz~+ap for refugees North Gaza

CIS Query Results - 2  (1,N) – 16 cases in which gold tree token projects to NP-LOC while parsed tree does not project to NP-LOC,NP-DIR,NP-ADV Gold\ParsedNQuery 1Query 2Query 3 N6013 Query Query Query  1. (NP-LOC A$) – anchor projects to a NP-LOC 2. (NP-DIR A$) – anchor projects to a NP-DIR 3. (NP-ADV A$) - anchor projects to a NP-ADV

CIS Parallel Annotation – How to Score? Entry in (1,N) cell for query set (1,2,3) Gold etree instance for token has NP-LOC, but parsed tree does not. NP PP 9NP 7 8 GOLD: PARSED: 10 GOLD: NP- LOC PP 9NP NP للاجئين شمال غزة li lAji}+iyna $amAl gaz~+ap for refugees North Gaza

CIS Conclusions and Future Work  Can search over small tree fragments in entire treebank or in two sets of trees on same data  Currently used for parser analysis Gold/Parsed, plan to use for InterAnnotator Agreement.  Focus of ongoing work: Now have searches over “derivation tree”, which encodes how the etree instances combine together For example, for attachment issues – where annotators agree on the basic syntactic units, but not on how they are combined

CIS More Recent Stuff  Now working with PPCEME – Penn-Helsinki Parsed Corpus of Early Modern English 1.8 Million Words – better speed test than PTB Use of Corpus Search – better test for queries  Elementary tree searches with lexical constraints  Derivation tree searches over etrees.  First speed results

CIS PPCEME Tree Decomposition  1,876,675 tokens  1,743,932 etree instances (some have multiple anchors)  6025 templates but not including all function tags yet Function tags in PPCEME very different than PTB

CIS Example of Tree Decomposition in PPCEME

CIS Derivation Tree

CIS Another Example Tree Decomposition

CIS And Derivation Tree

CIS Etree and Dtree Queries

CIS Query Steps  Etree Template Search Finds (etree query, etree template) matches Stores lexical restriction or dta (derivation tree address) info for each match  Example: query: (IP, (NP[t]-OB1^{dta:1} NP[t]-SBJ^)) Templates: (IP NP[t]-OB1^ NP[t]-SBJ^ A1) (CP (C (-NONE- 0)) (IP (NP[t]-OB1^ NP[t]-SBJ^ A1)) For first, stores address 1.1, for second 1.2.1

CIS Derivation Tree Search  Starts at root of Dtree query  Find all etree instances that satisfy etree query at root of dtree query. Uses (etree query id, template) info plus check for lexical constraints.  Find etree instance children of each such etree instance that substitutes or adjoins in the derivation tree at the {dta} address for the parent template

CIS Some Speed Results  All dtree queries of just a single etree query with no lexical constraints extremely fast  With lexical constraints, still very fast  Some derivation tree searches still very fast (sub{dta:1} E1 E2) NP tree headed by PRO substituting into NP[t]-OB1^  But this one is not: (adj{dta:2} E5 E3) CP with empty WHNP,C adjoining at head of NP  Reason: finds all etrees with head of NP first! 500K instances. Need to flip order.

CIS Other Stuff we Want to Do – 1  Issues with Tree Extraction Representation of empty categories across trees? Where is standard adjunction needed? Complete implementation of function tags Verify correctness – put back together  Parse with extracted trees already Issue with multiple anchor trees?

CIS Other Stuff we Want to Do – 2  More query testing – parallel or not Compare against full use of corpus search Complexity of searching on derivation tree – ever need more than two nodes? Yes, but how often?  Treebank Verification – Well-formedness of elementary trees (Xia, 2000) How elementary trees combine  Treebank construction using TAG  Make query search really fast, put on web

CIS Example of Tree Decomposition S PP -TMP NP -SBJ VP hadVP NP TurkeyIn July 2005 signed A protocol

CIS Example of Tree Decomposition S PP -TMP NP -SBJ VP hadVP NP Turkey In July 2005 signed A protocol  One-level relations (as in Collins 2003) can be helpful but not sufficient.

CIS Example of Tree Decomposition S PP -TMP NP -SBJ VP hadVP NP Turkey In July 2005 signed A protocol

CIS Template with two substitution nodes. This template occurs many times, but with different anchors. Example of One Etree Instance S NP -SBJ ^ VP hadVP NP^ signed Anchors

CIS Test Case  Test case is the Penn Arabic Treebank Continuously under construction We work on it – really want something automated  Using gold/parser files More files available right now Same basic problem as Inter-Annotator Agreement  Linguistic Decisions on Tree Extraction “idafa” construction as etree templates (NP NOUN (NP NOUN)) 2-level idafa (NP NOUN (NP NOUN (NP NOUN))) 3-level idafa

CIS  Blah  Long history for NLP, massively underutilized for corpus creation (Xia, 2001)

CIS Arabic Treebank Example NP 5 6 PP 7 NP 8 NP- LOC 9 NP 10 muxay~am refugee camp jabAliyA Jabaliya li for/to lAji}+iyna refugees $amAl North gaz~+ap Gaza 5,6 and 9,10 are idafas, extracted as a single etree