Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Syntax and Context-Free Grammars Julia Hirschberg CS 4705 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
Layering Semantics (Putting meaning into trees) Treebank Workshop Martha Palmer April 26, 2007.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Statistical NLP: Lecture 3
10/9/01PropBank1 Proposition Bank: a resource of predicate-argument relations Martha Palmer University of Pennsylvania October 9, 2001 Columbia University.
PropBanks, 10/30/03 1 Penn Putting Meaning Into Your Trees Martha Palmer Paul Kingsbury, Olga Babko-Malaya, Scott Cotton, Nianwen Xue, Shijong Ryu, Ben.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Introduction to Computational Linguistics Lecture 2.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1/13 Parsing III Probabilistic Parsing and Conclusions.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
1/17 Probabilistic Parsing … and some other approaches.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
10/9/01PropBank1 Proposition Bank: a resource of predicate-argument relations Martha Palmer, Dan Gildea, Paul Kingsbury University of Pennsylvania February.
Tips and Tricks … with INTEX/NOOJ Tamás Váradi Institute for Linguistics Research Hungarian Academy of Sciences Max Silberztein University.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Learning a token classification from a large corpus (A case study in abbreviations) Petya Osenova & Kiril Simov BulTreeBank Project (
Natural Language Processing Lecture 6 : Revision.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
Albert Gatt Corpora and Statistical Methods Lecture 11.
Parsing with Context-Free Grammars for ASR Julia Hirschberg CS 4706 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Tokenization & POS-Tagging
CSA2050 Introduction to Computational Linguistics Parsing I.
1 Context Free Grammars October Syntactic Grammaticality Doesn’t depend on Having heard the sentence before The sentence being true –I saw a unicorn.
Natural Language Processing
CPSC 503 Computational Linguistics
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
ARDA Visit 1 Penn Lexical Semantics at Penn: Proposition Bank and VerbNet Martha Palmer, Dan Gildea, Paul Kingsbury, Olga Babko-Malaya, Bert Xue, Karin.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Natural Language Processing Vasile Rus
An Introduction to the Government and Binding Theory
[A Contrastive Study of Syntacto-Semantic Dependencies]
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
LING/C SC 581: Advanced Computational Linguistics
Probabilistic and Lexicalized Parsing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Chapter 10: Compilers and Language Translation
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Language Data Resources Treebanks

A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented, part-of-speech tagged, and fully bracketed corpus (typically hand-built) collection of natural language utterances and associated linguistic analyses collection of morphologically, syntactically and semantically annotated sentences database essential for the study of the language due to it provides analyzed/annotated examples of real language large collection of sentences which have been parsed by hand

Parsed?

Penn Treebank Release MW English newspaper texts; the larges subpart taken from the Wall Street Journal Bracketing style (constituent syntax)‏ Structural reconstructions (traces)‏

A PennTreeBanked Sentence Analysts S NP-SBJ VP have VP beenVP expecting NP a GM-Jaguar pact NP that SBAR WHNP-1 *T*-1 S NP-SBJ VP would VP give the US car maker NP an eventual 30% stake NP the British company NP PP- LOC in (S (NP-SBJ Analysts)‏ (VP have (VP been (VP expecting (NP (NP a GM-Jaguar pact)‏ (SBAR (WHNP-1 that)‏ (S (NP-SBJ *T*-1)‏ (VP would (VP give (NP the U.S. car maker)‏ (NP (NP an eventual (ADJP 30 %) stake)‏ (PP-LOC in (NP the British company))))))))))))‏ Analysts have been expecting a GM-Jaguar pact that would give the U.S. car maker an eventual 30% stake in the British company.

Penn Treebank – značky terminálů (Part-of-speech tags)‏

Penn Treebank – značky neterminálů Penn Treebank – funkční značky

PropBank „adding a layer of semantic annotation to the Penn Treebank“ Rozlišení arguments-adjuncts Ukázka anotace: Mr.Bush met him privately, in the White House, on Thursday. Rel: met Arg0: Mr.Bush Arg1: him ArgM-MNR: privately ArgM-LOC: in the White House ArgM-TMP: on Thursday

The same sentence, PropBanked Analysts have been expecting a GM-Jaguar pact Arg0 Arg1 (S Arg0 (NP-SBJ Analysts)‏ (VP have (VP been (VP expecting Arg1 (NP (NP a GM-Jaguar pact)‏ (SBAR (WHNP-1 that)‏ (S Arg0 (NP-SBJ *T*-1)‏ (VP would (VP give Arg2 (NP the U.S. car maker)‏ Arg1 (NP (NP an eventual (ADJP 30 %) stake)‏ (PP-LOC in (NP the British company))))))))))))‏ that would give *T*-1 the US car maker an eventual 30% stake in the British company Arg0 Arg2 Arg1 expect(Analysts, GM-J pact)‏ give(GM-J pact, US car maker, 30% stake)‏

NEGRA NEGRA corpus version 2 consists of 355,096 tokens (20,602 sentences) of German newspaper text, taken from the Frankfurter Rundschau

NEGRA – příklady stromů

NEGRA – značky terminálů

NEGRA - značky neterminálů NEGRA - značky hran

NEGRA Export Format

TIGER The TIGER Treebank (Version 1) consists of app. 700,000 tokens (40,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. Rozdíl oproti Negra –Velikost –Lematizace+morfologie –Sekundární hrany

TIGER

TIGER-XML format

BulTreeBank (I)‏ HPSG-based treebank of Bulgarian at Bulgarian Academy of Sciences detailed described in a sequence of technical reports in 2004 ( source of material - BulTreeBank corpus –archive of texts converted from HTML and RTF documents: 90 MW –morphologically analyzed corpus: 10 MW –disambiguation by hand: 1 MW the treebank itself: 200 kW –1.500 kS extracted from Bulgarian grammars to show the variety of syntactic structures –10 kS from the corpus (newspapers, government documents, prose) to show their distribution

BulTreeBank (II) Morphosyntactic Tagset BTB-TS –derived from MULTEXT –positional ("positional") - the first letter (POS) specifies how many positions follow and what categories they express –examples: Ncmsd (noun common masc. sing. def.)‏ Amsi (adjective masc. sing. indef.)‏

BulTreeBank (III) Syntactic Structures constituency trees 4 types of elements –lexical elements (N,V,Prep,...)‏ –phrasal elements (VPA(djunct),NPC(omplement)...)‏ –functional elements (Conj,ConjArg,...)‏ –textual elements - the actual strings (inc.punctuation)‏ She is angry whatever you tell her.

BulTreeBank (IV) Discontinuities

Szeged Treebank (I)‏ syntactic structures for Hungarian developed at University of Szeged 82 kS, 1.2 MW( punctuation marks) 5 types of text material: fiction, compositions of year-old students, newspaper articles, IT texts, law manual morphological disambiguation and syntactic annotation

Szeged Treebank (II)‏ following generative syntax freer word-order in Hungarian -> many traces

Penn Chinese Treebank

Constituency vs. Dependency ?

Applications of existing treebanks Applications of existing treebanks fall into two broad categories: (i) use of an annotated corpus in empirical linguistics as a source of structured language data and distributional patterns and (ii) use of the treebank for the acquisition (e.g. using stochastic or machine learning approaches) and evaluation of parsing systems.

Conversion of phrase structures into dependencies relatively easy recursive procedure: –1. Mark the head child of each non-terminal node –2. In the dependency structure, make the head of each non-head child depend on the head of the head child