PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
En->Cz MT system based on tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
The Language Model in Bulgarian Treebank (BulTreeBank) Petya Osenova (Sofia) , Prague.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Issues in Building and Exploiting Latin Language Resources Marco Passarotti Università Cattolica del Sacro Cuore, Milan (Italy)
Example Database English-German Dictionary
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
DS-to-PS conversion Fei Xia University of Washington July 29,
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
1/36 TectoMT Zdeněk Žabokrtský Institute of Formal and Applied Linguistics MFF UK Software framework for developing MT systems (and other NLP applications)
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
April 17, 2007MT Marathon: Tree-based Translation1 Tree-based Translation with Tectogrammatical Representation Jan Hajič Institute of Formal and Applied.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Jan Hajič Otakar Smrž Petr Zemánek Jan Šnaidauf Emanuel Beška Faculty of Mathematics and Physics Faculty of Philosophy and Arts Charles University in Prague.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Prague Arabic Dependency Treebank MALACH Workshop in Prague August 28, 2003 Introduction & Related Projects Otakar Smrž et al.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
Prague Dependency Treebank 1.0 Functional Generative Description.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
Supertagging CMSC Natural Language Processing January 31, 2006.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Machine Translation using Tectogrammatics Zdeněk Žabokrtský IFAL, Charles University in Prague.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Tectogrammatics 1 PDT: Tectogrammatical Representation Jan Hajič Institute.
TSD, Brno, Institute of Formal and Applied Linguistics, 1 Czech Verbs of Communication and the Extraction of.
Semantic annotation of a dialog corpus Silvie Cinková Institute of Formal and Applied Linguistics Charles University in Prague, Czech Republic COMPANIONS.
Prague Czech-English Dependency Treebank 2.0 ufal.mff.cuni.cz/pcedt2.0 Silvie Cinková, Marie Mikulová, Jan Štěpánek & professors, annotators and programmers.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Beginning Syntax Linda Thomas
Natural Language Processing (NLP)
Prague Arabic Dependency Treebank
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Presentation transcript:

PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 Introduction treebank syntactically annotated corpus (“bank” of syntactic trees) Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses

PDT 2.0 About Czech western group of Slavic languages rich inflectional morphology (relatively) free word order language Latin alphabet extended with accents (příliš žluťoučký kůň) spoken in the Czech republic 10+ million speakers

PDT 2.0 Historical background and development of PDT 1920’s – Prague Linguistic Circle founded ’s – influential dependency-oriented works of Lucien Tesniere and Vladimír Šmilauer mid 1960’s – Petr Sgall’s Functional Generative Description 1992 – Penn Treebank 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC 2006 – PDT 2.0 to be released by LDC

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 Layered annotation scheme tectogrammatical layer surface-syntactic dependency tree analytical layer surface-syntactic dependency tree morphological layer morphological lemma and tag associated with each token word layer original text, segmented on word boundaries He would have gone intoforest.

PDT 2.0 M-layer sentence represented as a sequence of tokens each token lemmatized and tagged (attributes lemma and tag) 15-character long positional morphological tag 1. (main) POS 2. detailed POS 3. gender 4. number 5. case...

PDT 2.0 A-layer (1) - nodes and edges sentence represented as a rooted ordered tree with labeled nodes and edges edges labeled with analytical functions: dependency relations (Sb, Obj, Adv, Atr) non-dep. relations (Coord) auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...) special treatment of coordination constructions

PDT 2.0 A-layer (2) - coordination intricate interplay between dependency and coordination relations PDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member) direct parent vs. effective parent: MM

PDT 2.0 T-layer (1) - nodes t-nodes complex typed feature structures nodes represent autosemantic words functional words do not have nodes of their own artificially added nodes (e.g. for pro- drops) node attributes tectogrammatical lemma dependency relation – functor and subfunctor grammateme attributes (representing morphological meanings) attributes for topic-focus articulation attributes for coreference relations

PDT 2.0 T-layer (2) - dependency relations according to FGD, two types of functors actants (arguments) ACT – actor PAT – patient ADDR – addressee EFF – effect ORIG - origin free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN... spatial and directional modifiers – LOC, DIR1, DIR2, DIR3 MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition... additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR

PDT 2.0 T-layer (3) - valency all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEX individual valency frames roughly corresponds to individual senses of the given verb valency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified

PDT 2.0 T-layer (3) - coreference two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones) coreference in PDT binary relation between t-nodes depicted as a “non- tree” arc (arrow)

PDT 2.0 T-layer (4) - grammatemes grammatemes t-node attributes representing morphological meanings motivation number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality...

PDT 2.0 T-layer (5) - node typing presence/absence of a given attribute?  the need for node typing two-level hierarchy of t-layer node types used in PDT 2.0:

PDT 2.0 Interlinking the layers any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 Sources of text texts provided by the Czech National Corpus 7000 articles (or article fragments) from Czech newspapers and journals: Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal)

PDT 2.0 Amount of annotated data m-layer data 1.96 MW in 116 kS a-layer data (75 % of m-layer) 1.5 MW in 88 kS t-layer data (59 % of a-layer) 0.8 MW in 49 kS

PDT 2.0 Division into files 1 XML file per document and annotation layer

PDT 2.0 Train/test data train : devtest : evaltest = 8 : 1 : 1

PDT 2.0 Full vs. sample data sample data 500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed) the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 Tree editor TrEd general customizable tree editor implemented in Perl the main editing and browsing tool in the PDT project

PDT 2.0 Batch processing of the data btred – batch processing version of tred ntred – networked (parallelized) version of btred $ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q

PDT 2.0 Netgraph client-server application for on-line PDT search implemented in Java

PDT 2.0 Tools for post-annotation consistency checking hundreds of btred scripts of various types: technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist... m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel) a-layer tests not more than one subject in a clause attributes (afun Atr) should not appear directly below verbs t-layer tests surface forms of verb arguments match the specifications in the valency lexicon relative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference)

PDT 2.0 Tools for automatic annotation chain of tools for automatic text processing (from a raw text to a-layer trees): 1. sentence segmentation and tokenization 2. morphological analysis 3. morphological disambiguation 4. dependency parsing (adapted Collins) 5. analytical function assignment

PDT 2.0 Tools for format conversions conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd:

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 PDT 2.0 Documentation PDT Guide overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages) Publications conference and journal papers, technical reports, theses... Technical documentation (software tools and data formats)

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks

PDT 2.0 Want to experiment with... tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ?... Use PDT 2.0, it’s all there !!!

PDT 2.0 Annotation scheme not limited to Czech T-layer in EnglishT-layer in GermanA-layer in German A-layer in ArabicA-layer in SloveneA-layer in Romanian