Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague zabokrtsky@ufal.mff.cuni.cz http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
Introduction treebank Prague Dependency Treebank syntactically annotated corpus (“bank” of syntactic trees) Prague Dependency Treebank collection of linguistically annotated Czech texts (2MW), software tools and documentation morphological and surface- and deep-syntactic dependency-oriented sentence analyses http://ufal.mff.cuni.cz/pdt2.0
About Czech western group of Slavic languages rich inflectional morphology (relatively) free word order language Latin alphabet extended with accents (příliš žluťoučký kůň) spoken in the Czech republic 10+ million speakers http://ufal.mff.cuni.cz/pdt2.0
Historical background and development of PDT 1920’s – Prague Linguistic Circle founded 1930-50’s – influential dependency-oriented works of Lucien Tesniere and Vladimír Šmilauer mid 1960’s – Petr Sgall’s Functional Generative Description 1992 – Penn Treebank 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC 2006 – PDT 2.0 to be released by LDC http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
Layered annotation scheme tectogrammatical layer surface-syntactic dependency tree analytical layer morphological layer morphological lemma and tag associated with each token word layer original text, segmented on word boundaries http://ufal.mff.cuni.cz/pdt2.0 He would have gone intoforest.
M-layer sentence represented as a sequence of tokens each token lemmatized and tagged (attributes lemma and tag) 15-character long positional morphological tag 1. (main) POS 2. detailed POS 3. gender 4. number 5. case ... http://ufal.mff.cuni.cz/pdt2.0
A-layer (1) - nodes and edges sentence represented as a rooted ordered tree with labeled nodes and edges edges labeled with analytical functions: dependency relations (Sb, Obj, Adv, Atr) non-dep. relations (Coord) auxiliary (functional) nodes (AuxP for prepositions, AuxC for subordinating conjunctions...) special treatment of coordination constructions http://ufal.mff.cuni.cz/pdt2.0
A-layer (2) - coordination intricate interplay between dependency and coordination relations PDT solution: both conjuncts (members of coordination) and shared modifiers attached below the coordination conjunction (but distinguished from each other by a special attribute is_member) direct parent vs. effective parent: M M http://ufal.mff.cuni.cz/pdt2.0
T-layer (1) - nodes t-nodes node attributes complex typed feature structures nodes represent autosemantic words functional words do not have nodes of their own artificially added nodes (e.g. for pro-drops) node attributes tectogrammatical lemma dependency relation – functor and subfunctor grammateme attributes (representing morphological meanings) attributes for topic-focus articulation attributes for coreference relations http://ufal.mff.cuni.cz/pdt2.0
T-layer (2) - dependency relations according to FGD, two types of functors actants (arguments) ACT – actor PAT – patient ADDR – addressee EFF – effect ORIG - origin free modifiers (adjuncts) various types of temporal modifiers - TWHEN, TTIL, TSIN... spatial and directional modifiers – LOC, DIR1, DIR2, DIR3 MEANS, BENeficiary, CAUSe, REGard, EXTent, MATerial, CONDition... additional functors for representing non-dependency relations coordinations – CONJ, DISJ, ADVS ... appositions – APPS parenthetical constructions - PAR expressions in foreign language - FPHR http://ufal.mff.cuni.cz/pdt2.0
T-layer (3) - valency all occurrences of all verbs in t-trees interlinked with the valency lexicon PDT-VALLEX individual valency frames roughly corresponds to individual senses of the given verb valency frame ~ a sequence of frame slots, for each of which its functor, obligatority and its possible surface realizations are specified http://ufal.mff.cuni.cz/pdt2.0
T-layer (3) - coreference two types of coreference according to FGD grammatical (verbs of control, relative clauses, reflexive pronouns...) textual (personal pronouns, incl. elided ones) coreference in PDT binary relation between t-nodes depicted as a “non-tree” arc (arrow) http://ufal.mff.cuni.cz/pdt2.0
T-layer (4) - grammatemes t-node attributes representing morphological meanings motivation number for nouns, tense for verbs, degree for adjectives, deontic/verb/sentence modality ... http://ufal.mff.cuni.cz/pdt2.0
T-layer (5) - node typing presence/absence of a given attribute? the need for node typing two-level hierarchy of t-layer node types used in PDT 2.0: http://ufal.mff.cuni.cz/pdt2.0
Interlinking the layers any unit at any layer has a PDT unique ID neighboring layers connected by top-down pointers http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
Sources of text texts provided by the Czech National Corpus 7000 articles (or article fragments) from Czech newspapers and journals: Lidové noviny (daily newspapers) Mladá fronta Dnes (daily newspapers) Českomoravský profit (business weekly) Vesmír (scientific journal) http://ufal.mff.cuni.cz/pdt2.0
Amount of annotated data m-layer data 1.96 MW in 116 kS a-layer data (75 % of m-layer) 1.5 MW in 88 kS t-layer data (59 % of a-layer) 0.8 MW in 49 kS http://ufal.mff.cuni.cz/pdt2.0
Division into files 1 XML file per document and annotation layer http://ufal.mff.cuni.cz/pdt2.0
Train/test data train : devtest : evaltest = 8 : 1 : 1 http://ufal.mff.cuni.cz/pdt2.0
Full vs. sample data sample data 500 sentences a freely available subset of the full data converted also to HTML (can be viewed in any WWW browser, no tree editor needed) the whole PDT 2.0 except for the full data (but including sample data, all tools, docs, and sample data) is available on the web the full data will be available only to the licensed users who obtain the CD from the Linguistic Data Consortium http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
Tree editor TrEd general customizable tree editor implemented in Perl the main editing and browsing tool in the PDT project http://ufal.mff.cuni.cz/pdt2.0
Batch processing of the data btred – batch processing version of tred ntred – networked (parallelized) version of btred $ btred -TNe 'print "$this->{t_lemma}\n" if $this->parent==$root and grep{$_->{functor}=~/^DIR/} $this->children()‘ data/sample/*.t.gz -q http://ufal.mff.cuni.cz/pdt2.0
Netgraph client-server application for on-line PDT search implemented in Java http://ufal.mff.cuni.cz/pdt2.0
Tools for post-annotation consistency checking hundreds of btred scripts of various types: technical tests e.g. each sentence contains at least one token all identifiers are unique, all referred identifiers exist... m-layer tests locative (6th case) cannot occur without a preposition improbable word forms (e.g. imperatives haš, tel) a-layer tests not more than one subject in a clause attributes (afun Atr) should not appear directly below verbs t-layer tests surface forms of verb arguments match the specifications in the valency lexicon relative pronouns in relative clauses should be in agreement with their antecedent (in the sense of grammatical coreference) http://ufal.mff.cuni.cz/pdt2.0
Tools for automatic annotation chain of tools for automatic text processing (from a raw text to a-layer trees): 1. sentence segmentation and tokenization 2. morphological analysis 3. morphological disambiguation 4. dependency parsing (adapted Collins) 5. analytical function assignment http://ufal.mff.cuni.cz/pdt2.0
Tools for format conversions conversion not only between PDT data formats, but also from other treebanks’ formats constituency trees from Negra in TrEd: http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
PDT 2.0 Documentation PDT Guide Annotation guidelines Publications overview of all parts of PDT 2.0 mirrors the directory structure of the PDT 2.0 CD-ROM Annotation guidelines m-layer (~100 pages) a-layer (~ 250 pages) t-layer (~ 800 pages) Publications conference and journal papers, technical reports, theses ... Technical documentation (software tools and data formats) http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
Outline of the talk Introduction Layers of annotation Data Software tools Documentation Tour through the CD-ROM Final remarks http://ufal.mff.cuni.cz/pdt2.0
Want to experiment with... tagging ? dependency parsing ? semantic-role labeling ? frame semantics ? word-sense disambiguation ? anaphora resolution ? information structure ? ... Use PDT 2.0, it’s all there !!! http://ufal.mff.cuni.cz/pdt2.0
Annotation scheme not limited to Czech T-layer in English T-layer in German A-layer in German A-layer in Arabic A-layer in Slovene A-layer in Romanian http://ufal.mff.cuni.cz/pdt2.0
Those involved (some of) http://ufal.mff.cuni.cz/pdt2.0
Thank you! BTW, anyone interested in beta-testing? http://ufal.mff.cuni.cz/pdt2.0