Download presentation
Presentation is loading. Please wait.
Published byElizabeth Manning Modified over 9 years ago
1
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek
2
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 2 Today... The family of Prague Dependency Treebanks –Incl. the Prague (Czech-)English Dependency Treebank English “Tectogrammatical Representation” (TR) –Annotation layers –From Penn Treebank (et al.) to PDT-style English tectogrammatics –TR annotation of 5 interesting English phenomena The annotation process –TrEd, EngVallex and the current status To take home + pointers
3
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 3 The Family of Prague Dependency Treebanks Prague Dependency Treebank (Czech) –2001: version 1.0 (no deep syntax/semantics) –2006: version 2.0 (w/deep syntax, semantics) Prague Czech-English Dependency TB 1.0 –2004: automatic annotation –English: PTB, Czech: 1/3rd of PTB translated Prague Arabic Dependency Treebank 1.0 –2004: ~ PDT 1.0 (no deep syntax)
4
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 4 The Prague Czech-English Dependency Treebank Penn Treebank + PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics Translation to Czech –Manual annotation (with auto pre-annotation) Morphology, Syntax, Tectogrammatics (TR)
5
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 5 Example: English TR Words Dependencies Sem. function Valency (predicates) Coref (BBN) Named Entities (BBN)
6
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 6 Layers of Annotation t-layer –tectogrammatics a-layer –(surface) syntax m-layer –Morphology (POS) w-layer –words (tokens)
7
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 7 English Surface Syntax From PTB: –Form –POS Tag –Function label –(Structure) Added –Lemma –Heads
8
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 8 Head Determination Rules Exhaustive set of rules –By J. Eisner + M. Čmejrek/J. Cuřín –4000 rules (non-terminal based) Ex.: (S (NP-SBJ VP.)) → VP –Additional rules Coordination, Apposition Punctuation (end-of-sentence, internal) Original idea (possibility of conversion) –J. Robinson (1960s)
9
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 9 Example: Head Determination Rules (board) (the) (join) (will) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP Rules:
10
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 10 Conversion: Analytic Structure, Functions Syntactic Function assignment (conversion) Rules –based on PTB functional tags: -SBJ Sb -PRD Pnom-BNF Obj -DTV Obj -LGS Obj-ADV Adv-DIR Adv-EXT Adv -LOC Adv-MNR Adv-PRP Adv-PUT Adv -TMP Adv –Ad-hoc rules (if functional tags missing) –Lemmatization (years → year)
11
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 11 Syntactic Structure, Functions: PTB to P(E)DT (board) (the) (join) (will) (join) → → Penn Treebank structure (with heads added) PDT-like Analytic Representation PRED.Fut PAT PDT-like Tectogrammatic Representation (automatic pre-annotation)
12
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 12 English TR I Predicative Complement Free (non-valency) modification (of both a noun and a verb) attribute compl.rf (green arrow to the noun)
13
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 13 English TR II Which + Relative Clause We have not answered your question completely, for which we apologize.
14
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 14 English TR III: Coordination
15
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 15 English TR IV: Comparison
16
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 16 English TR V: Restriction (“Exclusion”) except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides
17
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 17 English TR: (manual) annotation TrEd –Pre-annotated –Graphical TR dep. tree is primary –Text + TR –Czech translation Valency (a.k.a. “propbanking”) –During TR annotation –Propbank origins and examples Linked, displayed
18
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 18 EngVallex (give)
19
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 19 EngVallex Format (admit)
20
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 20 Interannotator Agreement 2007-2009: - New annotators (lower numbers) - Annotation “by phenomenon” - Restarting now
21
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 21 Prague English Dependency Treebank Availability –Version 1.0 now (PTB license needed) 250k words –Full version (parallel with Czech): late 2010 Size –Full WSJ portion of PTB (2312 files) –49208 sentences, 1253013 tokens –Now: –17210 sentences (34.97%), 439983 tokens (35.11%)
22
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 22 Czech PDT-style Annotation All layers –morphology, syntax, tectogrammatical So far… –Automatic (many tools by many authors) Manual annotation –In progress (28124 sentences/639326 words) –Top-down Tectogrammatical first (lower layers automatically) … then syntactic structure and morphology
23
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 23 Summary PDT is/has (a)… –(Family of) dependency-based treebanking project(s) Czech (English, Arabic,...) –~ 1mil. words sufficient size for ML experiments –4 interlinked layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and “full” information at all levels interlinked (for the development of parsers/generators) –Parallel corpus Cze Eng -> Machine Translation
24
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank 24 Pointers, Acknowledgements http://ufal.mff.cuni.cz/pedt http://ufal.mff.cuni.cz/pdt2.0 http://ufal.mff.cuni.cz/~pajas/tred Acknowledgements –FP6-IST “Euromatrix”, FP7-IST “Euromatrix+” –LC536 (Center for Computational Linguistics) –GAČR 405/06/0589 (Speech and deep syntax) –MŠMT: MSM0021620838, ME838, ME09008
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.